M.F. Guest and P. Sherwood,
Department for Computation and Information,
CCLRC DaresburyLaboratory,
Warrington WA4 4AD, Cheshire, UK
J.A. Nichols,
High Performance Computational Chemistry Group,
Environmental Molecular Sciences Laboratory,
Pacific Northwest National Laboratory,
PO Box 999, Mail Stop K190, Richland,
WA. 99352, USA
[1] This paper is based on an invited presentation at the meeting "Supercomputing, Collision Processes and Applications" that was held at The Queen's University of Belfast from the 14th to 16th Spetember 1998 to mark the retirement of Professor P.G. Burke CBE FRS. The Proceedings will be published by Plenum Publishing Corporation.
1. Introduction
Computational chemistry covers a wide spectrum of activities ranging from quantum mechanical calculations of the electronic structure of molecules, to classical mechanical simulations of the dynamical properties of manyatom systems, to the mapping of both structureactivity relationships and reaction synthesis steps. Although chemical theory and insight play important roles in this work, the prediction of physical observables is almost invariably bounded by the available computer capacity.
The potential of massively parallel computers, with hundreds to thousands of processors (MPPs), to significantly outpace conventional supercomputers in both capacity and priceperformance has long been recognised. While increases in raw computing power alone will greatly expand the range of problems that can be treated by theoretical chemistry methods, it is now apparent that a significant investment in new algorithms is needed to fully exploit this potential. Merely porting presently available software to these parallel computers does not provide the efficiency required, with many existing parallel applications showing a deterioration in performance as greater numbers of processors are used. New algorithms must be developed that exhibit parallel scalability (i.e., show a near linear increase in performance with the processor count). Such improvements promise substantial scientific and commercial gains by increasing the number and complexity of chemical systems that can be studied.
Organised in five sections, this paper aims to analyse just how far the potential of MPPs has been realised, and what is needed in terms of software development to provide the appropriate applications. Following a consideration of the computational requirements for molecular computations as a function of the level of theory and required accuracy (section 2), we present an analysis of the cost effectiveness of current topend MPP solutions against both desktop and midrange departmental resources. We identify two distinct categories of application, Grand Challenge and Throughput. The former is taken to include those chemical systems of sufficient complexity that neither desk top nor midrange resources provide a computationally viable solution. Throughput applications are defined to include chemical systems that are amenable to treatment by more modest resources, but where time to solution demands a topend capability. We define an MPP performance metric applicable to both application categories that emerges from this consideration of priceperformance.
Section 3 considers Grand Challenge applications, and the need to address parallel scalability from the outset in developing molecular modelling software applications. In assessing the algorithmic impact of this requirement, we overview the development of the NWChem package at the Pacific Northwest National Laboratory [1] as an example of the provision of capabilities designed to facilitate ``new science'', but in cost effective fashion. Section 4 turns to the treatment of Throughput applications, and considers the efforts required to migrate existing applications to MPP platforms so as to achieve the performance metric of section 2. We consider the parallel development of the GAMESSUK package [2] at the CCLRC Daresbury Laboratory as a typical example of what is achievable from this more modest approach to exploiting MPP platforms. We present performance figures on both Cray T3E and IBM SP systems for a number of Grand Challenge and Throughput calculations on representative molecular systems using a variety of methods routinely employed by the computational chemist, including selfconsistent field HartreeFock (SCF) and Density Functional Theory (DFT), Secondorder Moller Plesset Perturbation theory (MP2), and SCF second derivatives. While the major focus of this paper lies at the topend, and exploiting the potential of MPPclass systems, we also consider in section 4 the potential of Beowulf class PCbased systems in providing a costeffective resource for the computational chemist.
2. Methods and CostPerformance Issues
The complexity of computational chemistry, and the demands that the discipline imposes on any computational resource, are illustrated in Figure 1 where the relative computing power required for molecular computations at four different levels of theory (and accuracy) are shown: configuration interaction (CI), HartreeFock (HF), density functional, and molecular dynamics. Each of these levels of theory formally scale from N^{2} to N^{6} with respect to computational requirements. While standard screening techniques which in essence ignore interactions for atoms far apart can bring these scalings down significantly, it is clear that the major challenge facing the computational chemist is the development of methods that exhibit superior scaling properties to those shown in Figure 1. We should not, of course, lose sight of this goal in our efforts to maximally exploit current methodology on present and future hardware.
Figure 1: The relative computing power required for molecular
computations at four levels of theory. In the absence of screening techniques,
the formal scaling for configuration interaction, HartreeFock, density
functional, and molecular dynamics is: N^{6}, N^{4} ,
N^{3} and N^{2} , respectively.
In assessing alternative hardware platforms for computational chemistry, we need to quantify in some fashion the cost effectiveness of a topend MPP solution, given that such a solution should be able to offer capabilities not possible on either desk top or midrange resources i.e., a solution that will facilitate ``new science'', but in cost effective fashion. Table 1 compares the cost of three HPC solutions, associated with the desktop, with a departmental or midrange resource (e.g., a 16 processor Origin 2000 from Silicon Graphics, with 4 GByte RAM), and with a topend MPP solution comprising perhaps 500 or 1000 nodes (e.g., an IBM SP P2SC/160 or Cray T3E/1200, with 64128 GByte RAM). We also consider the associated capabilities, defined in terms of CPU, memory and Input/Output (I/O) rate. In each case we have normalised these attributes relative to that available on the desktop. We have assumed the most cost effective desktop solution to be a 450 MHz PentiumII PC, and have costed that to include 256 MByte RAM and 9 GByte disk. Allowing for a factor of two in this, or for local variations in costs, does not change the final conclusion; the cost ratios of Table 1 lend little or no support to the repeated claims that MPP solutions are cost effective. This is clear when we compare the actual increase in capability as we move from desktop to the topend national resource. We see factors of 500/1000 increase in CPU capability, 250/500 increase in available memory, and 20/100 in achievable I/O bandwidth, each factor far below the associated cost ratios.
Specification  Usage  Units of Cost  Hardware Attributes 

CPU 
Memory 
I/O 

Topend MPP, 5001000 node,
(Cray T3E, IBM P2SC) 
HPC Community  3000 
5001000 
250500  20100 
midrange SMP, 16processor,
SGI Origin 2000 
Departmental  60 
20 
1020 
2050 
Desktop,
450 MHz Pentium II 
Single User  1 
1 
1 
1 
Table 1: Relative costs and Capabilities of three HPC solutions
What is clear from this rather crude analysis is that scalability, with the exploitation of not only CPU, but also memory and I/O bandwidth in scalable fashion, is crucial to the twin goals of costeffectiveness and the solving of Grand Challenge problems. Without that scalability, there is nothing to justify a topend resource against, say, a number of SGI Origin 2000's; funding for the latter is often easier to find and provides the principle investigator with satisfaction of local ownership and control. Such midrange provision would, we suspect, be the preferred route unless the topend resource really does provide the opportunity for Grand Challenge science not possible at the departmental or desktop level. So what level of scalability justifies a topend resource? A pragmatic approach suggests that the latter should provide a number of groups with simultaneous access to at least a twoorders of magnitude improvement over the desktop, and an order of magnitude improvement over computation available at the departmental level. Such comparisons should allow for the rapid evolution of desktop and midrange technology, with some form of technology refresh at the topend enabling these factors of 100 and 10 to be sustained throughout the lifetime of the machine. Another pointer comes from considering timetosolution for Throughput applications. Perhaps the longest viable desktop run would be of the order of 1 month; a user is unlikely to dedicate his own machine for longer periods. Given a 100fold performance improvement, such a run could be accomplished overnight.
Thus in our analysis of the parallel developments conducted to date, we will be looking for scalability and performance at the above levels. Furthermore, we require these levels to be available from some fraction of the topend resource, rather than the whole machine; this would enable several groups to be simultaneously delivering that level of performance. If a given group has to wait a week to access that level, rather than overnight, the value of the topend resource will rapidly diminish.
3. Grand Challenge Applications; the NWChem Project
The Environmental Molecular Sciences Laboratory (EMSL [3]) is a new research laboratory constructed by the U.S. Department of Energy (DOE) at PNNL, in Richland, Washington. EMSL's primary mission is to develop an understanding of the fundamental molecular processes required to develop efficient and costeffective technologies to remediate, process and provide longterm storage of hazardous wastes at DOE sites.
Computational simulations and modelling to be performed within the EMSL will further a molecularlevel understanding of the physical, chemical and biological processes that underlie environmental remediation, waste processing and related health and ecological effects. EMSL researchers aim to model the migration of contaminants below ground, on the earth's surface and in the air. Computations can predict the fate of those contaminants, and simulations can help in the design and execution of any required remedial actions. A 512 processor IBM RS/6000 SP is being used to carry out largescale computations to aid in the design of new materials and to perform environmental molecular science research that is essential to cleaning up the environment faster and more cost effectively. These calculations will help design compounds to separate radionuclides from other wastes, engineer enzymes to enhance the biodegradation of organic wastes and develop new materials to contain radioactive wastes for short and long term storage.
The High Performance Computational Chemistry (HPCC) group within the EMSL is developing a new generation of molecular modelling software to take full advantage of the parallel computing power of MPPs. In addition to the development of new algorithms that exhibit parallel scalability (i.e., show a near linear increase in performance with the number of processors), work has focused on creating the highlevel data and control structures needed to make parallel programs easier to write, maintain, and extend. These efforts have culminated in a package called NWChem that includes a broad range of electronic structure and molecular dynamics functionality. Version 1.0 of NWChem provides the following electronic structure capabilities; HF selfconsistent field (SCF) energies, gradients and second derivatives [4, 5], multiconfiguration SCF energies and gradients [6], several forms of manybody perturbation theory (MP2MP4) energies and gradients [7, 8], density functional theory (DFT) energies and gradients, coupled cluster (CCSD and CCSD(T)) energies, and effective core potential energies, gradients and second derivatives. A wide range of properties are also available: gradients, Hessians, electrical multipoles, NMR shielding tensors, etc. Molecular dynamics calculations can be performed with a variety of empirical force fields for the simulation of macromolecular and solution systems, with functionality including energy minimisation, molecular dynamics simulation and free energy calculation.
In outlining these developments, we focus below on the design philosophy, structure, and tools which make NWChem an effective environment for the development of computational chemistry applications. Although this focus has been on the efficient use of parallel computers in chemistry, almost all of the ideas and tools presented here are generally applicable. We provide, by way of example, an outline of the development and performance of a highly efficient and scalable algorithm for conducting SCFDFT studies on molecular systems, illustrated through DFT calculations of a number of zeolite fragments.
3.1 Design Philosophy and Software Development Tools
The philosophy and structure of NWChem evolved in response to experiences gained with other computational chemistry packages, and a number of constraints. Initial experiments led to the conclusion that while it was possible to modify existing software to implement parallel algorithms on a ``proof of principle'' basis, the effort involved in turning them into fully scalable (distributed rather than replicated data) codes was tantamount to writing the codes over from scratch. At the same time, as in most software development efforts, limitations were in effect on both personnel and time available to the development effort. The success of the project relied on having a system which was easy to program, and in particular one in which new parallel algorithms could be readily prototyped.
In order to meet the twin requirements of both maintainability (the ability to transition from locally developed parallel programming tools to vendorsupported, standardbased implementations) and portability (finding a programming model which can be used effectively across the variety of MPP platforms), ideas were adopted from the objectoriented (OO) style of software development [9]. Examples of the adoption of the basic principles of this methodology (abstraction, encapsulation, modularity, and hierarchy) have been presented elsewhere [11]. While there are OO programming languages, such as C++ [10], which allow the formal aspects of the methodology to be taken straight into the actual code, the project chose a compromise approach in the development of NWChem. Objectoriented ideas were used at the design stage, while the implementation remained a mixture of Fortran and C, languages much more familiar to current computational chemists than C++. Experience suggested that while it may take extra time, a thorough, careful design phase is probably the most important aspect of the OO methodology, and this effort is quickly recouped in easier implementation and fewer instances of modification of code after implementation. A design based on OO principles can be implemented in a nonOO programming language with only a small amount of discipline on the programmer's part.
A number of lowlevel tools were selected or developed to facilitate the development of NWChem. Although some were designed specifically to meet the needs of computational chemistry applications, they remain generally applicable. We focus attention below on the tools that target the development of distributed data applications and distributed linear algebra, noting that a discussion of the building blocks behind these tools, notably message passing and memory allocation, has appeared elsewhere [11].
3.1.1 Global Arrays
A vital consideration in using MPPs is how data are stored. Socalled replicateddata schemes require that a copy of each data item in the program be stored on each processor, so that the size of the problem that can be handled is limited by the memory of a single processor. In distributeddata applications each processor holds only a part of the total data; in such cases, the problem size is limited only by the total memory of the machine, allowing much larger problems to be treated. Efforts at PNNL have focused on distributeddata applications; relying on a replicated data implementation of the DFT module, for example, would limit the size of calculation to perhaps 2000 basis functions (see section 4), far short of the project's requirements of treating molecular systems with 10,000 basis functions.
No emerging standards for parallel programming languages (notably just High Performance Fortran (HPF1)) provide extensive support for multiple instruction multiple data (MIMD) programming [12]. The only truly portable MIMD programming model is message passing, for which a standard interface is now established [13, 14]. It is, however, very difficult to develop applications with fully distributed data structures using the messagepassing model [15, 16]. The sharedmemory programming model offers increased flexibility and programming ease but is less portable and provides less control over the interprocessor transfer cost. What is needed is support for onesided asynchronous access to data structures (here limited to one and twodimensional arrays) in the spirit of shared memory. With some effort this can be done portably [17]; in return for this investment, a much easier programming environment is achieved, speeding code development and improving extensibility and maintainability. A significant performance enhancement also results from increased asynchrony of execution of processes [18]; with a onesided communication mechanism, where each process can access what it needs without explicit participation of another process, all processes can operate independently. This approach eliminates unnecessary synchronization and naturally leads to interleaving of computation and communication. Most programs contain multiple algorithms, some of which may naturally be taskparallel (e.g., Fock matrix construction), and others that may be efficiently and compactly expressed as dataparallel operations (e.g., evaluating the trace of a matrix product). Both types of parallelism must be efficiently supported. Consideration of the requirements of the SCF algorithm, the parallel COLUMBUS configuration interaction program [19], second order manybody perturbation theory [8] and parallel coupledcluster methods [20] led to the design and implementation of the Global Array (GA) toolkit [17] to support onesided asynchronous access to globallyaddressable distributed one and twodimensional arrays.
This toolkit provides an efficient and portable ``sharedmemory'' programming interface for distributedmemory computers. Each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, without need for explicit cooperation by other processes. Unlike other sharedmemory environments, the GA model exposes the programmer to the nonuniform memory access (NUMA) timing characteristics of the parallel computers and acknowledges that access to remote data is slower than to local data. From the user perspective, a global array can be used as if it were stored in shared memory, except that explicit library calls are required to access it. The information on the actual data distribution can be obtained and exploited whenever data locality is important. Each process is assumed to have fast access to some ``local'' portion of each distributed matrix, and slower access to the remaining ``remote'' portion. Remote data can be accessed through operations like ``get'', ``put'' or ``accumulate'' (floating point sumreduction) that involve copying the globally accessible data to/from processprivate buffer space. The toolkit provides operations for (i) the creation, and destruction of distributed arrays, (ii) the synchronization of all processes, (iii) inquiries about arrays and their distribution, and (iv) primitive operations, such as get, put, accumulate, atomic read and increment, gather and scatter, and direct access to the local portion of an array, A number of BLASlike dataparallel operations have been developed on top of these primitives.
Additional functionality is provided through a variety of third party libraries made available by using the GA primitives to perform the necessary data rearrangement. These include standard and generalized real symmetric eigensolvers (PeIGS, see below), and linear equation solvers (SCALAPACK) [21]. The O(N^{2})cost of data rearrangement is observed to be negligible in comparison to that of O(N^{3}) linearalgebra operations. These libraries may internally use any form of parallelism appropriate to the host computer system, such as cooperative message passing or shared memory.
The GA interface has been designed in the light of emerging standards. In particular HPF1 and subsequent revisions will provide the basis for future standards definition of distributed arrays in Fortran. A long term goal must be to migrate to full language support, and to eliminate as much as possible the practice of parallel programming through subroutine libraries. The basic functionality described above (create, fetch, store, accumulate etc.) may be expressed as single statements using Fortran90 array notation and the datadistribution directives of HPF. However, HPF currently precludes the use of such operations on shared data in MIMD parallel code. There is reason to believe that future versions of the HPF standard will rectify this problem, as well as provide for irregular distributions, which has been found to lead to a significant increase in performance in computational chemistry applications.
3.1.2 Distributed Matrices and Linear Algebra
Many electronic structure computations are formulated in terms of dense matrices of size roughly N by N, where N is the number of basis functions. Two distinct classes of operations are performed on these matrices: random access to small blocks, for constructing matrices as a function of many one and twoelectron integrals, and linear algebra operations on the entire matrix, such as eigensolving, Cholesky decomposition, linear system solution, inversion, and matrixmatrix multiplication. Both types of operations must work on distributed matrices if the resulting application is to be truly scalable.
A particular focus of our distributed linear algebra work has been the development of a scalable, fully parallel eigensolver whose numerical properties satisfy the needs of the chemistry applications. This package, called PeIGS, solves dense real symmetric standard (Ax=lx) and generalized (Ax=lBx) eigenproblems. The numerical method used is multisection for eigenvalues [22] and repeated inverse iteration and orthogonalization for eigenvectors [23]. Accuracy and orthogonality are similar to LAPACK's DSPGV and DSPEV [24]. Unlike other parallel inverse iteration eigensolvers, PeIGS guarantees orthogonality of eigenvectors even for arbitrarily large clusters that span processors. Internally, PeIGS uses a conventional message passing programming model and columndistributed matrices. However, it is more commonly accessed through an interface provided by the GA toolkit. The necessary data reorganization is handled by the interface, and is very fast compared to the O(N^{3})/P) linear algebra times.
PeIGS is both fast and scalable  on a single processor it is competitive with LAPACK, and parallel efficiency remains high even for large processor counts. For example, in one of our SCF applications, the standard eigenproblem Ax=lx was solved for all eigenpairs of a 2053 by 2053 matrix. This computation required only 110 seconds on 150 processors of an Intel Paragon computer, a timetosolution estimated as 87 times faster than LAPACK for the same problem on 1 processor.
3.2 Chemistry Applications on MPPs  The DFTSCF Module
A wide range of methods have been implemented in NWChem, representing the core functionality required of any generalpurpose computational chemistry package. The novel aspects of their implementation are the parallel algorithms employed, along with other ideas for reducing the scaling of the resource requirements of calculations as a function of their size. Complete descriptions of all of the algorithms employed in NWChem are not possible in the space available here, and are being published separately e.g., QCSCF [25], MP2 [8, 26, 27], RISCF [28, 29, 30], RIMP2 [7]. In order to demonstrate how the structure and tools described above are used by a highlevel application, we sketch below the operation of the DFT module.
An essential core functionality in any electronic structure program is the direct self consistent field (SCF) module. While the application of direct SCF to large molecules is well suited to MPPs, targeting systems in excess of 1,000 atoms and 10,000 basis functions requires a reexamination of the conventional algorithm and assumed memory capacities. The majority of previous parallel implementations use a replicateddata approach [12] which is limited in scope since the size of these arrays will eventually exhaust the available single processor memory. The parallel direct DFT within NWChem (as in the HF module) distributes these arrays across the aggregate memory using the GA tools. This ensures that the size of systems treated scales with the size of the MPP and is not constrained by single processor memory.
The software developed is a MPP implementation of the HohenbergKohnSham formalism [31] of DFT. This method yields results similar to those from correlated ab initio methods, at substantially reduced cost. It assumes a charge density and approximations are made for the Hamiltonian (the exchange correlation functional), in contrast with traditional molecular orbital methods that assume an exact Hamiltonian and choose approximations to the wavefunction [32]. The Gaussian basis DFT method in NWChem breaks the Hamiltonian down into the same basic one and twoelectron components as traditional HF methods, with the latter component further reduced to a Coulomb and an exchangecorrelation term. The treatment of the former can be accomplished in identical fashion to that used in traditional SCF methods, from a fitted expression similar to that found in RISCF [28], or from the commonly used Dunlap fit [33]. DFT is distinguished from other traditional methods, however, by the treatment of the exchangecorrelation (XC) term. This term is typically integrated numerically on a grid, or fit to a Gaussian basis and subsequently integrated analytically.
Figure 2: Parallel scaling of the NWChem DFT Module for a Number of
zeolite fragments (see text) on a 512processor IBM SP.
It is instructive to elaborate on the computation/communication requirements of these timeconsuming components. The computationally dominant step, the construction of the Fock matrix, is readily parallelized as the integrals can be computed concurrently. A stripmined approach is used where the integral contributions to small blocks of the Fock matrix are computed locally and accumulated asynchronously into the distributed matrix. By choosing blocking over atoms, the falloff in interaction between distant atoms can be exploited, while simultaneously satisfying local memory constraints. The three timeconsuming steps in the construction of the Fock matrix are the fit of the charge density (FitCD), the calculation of the Coulomb potential (VCoul), and the evaluation of the XC potential (VXC). FitCD and VCoul typically consume similar amounts of floating point cycles, primarily evaluating threecenter twoelectron integrals at a rate of about 300400 flops each. Few communications are required beyond a shared counter and global array accumulate; these components are easily parallelized and display high efficiencies. VXC requires far fewer flops (evaluating gaussian functions numerically on a grid, as opposed to analytical integrations) but approximately the same communication efforts. This results in a higher communication/computation ratio then the preceding two components.
We illustrate these effects by considering the performance of the DFT module in calculations of a number of different fragment models of a zeolite cluster conducted on the 512 processor IBM SP at the EMSL's Molecular Sciences Computing Facility (MSCF). These fragments range in size from Si_{8}O_{7}H_{18}, with 347 basis functions and 832 CD fitting functions, to Si_{28}O_{67}H_{30}, with 1687 basis functions and 3928 fitting functions. Figure 2 shows the speedups achieved using up to 256 processors for each of the fragment DFT calculations, with total times to solution given in Table 2 [1]. The curves show that even on the smaller fragments e.g., Si_{8}O_{25}H_{18}, a speedup in excess of 100 is observed on 128 processors; for the largest zeolite we find excellent efficiency, with a measured speedup of ca. 190 on 256 processors of the IBM SP. We note in passing that higher efficiency has been demonstrated on Cray T3E systems (see section 4).
Zeolite Fragment  Basis Set  Number of 
Total Solution Time 
(AO / CD) 
Nodes 
(seconds) 

Si_{8}O_{7}H_{18} 
347 / 832 
64 
238 
Si_{8}O_{25}H_{18} 
617 / 1444 
128 
364 
Si_{26}O_{37}H_{36} 
1199 / 2818 
256 
1137 
Si_{28}O_{67}H_{30} 
1687 / 3928  256 
2766 
Table 2: Time in wall clock seconds for NWChem DFT Calculations on
a Variety of Zeolite Fragments Conducted on an IBM SP
4. Throughput Applications: The GAMESSUK Project
GAMESSUK [2] represents a typical established electronic structure code, comprising some 500K lines of Fortran that permits a wide range of computational methodology to be applied to molecular systems; other such codes include the GAUSSIAN series [34], CADPAC [35], GAMESSUS [36], MOLPRO [37], Turbomole [38], HONDO [39] etc. Present functionality lies at varying levels of approximation, including SCF, CASSCF, MCSCF, GVB, MP2/3/4, CCSD(T), MRDCI, DirectCI, FullCI, and analytic second derivatives. The package is available on vector hardware (e.g., Cray C90 and J90, and Fujitsu VPP/300), on workstations from all the leading vendors, and on parallel hardware from Cray/SGI (Cray T3D and T3E, and SGI Origin series) and from IBM (the SP series). There are over 110 academic and industrial user sites.
4.1 SCF, DFT and MP2 Modules Parallel Implementations
Initial parallel SCF implementations of GAMESSUK included those for the Intel iPSC2 and iPSC/860 hypercubes, with more recent developments undertaken within the Espritfunded EUROPORT 2 project, IMMP (Interactive Modelling through Parallelism). EUROPORT's principle aim was to provide exemplars of commercial codes that demonstrated the potential of parallel processing to industry; the primary goal was to establish a broad spectrum of standardsbased mediumscale parallel application software rather than targeting the highest levels of scalability and performance.
In contrast to NWChem, both SCF and DFT modules are parallelised in a replicated data fashion, with each node maintaining a copy of all data structures present in the serial version. While this structure limits the treatment of molecular systems beyond a certain size, experience suggests that it is possible on machines with 256 MByte nodes to handle systems of up to 2,000 basis functions. The main source of parallelism in the SCF module is the computation of the one and twoelectron integrals and their summation into the Fock matrix, with the more costly twoelectron quantities allocated dynamically using a shared global counter. The result of parallelism implemented at this level is a code scalable to a modest number of processors (around 32), at which point the cost of other components of the SCF procedure starts to become relatively significant. The first of these addressed was the diagonalisation, which is now based on the PeIGS module from NWChem.
Figure 3: Parallel scaling of the GAMESSUK directSCF, DFT, SCF second
derivatives and MP2 gradient modules in calculations on Morphine,
Cyclosporin, di(trifluoromethyl)biphenyl and Mn(CO)_{5}H on the Cray
T3E/1200 (see text).
Once the capability for GA [17] is added, some distribution of the linear algebra becomes trivial. As an example, the SCF convergence acceleration algorithm (DIIS  direct inversion in the iterative subspace) is distributed using GA storage for all matrices, and parallel matrix multiply and dotproduct functions. This not only reduces the time to perform the step, but the use of distributed memory storage (instead of disk) reduces the need for I/O during the SCF process. Timings for a number of DirectSCF and DFT calculations conducted on an IBM SP/P2SC160, CrayT3E/900 and CrayT3E/1200 are shown in Table 3. The DirectSCF calculations on morphine used a 631G** basis of 410 functions, those on cyclosporin a 321G basis of 1000 functions, while the DFT calculations on cyclosporin were conducted in a 631G basis, using both the BLYP (on the IBM SP and Cray T3E/900) and B3LYP functionals (on the Cray T3E/1200). Note that the DFT calculations did not exploit CD fitting, but evaluated the coulomb matrix explicitly. Figure 3 shows the speedups achieved using up to 128 Cray T3E/1200 processors; speedups of 108, 95, and 106 are obtained on 128 nodes for the morphine 631G** SCF, the cyclosporin 321G SCF, and cyclosporin 631G DFT calculation, respectively. Considering the total time to solution, we see that the IBM SP2 is typically faster than the Cray T3E/900 for up to 64 processors, although the T3E/900 exhibits significantly higher efficiency than the SP and is some 10% faster with 128 processors. The Cray T3E/1200 is seen to outperform the IBM SP2 at all processor counts, and is 3040% faster with 128 processors.
Machine  Nodes  Morphine 
Cyclosporin 
Cyclosporin  Mn(CO)_{5}H 
DirectSCF, 631G**  DirectSCF 631G  DFT, 631G  MP2 gradient, TZVP  
IBM SP2/P2SC160  16  715 
3627 
9008 
6955 
32  379 
1946 
4711 
3967 

64  212 
1121 
2552 
2372 

128  136 
735 
1430 
1482 

Cray T3E/900  16  873 
3869 
10305 
8035 
32  447 
2068 
5489 
4229 

64  238 
1102 
2899 
2390 

128  127 
653 
1456 
1368 

Cray T3E/1200  16  687 
3165 
8082 
6713 
32  356 
1678 
4305 
3533 

64  181 
899 
2261 
1923 

128  102 
532 
1219 
1158 
Table 3: Time in wall clock seconds for a variety of
DirectSCF, DFT and MP2gradient Calculations using GAMESSUK
Substantial modifications were required to enable the MP2 gradient to be computed in parallel [40]. Specifically, the conventional integral transformation step has been omitted, with the SCF step performed in direct fashion and the MO integrals generated by recomputation of the AO integrals, and stored in the global memory of the parallel machine. This storage and subsequent access is managed by the GA tools. The basic principle by which the subsequent steps are parallelised involves each node computing a contribution to the current term from MO integrals resident on that node. For some steps, however, more substantial changes to the algorithms are required. In the construction of the Lagrangian (the righthand side of the coupled HartreeFock (CPHF) equations), for example, MO integrals with three virtual orbital indices are required. Given the size of this class of integrals, they are not stored, the required terms of the Lagrangian being constructed directly from AO integrals. A second departure from the serial algorithm concerns the MP2 twoparticle density matrix. This quantity, which is required in the AO basis, is of a similar size to the twoelectron integrals and is stored on disk in the conventional algorithm, but generated as required during the twoelectron derivative integral generation from intermediates stored in the GAs.
We consider the performance of the MP2gradient module in an MP2 geometry optimisation of the Mn(CO)_{5}H molecule (with 217 basis functions) on the IBM SP/P2SC160, Cray T3E/900 and Cray T3E/1200.. Figure 3 shows a speedup of 93 achieved using 128 T3E/1200 processors to perform the complete optimisation, involving 5 energy and 5 gradient calculations. Considering the total time to solution (Table 3), we again see that the IBM SP2 is faster than the Cray T3E/900 for up to 64 processors, although the T3E exhibits significantly higher efficiency than the SP.
Finally, we turn to the parallelisation and performance of the SCF second derivative module within GAMESSUK. As with the MP2 gradient, the computation includes a number of terms constructed from twoelectron integrals in the MO basis, and the algorithm is adapted in a similar way. The conventional transformation is again replaced by a direct transformation, with integrals written to Global Array storage and subsequently used in the coupled HartreeFock (CPHF) step and in the construction of perturbed fock matrices. These steps are therefore parallelised according to the distribution of the MO integrals. The most costly step in the serial second derivative algorithm is the computation of the second derivative twoelectron integrals. This step is trivially parallelised through a similar approach to that adopted in the direct SCF scheme  using dynamic loadbalancing based on a shared global counter. We consider the performance of the second derivative SCF module in a calculation of the frequencies of 2,2'di(trifluoromethyl)biphenyl using a 631G basis of 196 functions. Figure 3 shows a speedup of 86 achieved using 128 T3E/1200 processors to perform the complete frequency calculation. Considering the total time to solution (499 seconds on 128 processors), we find that the construction of the perturbed fock matrices dominates the computation. It seems almost certain that these matrices would be more efficiently computed in the AO basis, rather than from the MO integrals as in the current implementation, thus enabling more effective use of sparsity when dealing with systems comprising more than 25 atoms.
4.2 Exploring the Potential of Beowulf Class Systems
While the focus of the preceding discussion has been on the topend, and exploiting the potential of MPPclass systems, we now return to the issue of costeffectiveness, with a brief considation of advances at the desktop and midrange departmental level of resource. Rapid developments in the power and price of commodity microprocessors and networking has provided an opportunity to obtain very high price performance by harnessing the power of PC's and low end workstations in modestly sized parallel computers.
Networks of personal computers (so called Beowulf systems) composed of fast PCs configured with large quantities of RAM and hard disk, and running the Linux operating system are becoming more and more attractive as cheap and efficient platforms for distributed applications [41]. The main drawback of a standard Beowulf architecture is the poor performance of the conventional interprocess communication mechanisms based on RPC, sockets, TCP/IP, Ethernet. Such standard mechanisms perform poorly both in terms of throughput and of message latency.
To investigate the potential of this class of system, we have recently installed a modest, prototype Beowulf configuration at Daresbury [42]. The current system comprises a total of 10 PentiumII 266 MHz CPUs, a Master node (with 128MB SDRAM and 9GB UltraWide SCSI disk) plus 8 compute nodes. Six of these nodes feature a single processor PentiumII266 CPU, each with 128MB SDRAM and 4GB UltraWide SCSI disk, while two of the nodes are dual Pentium II266 CPUs, each with 256MB SDRAM and 4GB UltraWide SCSI disk. All nodes are connected via a Fast Ethernet switch (3COM SuperStack II 3000 12 port), with job submission etc. handled by the Lobosq software provided by courtesy of the NIH LoBoS project. Software available includes a number of Fortran compilers (Absoft Pro Fortran v5.0, GNU Fortran (EGCS 1.1b), NAG Fortran 95, and Portland Group Fortran v1.75), together with the BLAS libraries from the ASCI Red programme.
The present implementation of the GAMESSUK code on this Beowulf system is limited to those modules that do not rely on the GA tools; the message latency and bandwidth are not sufficient to support the required levels of performance. Using instead an MPIbased implementation (LAM MPI 6.1) of the DirectSCF and DFT modules, we show in Table 4 the performance of both modules in calculations of the morphine molecule (631G** basis with 410 functions) on the Cray T3E/900, Cray T3E/1200 and Beowulf system.
Calculation  Nodes  Machine 

Cray T3E/900 
Cray T3E/1200 
Beowulf System 

DirectSCF  4  3434 
2718 
4775 
8  1740 
1380 
2617 

10  1369 
1079 
2179 

16  868 
688 
 

DFT B3LYP  4  5128 
4236 
7700 
8  2581 
2129 
4187 

10  2036 
1678 
3448 

16  1300 
1082 
 

Table 4: Time in Wall Clock Seconds for DirectSCF and DFT Calculations
on the Morphine molecule using GAMESSUK
Considering the total times to solution, we see that the Cray T3E/1200 is only ca. twice as fast as Beowulf for the 10node calculations. This promising level of performance should be considered in light of, (i) the scaleability of the Beowulf system being, as expected, much worse than that observed on the T3E, and not sustainable beyond perhaps the current CPU count, and (ii) the treatment of larger molecular systems demanding usage of the GA tools, with the parallel matrix algebra certain to dominate the computation times. Nevertheless these results are most encouraging; to be competitive with ten nodes of the T3E on a system costing less than a midrange workstation, even given the constraints above, points to the clear potential of this class of system. The advent of relatively inexpensive interconnects (e.g. Myrinet [43]) together with software developments that promise a 10fold reduction in latency (e.g. the GENOA project [44]) suggests that this class of system will prove highly cost effective for the computational chemist, at least as a departmental resource.
5 Conclusions
We have described work in progress to develop the molecular modelling software applications, NWChem and GAMESSUK, for both present and future generations of MPPs. The emphasis throughout the development of NWChem, targeting the Grand Challenge class of application, has been on scalability and the distribution, as opposed to the replication, of key data structures. An important part of this effort has been the careful design and implementation of a system of supporting libraries and modules using ideas from OO methodology. The DFT module provides just one example of the development of efficient parallel algorithms and the high level of performance which can be achieved through this approach. A performance analysis of calculations on a number of zeolite fragments suggests that our goal of scalability, and hence the ability to treat more complex species in costeffective and routine fashion, is well in hand. Based on experience gained with the GAMESSUK code, we have shown that the less demanding treatment of Throughput applications may be accomplished by judicious porting to MPP platforms, with our costperformance metric realised by adoption of the GA tools.
While the major focus of this paper has been on the topend, and exploiting the potential of MPPclass systems, we have also demonstrated the potential of Beowulf class systems in providing a costeffective resource for computational chemists. While this potential is at present limited to the midrange, or departmental level of resource, recent advances in network hardware and software promise to expand the realm of applicability of such systems.
With the next generation of MPP systems likely to provide an improvement by a factor of three in the key hardware attributes e.g., CPU, communication latencies and bandwidth, we are confident that our ongoing software development strategy will lead to both the type of calculations described above being performed interactively, a far cry from the present situation on conventional supercomputers, coupled with the ability to handle larger and more complex species. Important problems in biotechnology, pharmaceuticals and materials await just such advances. As stressed at the outset, however, raw computing power alone will not be sufficient to achieve these goals; continuing theoretical and computational innovation will be the real key to realising the full potential of the teraFLOPs computers scheduled for the millenium.
References
 1

D.A. Dixon, T.H. Dunning, Jr., M. Dupuis, D. Feller, D. Gracio, R.J.
Harrison, D.R. Jones, R.A. Kendall, J.A. Nichols, K. Schuchardt and T.
Straatsma, Computational Science in the Environmental Sciences
Laboratory, in High Performance Computing, eds: R.J. Allan, M.F.
Guest, A.D. Simpson D.S. Henty and D.A. Nichole, Proceedings of the
HPCI'98 Conference, UMIST 1214/1/98, Plenum Publishing (1998) ISBN
0306460343.
 2

GAMESSUK is a package of ab initio programs written by M.F. Guest,
J.H. van Lenthe, J. Kendrick, K. Schoffel and P. Sherwood, with
contributions from R.D. Amos, R.J. Buenker, M. Dupuis, N.C. Handy,
I.H. Hillier, P.J. Knowles, V. BonacicKoutecky, W. von Niessen,
R.J. Harrison, A.P. Rendell, V.R. Saunders, and A.J. Stone.
The package is derived from the original GAMESS code due to
M. Dupuis, D. Spangler and J. Wendoloski,
NRCC Software Catalog, Vol. 1, Program No. QG01
(GAMESS), 1980;
M.F. Guest, R.J. Harrison, J.H. van Lenthe and L.C.H. van Corler, Theo. Chim. Acta 71:117 (1987);
M.F. Guest et al., Computing for Science Ltd., CCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK.  3

Additional information on the Environmental Molecular Sciences
Laboratory, Pacific Northwest National Laboratory, Battelle Memorial
Institute, and the U.S. Department of Energy is available via the Pacific
Northwest Home Page at http://www.pnl.gov:2080/
 4

I.T. Foster, J.Jl Tilson, A.F. Wagner, R. Shepard, D.E. Bernholdt,
R.J. Harrison, R.A. Kendall, R.J. Littlefield and A.T. Wong,
Toward highperformance computational chemistry: I. Scalable
Fock matrix construction algorithms, J. Comp. Chem.
17:109123 (1996).
 5

R.J. Harrison, M.F. Guest, R.A. Kendall, D.E. Bernholdt, A.T. Wong,
M. Stave, J.L. Anchell, A.C. Hess, R.J. Littlefield, G.I. Fann,
J. Nieplocha, G.S. Thomas, D. Elwood, J. Tilson, R.L. Shepard,
A.F. Wagner, I.T. Foster, E. Lusk and R. Stevens, Toward
highperformance computational chemistry: II. A scalable
selfconsistent field program, J. Comp. Chem. 17:124132 (1996).
 6

A.T. Wong, R.J. Harrison and A.P. Rendell, Parallel Direct Four Index
Transformations, Theor. Chim. Acta 93:31731 (1996).
 7

D.E. Bernholdt and R.J. Harrison,
LargeScale Correlated Electronic Structure Calculations: the RIMP2 Method on
Parallel Computers, Chem. Phys. Lett. 250:47784 (1996).
 8

D.E. Bernholdt and R.J. Harrison, Orbital invariant
secondorder manybody perturbation theory on parallel computers. An
approach for large molecules, J. Chem. Phys. 102:958289 (1995).
 9

G. Booch, ObjectOriented Analysis and Design, 2nd edition, Benjamin/Cummings
Publishing Co., Inc. (1994).
 10

B. Stroustrup, The C++ Programming Language, 2nd edition, AddisonWesley Publishing
Company (1991).
 11

M.F. Guest, E. Apra, D.E. Bernholdt, H.A. Fruechtl, R.J. Harrison,
R.A. Kendall, R.A. Kutteh, X. Long, J.B. Nicholas, J.A. Nichols,
H.L. Taylor, A.T. Wong, G.I. Fann, R.J. Littlefield and J. Nieplocha,
Future Generation Computer Systems 12:273289 (1996).
 12

R.A. Kendall, R.J. Harrison, R.J. Littlefield and M.F. Guest,
in: Reviews in Computational Chemistry, K.B. Lipkowitz and D.B. Boyd
eds., VCH Publishers, Inc., New York (1994).
 13

The MPI Forum, MPI: A MessagePassing Interface Standard ,
University of Tennessee (1994).
 14

The MPI Forum, A message passing interface, in: Supercomputing '93, pp87883,
IEEE Computer Society Press, Los Alamitos, California, Portland, OR (1993).
 15

M.E. Colvin, C.L. Janssen, R.A. Whiteside and C.H. Tong,
Theoretica Chimica Acta 84:30114 (1993).
 16

T.R. Furlani and H.F. King, J. Comp. Chem. (1995).
 17

J. Nieplocha, R.J. Harrison and R.J. Littlefield, Global
arrays; A portable shared memory programming model for distributed
memory computers, in: Supercomputing '94, IEEE Computer Society
Press, Washington, D.C. (1994).
 18

M. Arango, D. Berndt, N. Carriero, D. Gelernter and D. Gilmore,
Supercomputer Review 3(10) (1990).
 19

M. Schuler, T. Kovar, H. Lischka, R. Shepard and R.J. Harrison,
Theoretica Chimica Acta 84:489509 (1993).
 20

A.P. Rendell, M.F. Guest and R.A. Kendall, J. Comp. Chem.
14:142939 (1993).
 21

J. Choi, J.J. Dongarra, L.S. Ostrouchov, A.P. Petitet, R.C. Whaley,
J. Demmel, I. Dhillon and K. Stanley, LAPACK
Working Note: Installation Guide for ScaLAPACK, Department of Computer
Science, University of Tennessee, Knoxville, TN, (1995). Available from
anonymous@ftp:ftp.netlib.org .
 22

S.S. Lo, B. Phillipps and A. Sameh,
SIAM J. Sci. Stat. Comput. 8(2) (1987).
 23

G. Fann and R.J. Littlefield, Parallel inverse iteration with
reorthogonalization, in: Sixth SIAM Conference on Parallel Processing for
Scientific Computing (SIAM), pp40913 (1993).
 24

E. Anderson, Z. Bai, C. Bischof, J. Demmel, J.J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammerling, A. McKenney, S. Ostrouchov and
D. Sorensen, LAPACK User's Guide, SIAM (1992).
 25

G.B. Bacskay, Chem. Phys. 61:385 (1982).
 26

P. Pulay, Chem. Phys. Lett. 100:1514 (1983).
 27

P. Pulay and S. Saebo, Theor. Chim. Acta 69:35768 (1986);
S. Saebo and P. Pulay, Annu. Rev. Phys. Chem. 44:213236 (1993).  28

O. Vahtras, J. Almlof and M. Feyereisen, Chem. Phys.
Lett. 213:514 (1993).
 29

M. Feyereisen, G. Fitzgerald, A. Komornicki, Chem.
Phys. Lett. 208:35963 (1993).
 30

H.A. Fruechtl and R.A. Kendall, A Scalable Implementation of the
RISCF Algorithm, Int. J. Quant. Chem. 64:6369 (1997).
 31

P. Hohenberg and W. Kohn, Phys. Rev. B. 136:86471 (1964);
W. Kohn and L.J. Sham, Phys. Rev. A. 140:113338 (1965).  32

E. Wimmer, Density Functional Methods in Chemistry,
J.K. Labanowski and J.W. Andzelm, eds. pp731 SpringerVerlag, Berlin (1991).
 33

B.I. Dunlap, J.W.D. Connolly and J.R. Sabin, J. Chem. Phys.
71:3396, 4993 (1979).
 34

M. Frisch et al., Gaussian, Inc., 4415 Fifth Avenue, Pittsburgh, PA
15213, USA (1992).
 35

R.D. Amos, I.L. Alberts, J.S. Andrews, S.M. Colwell, N.C. Handy, D. Jayatilaka,
P.J. Knowles, R. Kobayashi, N. Koga, K.E. Laidig, P.E. Maslen, C.W. Murray,
J.E. Rice, J. Sanz, E.D. Simandiras, A.J. Stone and M.D. Su,
CADPAC, Issue 6, University of Cambridge, (1995).
 36

M.W. Schmidt, et al., QCPE Bulletin 7:115 (1987).
 37

MOLPRO is a package of ab initio programs written by H.J. Werner and
P.J. Knowles, with contributions from R.D. Amos, A. Berning, D.L. Cooper,
M.J.O. Deegan, A.J. Dobbyn, F. Eckert, C. Hampel, T. Leininger, R. Lindh,
A.W. Lloyd, W. Meyer, M.E. Mura, A. Nicklass, P. Palmieri, K. Peterson,
R. Pitzer, P. Pulay, G. Rauhut, M. Schütz,
H. Stoll, A.J. Stone and T. Thorsteinsson.
 38

R. Ahlrichs, M. Br, M. Hser, H. Horn and C. Klmel,
Electronic structure calculations on workstation computers: The program
system TURBOMOLE, Chem. Phys. Letters 162:165 (1989);
R. Ahlrichs and M.v. Arnim, TURBOMOLE, parallel implementation of SCF, density functional, and chemical shift modules, in: Methods and Techniques in Computational Chemistry: METECC95, E. Clementi and G. Corongiu, eds.  39

M. Dupuis, J.D. Watts, H.O. Villar and G.J.B. Hurst,
The general atomic and molecular electronic structure system HONDO:
version 7.0., Comput. Phys. Commun. 52:415 (1989);
M. Dupuis, A. Farazdel, S.P. Karma and S.A. Maluendes, HONDO: A general atomic and molecular electronic structure system, in: MOTECC, E. Clementi ed., ESCOM, Leiden (1990).  40

G.D. Fletcher, A.P. Rendell and P. Sherwood, A parallel
secondorder MollerPlesset gradient, Molec. Phys. 91:43138 (1997).
 41

D. Ridge, D. Becker, P. Merkey, T.Sterling and P. Merkey,
Beowulf: Harnessing the Power of Parallelism in a PileofPCs,
Proceedings, IEEE Aerospace (1997).
 42

Additional information on the Daresbury Beowulf System is available
via WorldWide Web URL http://www.dci.clrc.ac.uk/Activity/DISCO+926.
 43

Additional information on myrinet is available via WorldWide Web
URL http://www.myri.com/.
 44
 G. Chiola and G. Ciaccio, Active Ports: A Performanceoriented Operating System Support to Fast LAN Communications, in proceedings of EuroPar'98, Southampton, UK, September 14, 1998. LNCS 1470, Springer.