M.F. Guest and P. Sherwood,
Department for Computation and Information,
CCLRC DaresburyLaboratory,
Warrington WA4 4AD, Cheshire, UK

and

J.A. Nichols,
High Performance Computational Chemistry Group,
Environmental Molecular Sciences Laboratory,
Pacific Northwest National Laboratory,
PO Box 999, Mail Stop K1-90, Richland,
WA. 99352, USA

[1]  This paper is based on an invited presentation at the meeting "Supercomputing, Collision Processes and Applications" that was held at The Queen's University of Belfast from the 14th to 16th Spetember 1998 to mark the retirement of Professor P.G. Burke CBE FRS. The Proceedings will be published by Plenum Publishing Corporation.

1. Introduction

Computational chemistry covers a wide spectrum of activities ranging from quantum mechanical calculations of the electronic structure of molecules, to classical mechanical simulations of the dynamical properties of many-atom systems, to the mapping of both structure-activity relationships and reaction synthesis steps. Although chemical theory and insight play important roles in this work, the prediction of physical observables is almost invariably bounded by the available computer capacity.

The potential of massively parallel computers, with hundreds to thousands of processors (MPPs), to significantly outpace conventional supercomputers in both capacity and price-performance has long been recognised. While increases in raw computing power alone will greatly expand the range of problems that can be treated by theoretical chemistry methods, it is now apparent that a significant investment in new algorithms is needed to fully exploit this potential. Merely porting presently available software to these parallel computers does not provide the efficiency required, with many existing parallel applications showing a deterioration in performance as greater numbers of processors are used. New algorithms must be developed that exhibit parallel scalability (i.e., show a near linear increase in performance with the processor count). Such improvements promise substantial scientific and commercial gains by increasing the number and complexity of chemical systems that can be studied.

Organised in five sections, this paper aims to analyse just how far the potential of MPPs has been realised, and what is needed in terms of software development to provide the appropriate applications. Following a consideration of the computational requirements for molecular computations as a function of the level of theory and required accuracy (section 2), we present an analysis of the cost effectiveness of current top-end MPP solutions against both desk-top and mid-range departmental resources. We identify two distinct categories of application, Grand Challenge and Throughput. The former is taken to include those chemical systems of sufficient complexity that neither desk top nor mid-range resources provide a computationally viable solution. Throughput applications are defined to include chemical systems that are amenable to treatment by more modest resources, but where time to solution demands a top-end capability. We define an MPP performance metric applicable to both application categories that emerges from this consideration of price-performance.

Section 3 considers Grand Challenge applications, and the need to address parallel scalability from the outset in developing molecular modelling software applications. In assessing the algorithmic impact of this requirement, we overview the development of the NWChem package at the Pacific Northwest National Laboratory [1] as an example of the provision of capabilities designed to facilitate ``new science'', but in cost effective fashion. Section 4 turns to the treatment of Throughput applications, and considers the efforts required to migrate existing applications to MPP platforms so as to achieve the performance metric of section 2. We consider the parallel development of the GAMESS-UK package [2] at the CCLRC Daresbury Laboratory as a typical example of what is achievable from this more modest approach to exploiting MPP platforms. We present performance figures on both Cray T3E and IBM SP systems for a number of Grand Challenge and Throughput calculations on representative molecular systems using a variety of methods routinely employed by the computational chemist, including self-consistent field Hartree-Fock (SCF) and Density Functional Theory (DFT), Second-order Moller Plesset Perturbation theory (MP2), and SCF second derivatives. While the major focus of this paper lies at the top-end, and exploiting the potential of MPP-class systems, we also consider in section 4 the potential of Beowulf class PC-based systems in providing a cost-effective resource for the computational chemist.

2. Methods and Cost-Performance Issues

The complexity of computational chemistry, and the demands that the discipline imposes on any computational resource, are illustrated in Figure 1 where the relative computing power required for molecular computations at four different levels of theory (and accuracy) are shown: configuration interaction (CI), Hartree-Fock (HF), density functional, and molecular dynamics. Each of these levels of theory formally scale from N2 to N6 with respect to computational requirements. While standard screening techniques which in essence ignore interactions for atoms far apart can bring these scalings down significantly, it is clear that the major challenge facing the computational chemist is the development of methods that exhibit superior scaling properties to those shown in Figure 1. We should not, of course, lose sight of this goal in our efforts to maximally exploit current methodology on present and future hardware.

 
Figure 1: The relative computing power required for molecular computations at four levels of theory. In the absence of screening techniques, the formal scaling for configuration interaction, Hartree-Fock, density functional, and molecular dynamics is: N6, N4 , N3 and N2 , respectively.

[Image]

In assessing alternative hardware platforms for computational chemistry, we need to quantify in some fashion the cost effectiveness of a top-end MPP solution, given that such a solution should be able to offer capabilities not possible on either desk top or mid-range resources i.e., a solution that will facilitate ``new science'', but in cost effective fashion. Table 1 compares the cost of three HPC solutions, associated with the desk-top, with a departmental or mid-range resource (e.g., a 16 processor Origin 2000 from Silicon Graphics, with 4 GByte RAM), and with a top-end MPP solution comprising perhaps 500 or 1000 nodes (e.g., an IBM SP P2SC/160 or Cray T3E/1200, with 64-128 GByte RAM). We also consider the associated capabilities, defined in terms of CPU, memory and Input/Output (I/O) rate. In each case we have normalised these attributes relative to that available on the desk-top. We have assumed the most cost effective desk-top solution to be a 450 MHz PentiumII PC, and have costed that to include 256 MByte RAM and 9 GByte disk. Allowing for a factor of two in this, or for local variations in costs, does not change the final conclusion; the cost ratios of Table 1 lend little or no support to the repeated claims that MPP solutions are cost effective. This is clear when we compare the actual increase in capability as we move from desk-top to the top-end national resource. We see factors of 500/1000 increase in CPU capability, 250/500 increase in available memory, and 20/100 in achievable I/O bandwidth, each factor far below the associated cost ratios.

Specification Usage Units of  Cost

Hardware Attributes

CPU

Memory

I/O

Top-end MPP, 500-1000 node,

(Cray T3E, IBM P2SC)

HPC Community

3000

500-1000

250-500 20-100
mid-range SMP, 16-processor,

SGI Origin 2000

Departmental

60

20

10-20

20-50

Desktop,

450 MHz Pentium II

Single User

1

1

1

1


 
Table 1:   Relative costs and Capabilities of three HPC solutions

What is clear from this rather crude analysis is that scalability, with the exploitation of not only CPU, but also memory and I/O bandwidth in scalable fashion, is crucial to the twin goals of cost-effectiveness and the solving of Grand Challenge problems. Without that scalability, there is nothing to justify a top-end resource against, say, a number of SGI Origin 2000's; funding for the latter is often easier to find and provides the principle investigator with satisfaction of local ownership and control. Such mid-range provision would, we suspect, be the preferred route unless the top-end resource really does provide the opportunity for Grand Challenge science not possible at the departmental or desk-top level. So what level of scalability justifies a top-end resource? A pragmatic approach suggests that the latter should provide a number of groups with simultaneous access to at least a two-orders of magnitude improvement over the desk-top, and an order of magnitude improvement over computation available at the departmental level. Such comparisons should allow for the rapid evolution of desk-top and mid-range technology, with some form of technology refresh at the top-end enabling these factors of 100 and 10 to be sustained throughout the lifetime of the machine. Another pointer comes from considering time-to-solution for Throughput applications. Perhaps the longest viable desk-top run would be of the order of 1 month; a user is unlikely to dedicate his own machine for longer periods. Given a 100-fold performance improvement, such a run could be accomplished over-night.

Thus in our analysis of the parallel developments conducted to date, we will be looking for scalability and performance at the above levels. Furthermore, we require these levels to be available from some fraction of the top-end resource, rather than the whole machine; this would enable several groups to be simultaneously delivering that level of performance. If a given group has to wait a week to access that level, rather than overnight, the value of the top-end resource will rapidly diminish.

3. Grand Challenge Applications; the NWChem Project

The Environmental Molecular Sciences Laboratory (EMSL [3]) is a new research laboratory constructed by the U.S. Department of Energy (DOE) at PNNL, in Richland, Washington. EMSL's primary mission is to develop an understanding of the fundamental molecular processes required to develop efficient and cost-effective technologies to remediate, process and provide long-term storage of hazardous wastes at DOE sites.

Computational simulations and modelling to be performed within the EMSL will further a molecular-level understanding of the physical, chemical and biological processes that underlie environmental remediation, waste processing and related health and ecological effects. EMSL researchers aim to model the migration of contaminants below ground, on the earth's surface and in the air. Computations can predict the fate of those contaminants, and simulations can help in the design and execution of any required remedial actions. A 512 processor IBM RS/6000 SP is being used to carry out large-scale computations to aid in the design of new materials and to perform environmental molecular science research that is essential to cleaning up the environment faster and more cost effectively. These calculations will help design compounds to separate radionuclides from other wastes, engineer enzymes to enhance the biodegradation of organic wastes and develop new materials to contain radioactive wastes for short and long term storage.

The High Performance Computational Chemistry (HPCC) group within the EMSL is developing a new generation of molecular modelling software to take full advantage of the parallel computing power of MPPs. In addition to the development of new algorithms that exhibit parallel scalability (i.e., show a near linear increase in performance with the number of processors), work has focused on creating the high-level data and control structures needed to make parallel programs easier to write, maintain, and extend. These efforts have culminated in a package called NWChem that includes a broad range of electronic structure and molecular dynamics functionality. Version 1.0 of NWChem provides the following electronic structure capabilities; HF self-consistent field (SCF) energies, gradients and second derivatives [4, 5], multiconfiguration SCF energies and gradients [6], several forms of many-body perturbation theory (MP2-MP4) energies and gradients [7, 8], density functional theory (DFT) energies and gradients, coupled cluster (CCSD and CCSD(T)) energies, and effective core potential energies, gradients and second derivatives. A wide range of properties are also available: gradients, Hessians, electrical multipoles, NMR shielding tensors, etc. Molecular dynamics calculations can be performed with a variety of empirical force fields for the simulation of macromolecular and solution systems, with functionality including energy minimisation, molecular dynamics simulation and free energy calculation.

In outlining these developments, we focus below on the design philosophy, structure, and tools which make NWChem an effective environment for the development of computational chemistry applications. Although this focus has been on the efficient use of parallel computers in chemistry, almost all of the ideas and tools presented here are generally applicable. We provide, by way of example, an outline of the development and performance of a highly efficient and scalable algorithm for conducting SCF-DFT studies on molecular systems, illustrated through DFT calculations of a number of zeolite fragments.

3.1 Design Philosophy and Software Development Tools

The philosophy and structure of NWChem evolved in response to experiences gained with other computational chemistry packages, and a number of constraints. Initial experiments led to the conclusion that while it was possible to modify existing software to implement parallel algorithms on a ``proof of principle'' basis, the effort involved in turning them into fully scalable (distributed rather than replicated data) codes was tantamount to writing the codes over from scratch. At the same time, as in most software development efforts, limitations were in effect on both personnel and time available to the development effort. The success of the project relied on having a system which was easy to program, and in particular one in which new parallel algorithms could be readily prototyped.

In order to meet the twin requirements of both maintainability (the ability to transition from locally developed parallel programming tools to vendor-supported, standard-based implementations) and portability (finding a programming model which can be used effectively across the variety of MPP platforms), ideas were adopted from the object-oriented (OO) style of software development [9]. Examples of the adoption of the basic principles of this methodology (abstraction, encapsulation, modularity, and hierarchy) have been presented elsewhere [11]. While there are OO programming languages, such as C++ [10], which allow the formal aspects of the methodology to be taken straight into the actual code, the project chose a compromise approach in the development of NWChem. Object-oriented ideas were used at the design stage, while the implementation remained a mixture of Fortran and C, languages much more familiar to current computational chemists than C++. Experience suggested that while it may take extra time, a thorough, careful design phase is probably the most important aspect of the OO methodology, and this effort is quickly recouped in easier implementation and fewer instances of modification of code after implementation. A design based on OO principles can be implemented in a non-OO programming language with only a small amount of discipline on the programmer's part.

A number of low-level tools were selected or developed to facilitate the development of NWChem. Although some were designed specifically to meet the needs of computational chemistry applications, they remain generally applicable. We focus attention below on the tools that target the development of distributed data applications and distributed linear algebra, noting that a discussion of the building blocks behind these tools, notably message passing and memory allocation, has appeared elsewhere [11].

3.1.1 Global Arrays

A vital consideration in using MPPs is how data are stored. So-called replicated-data schemes require that a copy of each data item in the program be stored on each processor, so that the size of the problem that can be handled is limited by the memory of a single processor. In distributed-data applications each processor holds only a part of the total data; in such cases, the problem size is limited only by the total memory of the machine, allowing much larger problems to be treated. Efforts at PNNL have focused on distributed-data applications; relying on a replicated data implementation of the DFT module, for example, would limit the size of calculation to perhaps 2000 basis functions (see section 4), far short of the project's requirements of treating molecular systems with 10,000 basis functions.

No emerging standards for parallel programming languages (notably just High Performance Fortran (HPF-1)) provide extensive support for multiple instruction multiple data (MIMD) programming [12]. The only truly portable MIMD programming model is message passing, for which a standard interface is now established [13, 14]. It is, however, very difficult to develop applications with fully distributed data structures using the message-passing model [15, 16]. The shared-memory programming model offers increased flexibility and programming ease but is less portable and provides less control over the inter-processor transfer cost. What is needed is support for one-sided asynchronous access to data structures (here limited to one- and two-dimensional arrays) in the spirit of shared memory. With some effort this can be done portably [17]; in return for this investment, a much easier programming environment is achieved, speeding code development and improving extensibility and maintainability. A significant performance enhancement also results from increased asynchrony of execution of processes [18]; with a one-sided communication mechanism, where each process can access what it needs without explicit participation of another process, all processes can operate independently. This approach eliminates unnecessary synchronization and naturally leads to interleaving of computation and communication. Most programs contain multiple algorithms, some of which may naturally be task-parallel (e.g., Fock matrix construction), and others that may be efficiently and compactly expressed as data-parallel operations (e.g., evaluating the trace of a matrix product). Both types of parallelism must be efficiently supported. Consideration of the requirements of the SCF algorithm, the parallel COLUMBUS configuration interaction program [19], second order many-body perturbation theory [8] and parallel coupled-cluster methods [20] led to the design and implementation of the Global Array (GA) toolkit [17] to support one-sided asynchronous access to globally-addressable distributed one- and two-dimensional arrays.

This toolkit provides an efficient and portable ``shared-memory'' programming interface for distributed-memory computers. Each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, without need for explicit co-operation by other processes. Unlike other shared-memory environments, the GA model exposes the programmer to the non-uniform memory access (NUMA) timing characteristics of the parallel computers and acknowledges that access to remote data is slower than to local data. From the user perspective, a global array can be used as if it were stored in shared memory, except that explicit library calls are required to access it. The information on the actual data distribution can be obtained and exploited whenever data locality is important. Each process is assumed to have fast access to some ``local'' portion of each distributed matrix, and slower access to the remaining ``remote'' portion. Remote data can be accessed through operations like ``get'', ``put'' or ``accumulate'' (floating point sum-reduction) that involve copying the globally accessible data to/from process-private buffer space. The toolkit provides operations for (i) the creation, and destruction of distributed arrays, (ii) the synchronization of all processes, (iii) inquiries about arrays and their distribution, and (iv) primitive operations, such as get, put, accumulate, atomic read and increment, gather and scatter, and direct access to the local portion of an array, A number of BLAS-like data-parallel operations have been developed on top of these primitives.

Additional functionality is provided through a variety of third party libraries made available by using the GA primitives to perform the necessary data rearrangement. These include standard and generalized real symmetric eigensolvers (PeIGS, see below), and linear equation solvers (SCALAPACK) [21]. The O(N2)cost of data rearrangement is observed to be negligible in comparison to that of O(N3) linear-algebra operations. These libraries may internally use any form of parallelism appropriate to the host computer system, such as co-operative message passing or shared memory.

The GA interface has been designed in the light of emerging standards. In particular HPF-1 and subsequent revisions will provide the basis for future standards definition of distributed arrays in Fortran. A long term goal must be to migrate to full language support, and to eliminate as much as possible the practice of parallel programming through subroutine libraries. The basic functionality described above (create, fetch, store, accumulate etc.) may be expressed as single statements using Fortran90 array notation and the data-distribution directives of HPF. However, HPF currently precludes the use of such operations on shared data in MIMD parallel code. There is reason to believe that future versions of the HPF standard will rectify this problem, as well as provide for irregular distributions, which has been found to lead to a significant increase in performance in computational chemistry applications.

3.1.2 Distributed Matrices and Linear Algebra

Many electronic structure computations are formulated in terms of dense matrices of size roughly N by N, where N is the number of basis functions. Two distinct classes of operations are performed on these matrices: random access to small blocks, for constructing matrices as a function of many one- and two-electron integrals, and linear algebra operations on the entire matrix, such as eigensolving, Cholesky decomposition, linear system solution, inversion, and matrix-matrix multiplication. Both types of operations must work on distributed matrices if the resulting application is to be truly scalable.

A particular focus of our distributed linear algebra work has been the development of a scalable, fully parallel eigensolver whose numerical properties satisfy the needs of the chemistry applications. This package, called PeIGS, solves dense real symmetric standard (Ax=lx) and generalized (Ax=lBx) eigenproblems. The numerical method used is multisection for eigenvalues [22] and repeated inverse iteration and orthogonalization for eigenvectors [23]. Accuracy and orthogonality are similar to LAPACK's DSPGV and DSPEV [24]. Unlike other parallel inverse iteration eigensolvers, PeIGS guarantees orthogonality of eigenvectors even for arbitrarily large clusters that span processors. Internally, PeIGS uses a conventional message passing programming model and column-distributed matrices. However, it is more commonly accessed through an interface provided by the GA toolkit. The necessary data reorganization is handled by the interface, and is very fast compared to the O(N3)/P) linear algebra times.

PeIGS is both fast and scalable - on a single processor it is competitive with LAPACK, and parallel efficiency remains high even for large processor counts. For example, in one of our SCF applications, the standard eigenproblem Ax=lx was solved for all eigenpairs of a 2053 by 2053 matrix. This computation required only 110 seconds on 150 processors of an Intel Paragon computer, a time-to-solution estimated as 87 times faster than LAPACK for the same problem on 1 processor.

3.2 Chemistry Applications on MPPs - The DFT-SCF Module

A wide range of methods have been implemented in NWChem, representing the core functionality required of any general-purpose computational chemistry package. The novel aspects of their implementation are the parallel algorithms employed, along with other ideas for reducing the scaling of the resource requirements of calculations as a function of their size. Complete descriptions of all of the algorithms employed in NWChem are not possible in the space available here, and are being published separately e.g., QCSCF [25], MP2 [8, 26, 27], RISCF [28, 29, 30], RIMP2 [7]. In order to demonstrate how the structure and tools described above are used by a high-level application, we sketch below the operation of the DFT module.

An essential core functionality in any electronic structure program is the direct self consistent field (SCF) module. While the application of direct SCF to large molecules is well suited to MPPs, targeting systems in excess of 1,000 atoms and 10,000 basis functions requires a re-examination of the conventional algorithm and assumed memory capacities. The majority of previous parallel implementations use a replicated-data approach [12] which is limited in scope since the size of these arrays will eventually exhaust the available single processor memory. The parallel direct DFT within NWChem (as in the HF module) distributes these arrays across the aggregate memory using the GA tools. This ensures that the size of systems treated scales with the size of the MPP and is not constrained by single processor memory.

The software developed is a MPP implementation of the Hohenberg-Kohn-Sham formalism [31] of DFT. This method yields results similar to those from correlated ab initio methods, at substantially reduced cost. It assumes a charge density and approximations are made for the Hamiltonian (the exchange correlation functional), in contrast with traditional molecular orbital methods that assume an exact Hamiltonian and choose approximations to the wavefunction [32]. The Gaussian basis DFT method in NWChem breaks the Hamiltonian down into the same basic one- and two-electron components as traditional HF methods, with the latter component further reduced to a Coulomb and an exchange-correlation term. The treatment of the former can be accomplished in identical fashion to that used in traditional SCF methods, from a fitted expression similar to that found in RI-SCF [28], or from the commonly used Dunlap fit [33]. DFT is distinguished from other traditional methods, however, by the treatment of the exchange-correlation (XC) term. This term is typically integrated numerically on a grid, or fit to a Gaussian basis and subsequently integrated analytically.

 
Figure 2: Parallel scaling of the NWChem DFT Module for a Number of zeolite fragments (see text) on a 512-processor IBM SP.

[Image]

It is instructive to elaborate on the computation/communication requirements of these time-consuming components. The computationally dominant step, the construction of the Fock matrix, is readily parallelized as the integrals can be computed concurrently. A strip-mined approach is used where the integral contributions to small blocks of the Fock matrix are computed locally and accumulated asynchronously into the distributed matrix. By choosing blocking over atoms, the falloff in interaction between distant atoms can be exploited, while simultaneously satisfying local memory constraints. The three time-consuming steps in the construction of the Fock matrix are the fit of the charge density (FitCD), the calculation of the Coulomb potential (VCoul), and the evaluation of the XC potential (VXC). FitCD and VCoul typically consume similar amounts of floating point cycles, primarily evaluating three-center two-electron integrals at a rate of about 300-400 flops each. Few communications are required beyond a shared counter and global array accumulate; these components are easily parallelized and display high efficiencies. VXC requires far fewer flops (evaluating gaussian functions numerically on a grid, as opposed to analytical integrations) but approximately the same communication efforts. This results in a higher communication/computation ratio then the preceding two components.

We illustrate these effects by considering the performance of the DFT module in calculations of a number of different fragment models of a zeolite cluster conducted on the 512 processor IBM SP at the EMSL's Molecular Sciences Computing Facility (MSCF). These fragments range in size from Si8O7H18, with 347 basis functions and 832 CD fitting functions, to Si28O67H30, with 1687 basis functions and 3928 fitting functions. Figure 2 shows the speedups achieved using up to 256 processors for each of the fragment DFT calculations, with total times to solution given in Table 2 [1]. The curves show that even on the smaller fragments e.g., Si8O25H18, a speedup in excess of 100 is observed on 128 processors; for the largest zeolite we find excellent efficiency, with a measured speedup of ca. 190 on 256 processors of the IBM SP. We note in passing that higher efficiency has been demonstrated on Cray T3E systems (see section 4).



Zeolite Fragment Basis Set

Number of

Total Solution Time

(AO / CD)

Nodes

(seconds)

Si8O7H18

347 / 832

64

238

Si8O25H18

617 / 1444

128

364

Si26O37H36

1199 / 2818

256

1137

Si28O67H30

1687 / 3928

256

2766


 
Table 2: Time in wall clock seconds for NWChem DFT Calculations on a Variety of Zeolite Fragments Conducted on an IBM SP

4. Throughput Applications: The GAMESS-UK Project

GAMESS-UK [2] represents a typical established electronic structure code, comprising some 500K lines of Fortran that permits a wide range of computational methodology to be applied to molecular systems; other such codes include the GAUSSIAN series [34], CADPAC [35], GAMESS-US [36], MOLPRO [37], Turbomole [38], HONDO [39] etc. Present functionality lies at varying levels of approximation, including SCF, CASSCF, MCSCF, GVB, MP2/3/4, CCSD(T), MRDCI, Direct-CI, Full-CI, and analytic second derivatives. The package is available on vector hardware (e.g., Cray C90 and J90, and Fujitsu VPP/300), on workstations from all the leading vendors, and on parallel hardware from Cray/SGI (Cray T3D and T3E, and SGI Origin series) and from IBM (the SP series). There are over 110 academic and industrial user sites.

4.1 SCF, DFT and MP2 Modules Parallel Implementations

Initial parallel SCF implementations of GAMESS-UK included those for the Intel iPSC-2 and iPSC/860 hypercubes, with more recent developments undertaken within the Esprit-funded EUROPORT 2 project, IMMP (Interactive Modelling through Parallelism). EUROPORT's principle aim was to provide exemplars of commercial codes that demonstrated the potential of parallel processing to industry; the primary goal was to establish a broad spectrum of standards-based medium-scale parallel application software rather than targeting the highest levels of scalability and performance.

In contrast to NWChem, both SCF and DFT modules are parallelised in a replicated data fashion, with each node maintaining a copy of all data structures present in the serial version. While this structure limits the treatment of molecular systems beyond a certain size, experience suggests that it is possible on machines with 256 MByte nodes to handle systems of up to 2,000 basis functions. The main source of parallelism in the SCF module is the computation of the one- and two-electron integrals and their summation into the Fock matrix, with the more costly two-electron quantities allocated dynamically using a shared global counter. The result of parallelism implemented at this level is a code scalable to a modest number of processors (around 32), at which point the cost of other components of the SCF procedure starts to become relatively significant. The first of these addressed was the diagonalisation, which is now based on the PeIGS module from NWChem.

 
Figure 3: Parallel scaling of the GAMESS-UK direct-SCF, DFT, SCF second derivatives and MP2 gradient modules in calculations on Morphine, Cyclosporin, di(trifluoromethyl)-biphenyl and Mn(CO)5H on the Cray T3E/1200 (see text).

[Image]

Once the capability for GA [17] is added, some distribution of the linear algebra becomes trivial. As an example, the SCF convergence acceleration algorithm (DIIS - direct inversion in the iterative subspace) is distributed using GA storage for all matrices, and parallel matrix multiply and dot-product functions. This not only reduces the time to perform the step, but the use of distributed memory storage (instead of disk) reduces the need for I/O during the SCF process. Timings for a number of Direct-SCF and DFT calculations conducted on an IBM SP/P2SC-160, Cray-T3E/900 and Cray-T3E/1200 are shown in Table 3. The Direct-SCF calculations on morphine used a 6-31G** basis of 410 functions, those on cyclosporin a 3-21G basis of 1000 functions, while the DFT calculations on cyclosporin were conducted in a 6-31G basis, using both the B-LYP (on the IBM SP and Cray T3E/900) and B3LYP functionals (on the Cray T3E/1200). Note that the DFT calculations did not exploit CD fitting, but evaluated the coulomb matrix explicitly. Figure 3 shows the speedups achieved using up to 128 Cray T3E/1200 processors; speedups of 108, 95, and 106 are obtained on 128 nodes for the morphine 6-31G** SCF, the cyclosporin 3-21G SCF, and cyclosporin 6-31G DFT calculation, respectively. Considering the total time to solution, we see that the IBM SP2 is typically faster than the Cray T3E/900 for up to 64 processors, although the T3E/900 exhibits significantly higher efficiency than the SP and is some 10% faster with 128 processors. The Cray T3E/1200 is seen to outperform the IBM SP2 at all processor counts, and is 30-40% faster with 128 processors.


Machine Nodes

Morphine

Cyclosporin

Cyclosporin

Mn(CO)5H

Direct-SCF, 6-31G** Direct-SCF 6-31G DFT, 6-31G MP2 gradient, TZVP
IBM SP2/P2SC-160 16

715

3627

9008

6955

32

379

1946

4711

3967

64

212

1121

2552

2372

128

136

735

1430

1482

Cray T3E/900 16

873

3869

10305

8035

32

447

2068

5489

4229

64

238

1102

2899

2390

128

127

653

1456

1368

Cray T3E/1200 16

687

3165

8082

6713

32

356

1678

4305

3533

64

181

899

2261

1923

128

102

532

1219

1158


 
Table 3: Time in wall clock seconds for a variety of Direct-SCF, DFT and MP2-gradient Calculations using GAMESS-UK

Substantial modifications were required to enable the MP2 gradient to be computed in parallel [40]. Specifically, the conventional integral transformation step has been omitted, with the SCF step performed in direct fashion and the MO integrals generated by recomputation of the AO integrals, and stored in the global memory of the parallel machine. This storage and subsequent access is managed by the GA tools. The basic principle by which the subsequent steps are parallelised involves each node computing a contribution to the current term from MO integrals resident on that node. For some steps, however, more substantial changes to the algorithms are required. In the construction of the Lagrangian (the right-hand side of the coupled Hartree-Fock (CPHF) equations), for example, MO integrals with three virtual orbital indices are required. Given the size of this class of integrals, they are not stored, the required terms of the Lagrangian being constructed directly from AO integrals. A second departure from the serial algorithm concerns the MP2 two-particle density matrix. This quantity, which is required in the AO basis, is of a similar size to the two-electron integrals and is stored on disk in the conventional algorithm, but generated as required during the two-electron derivative integral generation from intermediates stored in the GAs.

We consider the performance of the MP2-gradient module in an MP2 geometry optimisation of the Mn(CO)5H molecule (with 217 basis functions) on the IBM SP/P2SC-160, Cray T3E/900 and Cray T3E/1200.. Figure 3 shows a speedup of 93 achieved using 128 T3E/1200 processors to perform the complete optimisation, involving 5 energy and 5 gradient calculations. Considering the total time to solution (Table 3), we again see that the IBM SP2 is faster than the Cray T3E/900 for up to 64 processors, although the T3E exhibits significantly higher efficiency than the SP.

Finally, we turn to the parallelisation and performance of the SCF second derivative module within GAMESS-UK. As with the MP2 gradient, the computation includes a number of terms constructed from two-electron integrals in the MO basis, and the algorithm is adapted in a similar way. The conventional transformation is again replaced by a direct transformation, with integrals written to Global Array storage and subsequently used in the coupled Hartree-Fock (CPHF) step and in the construction of perturbed fock matrices. These steps are therefore parallelised according to the distribution of the MO integrals. The most costly step in the serial second derivative algorithm is the computation of the second derivative two-electron integrals. This step is trivially parallelised through a similar approach to that adopted in the direct SCF scheme - using dynamic load-balancing based on a shared global counter. We consider the performance of the second derivative SCF module in a calculation of the frequencies of 2,2'-di(trifluoromethyl)-biphenyl using a 6-31G basis of 196 functions. Figure 3 shows a speedup of 86 achieved using 128 T3E/1200 processors to perform the complete frequency calculation. Considering the total time to solution (499 seconds on 128 processors), we find that the construction of the perturbed fock matrices dominates the computation. It seems almost certain that these matrices would be more efficiently computed in the AO basis, rather than from the MO integrals as in the current implementation, thus enabling more effective use of sparsity when dealing with systems comprising more than 25 atoms.

4.2 Exploring the Potential of Beowulf Class Systems

While the focus of the preceding discussion has been on the top-end, and exploiting the potential of MPP-class systems, we now return to the issue of cost-effectiveness, with a brief considation of advances at the desk-top and mid-range departmental level of resource. Rapid developments in the power and price of commodity microprocessors and networking has provided an opportunity to obtain very high price performance by harnessing the power of PC's and low end workstations in modestly sized parallel computers.

Networks of personal computers (so called Beowulf systems) composed of fast PCs configured with large quantities of RAM and hard disk, and running the Linux operating system are becoming more and more attractive as cheap and efficient platforms for distributed applications [41]. The main drawback of a standard Beowulf architecture is the poor performance of the conventional inter-process communication mechanisms based on RPC, sockets, TCP/IP, Ethernet. Such standard mechanisms perform poorly both in terms of throughput and of message latency.

To investigate the potential of this class of system, we have recently installed a modest, prototype Beowulf configuration at Daresbury [42]. The current system comprises a total of 10 PentiumII 266 MHz CPUs, a Master node (with 128MB SDRAM and 9GB Ultra-Wide SCSI disk) plus 8 compute nodes. Six of these nodes feature a single processor PentiumII-266 CPU, each with 128MB SDRAM and 4GB Ultra-Wide SCSI disk, while two of the nodes are dual Pentium II-266 CPUs, each with 256MB SDRAM and 4GB Ultra-Wide SCSI disk. All nodes are connected via a Fast Ethernet switch (3COM SuperStack II 3000 12 port), with job submission etc. handled by the Lobosq software provided by courtesy of the NIH LoBoS project. Software available includes a number of Fortran compilers (Absoft Pro Fortran v5.0, GNU Fortran (EGCS 1.1b), NAG Fortran 95, and Portland Group Fortran v1.7-5), together with the BLAS libraries from the ASCI Red programme.

The present implementation of the GAMESS-UK code on this Beowulf system is limited to those modules that do not rely on the GA tools; the message latency and bandwidth are not sufficient to support the required levels of performance. Using instead an MPI-based implementation (LAM MPI 6.1) of the Direct-SCF and DFT modules, we show in Table 4 the performance of both modules in calculations of the morphine molecule (6-31G** basis with 410 functions) on the Cray T3E/900, Cray T3E/1200 and Beowulf system.


Calculation Nodes

Machine

Cray T3E/900

Cray T3E/1200

Beowulf System

Direct-SCF 4

3434

2718

4775

8

1740

1380

2617

10

1369

1079

2179

16

868

688

-

DFT B3LYP 4

5128

4236

7700

8

2581

2129

4187

10

2036

1678

3448

16

1300

1082

-


 
Table 4: Time in Wall Clock Seconds for Direct-SCF and DFT Calculations on the Morphine molecule using GAMESS-UK

Considering the total times to solution, we see that the Cray T3E/1200 is only ca. twice as fast as Beowulf for the 10-node calculations. This promising level of performance should be considered in light of, (i) the scaleability of the Beowulf system being, as expected, much worse than that observed on the T3E, and not sustainable beyond perhaps the current CPU count, and (ii) the treatment of larger molecular systems demanding usage of the GA tools, with the parallel matrix algebra certain to dominate the computation times. Nevertheless these results are most encouraging; to be competitive with ten nodes of the T3E on a system costing less than a mid-range workstation, even given the constraints above, points to the clear potential of this class of system. The advent of relatively inexpensive interconnects (e.g. Myrinet [43]) together with software developments that promise a 10-fold reduction in latency (e.g. the GENOA project [44]) suggests that this class of system will prove highly cost effective for the computational chemist, at least as a departmental resource.

5 Conclusions

We have described work in progress to develop the molecular modelling software applications, NWChem and GAMESS-UK, for both present and future generations of MPPs. The emphasis throughout the development of NWChem, targeting the Grand Challenge class of application, has been on scalability and the distribution, as opposed to the replication, of key data structures. An important part of this effort has been the careful design and implementation of a system of supporting libraries and modules using ideas from OO methodology. The DFT module provides just one example of the development of efficient parallel algorithms and the high level of performance which can be achieved through this approach. A performance analysis of calculations on a number of zeolite fragments suggests that our goal of scalability, and hence the ability to treat more complex species in cost-effective and routine fashion, is well in hand. Based on experience gained with the GAMESS-UK code, we have shown that the less demanding treatment of Throughput applications may be accomplished by judicious porting to MPP platforms, with our cost-performance metric realised by adoption of the GA tools.

While the major focus of this paper has been on the top-end, and exploiting the potential of MPP-class systems, we have also demonstrated the potential of Beowulf class systems in providing a cost-effective resource for computational chemists. While this potential is at present limited to the mid-range, or departmental level of resource, recent advances in network hardware and software promise to expand the realm of applicability of such systems.

With the next generation of MPP systems likely to provide an improvement by a factor of three in the key hardware attributes e.g., CPU, communication latencies and bandwidth, we are confident that our on-going software development strategy will lead to both the type of calculations described above being performed interactively, a far cry from the present situation on conventional supercomputers, coupled with the ability to handle larger and more complex species. Important problems in biotechnology, pharmaceuticals and materials await just such advances. As stressed at the outset, however, raw computing power alone will not be sufficient to achieve these goals; continuing theoretical and computational innovation will be the real key to realising the full potential of the teraFLOPs computers scheduled for the millenium.

References

1
D.A. Dixon, T.H. Dunning, Jr., M. Dupuis, D. Feller, D. Gracio, R.J. Harrison, D.R. Jones, R.A. Kendall, J.A. Nichols, K. Schuchardt and T. Straatsma, Computational Science in the Environmental Sciences Laboratory, in High Performance Computing, eds: R.J. Allan, M.F. Guest, A.D. Simpson D.S. Henty and D.A. Nichole, Proceedings of the HPCI'98 Conference, UMIST 12-14/1/98, Plenum Publishing (1998) ISBN 0-306-46034-3.

2
GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick, K. Schoffel and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders, and A.J. Stone. The package is derived from the original GAMESS code due to M. Dupuis, D. Spangler and J. Wendoloski, NRCC Software Catalog, Vol. 1, Program No. QG01 (GAMESS), 1980;
M.F. Guest, R.J. Harrison, J.H. van Lenthe and L.C.H. van Corler, Theo. Chim. Acta 71:117 (1987);
M.F. Guest et al., Computing for Science Ltd., CCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK.

3
Additional information on the Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Battelle Memorial Institute, and the U.S. Department of Energy is available via the Pacific Northwest Home Page at http://www.pnl.gov:2080/

4
I.T. Foster, J.Jl Tilson, A.F. Wagner, R. Shepard, D.E. Bernholdt, R.J. Harrison, R.A. Kendall, R.J. Littlefield and A.T. Wong, Toward high-performance computational chemistry: I. Scalable Fock matrix construction algorithms, J. Comp. Chem. 17:109-123 (1996).

5
R.J. Harrison, M.F. Guest, R.A. Kendall, D.E. Bernholdt, A.T. Wong, M. Stave, J.L. Anchell, A.C. Hess, R.J. Littlefield, G.I. Fann, J. Nieplocha, G.S. Thomas, D. Elwood, J. Tilson, R.L. Shepard, A.F. Wagner, I.T. Foster, E. Lusk and R. Stevens, Toward high-performance computational chemistry: II. A scalable self-consistent field program, J. Comp. Chem. 17:124-132 (1996).

6
A.T. Wong, R.J. Harrison and A.P. Rendell, Parallel Direct Four Index Transformations, Theor. Chim. Acta 93:317-31 (1996).

7
D.E. Bernholdt and R.J. Harrison, Large-Scale Correlated Electronic Structure Calculations: the RI-MP2 Method on Parallel Computers, Chem. Phys. Lett. 250:477-84 (1996).

8
D.E. Bernholdt and R.J. Harrison, Orbital invariant second-order many-body perturbation theory on parallel computers. An approach for large molecules, J. Chem. Phys. 102:9582-89 (1995).

9
G. Booch, Object-Oriented Analysis and Design, 2nd edition, Benjamin/Cummings Publishing Co., Inc. (1994).

10
B. Stroustrup, The C++ Programming Language, 2nd edition, Addison-Wesley Publishing Company (1991).

11
M.F. Guest, E. Apra, D.E. Bernholdt, H.A. Fruechtl, R.J. Harrison, R.A. Kendall, R.A. Kutteh, X. Long, J.B. Nicholas, J.A. Nichols, H.L. Taylor, A.T. Wong, G.I. Fann, R.J. Littlefield and J. Nieplocha, Future Generation Computer Systems 12:273-289 (1996).

12
R.A. Kendall, R.J. Harrison, R.J. Littlefield and M.F. Guest, in: Reviews in Computational Chemistry, K.B. Lipkowitz and D.B. Boyd eds., VCH Publishers, Inc., New York (1994).

13
The MPI Forum, MPI: A Message-Passing Interface Standard , University of Tennessee (1994).

14
The MPI Forum, A message passing interface, in: Supercomputing '93, pp878-83, IEEE Computer Society Press, Los Alamitos, California, Portland, OR (1993).

15
M.E. Colvin, C.L. Janssen, R.A. Whiteside and C.H. Tong, Theoretica Chimica Acta 84:301-14 (1993).

16
T.R. Furlani and H.F. King, J. Comp. Chem. (1995).

17
J. Nieplocha, R.J. Harrison and R.J. Littlefield, Global arrays; A portable shared memory programming model for distributed memory computers, in: Supercomputing '94, IEEE Computer Society Press, Washington, D.C. (1994).

18
M. Arango, D. Berndt, N. Carriero, D. Gelernter and D. Gilmore, Supercomputer Review 3(10) (1990).

19
M. Schuler, T. Kovar, H. Lischka, R. Shepard and R.J. Harrison, Theoretica Chimica Acta 84:489-509 (1993).

20
A.P. Rendell, M.F. Guest and R.A. Kendall, J. Comp. Chem. 14:1429-39 (1993).

21
J. Choi, J.J. Dongarra, L.S. Ostrouchov, A.P. Petitet, R.C. Whaley, J. Demmel, I. Dhillon and K. Stanley, LAPACK Working Note: Installation Guide for ScaLAPACK, Department of Computer Science, University of Tennessee, Knoxville, TN, (1995). Available from anonymous@ftp:ftp.netlib.org .

22
S.S. Lo, B. Phillipps and A. Sameh, SIAM J. Sci. Stat. Comput. 8(2) (1987).

23
G. Fann and R.J. Littlefield, Parallel inverse iteration with reorthogonalization, in: Sixth SIAM Conference on Parallel Processing for Scientific Computing (SIAM), pp409-13 (1993).

24
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J.J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammerling, A. McKenney, S. Ostrouchov and D. Sorensen, LAPACK User's Guide, SIAM (1992).

25
G.B. Bacskay, Chem. Phys. 61:385 (1982).

26
P. Pulay, Chem. Phys. Lett. 100:151-4 (1983).

27
P. Pulay and S. Saebo, Theor. Chim. Acta 69:357-68 (1986);
S. Saebo and P. Pulay, Annu. Rev. Phys. Chem. 44:213-236 (1993).

28
O. Vahtras, J. Almlof and M. Feyereisen, Chem. Phys. Lett. 213:514 (1993).

29
M. Feyereisen, G. Fitzgerald, A. Komornicki, Chem. Phys. Lett. 208:359-63 (1993).

30
H.A. Fruechtl and R.A. Kendall, A Scalable Implementation of the RI-SCF Algorithm, Int. J. Quant. Chem. 64:63-69 (1997).

31
P. Hohenberg and W. Kohn, Phys. Rev. B. 136:864-71 (1964);
W. Kohn and L.J. Sham, Phys. Rev. A. 140:1133-38 (1965).

32
E. Wimmer, Density Functional Methods in Chemistry, J.K. Labanowski and J.W. Andzelm, eds. pp7-31 Springer-Verlag, Berlin (1991).

33
B.I. Dunlap, J.W.D. Connolly and J.R. Sabin, J. Chem. Phys. 71:3396, 4993 (1979).

34
M. Frisch et al., Gaussian, Inc., 4415 Fifth Avenue, Pittsburgh, PA 15213, USA (1992).

35
R.D. Amos, I.L. Alberts, J.S. Andrews, S.M. Colwell, N.C. Handy, D. Jayatilaka, P.J. Knowles, R. Kobayashi, N. Koga, K.E. Laidig, P.E. Maslen, C.W. Murray, J.E. Rice, J. Sanz, E.D. Simandiras, A.J. Stone and M.-D. Su, CADPAC, Issue 6, University of Cambridge, (1995).

36
M.W. Schmidt, et al., QCPE Bulletin 7:115 (1987).

37
MOLPRO is a package of ab initio programs written by H.-J. Werner and P.J. Knowles, with contributions from R.D. Amos, A. Berning, D.L. Cooper, M.J.O. Deegan, A.J. Dobbyn, F. Eckert, C. Hampel, T. Leininger, R. Lindh, A.W. Lloyd, W. Meyer, M.E. Mura, A. Nicklass, P. Palmieri, K. Peterson, R. Pitzer, P. Pulay, G. Rauhut, M. Schütz, H. Stoll, A.J. Stone and T. Thorsteinsson.

38
R. Ahlrichs, M. Br, M. Hser, H. Horn and C. Klmel, Electronic structure calculations on workstation computers: The program system TURBOMOLE, Chem. Phys. Letters 162:165 (1989);
R. Ahlrichs and M.v. Arnim, TURBOMOLE, parallel implementation of SCF, density functional, and chemical shift modules, in: Methods and Techniques in Computational Chemistry: METECC-95, E. Clementi and G. Corongiu, eds.

39
M. Dupuis, J.D. Watts, H.O. Villar and G.J.B. Hurst, The general atomic and molecular electronic structure system HONDO: version 7.0., Comput. Phys. Commun. 52:415 (1989);
M. Dupuis, A. Farazdel, S.P. Karma and S.A. Maluendes, HONDO: A general atomic and molecular electronic structure system, in: MOTECC, E. Clementi ed., ESCOM, Leiden (1990).

40
G.D. Fletcher, A.P. Rendell and P. Sherwood, A parallel second-order Moller-Plesset gradient, Molec. Phys. 91:431-38 (1997).

41
D. Ridge, D. Becker, P. Merkey, T.Sterling and P. Merkey, Beowulf: Harnessing the Power of Parallelism in a Pile-of-PCs, Proceedings, IEEE Aerospace (1997).

42
Additional information on the Daresbury Beowulf System is available via WorldWide Web URL http://www.dci.clrc.ac.uk/Activity/DISCO+926.

43
Additional information on myrinet is available via WorldWide Web URL http://www.myri.com/.

44
G. Chiola and G. Ciaccio, Active Ports: A Performance-oriented Operating System Support to Fast LAN Communications, in proceedings of Euro-Par'98, Southampton, UK, September 1-4, 1998. LNCS 1470, Springer.