We present a Graphics Processing Unit (GPU)-accelerated version of the real-space SPARC electronic structure code for performing Kohn–Sham density functional theory calculations within the local density and generalized gradient approximations. In particular, we develop a modular math-kernel based implementation for NVIDIA architectures wherein the computationally expensive operations are carried out on the GPUs, with the remainder of the workload retained on the central processing units (CPUs). Using representative bulk and slab examples, we show that relative to CPU-only execution, GPUs enable speedups of up to 6× and 60× in node and core hours, respectively, bringing time to solution down to less than 30 s for a metallic system with over 14 000 electrons and enabling significant reductions in computational resources required for a given wall time.
I. INTRODUCTION
Over the past few decades, Kohn–Sham density functional theory (DFT)1,2 has established itself as a cornerstone of materials and chemical sciences research. In particular, due to its high accuracy-to-cost ratio relative to other ab initio methods, it has seen widespread use for understanding as well as predicting material properties and chemical phenomena from the first principles of quantum mechanics.3,4 Despite significant advances in numerical/computational algorithms as well as high-performance computing architectures, bringing down the time to solution of the Kohn–Sham problem remains a challenging task. In particular, the computational cost and memory requirements scale cubically and quadratically with system size, respectively, restricting the range and types of systems that can be investigated, particularly in ab initio molecular dynamics (AIMD) simulations, where reaching time scales of interest may require the solution of the Kohn–Sham equations tens or hundreds of thousands of times.3
The planewave pseudopotential method,5 which employs the complete, orthogonal, Laplacian-diagonalizing, periodic, and atom-position independent Fourier basis for discretization, is among the most widely used techniques for the solution of the Kohn–Sham equations.6–13 In particular, the planewave method is accurate, relies on a single parameter for convergence with basis, and is highly efficient on small to moderate computational resources through the use of efficient preconditioning schemes and well-optimized Fast Fourier Transforms (FFTs). However, the planewave method is restricted to periodic boundary conditions, wherein artificial periodicity has to be introduced through large vacuum regions for systems that are finite in one or more directions. Moreover, the global nature of the Fourier basis makes the development of linear-scaling methods14–16 difficult and limits the parallel scalability of the planewave method on large-scale computational resources, which severely limits the system sizes and time scales accessible to a rigorous first-principles Kohn–Sham DFT investigation.
Motivated by the limitations of the planewave method, a number of alternate solution strategies based on systematically improvable, localized representations have been developed,17–37 among which the real-space finite-difference method38,39 is perhaps the most mature and widely used to date. In this method, computational locality is maximized by discretizing all spatial quantities on a uniform, atom-position independent real-space grid, wherein convergence is controlled by a single parameter, i.e., the grid spacing. The method naturally accommodates both periodic and Dirichlet boundary conditions, allowing for the accurate and efficient treatment of systems with different dimensionalities, i.e., finite, semi-infinite, and bulk, as well as those with nontraditional symmetries.40,41 Moreover, the localized real-space representation allows for the development of linear-scaling methods and being free from communication-intensive transforms such as FFTs, the method allows for large-scale parallel computational resources to be efficiently leveraged.22,33,42–45
SPARC34,46,47 is a recently developed open source electronic structure code that incorporates a number of the developments in real-space DFT made over the past decade, allowing for efficient utilization of modest as well as large-scale computational resources. Its accuracy and performance have been extensively verified and benchmarked against established planewave codes, during which it has been found to be an order of magnitude faster for local, semilocal, and hybrid exchange–correlation functionals, with increasing advantages as the number of processors is increased.45,46 However, it has heretofore been unable to exploit the acceleration provided by Graphics Processing Units (GPUs), which have been shown to provide substantial speedups in the context of electronic structure calculations,29,48–60 providing the motivation for the current work. In particular, we develop a modular math-kernel based GPU-accelerated version of SPARC for local and semilocal Kohn–Sham DFT calculations, wherein the computationally expensive operations are carried out on the GPUs, with the remainder of the workload retained on the central processing units (CPUs). Using representative bulk and slab examples, we show that GPU acceleration provides speedups of up to 6× and 60× in node and core hours, respectively, bringing time to solution down to less than 30 s for a metallic system with over 14 000 electrons and enabling significant reductions in computational resources required for a given wall time.
II. GPU ACCELERATION OF LOCAL AND SEMILOCAL DFT CALCULATIONS IN SPARC
The electronic ground state in SPARC34,46,47 is determined using the self-consistent field (SCF) method,5 which represents a fixed-point iteration with respect to either the density or potential. In each SCF iteration, a Schrödinger-type linear eigenproblem is solved for the eigenvectors/orbitals and the Poisson equation is solved for the electrostatic potential. Given the large prefactor and scaling with system size, the overall computational cost of Kohn–Sham DFT calculations is primarily determined by the solution of the eigenproblem, especially when the exchange–correlation functional is approximated using either the local density approximation (LDA) or generalized gradient approximation (GGA),5 as is the focus of the current work.
SPARC employs the Chebyshev-filtered subspace iteration (CheFSI)61,62 to perform partial diagonalization of the Hamiltonian during each SCF iteration, as summarized in Algorithm 1. The CheFSI algorithm consists of two main steps: Chebyshev filtering and Rayleigh–Ritz. In Chebyshev filtering, the rapid growth of Chebyshev polynomials outside the interval [−1, 1] is used to filter out the unwanted part of the Hamiltonian’s spectrum, i.e., the unoccupied subspace. In Rayleigh–Ritz—which consists of projection of the Hamiltonian onto the filtered subspace, diagonalization of the resulting subspace Hamiltonian, and rotation of the filtered basis—approximations to the eigenvectors and eigenvalues of the Hamiltonian are then calculated. As the SCF iteration proceeds toward self-consistency, these eigenvectors converge to the Kohn–Sham orbitals.
H : Hamiltonian for a given spin, Bloch wavevector, and SCF iteration, a matrix of dimension Nd × Nd applied as an operator |
X : guess for the eigenvectors/orbitals, a matrix of dimension Nd × Ns |
Nd : Number of finite-difference nodes, Ns: Number of orbitals |
Filtering
|
Projection
|
Diagonalization
|
Rotation
|
H : Hamiltonian for a given spin, Bloch wavevector, and SCF iteration, a matrix of dimension Nd × Nd applied as an operator |
X : guess for the eigenvectors/orbitals, a matrix of dimension Nd × Ns |
Nd : Number of finite-difference nodes, Ns: Number of orbitals |
Filtering
|
Projection
|
Diagonalization
|
Rotation
|
In the CPU implementation of SPARC, parallelization is achieved using the Message Passing Interface (MPI) standard. In particular, an eigensolver topology is implemented for CheFSI in which the MPI_COMM_WORLD communicator is split into two spin groups, then each spin group is split into multiple Bloch wavevector groups, then each wavevector group is split into multiple orbital groups, and finally, each orbital group is embedded with a Cartesian topology.46 In the current GPU-accelerated implementation, we neglect spin and employ only wavevector and orbital parallelization, i.e., no domain decomposition, which translates to each orbital group no longer being embedded with a Cartesian topology. Note that it is relatively straightforward to include spin polarization, given that the eigenproblems for different spins are essentially independent. Also note that in the default SPARC operation, the parallelization over all the orbitals occurs first, and then only domain decomposition is activated, i.e., domain decomposition is important in the strong scaling limit, but not in regular operation where moderate number of processors are used, motivating the current choice.
In this work, we propose a strategy that ensures maximum portability across diverse and ever-evolving GPU architectures and their corresponding programming interfaces, code separation of CPU and GPU CheFSI modules that allows their independent development and optimizations, minimum data transfer between host and device, and minimum peak memory requirement on a GPU. In what follows, we describe how the key computational kernels in each of the aforementioned CheFSI steps are accelerated on NVIDIA GPUs using the cuBLAS and cuSOLVER libraries via the CUDA parallel programming platform. The vectors/matrices are transferred from CPU to GPU and GPU to CPU using the cublasSetVector and cublasGetVector routines, respectively. Note that for isolated systems/Γ− point calculations, real-valued computations are performed, whereas for other choices of Brillouin zone integration, complex-valued computations are performed, with all operations performed in double-precision arithmetic.
It is worth noting that some of the routines developed for the implementation of Chebyshev filtering (Sec. II A), i.e., stencil operations and nonlocal projector multiplications, can be used to accelerate the computation of the nonlocal component of the Hellmann–Feynman atomic forces and stresses, where the key computational kernels are the application of the gradient operator on the Kohn–Sham orbitals and application of the nonlocal pseudopotential operator on the resultant quantity.34,47,63 For the stresses, the gradient of the orbitals so computed can be used for the calculation of the electronic kinetic energy component of the stress. Indeed, these nonlocal components of the forces and stresses are explicitly dependent on the orbitals and therefore significantly more expensive than the local components, motivating acceleration through GPU computations.
A. Chebyshev filtering
The application of the finite-difference stencil for the Laplacian on each column of X(k) by GPUk, k ∈ {1, 2, …, p}, proceeds as follows:64 (i) Group threads into 2D threadblocks of size (p, q) to match data tiling in (x, y) and assign one thread per output element; (ii) allocate a shared memory for (p + n0) × (q + n0) array, n0 being the finite-difference order; (iii) load the column of Xp in the shared memory; (iv) compute 2D stencil in each threadblock by fetching data from the shared memory; and (v) compute 1D stencil in z-direction in each threadblock and add to the 2D stencil result. This algorithm ensures minimum read redundancy by collecting the data corresponding to the extended region in the (x, y) tile in the shared memory of a threadblock. In addition, all GPU threads work in parallel, each performing only 3n0 + 1 computations, thus enabling very fast and accurate stencil computations.
The effective potential is multiplied pointwise to each column of X(k) by GPUk, k ∈ {1, 2, …, p}.
The nonlocal projectors for each atom, which are stored as a dense matrix in both CPU-only and GPU-accelerated implementations, are applied on the appropriate components of X(k) by GPUk, k ∈ {1, 2, …, p}, by performing a dense matrix–matrix multiplication using the cublasZgemm/cublasDgemm routine. Note that the dense matrix associated with each atom is of size: number of grid points in the compact support of the largest projector times the number of projectors, whereby the associated computer memory, even when considering all the atoms, is negligible compared to the overall requirements.
Once the filtering is complete, the filtered basis is transferred from GPUs to CPUs. Note that since is needed as part of the projection step, it is calculated as described above and also transferred from the GPUs to CPUs.
B. Projection
C. Subspace diagonalization
D. Rotation
III. RESULTS AND DISCUSSION
We now study the performance of the GPU-accelerated SPARC implementation through representative examples, namely, bulk molybdenum (Mo) and 12-layer (001) slab of titanium dioxide (TiO2).65 Specifically, we consider 250-, 686-, and 1024-atom unit cells of Mo, with LDA2,66 exchange–correlation functional and Γ-point Brillouin zone integration; and 144-, 324-, and 576-atom unit cells of TiO2 with Perdew–Burke–Ernzerhof (PBE)67 exchange–correlation functional and 4 × 4, 3 × 3, and 2 × 2 Monkhorst–Pack68 grids for Brillouin zone integration, respectively. We perform isokinetic ensemble (NVK) ab initio molecular dynamics (AIMD) with Gaussian thermostat69 at temperatures of 3000 and 300 K, with electronic and ionic temperatures the same, and time steps of 1 and 2 fs for the Mo and TiO2 systems, respectively. In particular, we perform steps of the AIMD simulation before collecting timings, i.e., after the computational timings per MD step have stabilized.
In all calculations, we employ optimized norm-conserving Vanderbilt (ONCV) pseudopotentials70 with nonlinear core corrections (NLCCs) from the SPMS set,71 which has 14, 12, and 6 electrons in valence for Mo, Ti, and O, respectively. In addition, we employ the restarted Periodic Pulay mixing scheme,72,73 real-space Kerker preconditioning,74,75 and the Alternating Anderson–Richardson (AAR)76,77 linear solver for the Poisson equation. The Poisson equation is solved entirely on the CPUs since it takes a very small fraction of the total time. Indeed, it can be immediately ported to the GPUs using the Laplacian-vector product routine described in Sec. II A but is not done in order to maximize code simplicity. The number of orbitals chosen for the Mo systems, Mo250, Mo686, and Mo1024, are Ns = 2105, 5767, and 8606, respectively; and for the TiO2 systems, (TiO2)48, (TiO2)108, and (TiO2)192, Ns = 696, 1560, and 2769, respectively, as automatically determined by the SPARC code. The grid spacing used for the Mo and TiO2 systems is 0.372 and 0.3 bohr, respectively, which translates to Nd = 80 × 80 × 80, 112 × 112 × 112, and 144 × 144 × 144 finite-difference nodes for the Mo250, Mo686, and Mo1024 systems, respectively; and Nd = 58 × 37 × 225, 88 × 56 × 225, and 117 × 75 × 225 for the (TiO2)48, (TiO2)108, and (TiO2)192 systems, respectively. The Chebyshev filtering polynomial degrees for the Mo and TiO2 systems are 21 and 25, respectively, which are the SPARC defaults for the chosen grid spacings. All numerical parameters, including grid spacing and SCF tolerances, are chosen to provide chemical accuracy (10−3 Ha/atom), as typical in AIMD simulations and as targeted in recent TiO2 applications in particular.65 All simulations are carried out on the Lassen supercomputer at the Lawrence Livermore National Laboratory (LLNL),78 wherein each computational node has 4 NVIDIA Volta V100 GPUs with 16 GB of memory each and 40 IBM POWER9 CPU cores with a total of 256 GB of memory. We use all 40 CPU cores per computational node in CPU-only runs, with one CPU thread (MPI rank) per CPU core, and use 4 CPU cores and 4 GPUs per computational node in GPU-accelerated runs, with one CPU thread (MPI rank) per CPU core—the configuration that was found to be most efficient. This translates to speedups in terms of core hours being a factor of 10 larger than those in terms of node hours as are the focus here.
In Fig. 1, we present the strong scaling results obtained for the chosen Mo and TiO2 systems. In particular, we report the variation in the total wall time per MD step—which includes three SCF iterations and calculation of Hellmann–Feynman atomic forces—with the number of computational nodes. It is clear that the GPU implementation demonstrates good parallel strong scaling, with a continuous decrease in the time to solution as the number of nodes is increased. The parallel scaling for the TiO2 systems is especially good by virtue of parallelization over wavevectors in the Brillouin zone in addition to parallelization over orbitals. In particular, the GPU-accelerated execution provides significant speedup compared to CPU-only execution—with the same number of SCF iterations in both instances, as generally found for a given SCF tolerance—with a maximum speedup of 3.3×, 6.0×, and 6.2× for Mo250, Mo686, and Mo1024, respectively; and 2.7×, 4.4×, and 6.4× for (TiO2)48, (TiO2)108, and (TiO2)192, respectively. Furthermore, the minimum MD step times are well within half a minute: 2.7, 12.9, and 28.3 s for Mo250, Mo686, and Mo1024; and 3.8, 8.9, and 13.0 s for (TiO2)48, (TiO2)108, and (TiO2)192, respectively, demonstrating the attractiveness of GPU-accelerated SPARC for AIMD. The figure also indicates that the rate of decrease in the time to solution reduces as the number of computational nodes is increased, with the speedup having an inverse correlation with the number of computational nodes and a direct correlation with the problem size. This is due to the fact that within memory constraints, the GPU is able to simultaneously process much larger amounts of data in comparison to a CPU, therefore reduction in computational workload on a GPU is not in direct correspondence with the associated reduction in time. We have verified this behavior by performing the Mo1024 simulation with a grid spacing of 0.22 bohr, e.g., corresponding to a harder pseudopotential or higher accuracy, and found a speedup and timing of 5.6× and 86 s on 64 computational nodes, respectively. The corresponding numbers for 0.372 bohr mesh (Fig. 1) are 3.9× and 33.3 s, respectively. Indeed, though the number of finite-difference nodes increased by a factor of 4.8×, the wall time increased by only a factor of 2.6×, even with the Chebyshev polynomial degree increasing from 23 to 33. It is worth noting that, since we have considered three system sizes each for both Mo and TiO2, the weak scaling of can also be inferred from the curves, consistent with scaling estimates for the chosen system sizes. It is also worth noting that since the largest speedups occur on the smallest computational resources, the reduction in wall time is especially useful in real-world production runs where resources are generally limited.
To get further insight into the performance of the GPU-accelerated SPARC code, we determine the timings for each of the main CheFSI steps: Chebyshev filtering, projection, subspace diagonalization, and rotation, the details of which are available in Sec. II. Note that with GPU implementation of the nonlocal forces, the time taken in the calculation of the atomic forces is less than 1% of the total time in GPU-accelerated execution, similar to CPU-only execution, which motivates its exclusion from the analysis here. In Figs. 2 and 3, we present the breakdown of the timings for GPU-accelerated and CPU-only executions on the minimum and maximum number of computational nodes used in the strong scaling study for each system (Fig. 1). It is clear that other than subspace diagonalization, the speedups for each of the steps on the smallest number of nodes is significantly larger than on the largest number of nodes, for the reasons discussed above with regard to the processing capability of the GPU. There is no noticeable change in the timing of the subspace diagonalization since it is restricted to a single GPU in GPU-accelerated execution, while the number of CPU threads on which it is performed is large enough in all cases that the timing remains relatively unchanged in the strong scaling study. As is to be expected, for both GPU-accelerated and CPU-only executions, the steps, i.e., projection, subspace diagonalization, and rotation, become more dominant as the system size increases. The strong scaling efficiency of the different CheFSI steps in GPU-accelerated execution are in the following order: Chebyshev filtering rotation projection subspace diagonalization. The efficiency of Chebyshev filtering is the highest, given that it employs orbital parallelization, whereby all the computations happen independently on the GPUs, without the need for communication between the GPUs or CPUs during the whole step. The relatively large amount of global communications that are required for forming the subspace Hamiltonian and overlap matrices during the projection step make its scaling worse than the rotation step, which would otherwise be similar. The subspace diagonalization timings remain unchanged, by virtue of being run on the same number of processors for the whole strong scaling study, as discussed above. Note that the ordering of the strong scaling efficiency of the different steps in CPU-only execution mirrors that in GPU-accelerated execution, for reasons similar to those discussed above.
IV. CONCLUDING REMARKS
We have presented a GPU-accelerated implementation of the real-space SPARC electronic structure code for performing Kohn–Sham DFT calculations with LDA/GGA exchange–correlation functionals. In particular, we have developed a modular math-kernel based implementation for NVIDIA architectures in which the computationally intensive operations are carried out on the GPUs, while the remainder of the workload is retained on the CPUs. Through representative bulk and slab examples, we have shown that relative to CPU-only execution, GPUs enable speedups of up to 6× and 60× in node and core hours, respectively, bringing time to solution down to less than 30 s for a metallic system with over 14 000 electrons and enabling significant reductions in computational resources required for a given wall time. Opportunities for further reductions in wall time include (i) performing all-reduce operations in the projection step on GPUs through NVLink; (ii) having the CPUs, currently idle during GPU operation, perform some of the computations; and (iii) using more advanced NVIDIA H100 GPUs, possibly with mixed-precision computations (e.g., Ref. 29) that can take advantage of their tensor cores.
The modular yet general nature of the developed implementation allows for its relatively simple extension to other GPU architectures, e.g., AMD and Intel, which is currently being pursued by the authors. A GPU-accelerated Parallel Computation Engine (libPCE) is also in development, which targets problem sizes that do not fit on a single GPU and reduces the number of CPU-GPU and GPU-CPU transfers. It uses a distinct orbital + domain data distribution and uses CA3DMM79 to provide optimal or near-optimal communication for matrix–matrix products (used in the CheFSI projection and rotation steps). Other worthy subjects of research include extending the implementation to enable GPU acceleration for advanced semilocal/hybrid exchange–correlation functionals, which are significantly more computationally expensive than LDA/GGA; and GPU acceleration of the Spectral Quadrature (SQ) method80,81 in SPARC,44,82 which will enable the study of systems of a million atoms45 and more as larger-scale parallel computing platforms become available.
ACKNOWLEDGMENTS
J.E.P., A.S., and P.S. gratefully acknowledge support from U.S. Department of Energy (DOE), National Nuclear Security Administration (NNSA): Advanced Simulation and Computing (ASC) Program at LLNL, and computational resources provided under the Multiprogrammatic and Institutional Computing programs at LLNL. P.S., L.E., and E.C. gratefully acknowledge support from the U.S. Department of Energy, Office of Science under Grant No. DE-SC0019410. This work was performed in part under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Department of Energy, or the U.S. Government.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Abhiraj Sharma: Conceptualization (equal); Data curation (lead); Investigation (lead); Methodology (equal); Software (equal); Validation (lead); Writing – original draft (lead); Writing – review & editing (supporting). Alfredo Metere: Conceptualization (equal); Data curation (supporting); Investigation (supporting); Methodology (equal); Software (equal); Validation (supporting); Writing – review & editing (supporting). Phanish Suryanarayana: Conceptualization (supporting); Data curation (equal); Funding acquisition (lead); Investigation (equal); Methodology (equal); Project administration (equal); Supervision (supporting); Validation (supporting); Writing – original draft (supporting); Writing – review & editing (lead). Lucas Erlandson: Conceptualization (supporting); Investigation (equal); Methodology (equal); Software (equal); Validation (supporting); Writing – review & editing (supporting). Edmond Chow: Conceptualization (supporting); Funding acquisition (equal); Investigation (supporting); Methodology (supporting); Project administration (supporting); Software (supporting); Supervision (supporting); Writing – review & editing (supporting). John E. Pask: Conceptualization (lead); Data curation (supporting); Funding acquisition (lead); Investigation (equal); Methodology (lead); Project administration (lead); Software (supporting); Supervision (lead); Validation (supporting); Writing – original draft (supporting); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.