Quantum mechanical calculations for material modeling using Kohn–Sham density functional theory (DFT) involve the solution of a nonlinear eigenvalue problem for N smallest eigenvector-eigenvalue pairs, with N proportional to the number of electrons in the material system. These calculations are computationally demanding and have asymptotic cubic scaling complexity with the number of electrons. Large-scale matrix eigenvalue problems arising from the discretization of the Kohn–Sham DFT equations employing a systematically convergent basis traditionally rely on iterative orthogonal projection methods, which are shown to be computationally efficient and scalable on massively parallel computing architectures. However, as the size of the material system increases, these methods are known to incur dominant computational costs through the Rayleigh–Ritz projection step of the discretized Kohn–Sham Hamiltonian matrix and the subsequent subspace diagonalization of the projected matrix. This work explores the potential of polynomial expansion approaches based on recursive Fermi-operator expansion as an alternative to the subspace diagonalization of the projected Hamiltonian matrix to reduce the computational cost. Subsequently, we perform a detailed comparison of various recursive polynomial expansion approaches to the traditional approach of explicit diagonalization on both multi-node central processing unit and graphics processing unit architectures and assess their relative performance in terms of accuracy, computational efficiency, scaling behavior, and energy efficiency.

A well-known and challenging application of eigenvalue problems is in the area of quantum modeling of materials using Kohn–Sham density functional theory (DFT),1 which has been immensely successful in providing critical insights into various ground-state material properties. To compute the ground-state electronic structure in DFT, one is confronted with solving a large-scale nonlinear eigenvalue problem using a self-consistent field iteration procedure (SCF) for N smallest eigenvalue/eigenvector pairs, with N being proportional to the number of electrons in the material system. This results in asymptotic cubic complexity O(N3) with the number of electrons for DFT, making these calculations computationally demanding and often restrictive in terms of system sizes that can be handled using widely used DFT codes. Many of these codes employ systematically convergent plane-wave basis sets, restricting the simulation domains to be periodic, or employ atomic-orbital (AO) type basis sets with very few basis functions per atom, but AO basis sets are not complete. Furthermore, the extended nature of these basis sets restricts their parallel scalability. To extend the range of system sizes to be studied, numerous past efforts have focused on developing systematically convergent real-space computational methodologies2–6,28,29 that have relied on reducing the prefactor associated with the cubic computational complexity alongside improving the parallel scalability, thereby enabling large-scale DFT calculations up to 100 000 electrons. These real-space DFT discretization approaches result in large sparse Hermitian eigenvalue problems to be solved for N smallest eigenvalue/eigenvector pairs.

DFT codes employing systematically convergent basis sets either in Fourier-space or in real-space require a large number of basis functions per atom to achieve chemical accuracy compared to approaches employing AO basis sets. This results in large Hamiltonian matrices of dimension M with M ≈ 105–108, whose N smallest eigenvalue/eigenvector pairs we seek. We note that N is usually 0.1%–0.5% of M, the total number of basis functions [degrees of freedom (DoFs)] used in the simulation. The most popular eigensolver strategies employed to solve these large-scale DFT eigenvalue problems include the Davidson approach,7 the Locally Optimal Block Pre-conditioned Conjugate Gradient (LOBPCG) method,8 or the Chebyshev filtered subspace iteration (ChFSI)9,28 approach. These eigensolvers belong to the category of iterative orthogonal projection methods (IOP), wherein the discretized Kohn–Sham Hamiltonian matrix H of size M × M is orthogonally projected onto a carefully constructed subspace rich in the wanted eigenvectors (Rayleigh–Ritz step), and subsequently, the resulting smaller, densely projected Hamiltonian Hp of dimension k(≪M) is explicitly diagonalized (subspace diagonalization) to approximate the desired eigenvalue/eigenvector pairs of the H matrix. The cubic scaling computational cost of this subspace diagonalization step dominates for medium- to large-scale material systems (N > 20 000) compared to the costs associated with subspace construction and Rayleigh–Ritz steps in IOP methods. For instance, Das et al.6 employing the ChFSI approach have reported that the subspace diagonalization constitutes roughly 30% of the total ChFSI cost for N ≈ 30 000, whereas it accounts for around 56% of the total cost for N ≈ 50 000. To this end, the current work explores recursive Fermi-operator expansion approaches on HPC architectures as an alternative to the subspace diagonalization procedure to improve computational efficiency, thereby reducing the computational prefactor associated with the cubic complexity of the DFT problem. Furthermore, the accuracy, computational efficiency, parallel performance, and energy efficiency of these approaches are examined on both multi-node central processing unit (CPU) and graphics processing unit (GPU) architectures and compared with explicit diagonalization of the projected Hamiltonian matrix.

Recursive polynomial expansion approaches (RPE) rely on the key idea that constructing a density matrix (projector matrix corresponding to N smallest eigenvectors) suffices to compute ground-state electronic structure in DFT at zero-temperature without the necessity of computing explicit eigenvalues and eigenvectors. In the past, RPE approaches10–13 have been used for ground-state DFT calculations with an atomic-orbital basis. Most of these methods aim to achieve linear scaling complexity using sparse-matrix algebra employing numerical thresholds. Furthermore, these recursive expansion schemes have also been shown to achieve high throughput performance using dense matrix algebra14 on a single Nvidia M2070 GPU, compared to traditional diagonalization approaches. More recently, the computational structure of RPE approaches has been leveraged in the form of a deep neural network architecture15 to take advantage of the tensor cores in Nvidia A100 GPUs. These techniques have been applied for ab initio molecular dynamics16 and quantum response calculations17 using atomic-orbital basis sets. Furthermore, Google’s tensor processing units (TPUs) have been successfully used for large-scale DFT calculations18 employing density matrix purification approaches, which are similar in spirit to RPE and involve dense matrix–matrix multiplications. However, to the best of our knowledge, the computational efficiency, scaling behavior on distributed systems, and energy efficiency of these RPE approaches have not been explored compared to subspace diagonalization procedures for their use in iterative orthogonal projection methods on multi-node CPU and GPU architectures. The evolving computing architectures in today’s exascale era demand scalable methodologies focusing on reduced data movement and increased arithmetic intensity on massively parallel computing architectures, with an equal emphasis on employing energy-efficient algorithms. The current work exploring recursive polynomial expansion approaches as an alternative to subspace diagonalization is in this direction, enabling large-scale eigenvalue problems arising from the discretization of DFT problems using systematically convergent basis sets employing IOP methods. To this end, the key contributions of our current work, as described in the subsequent sections, include—(a) efficient implementation strategies of various recursive polynomial expansion (RPE) techniques based on Fermi-operator expansion on both multi-node CPU and GPU architectures for both zero-temperature cases, and the finite-temperature case of Fermi–Dirac smearing of the occupancy function; (b) mixed precision strategies in conjunction with RPE to reduce compute and data access costs; and (c) assessing accuracy, computational efficiency, parallel scaling behavior, and energy efficiency of the proposed implementation procedures by comparing with explicit diagonalization algorithms provided by state-of-the-art ELPA library.19,20

This section discusses key ideas central to recursive polynomial expansion approaches among the various purification schemes30,31,33–41 employed in the literature to approximate the densitymatrix.

At zero electronic temperature, the density matrix (D) can be defined as a projector matrix corresponding to the lowest occupied (NoccN) eigenvectors of the discretized Kohn–Sham Hamiltonian matrix H. Mathematically, it takes the form of a shifted Heaviside step function, θ(.), given by D = θ[μIH]. The density matrix (D) in the case of the finite temperature is a smeared version of zero-temperature density matrix and is mathematically represented by a Fermi-operator matrix function given by D=[eβ(HμI)+I]1, where I denotes the identity matrix, β = 1/(kBTe) is the inverse electronic temperature, and μ is the Fermi-energy. Note that the eigenvalues fi of D are referred to as occupancies. fi is either 0 or 1 for a zero-temperature case, whereas for a finite-temperature case, fi ∈ [0, 1].

Two types of polynomial expansion schemes can be used to approximate the density matrix—(a) Serial Fermi-operator expansion schemes (Chebyshev Fermi-operator expansion scheme21,22 and Green’s function expansion scheme23,32); and (b) Recursive Fermi-operator expansion schemes.10,11,12,13 In this work, we employ the recursive Fermi-operator expansion schemes, as they are shown to reach a very high polynomial order in the approximation with few iterations (few matrix–matrix multiplications) and can be used to approximate the density matrix for both zero-temperature and finite-temperature cases as well.

1. Recursive Fermi-operator expansion for zero-temperature density matrix (Z-RFOE)

The recursive Fermi-operator expansion10 involves successive projections of a matrix Xn that begins with X0 = H and Xn+1 = Fn(Xn). The functions Fn(Xn) are chosen to project the eigenvalue spectrum of Xn to eigenvalues closer either to 1 or to 0. Mathematically, this can be represented as
(1)

One of the most efficient techniques in Z-RFOE is to use the second-order projection polynomials (SP2)11 given by Xn+1=Fn(Xn)=Xn±(XnXn2). The SP2 here is a continuously increasing and decreasing function in the interval [0, 1]. The ± sign is chosen to adjust the trace of Xn+1 in each projection such that it converges to Nocc.

2. Accelerated recursive polynomial expansion for zero-temperature density matrix (A-Z-RFOE)

This technique works on the concept of shifting and scaling the SP2 projection polynomials. In contrast to Z-RFOE, which uses a fixed SP2 polynomial taking the form F(X) = X2 or F(X) = 2XX2, the A-Z-RFOE39,42 approach offers the flexibility to choose the expansion polynomials based on HOMO (Highest Occupied Molecular Orbital) and LUMO (Lowest Unoccupied Molecular Orbital) eigenvalues such that it moves the eigenvalues closer to either 1 or 0 at a rapid rate.

3. Recursive Fermi-operator expansion scheme for finite-temperature cases (T-RFOE)

A finite-temperature density matrix has occupancies fi ∈ [0, 1], and the SP2 recursion scheme discussed above is not well suited for approximating density matrices with fractional occupancies. To this end, an intermediate function generated in Z-RFOE that is obtained before the convergence of the algorithm to the projector matrix is used. This serves as a smeared function for the zero-temperature density matrix (Heaviside step function). To this end, the truncated expansion for computing the density matrix D can be given by the expression in Eq. (2),
(2)
The lower the electronic temperature Te, the higher the β value (refer to Sec. II A), and more recursion steps m will be required to approximate the density matrix.12,13

The accuracy24 of the aforementioned polynomial expansion procedures can be related to the degree npl of the polynomial needed to approximate the density matrix, and we note that npl ∝ (ɛNɛ1) with ɛ1, ɛN being spectral bound estimates of the DFT discretized Hamiltonian H. Since the spectral width of H arising in DFT calculations employing a systematically convergent basis is usually of the O(103105), a large value of npl is required for approximating the density matrix. To this end, the above RPE approaches involving H require a large number of computationally expensive matrix–matrix multiplications of huge sizes on distributed architectures and are never done in practice.

As discussed before, we consider iterative orthogonal projection (IOP) based methods for solving the large-scale eigenvalue problem associated with matrix H of dimension M × M. In these approaches, the orthogonal projection of the matrix H is carried out onto a carefully constructed subspace of dimension kM rich in the wanted eigenvectors. We choose to work with this smaller projected Hamiltonian Hp of dimension k × k and employ the recursive polynomial expansion procedures on Hp to approximate the density matrix in the subspace29 as an alternative to explicit subspace diagonalization of Hp. Owing to the reduced dimension of Hp and the spectral width of Hp being commensurate with that of the occupied Kohn–Sham eigenspectrum, the proposed approach is computationally efficient, as demonstrated subsequently.

Figure 1 describes the overall algorithmic details of implementing the recursive Fermi-operator expansion (RFOE) approaches discussed in this work (Z-RFOE, A-Z-RFOE, and T-RFOE) employing Hp. As indicated in the figure, the trace change in the trial density matrix (Xn) for a given nth iteration relative to the previous iteration (RelTr) is used as a criterion for RFOE convergence. Furthermore, we also explore mixed-precision strategies in conjunction with the Z-RFOE and T-RFOE schemes implemented in this work. To this end, we rely on the fact that far away from RFOE convergence to the density matrix, the floating point operations can be performed in single precision (FP32) before switching to double precision (FP64) operations thereafter. The criteria to decide the number of initial FP32 iterations is linked to RelTr.

FIG. 1.

General implementation details flowchart for all the RFOE codes.

FIG. 1.

General implementation details flowchart for all the RFOE codes.

Close modal

The C++ implementation of RFOE codes employs the Message Passing Interface (MPI) library to achieve multi-node parallelism. To handle the parallel matrices encountered during the RFOE process, we utilize the SLATE (Software for Linear Algebra Targeting Exascale) library.25 SLATE effectively stores these matrices in a 2-D block-cyclic fashion on both CPUs and GPUs within a node. The tile size, which determines the size of the sub-matrices (blocks) distributed in a 2-D block-cyclic manner across the processor grid, plays a crucial role in the parallel performance of matrix–matrix multiplication. Therefore, we conduct numerical experiments to explore the impact of varying the tile size on the overall performance. These experiments aimed to identify the optimal tile size for a given matrix dimension Hp.

The computation of the matrix trace, required at the beginning and end of RFOE iterations, involves traversing the diagonal elements of the global matrix and utilizing MPI collectives. Additionally, we compute the trace of the squares of symmetric matrices arising during the course of RFOE iterations by evaluating the square of the Frobenius norm of the given parallel distributed SLATE matrix (Tr(A2)=AF2). This is facilitated through a SLATE API call, which grants users access to the Frobenius norm. Among the various steps involved in RFOE algorithms, matrix–matrix multiplication (MM) is the computationally dominant step. To handle this step efficiently across multiple CPUs and GPUs, we leverage the capabilities of the SLATE library, which enables parallel execution of the underlying MM. Furthermore, we explored the performance of alternative approaches, namely the Communication-optimal Matrix Multiplication (COSMA)26 and cuBLASMg (provided by the CUDA Math Library Early Access Program27) libraries, for computing parallel matrix–matrix multiplications on CPUs and GPUs. Our studies indicate that COSMA proved to be slower compared to the SLATE library in terms of computational times. Additionally, it is worth noting that the cuBLASMg library is limited to utilizing multiple GPUs within a single node, which constrains its applicability.

For accuracy benchmarking of the RFOE methods implemented in this work, we compute two error metrics—(a) the relative error (ɛ1) in the Frobenius norm between the exact density matrix (Drefp=f(Hp)) and the approximate density matrix (Dp) computed via RFOE, i.e., ε1=(DpDrefpF)/DrefpF, and (b) the relative error (ɛ2) between the trace of the actual and the approximate value of f(Hp)Hp, a measure of band energy error. To this end, we have ɛ2 = (Tr(DHp) − Tr(DrefHp))/Tr(DrefHp). Dref is computed by explicit diagonalization using the ELPA library.19,20 We note that DrefpF will be of the O(Nocc) and, hence, the electron-density computed using RFOE approaches will have an absolute error ɛρ = O(ɛ1Nocc) with respect to the true electron-density (ground-state). Since the Kohn–Sham energy functional is variational with respect to the ground-state electron density, the ground-state energy error will be of O(ερ2) and forces that are derivatives of ground-state energy with respect to the position of atoms will have an error of O(ɛρ).

To assess the accuracy and performance of the proposed methods, we employ synthetic matrices representative of the subspace projected Hamiltonians (Hp) arising in pseudopotential DFT calculations. To this end, the matrix Hp is constructed so that the spectral width is smaller and remains constant with an increase in the matrix size (indicative of an increase in the system size of a given material system). We choose the matrix elements of Hp such that Hijp=Hjip=e0.5*|ij|*sin(i+1), and the matrix sizes investigated in this study are 8192 × 8192, 16 384 × 16 384, 32 768 × 32 768, and 65 536 × 65 536. The number of occupied states (Nocc) is chosen to be around 90% of the dimension of the matrix in the case of a zero-temperature density matrix, while in the case of a finite-temperature density matrix, the number of occupied states (including fractionally occupied) is chosen to be 85% of the dimension of the matrix. We compare the performance of different implementations of RFOE (Z-RFOE, A-Z-RFOE, and T-RFOE) and explicit diagonalization of Hp using the ELPA19,20 library employing a two-stage diagonalization algorithm (CPU-ELPA2 and GPU-ELPA2) on both multi-node CPUs and GPUs.

The multi-node CPU study used the PARAM PRAVEGA supercomputer, equipped with 584 nodes. Each node has two Intel Xeon Cascade-Lake 8268 2.9 GHz processors with 48 physical cores per node. The nodes are integrated with a BullSequana XH200 Mellanox HDR 100 Gbps Infiniband interconnection using a FAT-Tree topology for MPI. On PRAVEGA, we have compiled our codes using the libraries—GCC/11.2.0, INTEL-MKL/2020, and INTEL-MPI/2019.10.317. The multi-node GPU study was conducted on the SUMMIT (Oak Ridge Leadership Computing Facility) supercomputer equipped with 4608 IBM Power System AC922 nodes with two IBM POWER9 processors (42 physical cores) and six NVIDIA Volta V100 GPUs in each node. Summit nodes are connected to a dual-rail EDR InfiniBand network, providing a node injection bandwidth of 23 GB/s. On SUMMIT, we have compiled our codes using the libraries NVIDIA CUDA/11.0.3, GCC/9.1.0, IBM Spectrum-MPI/10.4.0.3, and OPENBLAS/0.3.15-omp.

Throughout our discussion of the results in this section, it is important to note that a relative trace tolerance RelTr of 5 × 10−8 is utilized as a convergence criterion for all double-precision (FP64) RFOE implementations described here (see Fig. 1). As a result, the number of RFOE iterations (parallel matrix–matrix multiplications) is observed to be around 50 in the case of Z-RFOE and around 30 in the case of A-Z-RFOE. For the mixed precision strategy implemented in the case of Z-RFOE and T-RFOE, the value of RelTr is chosen to be around 5 × 10−4 for FP32 RFOE iterations and, subsequently, the RFOE iterations in FP64 precision are continued until a RelTr of 5 × 10−8 is reached. In the case of finite-temperature density matrix approximation using T-RFOE, the smearing value of β corresponding to a temperature of 500 K is chosen. We observe the number of T-RFOE iterations to be around 30. Furthermore, we note that the tolerance criteria RelTr mentioned above resulted in errors ɛ1 and ɛ2 (refer to Sec. III C) of approximately O(10−10) and O(10−9), respectively, in the double-precision implementation of Z-RFOE and A-Z-RFOE. In the case of mixed precision implementation employing Z-RFOE, the errors ɛ1 and ɛ2 are of the O(10−7) and O(10−9), respectively. In the case of T-RFOE, we remark that the error ɛ1 is of the O(10−3) and ɛ2 is of the O(10−6). We note that the higher errors associated with the finite-temperature density matrix approximated by T-RFOE relative to the exact Fermi–Dirac density matrix are expected, as the T-RFOE method13 can be considered an alternative approach to smearing the zero-temperature density matrix. This viewpoint renders T-RFOE useful for approximating the finite-temperature density matrix in DFT since it relies on energy differences to compute material properties.

We now list the performance metrics used for our comparative study in this work:

  • CPU Node-hrs ⇒ Execution time (in hours) × the number of nodes (chosen in a good scaling regime). It gives a measure of the computational efficiency on CPUs.

  • GPU-hrs ⇒ Execution time (in hours) × the number of GPUs (chosen in a good scaling regime). It gives a measure of the computational efficiency on GPUs.

  • Minimum wall time ⇒ Least possible time obtained using as many resources (CPUs/GPUs) as possible. It is a measure of the scaling efficiency of the implementation.

  • Energy consumption ⇒ Upper bound of power consumption in kWh for executing a given algorithm on CPUs/GPUs. Indicative of the dollar cost required for the calculations on the supercomputer. For the energy consumption calculation, we used the Thermal Design Power (TDP) ratings for both CPUs and GPUs.

Figures 2 and 3 provide a comparison between ELPA’s subspace diagonalization procedure and various RFOE approaches that utilize parallel dense matrix–matrix multiplications based on the SLATE library. The comparison is based on CPU node-hrs, scalability, and minimum wall times on multi-node CPU architectures. On CPUs, Fig. 2(a) demonstrates that RFOE implementations for approximating the zero-temperature density matrix are computationally efficient than subspace diagonalization in terms of CPU node-hrs. Among the RFOE implementations, A-Z-RFOE is the closest competing method to ELPA’s diagonalization approach and is faster than ELPA approximately by a factor of 2× for all the matrix Hp dimensions (8192, 16 384, 32 768, and 65 536) considered in this work. This can be attributed to the reduced computational prefactor in the RFOE approach (only dense matrix–matrix multiplications) compared to exact diagonalization (reduction to a tridiagonal matrix followed by reduction to Schur form). Further, in Fig. 2(b), the multi-node CPU scaling behavior of the subspace diagonalization procedure using ELPA and the A-Z-RFOE approach employing the SLATE library for the Hp matrix of dimension 65 536 is illustrated. These results indicate the superior scaling behavior of ELPA over SLATE. For instance, relative to 48 cores (1 node), the minimum number of nodes that the problem can accommodate, we find the scaling efficiency of ELPA diagonalization is close to 75% at 3072 cores (64 nodes) and drops to around 60% at 12 288 cores (256 nodes). In contrast, the scaling efficiency of the A-Z-RFOE approach using SLATE drops to 37% at 3072 cores (64 nodes) and flattens around 6144 cores (128 nodes). Insights into minimum wall times, i.e., the least possible execution time obtained with increasing CPU cores, are valuable outcomes of the scaling efficiency for various implementations considered in this work on multi-node CPU architectures. To this end, Fig. 2(c) demonstrates that the minimum wall times obtained by ELPA’s subspace diagonalization are faster by a factor of around 1.7×–2× compared to A-Z-RFOE, the closest competitor, and is consistent with the observation that ELPA’s scaling behavior is close to ideal scaling behavior until a large number of CPU cores are compared to SLATE. For instance, in this case of the matrix Hp of dimension 65 536, the minimum wall times are observed at around 12 000 CPU cores when using an explicit diagonalization approach employing the ELPA library, while the best RFOE implementation (A-Z-RFOE) based on SLATE gave a 1.7× higher minimum wall time at around 6000 CPU cores. Finally, Fig. 3 illustrates the comparisons for building the finite-temperature density matrix on multi-node CPUs using T-RFOE and diagonalization approaches. We observe from Fig. 3(a) that T-RFOE approaches are more computationally efficient than ELPA’s diagonalization in terms of CPU node-hrs. In particular, mixed-precision T-RFOE implementation has been observed to be faster by a factor of around 3×–4× compared to ELPA’s diagonalization for the matrix dimensions considered in this work. In addition, in the minimum wall time comparisons illustrated in Fig. 3(b), mixed precision T-RFOE is found to have almost similar wall times to that of diagonalization using ELPA for all the sizes of the matrices considered in this work.

FIG. 2.

(a) CPU Node-hrs vs matrix size plot, (b) Scaling study involving Hp of size 65 536, and (c) Min. wall time vs matrix size plot. Case study: ELPA diagonalization vs different implementations of RFOE used to approximate the zero-temperature density matrix on multi-node CPUs. The bar plots in (c) indicate the number of CPU cores at which the minimum wall time is obtained for each matrix size.

FIG. 2.

(a) CPU Node-hrs vs matrix size plot, (b) Scaling study involving Hp of size 65 536, and (c) Min. wall time vs matrix size plot. Case study: ELPA diagonalization vs different implementations of RFOE used to approximate the zero-temperature density matrix on multi-node CPUs. The bar plots in (c) indicate the number of CPU cores at which the minimum wall time is obtained for each matrix size.

Close modal
FIG. 3.

(a) CPU Node-hrs vs matrix size plot and (b) Min. wall time vs matrix size plot for different implementations of RFOE used to approximate the finite-temperature density matrix on multi-node CPUs. The bar plots in (b) indicate the number of CPU cores at which the minimum wall time is obtained for each matrix size.

FIG. 3.

(a) CPU Node-hrs vs matrix size plot and (b) Min. wall time vs matrix size plot for different implementations of RFOE used to approximate the finite-temperature density matrix on multi-node CPUs. The bar plots in (b) indicate the number of CPU cores at which the minimum wall time is obtained for each matrix size.

Close modal

Figures 4 and 5 illustrate a comparative study of GPU-hrs, scalability, and minimum wall times on multi-node GPU architectures between ELPA’s subspace diagonalization procedure and different RFOE approaches implemented using SLATE. On V100 GPUs, Fig. 4(a) demonstrates that utilizing RFOE implementations for approximating the zero-temperature density matrix is computationally more efficient than subspace diagonalization in terms of GPU-hrs. Among the RFOE implementations, A-Z-RFOE is the closest competing method to ELPA’s diagonalization procedure and is faster than ELPA approximately by a factor of 4×, 2.7×, 2×, and 1.2× for matrix Hp dimensions of 8192, 16 384, 32 768, and 65 536, respectively. These results indicate that the speed-up over ELPA reduces with an increase in the size of the matrix. We note that the total computation time observed in RFOE or ELPA diagonalization is the cumulative sum of data movement costs and floating point operations costs on GPUs. The reduction in speed-ups of RFOE over ELPA with an increase in matrix size can be attributed to the reduction in the fraction of data movement cost to the total computational cost associated with ELPA’s two phase diagonalization approach. Furthermore, Fig. 4(b) shows the scalability data for both subspace diagonalization procedures using ELPA and A-Z-RFOE implemented using SLATE for the Hp matrix of dimension 65 536. These results indicate the superior scaling behavior of ELPA over SLATE. For instance, relative to two nodes (12 GPUs), the minimum number of nodes at which the problem can be accommodated, we find that the scaling efficiency of RFOE approaches using SLATE quickly drops to 60% at 4 nodes (24 GPUs), 20% at 16 nodes (96 GPUs), 10% at 128 nodes (768 GPUs), and around 1% at 2048 nodes (12 288 GPUs). In contrast, the scaling efficiency for ELPA is around 93%, 50%, 20%, and 14% at 4 nodes, 16 nodes, 128 nodes, and 256 nodes, respectively. As discussed previously in the CPU comparisons, a practically valuable outcome of scaling efficiency is the minimum wall time, i.e., the least possible execution time one can obtain with increasing GPUs. Figure 4(c) shows these minimum wall time comparisons between various approaches considered in this work. The minimum wall times obtained using A-Z-RFOE are faster by a factor of 1.2×–1.4× compared to ELPA’s subspace diagonalization for the Hp matrix of dimensions 8192 and 16 384. However, the superior scaling of ELPA over SLATE resulted in these minimum wall times on fewer GPUs using ELPA’s diagonalization (48 GPUs) compared to RFOE approaches considered in this work (768 and 1536 GPUs for matrix dimensions of 8192 and 16 384, respectively). As the matrix dimension increases to 32 768, the minimum wall time obtained is almost similar in the case of diagonalization and A-Z-RFOE implementation using SLATE, while the number of GPUs required to achieve this wall time is larger by a factor of 16× for the A-Z-RFOE approach in comparison to diagonalization. In the case of 65 536 matrix size, a factor of 2× speed-up is observed in minimum wall time employing ELPA’s diagonalization with much fewer GPUs than A-Z-RFOE. Finally, Fig. 5 illustrates the comparative studies carried out in building the finite-temperature density matrix on GPUs using T-RFOE and diagonalization approaches. We observe that both double- and mixed-precision variants of the T-RFOE implementations are more computationally efficient than the explicit diagonalization procedure using ELPA in the GPU-hrs regime. We also observe that the mixed-precision variant of T-RFOE is the closest competitor to the diagonalization approach. In particular, we see computational gains of around 4.4×, 3.8×, 3×, and 2.3× in terms of GPU-hrs for the Hp matrix of dimensions 8192, 16 384, 32 768, and 65 536, respectively, over ELPA’s diagonalization. Figure 5(b) illustrates the minimum wall time comparisons, and these times obtained using mixed-precision T-RFOE are faster by a factor of 1.3×–2× compared to ELPA’s subspace diagonalization involving matrix dimensions of 8192, 16 384, and 32 768. However, the superior scaling of ELPA over SLATE resulted in these minimum wall times on fewer GPUs using ELPA. Furthermore, for the Hp matrix size of 65 536, the minimum wall time obtained via ELPA’s diagonalization approach was lower than T-RFOE and was obtained on much fewer GPUs.

FIG. 4.

(a) GPU-hrs vs matrix size plot, (b) Scaling study, and (c) Min. wall time vs matrix size plot. Case study: ELPA Diagonalization vs different implementations of RFOE used to approximate the zero-temperature density matrix on multi-node GPUs.

FIG. 4.

(a) GPU-hrs vs matrix size plot, (b) Scaling study, and (c) Min. wall time vs matrix size plot. Case study: ELPA Diagonalization vs different implementations of RFOE used to approximate the zero-temperature density matrix on multi-node GPUs.

Close modal
FIG. 5.

(a) GPU-hrs vs matrix size plot, and (b) Min. wall time vs matrix size plot for different implementations of RFOE used to approximate the finite-temperature density matrix on multi-node GPUs.

FIG. 5.

(a) GPU-hrs vs matrix size plot, and (b) Min. wall time vs matrix size plot for different implementations of RFOE used to approximate the finite-temperature density matrix on multi-node GPUs.

Close modal

The comparisons reported in this study were performed on V100 SXM2 GPUs. However, it is important to note that newer GPU architectures can potentially provide even greater computational improvements in terms of GPU-hrs. This is because the dominant computational costs associated with the RFOE approaches examined in this study are primarily associated with dense matrix–matrix multiplications. For instance, we anticipate a speed-up of ∼1.5–2× in terms of GPU-hrs for RFOE approaches using A100 SXM2 GPUs in comparison to V100 SXM2, taking into account the data movement costs associated with dense matrix–matrix multiplications. Furthermore, for the case of H100 GPUs, we anticipate the speed-ups in GPU-hrs compared to V100 GPUs to be around ∼5–6× since the H100 GPUs have close to 8× higher FP64 throughput performance. These advanced architectures can also benefit ELPA’s diagonalization, but not to the extent it can benefit RFOE dense matrix–matrix multiplications since the explicit diagonalization methods usually do not involve high arithmetic intensity operations such as RFOE. This is because diagonalization relies on a two-phase approach, with the first phase involving the reduction of a dense matrix to a tridiagonal matrix using Householder reflectors, which are constructed using vectors and do not involve arithmetic intensity operations like dense matrix–matrix multiplications. The subsequent phase reduces this tridiagonal matrix to a diagonal matrix, which involves manipulation with sparse matrices, which involves more data movement costs.

In the case of minimum wall time comparisons with the newer architecture GPUs, the speed-ups gained by the RFOE approach over ELPA or vice versa may not change much compared to what is reported in the manuscript as the communication cost becomes the dominant cost of computing at a larger number of GPUs. As of today, this is better handled in ELPA’s diagonalization implementation than the parallel dense matrix–matrix multiplications (pgemm) implemented in the state-of-the-art libraries.

In order to estimate the upper bound for energy consumption in units of kWh as a function of matrix dimension, we utilize the TDP (Thermal design power) ratings for executing both RFOE implementations and the explicit diagonalization approach using the ELPA library. To plot the energy consumption against the matrix dimensions, we consider both the CPU node-hours/GPU-hours and minimum wall time regimes using the execution time of the best-performing RFOE implementation and ELPA diagonalization involving Hp. Figures 6 and 7 summarize the relevant data. As the matrix size increases, we observe a higher energy consumption in the case of explicit diagonalization using ELPA for CPUs, both in the regime of CPU node-hours (∼2×) and minimum wall time (∼1.2×), compared to the best-performing RFOE implementations. In the case of GPUs, we also observe a higher energy consumption for explicit diagonalization using ELPA in the GPU-hrs regime. However, in the regime of minimum wall time, energy consumption is significantly higher in the case of the best RFOE implementation using SLATE compared to ELPA, owing to the excellent scaling behavior of ELPA on multinode GPUs.

FIG. 6.

Energy consumption (kWh) vs matrix size plot in terms of (a) Node-hrs/GPU-hrs and (b) Min. wall time for the best-performing implementation of RFOE used to approximate a zero-temperature density matrix.

FIG. 6.

Energy consumption (kWh) vs matrix size plot in terms of (a) Node-hrs/GPU-hrs and (b) Min. wall time for the best-performing implementation of RFOE used to approximate a zero-temperature density matrix.

Close modal
FIG. 7.

Energy consumption (kWh) vs matrix size plot in terms of (a) Node-hrs/GPU-hrs and (b) Min. wall time for the best-performing implementation of RFOE in the case of approximating a finite-temperature density matrix.

FIG. 7.

Energy consumption (kWh) vs matrix size plot in terms of (a) Node-hrs/GPU-hrs and (b) Min. wall time for the best-performing implementation of RFOE in the case of approximating a finite-temperature density matrix.

Close modal

Our current work investigates recursive Fermi-operator expansion (RFOE) strategies involving subspace projected Hamiltonians as an alternative to explicit subspace diagonalization in iterative orthogonal projection methods for solving large-scale eigenvalue problems. We explore the potential use of these strategies on both multi-node CPU and GPU architectures by employing the ELPA and SLATE libraries. As demonstrated by various studies conducted in this work, we find that RFOE schemes involving dense matrix–matrix multiplications have a lesser computational prefactor associated with the cubic scaling complexity and, hence, are more computationally efficient (∼2×–4×) than the explicit diagonalization of the subspace projected Hamiltonian in the CPU node-hrs/GPU-hrs regime. Further, we found that RFOE strategies can be memory efficient compared to explicit diagonalization procedures on GPUs. However, our results also demonstrate that the scaling efficiency associated with the explicit subspace diagonalization procedure carried out using ELPA is superior to RFOE strategies, resulting in lower minimum wall times on both CPUs and GPUs. Finally, we conclude that RFOE strategies can be considered effective alternatives to subspace diagonalization, particularly in the context of a good scaling regime where computational efficiency is measured in terms of CPU node-hrs/GPU-hrs consumed. However, it is worth noting that these strategies may not perform as well in the extreme scaling regime as demonstrated in the current work.

The authors gratefully acknowledge the seed grant from Indian Institute of Science (IISc) and SERB Startup Research Grant from the Department of Science and Technology India (Grant No. SRG/2020/002194) for the purchase of a GPU cluster, which provided computational resources for this work. The research used the resources of PARAM Pravega at Indian Institute of Science, supported by National Supercomputing Mission (NSM) R&D for an exa-scale grant (DST/NSM/R&D_Exascale/2021/14.02). This research also used the resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

The authors have no conflicts to disclose.

Sameer Khadatkar: Data curation (lead); Formal analysis (lead); Methodology (equal); Writing – original draft (equal). Phani Motamarri: Conceptualization (lead); Methodology (equal); Project administration (lead); Supervision (lead); Writing – original draft (equal); Writing – review & editing (lead).

The data that support the findings of this study are available within the article.

1.
W.
Kohn
and
L. J.
Sham
, “
Self-consistent equations including exchange and correlation effects
,”
Phys. Rev.
140
,
A1133
A1138
(
1965
).
2.
L. E.
Ratcliff
,
W.
Dawson
,
G.
Fisicaro
,
D.
Caliste
,
S.
Mohr
,
A.
Degomme
,
B.
Videau
,
V.
Cristiglio
,
M.
Stella
,
M.
D’Alessandro
,
S.
Goedecker
,
T.
Nakajima
,
T.
Deutsch
, and
L.
Genovese
, “
Flexibilities of wavelets as a computational basis set for large-scale electronic structure calculations
,”
J. Chem. Phys.
152
,
194110
(
2020
).
3.
S.
Ghosh
and
P.
Suryanarayana
, “
SPARC: Accurate and efficient finite-difference formulation and parallel implementation of density functional theory: Isolated clusters
,”
Comput. Phys. Commun.
212
,
189
204
(
2017
).
4.
S.
Das
,
P.
Motamarri
,
V.
Gavini
,
B.
Turcksin
,
Y. W.
Li
, and
B.
Leback
, “
Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing: 46 PFLOPS simulation of a metallic dislocation system
,” in
Proceedings of the International Conference for High Performance Computing, Networking, Storage And Analysis
,
2019
.
5.
P.
Motamarri
,
S.
Das
,
S.
Rudraraju
,
K.
Ghosh
,
D.
Davydov
, and
V.
Gavini
, “
DFT-FE—A massively parallel adaptive finite-element code for large-scale density functional theory calculations
,”
Comput. Phys. Commun.
246
,
106853
(
2020
).
6.
S.
Das
,
P.
Motamarri
,
V.
Subramanian
,
D. M.
Rogers
, and
V.
Gavini
, “
DFT-FE 1.0: A massively parallel hybrid CPU-GPU density functional theory code using finite-element discretization
,”
Comput. Phys. Commun.
280
,
108473
(
2022
).
7.
E. R.
Davidson
, “
The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices
,”
J. Comput. Phys.
17
,
87
94
(
1975
).
8.
A. V.
Knyazev
, “
Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method
,”
SIAM J. Sci. Comput.
23
,
517
541
(
2001
).
9.
Y.
Zhou
,
Y.
Saad
,
M. L.
Tiago
, and
J. R.
Chelikowsky
, “
Self-consistent-field calculations using Chebyshev-filtered subspace iteration
,”
J. Comput. Phys.
219
,
172
184
(
2006
).
10.
A. M. N.
Niklasson
, “
Expansion algorithm for the density matrix
,”
Phys. Rev. B
66
,
155115
(
2002
).
11.
G.
Beylkin
,
N.
Coult
, and
M. J.
Mohlenkamp
, “
Fast spectral projection algorithms for density-matrix computations
,”
J. Comput. Phys.
152
,
32
54
(
1999
).
12.
S. M.
Mniszewski
,
R.
Perriot
,
E. H.
Rubensson
,
C. F. A.
Negre
,
M. J.
Cawkwell
, and
A. M. N.
Niklasson
, “
Linear scaling pseudo Fermi-operator expansion for fractional occupation
,”
J. Chem. Theory Comput.
15
,
190
200
(
2019
).
13.
A. M. N.
Niklasson
, “
A note on the Pulay force at finite electronic temperatures
,”
J. Chem. Phys.
129
,
244107
(
2008
).
14.
M. J.
Cawkwell
,
E. J.
Sanville
,
S. M.
Mniszewski
, and
A. M. N.
Niklasson
, “
Computing the density matrix in electronic structure theory on graphics processing units
,”
J. Chem. Theory Comput.
8
,
4094
4101
(
2012
).
15.
J.
Finkelstein
,
J. S.
Smith
,
S. M.
Mniszewski
,
K.
Barros
,
C. F. A.
Negre
,
E. H.
Rubensson
, and
A. M. N.
Niklasson
, “
Mixed precision Fermi-operator expansion on tensor cores from a machine learning perspective
,”
J. Chem. Theory Comput.
17
,
2256
2265
(
2021
).
16.
J.
Finkelstein
,
J. S.
Smith
,
S. M.
Mniszewski
,
K.
Barros
,
C. F. A.
Negre
,
E. H.
Rubensson
, and
A. M. N.
Niklasson
, “
Quantum-based molecular dynamics simulations using tensor cores
,”
J. Chem. Theory Comput.
17
,
6180
6192
(
2021
).
17.
J.
Finkelstein
,
E. H.
Rubensson
,
S. M.
Mniszewski
,
C. F. A.
Negre
, and
A. M. N.
Niklasson
, “
Quantum perturbation theory using tensor cores and a deep neural network
,”
J. Chem. Theory Comput.
18
,
4255
4268
(
2022
).
18.
R.
Pederson
,
J.
Kozlowski
,
R.
Song
,
J.
Beall
,
M.
Ganahl
,
M.
Hauru
,
A. G. M.
Lewis
,
Y.
Yao
,
S. B.
Mallick
,
V.
Blum
, and
G.
Vidal
, “
Large scale quantum chemistry with tensor processing units
,”
J. Chem. Theory Comput.
19
,
25
32
(
2023
).
19.
A.
Marek
,
V.
Blum
,
R.
Johanni
,
V.
Havu
,
B.
Lang
,
T.
Auckenthaler
,
A.
Heinecke
,
H.-J.
Bungartz
, and
H.
Lederer
, “
The ELPA library: Scalable parallel eigenvalue solutions for electronic structure theory and computational science
,”
J. Phys.: Condens. Matter
26
,
213201
(
2014
).
20.
V. W.-z.
Yu
,
J.
Moussa
,
P.
Kůs
,
A.
Marek
,
P.
Messmer
,
M.
Yoon
,
H.
Lederer
, and
V.
Blum
, “
GPU-acceleration of the ELPA2 distributed eigensolver for dense symmetric and Hermitian eigenproblems
,”
Comput. Phys. Commun.
262
,
107808
(
2021
).
21.
S.
Goedecker
and
L.
Colombo
, “
Efficient linear scaling algorithm for tight-binding molecular dynamics
,”
Phys. Rev. Lett.
73
,
122
125
(
1994
).
22.
A.
Weiße
,
G.
Wellein
,
A.
Alvermann
, and
H.
Fehske
, “
The kernel polynomial method
,”
Rev. Mod. Phys.
78
,
275
306
(
2006
).
23.
R.
Zeller
,
J.
Deutz
, and
P. H.
Dederichs
, “
Application of complex energy integration to selfconsistent electronic structure calculations
,”
Solid State Commun.
44
,
993
997
(
1982
).
24.
S.
Goedecker
, “
Linear scaling electronic structure methods
,”
Rev. Mod. Phys.
71
,
1085
1123
(
1999
).
25.
M.
Gates
,
J.
Kurzak
,
A.
Charara
,
A.
YarKhan
, and
J.
Dongarra
, “
Slate: Design of a modern distributed and accelerated linear algebra library
,” in
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
,
2019
.
26.
G.
Kwasniewski
,
M.
Kabić
,
M.
Besta
,
J.
VandeVondele
,
R.
Solcà
, and
T.
Hoefler
, “
Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication
,” in
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
,
2019
.
27.
NVIDIA
,
CUDA math library early access program
.
28.
P.
Motamarri
,
M. R.
Nowak
,
K.
Leiter
,
J.
Knap
, and
V.
Gavini
, “
Higher-order adaptive finite-element methods for Kohn–Sham density functional theory
,”
J. Comput. Phys.
253
,
308
343
(
2013
).
29.
P.
Motamarri
and
V.
Gavini
, “
Subquadratic-scaling subspace projection method for large-scale Kohn–Sham density functional theory calculations using spectral finite-element discretization
,”
Phys. Rev. B
90
,
115127
(
2014
).
30.
R.
McWeeny
, “
Some recent advances in density matrix theory
,”
Rev. Mod. Phys.
32
,
335
369
(
1960
).
31.
A. H. R.
Palser
and
D. E.
Manolopoulos
, “
Canonical purification of the density matrix in electronic-structure theory
,”
Phys. Rev. B
58
,
12704
12711
(
1998
).
32.
T.
Ozaki
, “
Continued fraction representation of the Fermi-Dirac function for large-scale electronic structure calculations
,”
Phys. Rev. B
75
,
035123
(
2007
).
33.
K.
Németh
and
G. E.
Scuseria
, “
Linear scaling density matrix search based on sign matrices
,”
J. Chem. Phys.
113
,
6035
6041
(
2000
).
34.
A.
Holas
, “
Transforms for idempotency purification of density matrices in linear-scaling electronic-structure calculations
,”
Chem. Phys. Lett.
340
,
552
558
(
2001
).
35.
A. M. N.
Niklasson
, “
Implicit purification for temperature-dependent density matrices
,”
Phys. Rev. B
68
,
233104
(
2003
).
36.
D. K.
Jordan
and
D. A.
Mazziotti
, “
Comparison of two genres for linear scaling in density functional theory: Purification and density matrix minimization methods
,”
J. Chem. Phys.
122
,
084114
(
2005
).
37.
E.
Rudberg
and
E. H.
Rubensson
, “
Assessment of density matrix methods for linear scaling electronic structure calculations
,”
J. Phys.: Condens. Matter
23
,
075502
(
2011
).
38.
P.
Suryanarayana
, “
Optimized purification for density matrix calculation
,”
Chem. Phys. Lett.
555
,
291
295
(
2013
).
39.
E. H.
Rubensson
and
A. M. N.
Niklasson
, “
Interior eigenvalues from density matrix expansions in quantum mechanical molecular dynamics
,”
SIAM J. Sci. Comput.
36
,
B147
B170
(
2014
).
40.
D. R.
Bowler
and
M. J.
Gillan
, “
Density matrices in O(N) electronic structure calculations: Theory and applications
,”
Comput. Phys. Commun.
120
,
95
108
(
1999
).
41.
L. A.
Truflandier
,
R. M.
Dianzinga
, and
D. R.
Bowler
, “
Communication: Generalized canonical purification for density matrix minimization
,”
J. Chem. Phys.
144
,
091102
(
2016
).
42.
E. H.
Rubensson
, “
Nonmonotonic recursive polynomial expansions for linear scaling calculation of the density matrix
,”
J. Chem. Theory Comput.
7
,
1233
1236
(
2011
).