Quantum mechanical calculations for material modeling using Kohn–Sham density functional theory (DFT) involve the solution of a nonlinear eigenvalue problem for N smallest eigenvector-eigenvalue pairs, with N proportional to the number of electrons in the material system. These calculations are computationally demanding and have asymptotic cubic scaling complexity with the number of electrons. Large-scale matrix eigenvalue problems arising from the discretization of the Kohn–Sham DFT equations employing a systematically convergent basis traditionally rely on iterative orthogonal projection methods, which are shown to be computationally efficient and scalable on massively parallel computing architectures. However, as the size of the material system increases, these methods are known to incur dominant computational costs through the Rayleigh–Ritz projection step of the discretized Kohn–Sham Hamiltonian matrix and the subsequent subspace diagonalization of the projected matrix. This work explores the potential of polynomial expansion approaches based on recursive Fermi-operator expansion as an alternative to the subspace diagonalization of the projected Hamiltonian matrix to reduce the computational cost. Subsequently, we perform a detailed comparison of various recursive polynomial expansion approaches to the traditional approach of explicit diagonalization on both multi-node central processing unit and graphics processing unit architectures and assess their relative performance in terms of accuracy, computational efficiency, scaling behavior, and energy efficiency.
I. INTRODUCTION
A well-known and challenging application of eigenvalue problems is in the area of quantum modeling of materials using Kohn–Sham density functional theory (DFT),1 which has been immensely successful in providing critical insights into various ground-state material properties. To compute the ground-state electronic structure in DFT, one is confronted with solving a large-scale nonlinear eigenvalue problem using a self-consistent field iteration procedure (SCF) for N smallest eigenvalue/eigenvector pairs, with N being proportional to the number of electrons in the material system. This results in asymptotic cubic complexity with the number of electrons for DFT, making these calculations computationally demanding and often restrictive in terms of system sizes that can be handled using widely used DFT codes. Many of these codes employ systematically convergent plane-wave basis sets, restricting the simulation domains to be periodic, or employ atomic-orbital (AO) type basis sets with very few basis functions per atom, but AO basis sets are not complete. Furthermore, the extended nature of these basis sets restricts their parallel scalability. To extend the range of system sizes to be studied, numerous past efforts have focused on developing systematically convergent real-space computational methodologies2–6,28,29 that have relied on reducing the prefactor associated with the cubic computational complexity alongside improving the parallel scalability, thereby enabling large-scale DFT calculations up to 100 000 electrons. These real-space DFT discretization approaches result in large sparse Hermitian eigenvalue problems to be solved for N smallest eigenvalue/eigenvector pairs.
DFT codes employing systematically convergent basis sets either in Fourier-space or in real-space require a large number of basis functions per atom to achieve chemical accuracy compared to approaches employing AO basis sets. This results in large Hamiltonian matrices of dimension M with M ≈ 105–108, whose N smallest eigenvalue/eigenvector pairs we seek. We note that N is usually 0.1%–0.5% of M, the total number of basis functions [degrees of freedom (DoFs)] used in the simulation. The most popular eigensolver strategies employed to solve these large-scale DFT eigenvalue problems include the Davidson approach,7 the Locally Optimal Block Pre-conditioned Conjugate Gradient (LOBPCG) method,8 or the Chebyshev filtered subspace iteration (ChFSI)9,28 approach. These eigensolvers belong to the category of iterative orthogonal projection methods (IOP), wherein the discretized Kohn–Sham Hamiltonian matrix H of size M × M is orthogonally projected onto a carefully constructed subspace rich in the wanted eigenvectors (Rayleigh–Ritz step), and subsequently, the resulting smaller, densely projected Hamiltonian Hp of dimension k(≪M) is explicitly diagonalized (subspace diagonalization) to approximate the desired eigenvalue/eigenvector pairs of the H matrix. The cubic scaling computational cost of this subspace diagonalization step dominates for medium- to large-scale material systems (N > 20 000) compared to the costs associated with subspace construction and Rayleigh–Ritz steps in IOP methods. For instance, Das et al.6 employing the ChFSI approach have reported that the subspace diagonalization constitutes roughly 30% of the total ChFSI cost for N ≈ 30 000, whereas it accounts for around 56% of the total cost for N ≈ 50 000. To this end, the current work explores recursive Fermi-operator expansion approaches on HPC architectures as an alternative to the subspace diagonalization procedure to improve computational efficiency, thereby reducing the computational prefactor associated with the cubic complexity of the DFT problem. Furthermore, the accuracy, computational efficiency, parallel performance, and energy efficiency of these approaches are examined on both multi-node central processing unit (CPU) and graphics processing unit (GPU) architectures and compared with explicit diagonalization of the projected Hamiltonian matrix.
Recursive polynomial expansion approaches (RPE) rely on the key idea that constructing a density matrix (projector matrix corresponding to N smallest eigenvectors) suffices to compute ground-state electronic structure in DFT at zero-temperature without the necessity of computing explicit eigenvalues and eigenvectors. In the past, RPE approaches10–13 have been used for ground-state DFT calculations with an atomic-orbital basis. Most of these methods aim to achieve linear scaling complexity using sparse-matrix algebra employing numerical thresholds. Furthermore, these recursive expansion schemes have also been shown to achieve high throughput performance using dense matrix algebra14 on a single Nvidia M2070 GPU, compared to traditional diagonalization approaches. More recently, the computational structure of RPE approaches has been leveraged in the form of a deep neural network architecture15 to take advantage of the tensor cores in Nvidia A100 GPUs. These techniques have been applied for ab initio molecular dynamics16 and quantum response calculations17 using atomic-orbital basis sets. Furthermore, Google’s tensor processing units (TPUs) have been successfully used for large-scale DFT calculations18 employing density matrix purification approaches, which are similar in spirit to RPE and involve dense matrix–matrix multiplications. However, to the best of our knowledge, the computational efficiency, scaling behavior on distributed systems, and energy efficiency of these RPE approaches have not been explored compared to subspace diagonalization procedures for their use in iterative orthogonal projection methods on multi-node CPU and GPU architectures. The evolving computing architectures in today’s exascale era demand scalable methodologies focusing on reduced data movement and increased arithmetic intensity on massively parallel computing architectures, with an equal emphasis on employing energy-efficient algorithms. The current work exploring recursive polynomial expansion approaches as an alternative to subspace diagonalization is in this direction, enabling large-scale eigenvalue problems arising from the discretization of DFT problems using systematically convergent basis sets employing IOP methods. To this end, the key contributions of our current work, as described in the subsequent sections, include—(a) efficient implementation strategies of various recursive polynomial expansion (RPE) techniques based on Fermi-operator expansion on both multi-node CPU and GPU architectures for both zero-temperature cases, and the finite-temperature case of Fermi–Dirac smearing of the occupancy function; (b) mixed precision strategies in conjunction with RPE to reduce compute and data access costs; and (c) assessing accuracy, computational efficiency, parallel scaling behavior, and energy efficiency of the proposed implementation procedures by comparing with explicit diagonalization algorithms provided by state-of-the-art ELPA library.19,20
II. RELATED WORK AND BACKGROUND
This section discusses key ideas central to recursive polynomial expansion approaches among the various purification schemes30,31,33–41 employed in the literature to approximate the densitymatrix.
A. Density matrix
At zero electronic temperature, the density matrix (D) can be defined as a projector matrix corresponding to the lowest occupied (Nocc ≤ N) eigenvectors of the discretized Kohn–Sham Hamiltonian matrix H. Mathematically, it takes the form of a shifted Heaviside step function, θ(.), given by D = θ[μI − H]. The density matrix (D) in the case of the finite temperature is a smeared version of zero-temperature density matrix and is mathematically represented by a Fermi-operator matrix function given by , where I denotes the identity matrix, β = 1/(kBTe) is the inverse electronic temperature, and μ is the Fermi-energy. Note that the eigenvalues fi of D are referred to as occupancies. fi is either 0 or 1 for a zero-temperature case, whereas for a finite-temperature case, fi ∈ [0, 1].
B. Recursive polynomial expansion techniques to approximate the density matrix
Two types of polynomial expansion schemes can be used to approximate the density matrix—(a) Serial Fermi-operator expansion schemes (Chebyshev Fermi-operator expansion scheme21,22 and Green’s function expansion scheme23,32); and (b) Recursive Fermi-operator expansion schemes.10,11,12,13 In this work, we employ the recursive Fermi-operator expansion schemes, as they are shown to reach a very high polynomial order in the approximation with few iterations (few matrix–matrix multiplications) and can be used to approximate the density matrix for both zero-temperature and finite-temperature cases as well.
1. Recursive Fermi-operator expansion for zero-temperature density matrix (Z-RFOE)
One of the most efficient techniques in Z-RFOE is to use the second-order projection polynomials (SP2)11 given by . The SP2 here is a continuously increasing and decreasing function in the interval [0, 1]. The ± sign is chosen to adjust the trace of Xn+1 in each projection such that it converges to Nocc.
2. Accelerated recursive polynomial expansion for zero-temperature density matrix (A-Z-RFOE)
This technique works on the concept of shifting and scaling the SP2 projection polynomials. In contrast to Z-RFOE, which uses a fixed SP2 polynomial taking the form F(X) = X2 or F(X) = 2X − X2, the A-Z-RFOE39,42 approach offers the flexibility to choose the expansion polynomials based on HOMO (Highest Occupied Molecular Orbital) and LUMO (Lowest Unoccupied Molecular Orbital) eigenvalues such that it moves the eigenvalues closer to either 1 or 0 at a rapid rate.
3. Recursive Fermi-operator expansion scheme for finite-temperature cases (T-RFOE)
C. Accuracy and performance of polynomial expansion procedures
The accuracy24 of the aforementioned polynomial expansion procedures can be related to the degree npl of the polynomial needed to approximate the density matrix, and we note that npl ∝ (ɛN − ɛ1) with ɛ1, ɛN being spectral bound estimates of the DFT discretized Hamiltonian H. Since the spectral width of H arising in DFT calculations employing a systematically convergent basis is usually of the , a large value of npl is required for approximating the density matrix. To this end, the above RPE approaches involving H require a large number of computationally expensive matrix–matrix multiplications of huge sizes on distributed architectures and are never done in practice.
III. COMPUTATIONAL METHODOLOGY AND IMPLEMENTATION
A. Methodology and algorithmic details
As discussed before, we consider iterative orthogonal projection (IOP) based methods for solving the large-scale eigenvalue problem associated with matrix H of dimension M × M. In these approaches, the orthogonal projection of the matrix H is carried out onto a carefully constructed subspace of dimension k ≪ M rich in the wanted eigenvectors. We choose to work with this smaller projected Hamiltonian Hp of dimension k × k and employ the recursive polynomial expansion procedures on Hp to approximate the density matrix in the subspace29 as an alternative to explicit subspace diagonalization of Hp. Owing to the reduced dimension of Hp and the spectral width of Hp being commensurate with that of the occupied Kohn–Sham eigenspectrum, the proposed approach is computationally efficient, as demonstrated subsequently.
Figure 1 describes the overall algorithmic details of implementing the recursive Fermi-operator expansion (RFOE) approaches discussed in this work (Z-RFOE, A-Z-RFOE, and T-RFOE) employing Hp. As indicated in the figure, the trace change in the trial density matrix (Xn) for a given nth iteration relative to the previous iteration is used as a criterion for RFOE convergence. Furthermore, we also explore mixed-precision strategies in conjunction with the Z-RFOE and T-RFOE schemes implemented in this work. To this end, we rely on the fact that far away from RFOE convergence to the density matrix, the floating point operations can be performed in single precision (FP32) before switching to double precision (FP64) operations thereafter. The criteria to decide the number of initial FP32 iterations is linked to .
B. Implementation strategies
The C++ implementation of RFOE codes employs the Message Passing Interface (MPI) library to achieve multi-node parallelism. To handle the parallel matrices encountered during the RFOE process, we utilize the SLATE (Software for Linear Algebra Targeting Exascale) library.25 SLATE effectively stores these matrices in a 2-D block-cyclic fashion on both CPUs and GPUs within a node. The tile size, which determines the size of the sub-matrices (blocks) distributed in a 2-D block-cyclic manner across the processor grid, plays a crucial role in the parallel performance of matrix–matrix multiplication. Therefore, we conduct numerical experiments to explore the impact of varying the tile size on the overall performance. These experiments aimed to identify the optimal tile size for a given matrix dimension Hp.
The computation of the matrix trace, required at the beginning and end of RFOE iterations, involves traversing the diagonal elements of the global matrix and utilizing MPI collectives. Additionally, we compute the trace of the squares of symmetric matrices arising during the course of RFOE iterations by evaluating the square of the Frobenius norm of the given parallel distributed SLATE matrix . This is facilitated through a SLATE API call, which grants users access to the Frobenius norm. Among the various steps involved in RFOE algorithms, matrix–matrix multiplication (MM) is the computationally dominant step. To handle this step efficiently across multiple CPUs and GPUs, we leverage the capabilities of the SLATE library, which enables parallel execution of the underlying MM. Furthermore, we explored the performance of alternative approaches, namely the Communication-optimal Matrix Multiplication (COSMA)26 and cuBLASMg (provided by the CUDA Math Library Early Access Program27) libraries, for computing parallel matrix–matrix multiplications on CPUs and GPUs. Our studies indicate that COSMA proved to be slower compared to the SLATE library in terms of computational times. Additionally, it is worth noting that the cuBLASMg library is limited to utilizing multiple GPUs within a single node, which constrains its applicability.
C. Metrics for accuracy benchmarking of RFOE
For accuracy benchmarking of the RFOE methods implemented in this work, we compute two error metrics—(a) the relative error (ɛ1) in the Frobenius norm between the exact density matrix and the approximate density matrix (Dp) computed via RFOE, i.e., , and (b) the relative error (ɛ2) between the trace of the actual and the approximate value of f(Hp)Hp, a measure of band energy error. To this end, we have ɛ2 = (Tr(DHp) − Tr(DrefHp))/Tr(DrefHp). Dref is computed by explicit diagonalization using the ELPA library.19,20 We note that will be of the O(Nocc) and, hence, the electron-density computed using RFOE approaches will have an absolute error ɛρ = O(ɛ1Nocc) with respect to the true electron-density (ground-state). Since the Kohn–Sham energy functional is variational with respect to the ground-state electron density, the ground-state energy error will be of and forces that are derivatives of ground-state energy with respect to the position of atoms will have an error of O(ɛρ).
IV. RESULTS AND DISCUSSION
To assess the accuracy and performance of the proposed methods, we employ synthetic matrices representative of the subspace projected Hamiltonians (Hp) arising in pseudopotential DFT calculations. To this end, the matrix Hp is constructed so that the spectral width is smaller and remains constant with an increase in the matrix size (indicative of an increase in the system size of a given material system). We choose the matrix elements of Hp such that , and the matrix sizes investigated in this study are 8192 × 8192, 16 384 × 16 384, 32 768 × 32 768, and 65 536 × 65 536. The number of occupied states (Nocc) is chosen to be around 90% of the dimension of the matrix in the case of a zero-temperature density matrix, while in the case of a finite-temperature density matrix, the number of occupied states (including fractionally occupied) is chosen to be 85% of the dimension of the matrix. We compare the performance of different implementations of RFOE (Z-RFOE, A-Z-RFOE, and T-RFOE) and explicit diagonalization of Hp using the ELPA19,20 library employing a two-stage diagonalization algorithm (CPU-ELPA2 and GPU-ELPA2) on both multi-node CPUs and GPUs.
The multi-node CPU study used the PARAM PRAVEGA supercomputer, equipped with 584 nodes. Each node has two Intel Xeon Cascade-Lake 8268 2.9 GHz processors with 48 physical cores per node. The nodes are integrated with a BullSequana XH200 Mellanox HDR 100 Gbps Infiniband interconnection using a FAT-Tree topology for MPI. On PRAVEGA, we have compiled our codes using the libraries—GCC/11.2.0, INTEL-MKL/2020, and INTEL-MPI/2019.10.317. The multi-node GPU study was conducted on the SUMMIT (Oak Ridge Leadership Computing Facility) supercomputer equipped with 4608 IBM Power System AC922 nodes with two IBM POWER9 processors (42 physical cores) and six NVIDIA Volta V100 GPUs in each node. Summit nodes are connected to a dual-rail EDR InfiniBand network, providing a node injection bandwidth of 23 GB/s. On SUMMIT, we have compiled our codes using the libraries NVIDIA CUDA/11.0.3, GCC/9.1.0, IBM Spectrum-MPI/10.4.0.3, and OPENBLAS/0.3.15-omp.
Throughout our discussion of the results in this section, it is important to note that a relative trace tolerance of 5 × 10−8 is utilized as a convergence criterion for all double-precision (FP64) RFOE implementations described here (see Fig. 1). As a result, the number of RFOE iterations (parallel matrix–matrix multiplications) is observed to be around 50 in the case of Z-RFOE and around 30 in the case of A-Z-RFOE. For the mixed precision strategy implemented in the case of Z-RFOE and T-RFOE, the value of is chosen to be around 5 × 10−4 for FP32 RFOE iterations and, subsequently, the RFOE iterations in FP64 precision are continued until a of 5 × 10−8 is reached. In the case of finite-temperature density matrix approximation using T-RFOE, the smearing value of β corresponding to a temperature of 500 K is chosen. We observe the number of T-RFOE iterations to be around 30. Furthermore, we note that the tolerance criteria mentioned above resulted in errors ɛ1 and ɛ2 (refer to Sec. III C) of approximately O(10−10) and O(10−9), respectively, in the double-precision implementation of Z-RFOE and A-Z-RFOE. In the case of mixed precision implementation employing Z-RFOE, the errors ɛ1 and ɛ2 are of the O(10−7) and O(10−9), respectively. In the case of T-RFOE, we remark that the error ɛ1 is of the O(10−3) and ɛ2 is of the O(10−6). We note that the higher errors associated with the finite-temperature density matrix approximated by T-RFOE relative to the exact Fermi–Dirac density matrix are expected, as the T-RFOE method13 can be considered an alternative approach to smearing the zero-temperature density matrix. This viewpoint renders T-RFOE useful for approximating the finite-temperature density matrix in DFT since it relies on energy differences to compute material properties.
We now list the performance metrics used for our comparative study in this work:
CPU Node-hrs ⇒ Execution time (in hours) × the number of nodes (chosen in a good scaling regime). It gives a measure of the computational efficiency on CPUs.
GPU-hrs ⇒ Execution time (in hours) × the number of GPUs (chosen in a good scaling regime). It gives a measure of the computational efficiency on GPUs.
Minimum wall time ⇒ Least possible time obtained using as many resources (CPUs/GPUs) as possible. It is a measure of the scaling efficiency of the implementation.
Energy consumption ⇒ Upper bound of power consumption in kWh for executing a given algorithm on CPUs/GPUs. Indicative of the dollar cost required for the calculations on the supercomputer. For the energy consumption calculation, we used the Thermal Design Power (TDP) ratings for both CPUs and GPUs.
A. Multi-node CPU comparisons
Figures 2 and 3 provide a comparison between ELPA’s subspace diagonalization procedure and various RFOE approaches that utilize parallel dense matrix–matrix multiplications based on the SLATE library. The comparison is based on CPU node-hrs, scalability, and minimum wall times on multi-node CPU architectures. On CPUs, Fig. 2(a) demonstrates that RFOE implementations for approximating the zero-temperature density matrix are computationally efficient than subspace diagonalization in terms of CPU node-hrs. Among the RFOE implementations, A-Z-RFOE is the closest competing method to ELPA’s diagonalization approach and is faster than ELPA approximately by a factor of 2× for all the matrix Hp dimensions (8192, 16 384, 32 768, and 65 536) considered in this work. This can be attributed to the reduced computational prefactor in the RFOE approach (only dense matrix–matrix multiplications) compared to exact diagonalization (reduction to a tridiagonal matrix followed by reduction to Schur form). Further, in Fig. 2(b), the multi-node CPU scaling behavior of the subspace diagonalization procedure using ELPA and the A-Z-RFOE approach employing the SLATE library for the Hp matrix of dimension 65 536 is illustrated. These results indicate the superior scaling behavior of ELPA over SLATE. For instance, relative to 48 cores (1 node), the minimum number of nodes that the problem can accommodate, we find the scaling efficiency of ELPA diagonalization is close to 75% at 3072 cores (64 nodes) and drops to around 60% at 12 288 cores (256 nodes). In contrast, the scaling efficiency of the A-Z-RFOE approach using SLATE drops to 37% at 3072 cores (64 nodes) and flattens around 6144 cores (128 nodes). Insights into minimum wall times, i.e., the least possible execution time obtained with increasing CPU cores, are valuable outcomes of the scaling efficiency for various implementations considered in this work on multi-node CPU architectures. To this end, Fig. 2(c) demonstrates that the minimum wall times obtained by ELPA’s subspace diagonalization are faster by a factor of around 1.7×–2× compared to A-Z-RFOE, the closest competitor, and is consistent with the observation that ELPA’s scaling behavior is close to ideal scaling behavior until a large number of CPU cores are compared to SLATE. For instance, in this case of the matrix Hp of dimension 65 536, the minimum wall times are observed at around 12 000 CPU cores when using an explicit diagonalization approach employing the ELPA library, while the best RFOE implementation (A-Z-RFOE) based on SLATE gave a 1.7× higher minimum wall time at around 6000 CPU cores. Finally, Fig. 3 illustrates the comparisons for building the finite-temperature density matrix on multi-node CPUs using T-RFOE and diagonalization approaches. We observe from Fig. 3(a) that T-RFOE approaches are more computationally efficient than ELPA’s diagonalization in terms of CPU node-hrs. In particular, mixed-precision T-RFOE implementation has been observed to be faster by a factor of around 3×–4× compared to ELPA’s diagonalization for the matrix dimensions considered in this work. In addition, in the minimum wall time comparisons illustrated in Fig. 3(b), mixed precision T-RFOE is found to have almost similar wall times to that of diagonalization using ELPA for all the sizes of the matrices considered in this work.
(a) CPU Node-hrs vs matrix size plot, (b) Scaling study involving Hp of size 65 536, and (c) Min. wall time vs matrix size plot. Case study: ELPA diagonalization vs different implementations of RFOE used to approximate the zero-temperature density matrix on multi-node CPUs. The bar plots in (c) indicate the number of CPU cores at which the minimum wall time is obtained for each matrix size.
(a) CPU Node-hrs vs matrix size plot, (b) Scaling study involving Hp of size 65 536, and (c) Min. wall time vs matrix size plot. Case study: ELPA diagonalization vs different implementations of RFOE used to approximate the zero-temperature density matrix on multi-node CPUs. The bar plots in (c) indicate the number of CPU cores at which the minimum wall time is obtained for each matrix size.
(a) CPU Node-hrs vs matrix size plot and (b) Min. wall time vs matrix size plot for different implementations of RFOE used to approximate the finite-temperature density matrix on multi-node CPUs. The bar plots in (b) indicate the number of CPU cores at which the minimum wall time is obtained for each matrix size.
(a) CPU Node-hrs vs matrix size plot and (b) Min. wall time vs matrix size plot for different implementations of RFOE used to approximate the finite-temperature density matrix on multi-node CPUs. The bar plots in (b) indicate the number of CPU cores at which the minimum wall time is obtained for each matrix size.
B. Multi-node GPU comparisons
Figures 4 and 5 illustrate a comparative study of GPU-hrs, scalability, and minimum wall times on multi-node GPU architectures between ELPA’s subspace diagonalization procedure and different RFOE approaches implemented using SLATE. On V100 GPUs, Fig. 4(a) demonstrates that utilizing RFOE implementations for approximating the zero-temperature density matrix is computationally more efficient than subspace diagonalization in terms of GPU-hrs. Among the RFOE implementations, A-Z-RFOE is the closest competing method to ELPA’s diagonalization procedure and is faster than ELPA approximately by a factor of 4×, 2.7×, 2×, and 1.2× for matrix Hp dimensions of 8192, 16 384, 32 768, and 65 536, respectively. These results indicate that the speed-up over ELPA reduces with an increase in the size of the matrix. We note that the total computation time observed in RFOE or ELPA diagonalization is the cumulative sum of data movement costs and floating point operations costs on GPUs. The reduction in speed-ups of RFOE over ELPA with an increase in matrix size can be attributed to the reduction in the fraction of data movement cost to the total computational cost associated with ELPA’s two phase diagonalization approach. Furthermore, Fig. 4(b) shows the scalability data for both subspace diagonalization procedures using ELPA and A-Z-RFOE implemented using SLATE for the Hp matrix of dimension 65 536. These results indicate the superior scaling behavior of ELPA over SLATE. For instance, relative to two nodes (12 GPUs), the minimum number of nodes at which the problem can be accommodated, we find that the scaling efficiency of RFOE approaches using SLATE quickly drops to 60% at 4 nodes (24 GPUs), 20% at 16 nodes (96 GPUs), 10% at 128 nodes (768 GPUs), and around 1% at 2048 nodes (12 288 GPUs). In contrast, the scaling efficiency for ELPA is around 93%, 50%, 20%, and 14% at 4 nodes, 16 nodes, 128 nodes, and 256 nodes, respectively. As discussed previously in the CPU comparisons, a practically valuable outcome of scaling efficiency is the minimum wall time, i.e., the least possible execution time one can obtain with increasing GPUs. Figure 4(c) shows these minimum wall time comparisons between various approaches considered in this work. The minimum wall times obtained using A-Z-RFOE are faster by a factor of 1.2×–1.4× compared to ELPA’s subspace diagonalization for the Hp matrix of dimensions 8192 and 16 384. However, the superior scaling of ELPA over SLATE resulted in these minimum wall times on fewer GPUs using ELPA’s diagonalization (48 GPUs) compared to RFOE approaches considered in this work (768 and 1536 GPUs for matrix dimensions of 8192 and 16 384, respectively). As the matrix dimension increases to 32 768, the minimum wall time obtained is almost similar in the case of diagonalization and A-Z-RFOE implementation using SLATE, while the number of GPUs required to achieve this wall time is larger by a factor of 16× for the A-Z-RFOE approach in comparison to diagonalization. In the case of 65 536 matrix size, a factor of 2× speed-up is observed in minimum wall time employing ELPA’s diagonalization with much fewer GPUs than A-Z-RFOE. Finally, Fig. 5 illustrates the comparative studies carried out in building the finite-temperature density matrix on GPUs using T-RFOE and diagonalization approaches. We observe that both double- and mixed-precision variants of the T-RFOE implementations are more computationally efficient than the explicit diagonalization procedure using ELPA in the GPU-hrs regime. We also observe that the mixed-precision variant of T-RFOE is the closest competitor to the diagonalization approach. In particular, we see computational gains of around 4.4×, 3.8×, 3×, and 2.3× in terms of GPU-hrs for the Hp matrix of dimensions 8192, 16 384, 32 768, and 65 536, respectively, over ELPA’s diagonalization. Figure 5(b) illustrates the minimum wall time comparisons, and these times obtained using mixed-precision T-RFOE are faster by a factor of 1.3×–2× compared to ELPA’s subspace diagonalization involving matrix dimensions of 8192, 16 384, and 32 768. However, the superior scaling of ELPA over SLATE resulted in these minimum wall times on fewer GPUs using ELPA. Furthermore, for the Hp matrix size of 65 536, the minimum wall time obtained via ELPA’s diagonalization approach was lower than T-RFOE and was obtained on much fewer GPUs.
(a) GPU-hrs vs matrix size plot, (b) Scaling study, and (c) Min. wall time vs matrix size plot. Case study: ELPA Diagonalization vs different implementations of RFOE used to approximate the zero-temperature density matrix on multi-node GPUs.
(a) GPU-hrs vs matrix size plot, (b) Scaling study, and (c) Min. wall time vs matrix size plot. Case study: ELPA Diagonalization vs different implementations of RFOE used to approximate the zero-temperature density matrix on multi-node GPUs.
(a) GPU-hrs vs matrix size plot, and (b) Min. wall time vs matrix size plot for different implementations of RFOE used to approximate the finite-temperature density matrix on multi-node GPUs.
(a) GPU-hrs vs matrix size plot, and (b) Min. wall time vs matrix size plot for different implementations of RFOE used to approximate the finite-temperature density matrix on multi-node GPUs.
The comparisons reported in this study were performed on V100 SXM2 GPUs. However, it is important to note that newer GPU architectures can potentially provide even greater computational improvements in terms of GPU-hrs. This is because the dominant computational costs associated with the RFOE approaches examined in this study are primarily associated with dense matrix–matrix multiplications. For instance, we anticipate a speed-up of ∼1.5–2× in terms of GPU-hrs for RFOE approaches using A100 SXM2 GPUs in comparison to V100 SXM2, taking into account the data movement costs associated with dense matrix–matrix multiplications. Furthermore, for the case of H100 GPUs, we anticipate the speed-ups in GPU-hrs compared to V100 GPUs to be around ∼5–6× since the H100 GPUs have close to 8× higher FP64 throughput performance. These advanced architectures can also benefit ELPA’s diagonalization, but not to the extent it can benefit RFOE dense matrix–matrix multiplications since the explicit diagonalization methods usually do not involve high arithmetic intensity operations such as RFOE. This is because diagonalization relies on a two-phase approach, with the first phase involving the reduction of a dense matrix to a tridiagonal matrix using Householder reflectors, which are constructed using vectors and do not involve arithmetic intensity operations like dense matrix–matrix multiplications. The subsequent phase reduces this tridiagonal matrix to a diagonal matrix, which involves manipulation with sparse matrices, which involves more data movement costs.
In the case of minimum wall time comparisons with the newer architecture GPUs, the speed-ups gained by the RFOE approach over ELPA or vice versa may not change much compared to what is reported in the manuscript as the communication cost becomes the dominant cost of computing at a larger number of GPUs. As of today, this is better handled in ELPA’s diagonalization implementation than the parallel dense matrix–matrix multiplications (pgemm) implemented in the state-of-the-art libraries.
C. Energy consumption comparisons
In order to estimate the upper bound for energy consumption in units of kWh as a function of matrix dimension, we utilize the TDP (Thermal design power) ratings for executing both RFOE implementations and the explicit diagonalization approach using the ELPA library. To plot the energy consumption against the matrix dimensions, we consider both the CPU node-hours/GPU-hours and minimum wall time regimes using the execution time of the best-performing RFOE implementation and ELPA diagonalization involving Hp. Figures 6 and 7 summarize the relevant data. As the matrix size increases, we observe a higher energy consumption in the case of explicit diagonalization using ELPA for CPUs, both in the regime of CPU node-hours (∼2×) and minimum wall time (∼1.2×), compared to the best-performing RFOE implementations. In the case of GPUs, we also observe a higher energy consumption for explicit diagonalization using ELPA in the GPU-hrs regime. However, in the regime of minimum wall time, energy consumption is significantly higher in the case of the best RFOE implementation using SLATE compared to ELPA, owing to the excellent scaling behavior of ELPA on multinode GPUs.
Energy consumption (kWh) vs matrix size plot in terms of (a) Node-hrs/GPU-hrs and (b) Min. wall time for the best-performing implementation of RFOE used to approximate a zero-temperature density matrix.
Energy consumption (kWh) vs matrix size plot in terms of (a) Node-hrs/GPU-hrs and (b) Min. wall time for the best-performing implementation of RFOE used to approximate a zero-temperature density matrix.
Energy consumption (kWh) vs matrix size plot in terms of (a) Node-hrs/GPU-hrs and (b) Min. wall time for the best-performing implementation of RFOE in the case of approximating a finite-temperature density matrix.
Energy consumption (kWh) vs matrix size plot in terms of (a) Node-hrs/GPU-hrs and (b) Min. wall time for the best-performing implementation of RFOE in the case of approximating a finite-temperature density matrix.
V. CONCLUSIONS
Our current work investigates recursive Fermi-operator expansion (RFOE) strategies involving subspace projected Hamiltonians as an alternative to explicit subspace diagonalization in iterative orthogonal projection methods for solving large-scale eigenvalue problems. We explore the potential use of these strategies on both multi-node CPU and GPU architectures by employing the ELPA and SLATE libraries. As demonstrated by various studies conducted in this work, we find that RFOE schemes involving dense matrix–matrix multiplications have a lesser computational prefactor associated with the cubic scaling complexity and, hence, are more computationally efficient (∼2×–4×) than the explicit diagonalization of the subspace projected Hamiltonian in the CPU node-hrs/GPU-hrs regime. Further, we found that RFOE strategies can be memory efficient compared to explicit diagonalization procedures on GPUs. However, our results also demonstrate that the scaling efficiency associated with the explicit subspace diagonalization procedure carried out using ELPA is superior to RFOE strategies, resulting in lower minimum wall times on both CPUs and GPUs. Finally, we conclude that RFOE strategies can be considered effective alternatives to subspace diagonalization, particularly in the context of a good scaling regime where computational efficiency is measured in terms of CPU node-hrs/GPU-hrs consumed. However, it is worth noting that these strategies may not perform as well in the extreme scaling regime as demonstrated in the current work.
ACKNOWLEDGMENTS
The authors gratefully acknowledge the seed grant from Indian Institute of Science (IISc) and SERB Startup Research Grant from the Department of Science and Technology India (Grant No. SRG/2020/002194) for the purchase of a GPU cluster, which provided computational resources for this work. The research used the resources of PARAM Pravega at Indian Institute of Science, supported by National Supercomputing Mission (NSM) R&D for an exa-scale grant (DST/NSM/R&D_Exascale/2021/14.02). This research also used the resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Sameer Khadatkar: Data curation (lead); Formal analysis (lead); Methodology (equal); Writing – original draft (equal). Phani Motamarri: Conceptualization (lead); Methodology (equal); Project administration (lead); Supervision (lead); Writing – original draft (equal); Writing – review & editing (lead).
DATA AVAILABILITY
The data that support the findings of this study are available within the article.