The development of novel double-hybrid density functionals offers new levels of accuracy and is leading to fresh insights into the fundamental properties of matter. Hartree–Fock exact exchange and correlated wave function methods, such as second-order Møller–Plesset (MP2) and direct random phase approximation (dRPA), are usually required to build such functionals. Their high computational cost is a concern, and their application to large and periodic systems is, therefore, limited. In this work, low-scaling methods for Hartree–Fock exchange (HFX), SOS-MP2, and direct RPA energy gradients are developed and implemented in the CP2K software package. The use of the resolution-of-the-identity approximation with a short range metric and atom-centered basis functions leads to sparsity, allowing for sparse tensor contractions to take place. These operations are efficiently performed with the newly developed Distributed Block-sparse Tensors (DBT) and Distributed Block-sparse Matrices (DBM) libraries, which scale to hundreds of graphics processing unit (GPU) nodes. The resulting methods, resolution-of-the-identity (RI)-HFX, SOS-MP2, and dRPA, were benchmarked on large supercomputers. They exhibit favorable sub-cubic scaling with system size, good strong scaling performance, and GPU acceleration up to a factor of 3. These developments will allow for double-hybrid level calculations of large and periodic condensed phase systems to take place on a more regular basis.
I. INTRODUCTION
In the quest for more accurate density functionals, double-hybrid functionals1 have emerged as the best option to date.2–4 In this approach, fractions of exact exchange and wave function correlation energy are linearly combined with a local density functional. Most of the time, the wave function correlation method applied is the second-order Møller–Plesset (MP2)5 or direct RPA6 based on variational orbitals from a converged SCF calculation with the corresponding hybrid functional. Although dRPA and MP2 do not have the unfavorable scaling of highly accurate wave function methods, they still exhibit a steep N4 or N5 scaling with system size N. In addition, as with all post-Hartree–Fock methods, the convergence with respect to basis set size is slow.7,8 Double-hybrid functionals inherit both these aspects from wave function theory. For condensed matter calculations, the need for large basis sets and the steep dependence on system size are especially computationally demanding. Although double-hybrid functionals have been applied to solid state systems,9–13 more routine applications to the condensed phase require the development of new algorithms and efficient implementations on high performance computers. Recent developments include range separated double-hybrids to lessen basis set dependence;14–17 reduced scaling methods for MP2,18–21 SOS-MP2,22–25 and direct RPA26–33 in order to allow for larger systems; and massively parallel implementations on graphics processing unit (GPU) accelerated hardware.34–37
We demonstrate a combination of the above achievements for the calculation of energy and gradients of large condensed matter systems. Low-scaling versions of SOS-MP2 and dRPA energy calculations are extended to include nuclear forces and stress tensors. The resulting methods can be used for stand-alone correlated wave function calculations or seamlessly combined with density functional theory (DFT) in the framework of double-hybrid functionals. The exact exchange part of the energy, force, and Z-vector equation38 is efficiently modeled with the auxiliary density matrix method39 (ADMM). The resolution-of-the-identity40–42 (RI) technique is systematically used for the post-HF methods and optionally for the Hartree–Fock exact exchange. All tensor contractions resulting from the use of the RI approximation are handled through a newly implemented, massively parallel, and GPU accelerated sparse tensor library. These developments allow for calculations of large systems at high accuracy, enabling statistical sampling via molecular dynamics as well as geometry and cell optimizations.
This work reports on the efficient implementation of ADMM-RI-HF, cubic scaling SOS-MP2, and cubic scaling direct RPA energy gradients in the CP2K software package.43 All three methods rely on sparse tensor contractions, using the Distributed Block-sparse Matrices (DBM) and Distributed Block-sparse Tensors (DBT) libraries described in the first part of this work. In the second part, we present the relevant gradient equations that were derived and implemented. Finally, performance benchmark results are presented with a strong focus on the periodic condensed matter system. Scaling with respect to system size and computing resources is investigated.
II. TENSOR LIBRARY
The central quantities in RI-HF and post-HF methods can be conveniently expressed as combinations of tensor contractions. Since the underlying Gaussian basis set is local, the involved tensors become sparse for larger system sizes. In addition, because the basis set is atom centered, this sparsity exhibits a blocked pattern. To take advantage of the blocked sparsity, the Distributed Block-sparse Tensors (DBT) library was developed.31,44
A. Tensors contractions via matrix multiplications
A central design decision of the DBT library was to implement the sparse tensor contractions via sparse matrix multiplication. This allowed us to build upon CP2K’s mature DBCSR library,45,46 which was originally developed for linear scaling DFT.47 Recently, the DBCSR library has been succeeded by the new Distributed Block-sparse Matrices (DBM) library, which offers a better OpenMP and GPU performance in the context of tensor contraction.
The DBM library distributes matrix blocks by arranging the processors in a 2D grid and then assigning each row/column of the blocked matrix to one row/column of that 2D processor grid. Typically, it is a many-to-one mapping as the blocked matrix has (much) more rows/columns than the processor grid. For tensors of rank N, the scheme is generalized by arranging the processors in an N-dimensional processor grid.
B. Multiplying tall-and-skinny matrices
The DBM library employs the SRUMMA algorithm48 for matrix multiplications, which does not perform well for tall-and-skinny matrices.49 The DBT library, therefore, borrows a page from the CARMA algorithm50 and splits the two largest matrices along their shared dimension into ssplit submatrices. The processor grid is also split into ssplit corresponding subgroups, and the submatrices are then multiplied independently within their processor subgroups. The trade-off for splitting the larger matrices is the additional memory and communication required for replicating the smallest matrix across all subgroups. Furthermore, if the two larger matrices do not have compatible distributions, then the mid-sized matrix has to be redistributed, which can dominate the overall communication cost of the tensor contraction. The choice of the smallest matrix and the optimal splitting factor ssplit depends on the sparsity of all three matrices. To estimate the sparsity of the result matrix, a multiplication of the block norms is quickly performed upfront.
III. METHOD DEVELOPMENT
In this work, the implementation in CP2K of atomic forces and stress tensors for Hartree–Fock, as well as the correlated wave function theories SOS-MP251,52 and direct RPA,6 is reported. The focus is on the development of methods with favorable scaling with respect to system and basis set size such that large systems can be simulated within periodic boundary conditions (PBCs). The resulting code should scale efficiently over thousands of central processing unit (CPU) cores and be GPU accelerated. There exist multiple ways to approach each of the three methods of interest, with various scalings and prefactors for the calculation of energies (HF,53–58 SOS-MP2,22–25 and RPA26–33). To fully exploit the highly efficient sparse tensor DBT library, we make the choice of implementing all three methods within the resolution-of-the-identity (RI) approximation40–42 while working in a local atomic orbital (AO) basis made of Gaussian type orbitals (GTOs).
There are both memory and computational efficiency advantages when working within the RI approximation and the AO basis. The number of two- and three-center ERIs to evaluate and store is small compared to that of a four-center based calculation. Moreover, using a short range RI metric, such as the overlap or a truncated Coulomb operator,59,60 leads to a lower occupancy of the integrals and more efficient subsequent sparse tensor contractions. Finally, keeping three-center quantities in the local AO basis rather than the molecular orbital (MO) basis helps retain that sparsity, as MOs are typically much less localized. All three methods discussed in this work, RI-HF, low-scaling SOS-MP2, and dRPA, benefit from these choices.
A. RI-HFX
The RI-HFX forces were implemented in the CP2K software package according to the above equations on top of the existing RI-HFX energy code.44 The two-, three-, and four-center integrals and derivatives are calculated analytically with the LIBINT63 library. The RI basis sets can be provided by the user as part of the calculation input, but the default behavior is to generate them automatically according to the scheme of Stoychev and co-workers.64 The storage and contraction of all three-center and most two-center quantities are done via the block sparse tensor library DBT.
B. Low-scaling post-HF
1. SOS-MP2
The low-scaling SOS-MP2 gradients were implemented in CP2K according to the above equations. The Laplace quadrature is obtained via a minimax scheme.74,75 The quadrature is kept fixed by default, while the option to adapt it depending on the gap is available. The two- and three-center integrals and derivatives involving the RI metric, where a short-range potential is used for sparsity, are calculated analytically using the LIBINT library. The two-center integrals and derivatives requiring the Coulomb operator, (P|Q) and (P|Q)(x), are calculated numerically with the Gaussian and plane waves (GPW) method,76,77 which intrinsically takes care of PBCs. Most two- and three-center quantities are stored and processed as DBT block sparse tensors. The iterative construction of the Y1,2 matrices, as well as the RHS of the Z-vector equation O(τj), is done with the DBCSR library.45 As for RI-HFX, the RI basis is automatically generated by default. The only non-trivial contribution to the stress tensor stems from the GPW numerical integrals since the integration grids change with cell size. The procedure is described in Ref. 78. More details about the CP2K implementation can be found in Appendix B, and more details about the derivation of the equations can be found in the supplementary material.
2. Direct RPA
The requirements to calculate low-scaling RPA forces are very similar to SOS-MP2. Therefore, the discussion about the various libraries and methods used in the CP2K implementation at the end of Sec. III B 1 applies here as well. Note that the RPA forces are more complex due to the presence of the double time and frequency integration grids. The final expression is also lengthier because of the Fermi energy in the definition of the pseudo-densities. The details about the implementation can be found in Appendix B, while a detailed derivation of the gradient formula can be found in the supplementary material.
C. Double-hybrid functionals
In most instances, the RPA correlation energy is calculated on top of a DFT or hybrid DFT level SCF rather than HF.81 At the end of the calculation, the DFT exchange–correlation energy is replaced by HF exchange, evaluated using the SCF orbitals. In terms of forces, this requires the addition of the HF Hamiltonian to the RHS of the Z-vector equation.73 Such a double-hybrid functional is referred to as RPA@DFT (e.g., RPA@PBE or RPA@PBE0).
The cost and scaling of a double-hybrid calculation are always dominated by the correlated wave function method of choice. Similarly, the basis set convergence of double-hybrids is limited by that of WFT, which is typically slower than DFT.7,8 The low-scaling methods presented in this work should help reach double-hybrid DFT accuracy in large systems.
IV. RESULTS AND BENCHMARKS
The main focus of this work is on the performance and scalability of the new sparse tensor based implementations. For all three methods of interest, benchmark calculations were run on the GPU partition of two large supercomputers: Piz Daint at CSCS in Switzerland, and LUMI-G at CSC in Finland. A Piz Daint node is made of an Intel Xeon E5-2690 v3 processor with 12 cores and 64 GB of memory, together with a 16 GB NVIDIA Tesla P100 GPU. The latest update to the machine hardware was made in November 2016. On the other hand, LUMI-G is a state-of-the-art installation still in its pilot phase as of December 2022. A LUMI-G node is made of an AMD EPYC 7A53 processor with 64 cores and 512 GB of memory, with four AMD Instinct MI250X GPUs with 128 GB of memory each. The following benchmarks showcase the code performance on these two machines, with different manufacturers and architectures. Note that LUMI-G was still in pre-production mode at the time of the benchmarks and that improvements can be expected. All calculations were run with 4 MPI ranks and 3 OMP threads per node on Piz Daint and 16 MPI ranks with 3 OMP threads on LUMI-G. These configurations have been shown to be optimal for sparse tensor contractions. Note that all GPUs of the LUMI-G node are used, but not all CPUs. This is due to the machine temporarily running in low-noise mode, where one processor is reserved for the operating system. All LUMI-G calculations reported in this work were run in the same configuration. This is not expected to change the scaling behaviors observed in this work.
A. RI-HFX
The benchmarks for RI-HFX are organized as follows: we first discuss the choice of RI metric and its impact on performance and accuracy, with respect to standard four-center HFX. Then, we measure the scaling of RI-HFX and 4c-HFX with respect to system size, and finally, we study strong scaling performance. We conclude with a discussion about the advantages of both methods and when to apply them.
Two Γ-point periodic systems were selected for these benchmarks: liquid water and bulk silicon. Because sparsity is expected to have a strong impact on performance, both for integral screening with 4c-HFX and for sparse contraction with RI-HFX, this choice is meant to be representative. Indeed, liquid water is a much sparser system than crystalline silicon. For comparison purposes, all calculations were executed with and without the ADMM approximation.
1. RI metric
We are interested in the performance and accuracy of RI-HFX for energy and forces, as a function of the RI metric range. Four metrics are considered: the overlap and the truncated Coulomb potential with cutoff radii of 1.0, 2.0, and 3.0 Å. Because relative energies are more important than their absolute counterparts, the accuracy of RI-HFX energies was tested by taking the energy difference between two different configurations of the same system. For water, two uncorrelated MD frames were considered. For bulk silicon, the optimized geometry and a random perturbation of it (with atoms moved by up to 0.15 Å) were taken. To test the accuracy of RI-HFX forces, the component-wise root mean square error of the atomic force was calculated with respect to the 4c-HFX reference, only on systems with non-vanishing forces. All calculations reported in this section were done on the Piz Daint supercomputer, at the HF level of theory. The double-zeta ccGRB-D and admm-dzp basis sets, distributed as part of the CP2K software package, were used.
The energy results are displayed in Fig. 1. For liquid water, accuracy seems to converge non-monotonically with the RI metric range, regardless of the use of ADMM. Note that this behavior is also observed in larger water systems and with various other basis sets (see the supplementary material). For bulk silicon, accuracy is well converged already with the overlap RI metric. Moreover, using longer range operators seems to only increase noise levels. This different behavior with respect to the RI metric range is most likely due to system density. Because water is much sparser, it probably requires longer ranged RI metrics to have sufficient interactions with the rest of the system. In terms of performance, the cost of RI-HFX increases linearly with the RI metric range in all cases. This correlates exactly with the occupancy of the three-center ERI tensor , as summarized in Table I. Compared to the timings of standard 4c-HFX, RI-HFX is not systematically faster. In the particular case of water within the ADMM approximation, RI-HFX is at best 1.5 times more expensive and at least twice as expensive for sufficient accuracy. Without ADMM, the two methods are roughly equivalent to water. In the denser bulk silicon system, however, RI-HFX performs the best overall. Using the overlap RI metric, converging the SCF comes at 65% and 38% of the cost of 4c-HFX with and without ADMM, respectively. We observe that 4c-HFX performs comparatively better within the ADMM approximation. This is to be expected, since the ADMM basis sets are optimized for performance when calculating four-center ERIs of the type (λσ|μν). The spread of the auxiliary basis function is limited in order to maximize the screening of the non-overlapping λ, σ and μ, ν pairs. Moreover, they are typically kept as uncontracted bases, which is the optimal format for the LIBINT library within CP2K. Note that RI-HFX also benefits from ADMM, both for the evaluation and the overall sparsity of the three-center tensor. The benefits are, however, not as dramatic since the tensor contractions are the main computational bottleneck.
Error on the energy difference between two perturbed structures of the same periodic system, as a function of the RI metric range. The reference energy is that of a standard four-center HF calculation using the truncated Coulomb potential, with a range of 6 Å. Ovlp stands for the overlap metric, and TC RC stands for the truncated Coulomb metric with a truncation radius of RC Å. The relative timing of the reference 4c-HFX calculation is always set to 1. The water system is made of 64 molecules with the ccGRB-D basis set and the admm-dzp auxiliary basis set when ADMM is used. The silicon system is made of 128 atoms with the ccGRB-D/admm-dzp basis sets. RI basis sets were automatically generated.
Error on the energy difference between two perturbed structures of the same periodic system, as a function of the RI metric range. The reference energy is that of a standard four-center HF calculation using the truncated Coulomb potential, with a range of 6 Å. Ovlp stands for the overlap metric, and TC RC stands for the truncated Coulomb metric with a truncation radius of RC Å. The relative timing of the reference 4c-HFX calculation is always set to 1. The water system is made of 64 molecules with the ccGRB-D basis set and the admm-dzp auxiliary basis set when ADMM is used. The silicon system is made of 128 atoms with the ccGRB-D/admm-dzp basis sets. RI basis sets were automatically generated.
Occupancy of the three-center ERI tensor as a function of the RI metric range, for RI-HFX calculations of 64 water molecules and 128 atoms of bulk silicon. Basis sets are ccGRB-D, admm-dzp, and automatically generated RI bases.
. | Liquid water . | Bulk silicon . | ||
---|---|---|---|---|
ADMM (%) . | No ADMM (%) . | ADMM (%) . | No ADMM (%) . | |
Overlap | 6.5 | 20.8 | 21.2 | 69.4 |
TC 1.0 Å | 12.2 | 33.3 | 33.4 | 81.2 |
TC 2.0 Å | 20.7 | 47.2 | 47.3 | 88.0 |
TC 3.0 Å | 31.1 | 60.2 | 59.8 | 91.7 |
. | Liquid water . | Bulk silicon . | ||
---|---|---|---|---|
ADMM (%) . | No ADMM (%) . | ADMM (%) . | No ADMM (%) . | |
Overlap | 6.5 | 20.8 | 21.2 | 69.4 |
TC 1.0 Å | 12.2 | 33.3 | 33.4 | 81.2 |
TC 2.0 Å | 20.7 | 47.2 | 47.3 | 88.0 |
TC 3.0 Å | 31.1 | 60.2 | 59.8 | 91.7 |
Figure 2 reports the accuracy and performance of RI-HFX forces, as a function of the RI metric range. The trends observed for the energies are the same here: accuracy converges with the range for water, whereas it is already converged with the overlap metric for bulk Si. The same noisy behavior can also be observed for the truncated Coulomb metric with longer ranges in silicon. Note that the component-wise RMSE on the forces for the denser system is particularly low. RI-HFX force timings also increase linearly with the RI metric range and, therefore, with tensor occupancy. Note that the sparsity levels of the three-center tensors containing the derivatives are very close to that of the integrals displayed in Table I. In most cases, the cost of RI-HFX compared to that of 4c-HFX is competitive, in particular for silicon with the overlap metric, where the force evaluation takes 45% and 3% of the time required by the standard method, with and without ADMM, respectively. On the other hand, for the sparser water system with an ADMM basis optimized for four-center integrals and derivatives, RI-HFX is far less efficient.
Component-wise RMSE on atomic forces for periodic systems, as a function of the RI metric range. The reference force is that of a standard four-center HF calculation using the truncated Coulomb potential, with a range of 6 Å. Ovlp stands for the overlap metric, and TC RC for the truncated Coulomb metric with a truncation radius of RC Å. The relative timing of the reference 4c-HFX calculation is always set to 1. The water system is made of 64 molecules with the ccGRB-D basis set and the admm-dzp auxiliary basis set when ADMM is used. The silicon system is made of 128 atoms with the ccGRB-D/admm-dzp basis sets. RI basis sets were automatically generated.
Component-wise RMSE on atomic forces for periodic systems, as a function of the RI metric range. The reference force is that of a standard four-center HF calculation using the truncated Coulomb potential, with a range of 6 Å. Ovlp stands for the overlap metric, and TC RC for the truncated Coulomb metric with a truncation radius of RC Å. The relative timing of the reference 4c-HFX calculation is always set to 1. The water system is made of 64 molecules with the ccGRB-D basis set and the admm-dzp auxiliary basis set when ADMM is used. The silicon system is made of 128 atoms with the ccGRB-D/admm-dzp basis sets. RI basis sets were automatically generated.
The performance of standard 4c-HFX and RI-HFX depends strongly on the choice of some threshold values passed in the CP2K input file. For the first method, EPS_SCHWARZ and EPS_SCHWARZ_FORCES control the Cauchy–Schwarz pre-screening of the four-center ERIs (λσ|μν) and their derivatives (λσ|μν)(x). For RI-HFX, the EPS_FILTER parameter controls the sparsity of all two- and three-center tensors, discarding elements smaller than the given threshold. The 4c-HFX calculations reported in this section used the default 10−10 value for the integrals and 10−8 for the derivatives. RI-HFX runs systematically with a default value of 10−9 for the filtering threshold. These choices should ensure high enough accuracy for meaningful comparisons between the two methods. Changing these values would impact the exact timings reported in this work but not the trends observed. This last remark would also apply if different basis sets should be used, or should the calculations be run on a different computer.
2. Method scaling
The scaling of the RI-HFX and four-center HFX energy and forces with respect to system size was measured. Calculations with simulation cells of increasing size were run on both Piz Daint and LUMI-G, for liquid water and bulk silicon. The computational resources required are reported in Fig. 3. All calculations were executed with ADMM and the ccGRB-D/admm-dzp basis sets. The RI metrics of choice were the truncated Coulomb operator with a cutoff radius of 1.5 Å for water and the overlap for silicon. Note that all simulation cells are cubic for water but not for silicon. In the latter case, the 64 atom system is cubic, and larger cells are built by duplication along one direction (2 × 1 × 1, 2 × 2 × 1, 2 × 2 × 2). The filtering threshold for the RI-HFX calculation was kept at its default value of 10−9, and so was the Schwarz screening threshold for the ERIs (10−10). The Schwarz screening threshold for the derivatives was set to the default value of 10−6 for water and to 10−7 for silicon.
Scaling of the RI-HFX and four-center HFX energy and force calculations with respect to system size. The systems considered are liquid water and bulk silicon, with both ADMM-PBE0 (using a fixed range truncated Coulomb potential) and the ccGRB-D/admm-dzp basis sets. All systems were calculated on both the GPU partition of Piz Daint and LUMI-G. Sub-figures (a) and (c) correspond to liquid water on Piz Daint and LUMI-G, respectively, while sub-figures (b) and (d) correspond to bulk silicon.
Scaling of the RI-HFX and four-center HFX energy and force calculations with respect to system size. The systems considered are liquid water and bulk silicon, with both ADMM-PBE0 (using a fixed range truncated Coulomb potential) and the ccGRB-D/admm-dzp basis sets. All systems were calculated on both the GPU partition of Piz Daint and LUMI-G. Sub-figures (a) and (c) correspond to liquid water on Piz Daint and LUMI-G, respectively, while sub-figures (b) and (d) correspond to bulk silicon.
When performing calculations in periodic boundary conditions (PBCs), the slowly decaying 1/r12 Coulomb potential leads to spurious exchanges between electrons and their periodic images.60 This can be avoided by using short range exchange potentials, such as the truncated Coulomb operator or the error function.82 Therefore, in the limit of large periodic systems with a fixed range potential, the standard four-center HFX is expected to scale linearly with system size. RI-HFX, on the other hand, still involves the contraction of three-center tensors, thus keeping its cubic scaling behavior. The log–log fit in Fig. 3 confirms these expectations, with a scaling of for four-center HFX and for RI-HFX. Note that the latter performs better than its predicted cubic scaling. This is due to the high levels of sparsity achieved in large systems because of the use of a short range RI metric. For dense systems such as bulk silicon, the prefactor of the linear scaling four-center HFX is rather high. Therefore, calculating smaller systems with the steeper scaling RI-HFX method is still advantageous. Bear in mind that the exact crossover point between the two HFX methods could vary widely depending on the screening thresholds used. The small variations in the fitted scaling across the different computers come from calculations being run at slightly different levels of parallel efficiency.
3. Strong scaling
Figures 4–7 summarize the strong scaling experiments conducted with HFX for this work. Several steps of molecular dynamics at the PBE083 level were taken for a liquid water system of 64 molecules, with and without the ADMM approximation. The same calculation was run with an increasing amount of computational resources, and the average wall time per MD step was reported. For bulk silicon, a PBE0 geometry optimization was run instead for a system of 128 atoms. The ccGRB-D/admm-dzp basis sets and an automatically generated RI basis were used throughout. In all cases, the time related to the first step, where the SCF takes longer to converge, was left out of the average. The screening and filtering thresholds used here are the same as those in Sec. IV A 2.
Strong scaling of the four-center and RI-HFX methods for molecular dynamics on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the ADMM-PBE0 hybrid functional and the ccGRB-D/admm-dzp basis sets. The average wall time per MD step is reported (the initial step is omitted from the mean).
Strong scaling of the four-center and RI-HFX methods for molecular dynamics on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the ADMM-PBE0 hybrid functional and the ccGRB-D/admm-dzp basis sets. The average wall time per MD step is reported (the initial step is omitted from the mean).
Strong scaling of the four-center and RI-HFX methods for molecular dynamics on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the PBE0 hybrid functional and the ccGRB-D basis sets. The average wall time per MD step is reported (the initial step is omitted from the mean).
Strong scaling of the four-center and RI-HFX methods for molecular dynamics on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the PBE0 hybrid functional and the ccGRB-D basis sets. The average wall time per MD step is reported (the initial step is omitted from the mean).
Strong scaling of the four-center and RI-HFX methods for geometry optimization on Piz Daint and LUMI-G. The benchmark system is 128 atoms of bulk silicon, with the ADMM-PBE0 hybrid functional and the ccGRB-D/admm-dzp basis sets. The average wall time per GEO_OPT step is reported (the initial step is omitted from the mean).
Strong scaling of the four-center and RI-HFX methods for geometry optimization on Piz Daint and LUMI-G. The benchmark system is 128 atoms of bulk silicon, with the ADMM-PBE0 hybrid functional and the ccGRB-D/admm-dzp basis sets. The average wall time per GEO_OPT step is reported (the initial step is omitted from the mean).
Strong scaling of the four-center and RI-HFX methods for geometry optimization on Piz Daint and LUMI-G. The benchmark system is 128 atoms of bulk silicon, with the PBE0 hybrid functional and the ccGRB-D basis sets. The average wall time per GEO_OPT step is reported (the initial step is omitted from the mean).
Strong scaling of the four-center and RI-HFX methods for geometry optimization on Piz Daint and LUMI-G. The benchmark system is 128 atoms of bulk silicon, with the PBE0 hybrid functional and the ccGRB-D basis sets. The average wall time per GEO_OPT step is reported (the initial step is omitted from the mean).
Both RI-HFX and four-center HFX implementations exhibit good strong scaling behavior, regardless of the machine on which the calculation was run, and the strong scaling efficiency is high across the board. The GPU acceleration of the RI-HFX method is good overall. On Piz Daint, the average speedup due to GPUs is 1.64 and 1.95 for ADMM and non-ADMM runs, respectively. On LUMI-G, the average GPU acceleration is a factor of 2.37 for ADMM and of 2.95 for non-ADMM calculations. Tables displaying parallel efficiencies and speedups, as well as an in-depth discussion comparing Piz Daint and LUMI-G performances, are available in the supplementary material.
4. Discussion
RI-HFX offers a good alternative to standard four-center HFX in periodic systems, but only under certain circumstances. For sparse systems with basis sets optimized for (λσ|μν) ERIs screening, such as ADMM liquid water, four-center HFX is likely to outperform RI-HFX. However, for denser systems of small to medium size, the newer implementation is expected to be superior. This is especially true when the ADMM approximation is not used, as illustrated by the silicon tests. It is important to note that when both RI and ADMM are used at the same time, the accuracy will, at best, be that of ADMM. Pure RI, while slower, can still be used to accelerate HFX calculations with a higher accuracy.62 RI-HFX can also potentially outperform 4c-HFX when memory is scarce, and all the four-center ERIs (λσ|μν) cannot be stored. In such a case, a fraction of the integrals must be recalculated at each SCF step, which can significantly slow down the procedure. In contrast, the RI integrals required far less memory for storage. When non-periodic calculations are performed with the 1/r12 Coulomb operator, the quartic scaling of four-center HFX with the basis set size is a major concern. RI-HFX offers an efficient alternative, should large basis sets be required (see the supplementary material for an example). Finally, RI-HFX is GPU accelerated, while 4c-HFX is not. There is, therefore, more progression potential for the former, as GPU hardware keeps improving and the low-level DBT and DBM code can be optimized accordingly.
B. Low-scaling post-HF
The low-scaling post-HF benchmarks cover scaling tests for the cost of the methods with respect to system size, as well as strong scaling measurements. The selected systems are liquid water and bulk rutile TiO2. This choice is meant to be representative because of the varying densities and basis set sizes, especially with respect to the l quantum numbers for Ti atoms. The two variants of correlated wave function methods studied in this section are the standard SOS-MP2 with cOS = 1.3 on top of a HF ground state, and RPA@PBE0. The RI metrics used with each system are the same throughout this section, with an overlap for titania and a truncated Coulomb with a cutoff of 1.5 Å for water.
1. Method scaling
Figure 8 illustrates the scaling of the sparse tensor based SOS-MP2 and RPA methods as a function of system size for energy and force calculations. All SCF and Z-vector equation solutions were converged within the ADMM approximation, with 4c-HFX for water and RI-HFX for TiO2 (overlap metric). The admm-dzp auxiliary basis set was used, as well as cc-DZ and RI_DZ77 for water and ccGRB-D for titania. The post-HF RI basis set for the latter system is an uncontracted version of the automatically generated one, for optimal LIBINT performance when calculating three-center ERIs. Note that double-zeta basis sets are generally considered to be of too low quality for production level post-HF calculation.84 However, since the method of scaling is independent of basis set size, this allows for the benchmarking of large systems at reasonable costs. All calculations were run with a six point minimax grid. All liquid water cells are cubic. The smallest titania cell is orthorhombic with dimensions of 9.33 × 9.33 × 6.07 Å3, and all larger cells are built by duplicating it along one or more dimensions (1 × 1 × 2, 1 × 2 × 2, 2 × 2 × 2). Note that the figure lacks data points for the largest TiO2 system on LUMI-G. This is due to stability issues preventing the completion of long calculations on a large number of nodes.
Scaling of the sparse tensor based SOS-MP2 and RPA@PBE0 energy and force calculations with respect to system size. The systems considered are liquid water and bulk rutile TiO2. Water is simulated with the cc-DZ and the optimized RI_DZ basis sets, while TiO2 uses ccGRB-D and an automatically generated and decontracted RI basis. Note that the SCF is converged within the ADMM approximation using the admm-dzp basis in all cases, with four-center HFX for water and RI-HFX for TiO2. All systems were calculated on both the GPU partitions of Piz Daint and LUMI-G. Sub-figures (a) and (c) correspond to liquid water on Daint and LUMI-G, respectively, while sub-figures (b) and (d) correspond to rutile titania.
Scaling of the sparse tensor based SOS-MP2 and RPA@PBE0 energy and force calculations with respect to system size. The systems considered are liquid water and bulk rutile TiO2. Water is simulated with the cc-DZ and the optimized RI_DZ basis sets, while TiO2 uses ccGRB-D and an automatically generated and decontracted RI basis. Note that the SCF is converged within the ADMM approximation using the admm-dzp basis in all cases, with four-center HFX for water and RI-HFX for TiO2. All systems were calculated on both the GPU partitions of Piz Daint and LUMI-G. Sub-figures (a) and (c) correspond to liquid water on Daint and LUMI-G, respectively, while sub-figures (b) and (d) correspond to rutile titania.
The nominal scaling of these AO and sparse tensor based implementations is cubic. However, the log–log fits of Fig. 8 yield effective scaling of for liquid water and for bulk TiO2. As previously mentioned in the context of RI-HFX, this is due to the increasing levels of sparsity with system size. Table II summarizes the occupancy of the three-center ERIs for all systems considered. Note that despite the use of a shorter ranged RI metric for TiO2, occupancy is generally higher. These numbers account for most of the difference in cost between the water and titania calculations, i.e., the offset between the scaling curves. The calculation of the three-center integrals and derivatives is also much more expensive for the latter system, due to the large basis sets and high angular momenta associated with titanium atoms. However, their relative cost is directly proportional to occupancy, ranging from 37% of the total cost for the 1 × 1 × 1 system to 2% for 2 × 2 × 2 (including ADMM RI-HFX integrals). Note that the RPA and SOS-MP2 curves are virtually identical. This is to be expected since most of the tensor contraction code is shared between the two methods, and they both rely on a hybrid-DFT level SCF and Z-vector equation.
Occupancy and number of non-zero elements of the post-HF three-center ERI tensor as a function of system size. The values are identical for SOS-MP2 and RPA. The sparsity levels of the three-center derivatives are similar.
Liquid water . | Bulk TiO2 . | ||||
---|---|---|---|---|---|
Size . | Occupancy (%) . | Non-zero . | Size . | Occupancy (%) . | Non-zero . |
32 | 45.7 | 5.9 × 108 | 48 | 60.1 | 1.5 × 109 |
64 | 14.7 | 1.5 × 109 | 96 | 39.0 | 7.5 × 109 |
128 | 3.7 | 3.1 × 109 | 192 | 16.1 | 2.5 × 1010 |
256 | 0.9 | 6.1 × 109 | 384 | 6.2 | 7.6 × 1010 |
Liquid water . | Bulk TiO2 . | ||||
---|---|---|---|---|---|
Size . | Occupancy (%) . | Non-zero . | Size . | Occupancy (%) . | Non-zero . |
32 | 45.7 | 5.9 × 108 | 48 | 60.1 | 1.5 × 109 |
64 | 14.7 | 1.5 × 109 | 96 | 39.0 | 7.5 × 109 |
128 | 3.7 | 3.1 × 109 | 192 | 16.1 | 2.5 × 1010 |
256 | 0.9 | 6.1 × 109 | 384 | 6.2 | 7.6 × 1010 |
Detailed fitted scaling of the energy and force timings for the post-HF calculations. The total times are considered, including those from the SCF.
. | Liquid water . | Bulk TiO2 . | ||
---|---|---|---|---|
Energy . | Forces . | Energy . | Forces . | |
SOS-MP2, Daint | 1.9 | 2.6 | 2.6 | 2.7 |
SOS-MP2, LUMI | 1.9 | 2.4 | 2.4 | 2.8 |
RPA@PBE0, Daint | 1.8 | 2.6 | 2.5 | 2.6 |
RPA@PBE0, LUMI | 1.9 | 2.5 | 2.5 | 2.7 |
. | Liquid water . | Bulk TiO2 . | ||
---|---|---|---|---|
Energy . | Forces . | Energy . | Forces . | |
SOS-MP2, Daint | 1.9 | 2.6 | 2.6 | 2.7 |
SOS-MP2, LUMI | 1.9 | 2.4 | 2.4 | 2.8 |
RPA@PBE0, Daint | 1.8 | 2.6 | 2.5 | 2.6 |
RPA@PBE0, LUMI | 1.9 | 2.5 | 2.5 | 2.7 |
Regardless of system sparsity, operations such as matrix inverse or square root are necessary for both energy and forces. In the current implementation, these are performed on dense matrices via the ScaLAPACK library.85 Therefore, in the limit of very large systems, all post-HF calculations are expected to scale cubically. Note that the scaling numbers reported in Table III are specific to the systems studied in this work and not a general result of the implementation. For the same range of systems, the cost of calculating the gradients ranges from a factor of two to eight times the cost of an energy calculation, as illustrated in Table IV for SOS-MP2. Similar numbers are observed for RPA.
Relative timing of the SOS-MP2 force calculation with respect to energy. The total times are considered, including those from the SCF.
Liquid water . | Bulk TiO2 . | ||||
---|---|---|---|---|---|
Size . | Daint . | LUMI . | Size . | Daint . | LUMI . |
32 | 2.0 | 1.9 | 48 | 3.4 | 3.7 |
64 | 2.5 | 2.6 | 96 | 3.2 | 3.2 |
128 | 4.0 | 4.7 | 192 | 3.1 | 3.4 |
256 | 8.2 | 5.7 | 384 | 4.2 | ⋯ |
Liquid water . | Bulk TiO2 . | ||||
---|---|---|---|---|---|
Size . | Daint . | LUMI . | Size . | Daint . | LUMI . |
32 | 2.0 | 1.9 | 48 | 3.4 | 3.7 |
64 | 2.5 | 2.6 | 96 | 3.2 | 3.2 |
128 | 4.0 | 4.7 | 192 | 3.1 | 3.4 |
256 | 8.2 | 5.7 | 384 | 4.2 | ⋯ |
2. Strong scaling
The strong scaling behavior of the SOS-MP2 and RPA@PBE0 forces was measured on Piz Daint and LUMI-G. The results are displayed in Figs. 9–12. All calculations use the triple-zeta cc-TZ and RI_TZ basis sets to emulate production level conditions. The HFX exchange contribution to the SCF and the Z-vector equation is calculated within the ADMM approximation with the admm-tzp auxiliary basis. The standard four-center HFX method was selected for liquid water, and the RI-HFX implementation with an overlap metric was selected for TiO2.
Strong scaling of the sparse tensor based SOS-MP2 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the pure ADMM-HF level with the admm-tzp basis set.
Strong scaling of the sparse tensor based SOS-MP2 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the pure ADMM-HF level with the admm-tzp basis set.
Strong scaling of the sparse tensor based SOS-MP2 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 72 atoms of bulk rutile TiO2, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the pure RI-ADMM-HF level with the admm-tzp basis set.
Strong scaling of the sparse tensor based SOS-MP2 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 72 atoms of bulk rutile TiO2, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the pure RI-ADMM-HF level with the admm-tzp basis set.
Strong scaling of the sparse tensor based RPA@PBE0 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the ADMM-PBE0 level with the admm-tzp basis set.
Strong scaling of the sparse tensor based RPA@PBE0 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the ADMM-PBE0 level with the admm-tzp basis set.
Strong scaling of the sparse tensor based RPA@PBE0 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 72 atoms of bulk rutile TiO2, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the RI-ADMM-PBE0 level with the admm-tzp basis set.
Strong scaling of the sparse tensor based RPA@PBE0 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 72 atoms of bulk rutile TiO2, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the RI-ADMM-PBE0 level with the admm-tzp basis set.
Strong scaling performance is good overall, at least up to 256 nodes on Daint and 32 nodes on LUMI-G. High speedups and favorable parallel efficiencies are observed with increasing number of computational resources. GPU acceleration is good on both machines. On Piz Daint, the average speedup is 1.50. On LUMI-G, it amounts to 2.79. Note that these numbers are similar to those observed for GPU accelerated RI-HFX. Tables displaying parallel efficiencies and speedups, as well as an in depth discussion comparing Piz Daint and LUMI-G performances, are available in the supplementary material.
3. Discussion
The implementations of low-scaling post-HF forces in CP2K show a good performance. Despite not scaling as favorably as energies with system size, their effective scaling is kept under the cubic regime because of sparsity. A good strong scaling behavior is also observed across the board. There is significant GPU acceleration, especially on the LUMI-G machine.
Despite their advantageous scaling, these AO based methods come with a high prefactor. This was observed for RPA energies,31 and the same is expected for forces. Therefore, there is a place for the more standard quartic scaling MO based implementations of forces, which should outperform the methods discussed in this work for small to medium systems.
All calculations reported in this section used user provided, uncontracted RI basis sets. Even the TiO2 method scaling tests, where no optimized double-zeta basis was available for titanium atoms, were based on a manually uncontracted version of the automatically generated basis. Uncontracted basis sets are optimal for three-center ERI calculations in CP2K with the LIBINT library. This is relevant when larger basis sets, and especially high angular momenta, are required. Moreover, higher levels of sparsity can be achieved in AO based tensors than with contracted basis sets. However, all of this comes with a trade-off; uncontracted basis sets typically have more elements, leading to larger tensors overall. All non-standard basis sets used in this work are available in the supplementary material.
The default filtering threshold of 10−9 for the sparse tensor was used throughout. Using looser thresholds can potentially speed up calculations. However, careful testing is advised to make sure accuracy is retained. The same goes for the choice of the RI metric. It was selected to be the truncated Coulomb potential with a 1.5 Å cutoff radius for liquid water. Using a smaller range would lead to greater sparsity and more efficient calculations, but tests should ideally be conducted to ensure accuracy.
The default three-center tensor block size of 5 × 5 × 5 was consistently used. The ideal block size is a trade-off between the sparsity for smaller blocks and the efficiency of block-wise matrix–matrix multiplications for larger ones. As discussed previously, one of the bottlenecks of the low-scaling post-HF force implementation is due to tensor densification [Eq. (33)]. As energy and forces scale differently, the latter will overwhelmingly dominate timings for very large systems. In such cases, calculations can be accelerated by using larger block sizes such that denser tensors are more efficiently contracted.
V. CONCLUSION
In this paper, we reported the efficient implementation of the cubic-scaling RI-HFX, SOS-MP2, and direct RPA energy gradients for periodic systems in CP2K. All methods rely on the resolution-of-the-identity approximation in the AO basis, leading to sparse tensor contractions being the computational bottleneck of each calculation. This is efficiently taken care of with the recently implemented DBT and DBM libraries, presented in this work as well. The performance benchmark calculations performed for a variety of systems show a good strong scaling behavior and parallel efficiency up to a large number of nodes. The implementations benefit from robust GPU acceleration, reaching a factor of 2.95 speedup on the state-of-the-art LUMI-G supercomputer.
All three methods enjoy sub-cubic scaling because of tensor sparsity. In the case of RI-HFX, this is not a favorable comparison with respect to the standard four-center integrals implementation, which scales linearly with system size. However, for small to medium size dense systems and molecules, RI-HFX is more efficient. The scaling of the AO based SOS-MP2 and RPA energy gradients is advantageous compared to their quartic scaling MO based counterparts. However, it is expected that their large prefactor makes them competitive for larger systems only.
This work enables the calculation of large systems at double-hybrid DFT accuracy. While the code scales well on large supercomputers, the calculation of many molecular dynamics frames remains expensive. These methods can, however, be used for the generation of machine learning training data.86,87 Compared to the MO based RPA and SOS-MP2 methods, using low-scaling post-HF implementations speeds up the overall fitting process for large systems. This should allow for long MD trajectories of chemically interesting systems to be produced at an even lower cost while keeping high accuracy.
SUPPLEMENTARY MATERIAL
See the supplementary material for the detailed derivation of the low-scaling post-HF energy gradient equations, RI-HFX supporting figures, and custom basis sets.
ACKNOWLEDGMENTS
This work was funded by the Swiss Platform for Advanced Scientific Computing (PASC) as part of the 2021–2024 “Ab Initio Molecular Dynamics at the Exa-Scale” project. Computing time on Piz Daint was provided by the CSCS Swiss National Supercomputing Centre under Account Nos. d110 and c18. LUMI-G computing time was made available as part of its pilot phase, under Project No. 465000151.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Augustin Bussy: Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Ole Schütt: Software (equal); Writing – original draft (equal). Jürg Hutter: Funding acquisition (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal).
DATA AVAILABILITY
The data that support the findings of this study, namely CP2K input and output files, are openly available in the Materials Cloud at http://doi.org/10.24435/materialscloud:kc-ex.
APPENDIX A: IMPLEMENTATION OF RI-HFX FORCES
The algorithm used in the code is outlined in Table V. At step 4, a loop over batches of AO indices is initiated. This is to ensure that the tensor fits in memory since the contraction of the three-center ERIs with the density matrix leads to higher occupancy. At steps 7 and 8, the tensor is calculated before its trace is taken with the three-center derivatives. In principle, would be a fairly dense tensor. However, we can force it to take the sparsity pattern of . This leads to savings in memory and computation time.
Pseudocode for RI-HFX forces.
1 | Pre-calculate and store the two-center derivatives |
(Q|R) (x) and | |
2 | Pre-calculate and store the 3c derivatives |
3 | Perform the matrix product |
4 | Loop over batches of μ and ν AOs: |
5 | Contract the density matrices with the 3c integrals: |
6 | Loop over batches of S RI basis elements: |
7 | Perform the contraction |
8 | Calculate the trace |
9 | Perform the contraction |
10 | Perform the matrix product |
11 | Calculate the trace |
12 | Perform the matrix product |
13 | Calculate the trace |
1 | Pre-calculate and store the two-center derivatives |
(Q|R) (x) and | |
2 | Pre-calculate and store the 3c derivatives |
3 | Perform the matrix product |
4 | Loop over batches of μ and ν AOs: |
5 | Contract the density matrices with the 3c integrals: |
6 | Loop over batches of S RI basis elements: |
7 | Perform the contraction |
8 | Calculate the trace |
9 | Perform the contraction |
10 | Perform the matrix product |
11 | Calculate the trace |
12 | Perform the matrix product |
13 | Calculate the trace |
APPENDIX B: IMPLEMENTATION OF LOW-SCALING POST-HF FORCES
The implementation of the RPA energy gradient in CP2K is summarized as a pseudocode in Tables VI–VIII. It is assumed that certain quantities, such as the two- and three-center integrals, and the matrices P(τj) and Q(τj) are available from the energy calculation. The considerations about two- and three-center derivatives made in the RI-HFX appendix apply here as well. Note that the SOS-MP2 algorithm is easily recoverable by setting and ɛF = 0. At step 3, the tensor is calculated. Its occupancy is typically significantly higher than that of the three-center integrals. To save memory, it is stored in a compressed format described in Ref. 88. Note that the three-center operations described in Table VIII are performed over batches of AOs and RI basis elements. Because tensor densification happens as the algorithm progresses, this scheme ensures that all relevant three–center quantities can be stored in memory. This is especially critical on machines with low-memory GPUs. The number of batches along each tensor dimension can be passed in the CP2K input under the MEMORY_CUT keyword. A high value leads to a lower memory footprint at the cost of more overhead. Finally, the sparsity pattern of tensor has to match that of the three-center derivatives because of the trace at step 14. This can be exploited to save memory and accelerate the execution of step 13. Note that the R matrices calculated at steps 8 and 9 are passed onto the rest of the code.
Pseudocode for post-HF forces.
1 | Pre-calculate and store the two-center derivatives |
(P|R)(x) and | |
2 | Pre-calculate and store the 3c derivatives |
3 | Calculate |
4 | Loop over τj quadrature points: |
5 | Perform two-center operations: Table VII |
6 | Perform three-center operations: Table VIII |
7 | Perform the matrix multiplications |
A1 = τjPocc(F − ɛFS), A2 = −τjPvirt(F − ɛFS) | |
8 | Iteratively calculate matrices Y1, Y2, , and |
9 | Perform the matrix multiplications: |
, | |
, | |
10 | Calculate the trace |
11 | Calculate the trace |
12 | Add to the Z-vector equation RHS: |
+ | |
13 | Solve the Z-vector equation and update the forces |
1 | Pre-calculate and store the two-center derivatives |
(P|R)(x) and | |
2 | Pre-calculate and store the 3c derivatives |
3 | Calculate |
4 | Loop over τj quadrature points: |
5 | Perform two-center operations: Table VII |
6 | Perform three-center operations: Table VIII |
7 | Perform the matrix multiplications |
A1 = τjPocc(F − ɛFS), A2 = −τjPvirt(F − ɛFS) | |
8 | Iteratively calculate matrices Y1, Y2, , and |
9 | Perform the matrix multiplications: |
, | |
, | |
10 | Calculate the trace |
11 | Calculate the trace |
12 | Add to the Z-vector equation RHS: |
+ | |
13 | Solve the Z-vector equation and update the forces |
Post-HF two-center operations.
1 | Perform the matrix multiplications |
, | |
, | |
2 | Calculate the trace |
3 | Calculate the trace |
1 | Perform the matrix multiplications |
, | |
, | |
2 | Calculate the trace |
3 | Calculate the trace |
Post-HF three-center operations.
1 | Calculate pseudo-densities and |
2 | Perform the matrix multiplication |
3 | Perform the tensor contraction |
4 | Loop over batches of σ AOs: |
5 | Calculate |
6 | Calculate |
7 | Loop over batches of T RI basis elements: |
8 | Calculate |
9 | Calculate |
10 | Loop over batches of λ AOs: |
11 | Contract |
12 | Loop over batches of T RI basis elements: |
13 | Contract |
14 | Calculate the trace |
1 | Calculate pseudo-densities and |
2 | Perform the matrix multiplication |
3 | Perform the tensor contraction |
4 | Loop over batches of σ AOs: |
5 | Calculate |
6 | Calculate |
7 | Loop over batches of T RI basis elements: |
8 | Calculate |
9 | Calculate |
10 | Loop over batches of λ AOs: |
11 | Contract |
12 | Loop over batches of T RI basis elements: |
13 | Contract |
14 | Calculate the trace |