The development of novel double-hybrid density functionals offers new levels of accuracy and is leading to fresh insights into the fundamental properties of matter. Hartree–Fock exact exchange and correlated wave function methods, such as second-order Møller–Plesset (MP2) and direct random phase approximation (dRPA), are usually required to build such functionals. Their high computational cost is a concern, and their application to large and periodic systems is, therefore, limited. In this work, low-scaling methods for Hartree–Fock exchange (HFX), SOS-MP2, and direct RPA energy gradients are developed and implemented in the CP2K software package. The use of the resolution-of-the-identity approximation with a short range metric and atom-centered basis functions leads to sparsity, allowing for sparse tensor contractions to take place. These operations are efficiently performed with the newly developed Distributed Block-sparse Tensors (DBT) and Distributed Block-sparse Matrices (DBM) libraries, which scale to hundreds of graphics processing unit (GPU) nodes. The resulting methods, resolution-of-the-identity (RI)-HFX, SOS-MP2, and dRPA, were benchmarked on large supercomputers. They exhibit favorable sub-cubic scaling with system size, good strong scaling performance, and GPU acceleration up to a factor of 3. These developments will allow for double-hybrid level calculations of large and periodic condensed phase systems to take place on a more regular basis.

In the quest for more accurate density functionals, double-hybrid functionals1 have emerged as the best option to date.2–4 In this approach, fractions of exact exchange and wave function correlation energy are linearly combined with a local density functional. Most of the time, the wave function correlation method applied is the second-order Møller–Plesset (MP2)5 or direct RPA6 based on variational orbitals from a converged SCF calculation with the corresponding hybrid functional. Although dRPA and MP2 do not have the unfavorable scaling of highly accurate wave function methods, they still exhibit a steep N4 or N5 scaling with system size N. In addition, as with all post-Hartree–Fock methods, the convergence with respect to basis set size is slow.7,8 Double-hybrid functionals inherit both these aspects from wave function theory. For condensed matter calculations, the need for large basis sets and the steep dependence on system size are especially computationally demanding. Although double-hybrid functionals have been applied to solid state systems,9–13 more routine applications to the condensed phase require the development of new algorithms and efficient implementations on high performance computers. Recent developments include range separated double-hybrids to lessen basis set dependence;14–17 reduced scaling methods for MP2,18–21 SOS-MP2,22–25 and direct RPA26–33 in order to allow for larger systems; and massively parallel implementations on graphics processing unit (GPU) accelerated hardware.34–37 

We demonstrate a combination of the above achievements for the calculation of energy and gradients of large condensed matter systems. Low-scaling versions of SOS-MP2 and dRPA energy calculations are extended to include nuclear forces and stress tensors. The resulting methods can be used for stand-alone correlated wave function calculations or seamlessly combined with density functional theory (DFT) in the framework of double-hybrid functionals. The exact exchange part of the energy, force, and Z-vector equation38 is efficiently modeled with the auxiliary density matrix method39 (ADMM). The resolution-of-the-identity40–42 (RI) technique is systematically used for the post-HF methods and optionally for the Hartree–Fock exact exchange. All tensor contractions resulting from the use of the RI approximation are handled through a newly implemented, massively parallel, and GPU accelerated sparse tensor library. These developments allow for calculations of large systems at high accuracy, enabling statistical sampling via molecular dynamics as well as geometry and cell optimizations.

This work reports on the efficient implementation of ADMM-RI-HF, cubic scaling SOS-MP2, and cubic scaling direct RPA energy gradients in the CP2K software package.43 All three methods rely on sparse tensor contractions, using the Distributed Block-sparse Matrices (DBM) and Distributed Block-sparse Tensors (DBT) libraries described in the first part of this work. In the second part, we present the relevant gradient equations that were derived and implemented. Finally, performance benchmark results are presented with a strong focus on the periodic condensed matter system. Scaling with respect to system size and computing resources is investigated.

The central quantities in RI-HF and post-HF methods can be conveniently expressed as combinations of tensor contractions. Since the underlying Gaussian basis set is local, the involved tensors become sparse for larger system sizes. In addition, because the basis set is atom centered, this sparsity exhibits a blocked pattern. To take advantage of the blocked sparsity, the Distributed Block-sparse Tensors (DBT) library was developed.31,44

A central design decision of the DBT library was to implement the sparse tensor contractions via sparse matrix multiplication. This allowed us to build upon CP2K’s mature DBCSR library,45,46 which was originally developed for linear scaling DFT.47 Recently, the DBCSR library has been succeeded by the new Distributed Block-sparse Matrices (DBM) library, which offers a better OpenMP and GPU performance in the context of tensor contraction.

The DBM library distributes matrix blocks by arranging the processors in a 2D grid and then assigning each row/column of the blocked matrix to one row/column of that 2D processor grid. Typically, it is a many-to-one mapping as the blocked matrix has (much) more rows/columns than the processor grid. For tensors of rank N, the scheme is generalized by arranging the processors in an N-dimensional processor grid.

The contraction of two tensors involves the pairing of some of their indices and subsequent summation. The indices that get paired are called dummy or summation indices. The remaining indices are called free indices and are carried over to the result tensor. To compute a tensor contraction via a matrix multiplication, the input tensors first have to be converted to matrices through index folding. All summation indices of the contraction have to be folded into one matrix dimension, and all of the free indices have to be folded into the other matrix dimension. Given an N-dimensional index (i1, i2, …, iN) of a rank-N tensor with shape L1 × L2 ×⋯× LN, the folded index I can be defined as
(1)
Note that this folding operation is applied to the individual elements within each tensor block, the indices of the tensor blocks, and the N-dimensional processor grid. The folding of tensors into matrices is a local operation that does not require message passing interface (MPI) communication. Since the folding combines multiple tensor indices into one matrix dimension, the resulting matrices are often tall-and-skinny. For example, a rank-3 tensor of shape L × L × L can be folded into the following four matrix shapes: L2 × L, L × L2, L3 × 1, and 1 × L3.

The DBM library employs the SRUMMA algorithm48 for matrix multiplications, which does not perform well for tall-and-skinny matrices.49 The DBT library, therefore, borrows a page from the CARMA algorithm50 and splits the two largest matrices along their shared dimension into ssplit submatrices. The processor grid is also split into ssplit corresponding subgroups, and the submatrices are then multiplied independently within their processor subgroups. The trade-off for splitting the larger matrices is the additional memory and communication required for replicating the smallest matrix across all subgroups. Furthermore, if the two larger matrices do not have compatible distributions, then the mid-sized matrix has to be redistributed, which can dominate the overall communication cost of the tensor contraction. The choice of the smallest matrix and the optimal splitting factor ssplit depends on the sparsity of all three matrices. To estimate the sparsity of the result matrix, a multiplication of the block norms is quickly performed upfront.

In this work, the implementation in CP2K of atomic forces and stress tensors for Hartree–Fock, as well as the correlated wave function theories SOS-MP251,52 and direct RPA,6 is reported. The focus is on the development of methods with favorable scaling with respect to system and basis set size such that large systems can be simulated within periodic boundary conditions (PBCs). The resulting code should scale efficiently over thousands of central processing unit (CPU) cores and be GPU accelerated. There exist multiple ways to approach each of the three methods of interest, with various scalings and prefactors for the calculation of energies (HF,53–58 SOS-MP2,22–25 and RPA26–33). To fully exploit the highly efficient sparse tensor DBT library, we make the choice of implementing all three methods within the resolution-of-the-identity (RI) approximation40–42 while working in a local atomic orbital (AO) basis made of Gaussian type orbitals (GTOs).

The periodic four-center electron repulsion integrals (ERIs) are a central quantity in both HF and post-HF theories. In the current work, only the Γ-point is considered, and periodicity is taken care of by summing contributions from periodic images,
(2)
where a, b, and c denote the translations of the unit cell. From now on, the explicit PBC dependence is dropped, and all ERIs are assumed to be periodic. The four-center ERIs can be rewritten as contractions of two- and three-center integrals within the RI approximation,
(3)
where the Mulliken and Einstein notations are used for the integrals and the summation of repeating indices, respectively. λ, σ, μ, and ν denote Gaussian type AOs, while P, Q, R, and S label Gaussian type RI basis functions. The latter are optimized such that they span most of the space defined by all pairwise products of AOs, while the size of their set is kept minimal. The symbol represents the RI metric, which simplifies the above equation if chosen to be the same as the main operator represented by the vertical bar | (typically, the 1/r12 Coulomb operator).

There are both memory and computational efficiency advantages when working within the RI approximation and the AO basis. The number of two- and three-center ERIs to evaluate and store is small compared to that of a four-center based calculation. Moreover, using a short range RI metric, such as the overlap or a truncated Coulomb operator,59,60 leads to a lower occupancy of the (λσP) integrals and more efficient subsequent sparse tensor contractions. Finally, keeping three-center quantities in the local AO basis rather than the molecular orbital (MO) basis helps retain that sparsity, as MOs are typically much less localized. All three methods discussed in this work, RI-HF, low-scaling SOS-MP2, and dRPA, benefit from these choices.

The Hartree–Fock exchange (HFX) energy is expressed in terms of the density matrix Pλμ in the AO basis and the four-center ERIs (λσ|μν),
(4)
which can be straightforwardly rewritten in the RI approximation,
(5)
From the above equations, it is clear that RI-HFX ERIs evaluation scales cubically with the number of basis elements, while the standard four-center ERIs scale formally as O(N4). Note that integral screening61 helps reduce the cost of the latter, while tensor sparsity improves the efficiency of the former. The RI approximation in the context of HFX allows us to shift the computational burden from ERIs evaluation to tensor contractions, which are better suited for GPU acceleration. Because the scaling with respect to the basis set size is so steep, the auxiliary density matrix method (ADMM) was developed.39 In this approximation, the HFX energy is expressed as
(6)
where P̂ is the electronic density matrix expressed in an auxiliary basis, which is usually taken to be smaller and/or less diffuse than the primary basis. The difference in exchange energy between the primary and the auxiliary matrix is approximated with a local exchange functional. This latter step is very efficient compared to HF exchange, while assumed to be accurate, provided that PP̂. Depending on the system and the choice of primary and auxiliary bases, the speedup due to ADMM can reach orders of magnitude.62 The ADMM approximation can also be easily combined with RI, accelerating the evaluation of the three-center integrals, now in the auxiliary basis. More importantly, because the bottleneck of RI-HFX calculations is expected to be tensor contractions rather than ERI evaluations, the dimensions of all relevant two- and three-center tensors reduce to the size of the smaller auxiliary basis and corresponding RI basis.
The HFX forces acting on the nuclei are obtained by taking the derivative of the energy with respect to atomic positions, which we denote with the superscript (x),
(7)
where symmetries were exploited to reduce the number of terms to three for the RI-HFX forces. In the standard four-center implementation, derivatives are calculated and contracted with the relevant elements of the density matrices on the fly. In the RI case, three-center derivatives are pre-calculated and stored before the sparse tensor contractions can take place. Because the AOs λ and σ can symmetrically be swapped in (λσP)(x), six three-center derivative tensors are required (x, y, and z directions, for the AO and RI bases). More details about the implementation and efficiency considerations of RI-HFX forces can be found in  Appendix A. Note that all the scaling considerations previously mentioned for the HFX energy apply here as well.

The RI-HFX forces were implemented in the CP2K software package according to the above equations on top of the existing RI-HFX energy code.44 The two-, three-, and four-center integrals and derivatives are calculated analytically with the LIBINT63 library. The RI basis sets can be provided by the user as part of the calculation input, but the default behavior is to generate them automatically according to the scheme of Stoychev and co-workers.64 The storage and contraction of all three-center and most two-center quantities are done via the block sparse tensor library DBT.

To go beyond the accuracy of DFT, HF, or hybrid DFT, correlated wave function methods are required. These approaches help capture the dynamical electron correlation effects, which are completely lacking in HF, or approximated in DFT.65 A long-range dispersion is also inherently included,66,67 in contrast to DFT and HF, where ad hoc correction schemes are necessary.68–70 A common feature of these post-HF methods is that the correlation energy is computed in a single shot, as post-processing of a converged SCF calculation,
(8)
where ϕn,εn represent the collection of MOs and orbital energies, occupied and virtual, resulting from the SCF calculation. This causes the energy to be non-variational with respect to the wave function parameters, leading to non-trivial expressions for the atomic forces. Indeed, the solution of a Z-vector equation38 is systematically required.

1. SOS-MP2

Within the second-order Møller–Plesset (MP2) correlated wave function method,5 the scaled opposite-spin contribution to the correlation energy is expressed as
(9)
where the i, j indices denote occupied MOs and a, b virtual MOs, resulting from a converged SCF calculation at the HF level. The use of the bar above MO indices distinguishes between α and β spin-orbitals. The opposite-spin scaling factor cOS is typically set to 1.3.52 SOS-MP2 is itself an approximation to canonical MP2, but dropping the same-spin component of the latter allows for a reduction of the method’s scaling from an initial O(N5) to O(N4), and ultimately to O(N3). This can be achieved by combining the RI approximation, a Laplace transform with numerical integration, and the exclusive use of the AO basis.23 The resulting expression for the SOS-MP2 correlation energy reads
(10)
where τj denotes Laplace quadrature points with weight ωj. The matrix Q(τj) is symmetric and defined in the RI basis as
(11)
with the symmetric matrix P(τj) obtained by the two-index contraction of two three-center quantities
(12)
themselves defined as
(13)
with the occupied and virtual pseudo-densities
(14)
where Cμi and Cνa are the occupied and virtual MO coefficients, respectively. The cubic scaling SOS-MP2 energy was implemented in CP2K.31,43 The atomic forces are highly non-trivial, and the final expression written below was derived based on the work of Schweizer and co-workers:71 
(15)
(16)
with
(17)
The term stemming from the second spin combination, Eβα(x), is easy to recover. The matrices S and F are the overlap and the Fock matrix in the AO basis, respectively. Pocc and Pvirt are the occupied and virtual density matrices such that
(18)
The symmetric R matrices are a shorthand notation for the contraction of the usual two- and three-center quantities
(19)
Finally, matrices Y1 and Y2 are defined iteratively such that the following sum converges:
(20)
where R = Rvirt, P = Pocc, and A = τjPoccF for Y1, while R = Rocc, P = Pvirt, and A = −τjPvirtF for Y2. PR̲(n) is itself defined recursively, with PR̲(0)=PR and
(21)
The value of n is determined such that the norm of A/2n is sufficiently small for the scaling and squaring algorithm72 to converge for the eA matrix exponential. The last term in Eq. (16) is related to orbital relaxation, where Pxocc contains the density response to an atomic displacement. Its contribution to the total force is evaluated through the solution of a Z-vector equation, of which the O(τj) operator defined in Eq. (17) is the RHS. For efficiency, the terms stemming from the different Laplace quadrature points and spin configurations can be summed before a single Z-vector equation is solved. The MP2 correction to the density matrix, necessary to calculate properties such as the dipole, is given by
(22)
with PZ-vec the solution of the Z-vector equation. Note that contributions to the energy gradient coming from a change in the Laplace quadrature are ignored. To ensure consistency of energy and forces, the Laplace quadrature is fixed at the start of the calculation and kept constant upon changes in the system geometry (as would happen during a molecular dynamics run, for example). This has little effect on accuracy, provided that the quadrature is converged well enough to start with. Because the Laplace quadrature depends on the HOMO–LUMO gap of the system, keeping it fixed during molecular dynamics runs where the gap varies significantly could lead to inaccuracies. In such cases, a different quadrature can be used for each frame. In principle, this leads to inconsistencies, although they are known to be small.73 

The low-scaling SOS-MP2 gradients were implemented in CP2K according to the above equations. The Laplace quadrature is obtained via a minimax scheme.74,75 The quadrature is kept fixed by default, while the option to adapt it depending on the gap is available. The two- and three-center integrals and derivatives involving the RI metric, where a short-range potential is used for sparsity, are calculated analytically using the LIBINT library. The two-center integrals and derivatives requiring the Coulomb operator, (P|Q) and (P|Q)(x), are calculated numerically with the Gaussian and plane waves (GPW) method,76,77 which intrinsically takes care of PBCs. Most two- and three-center quantities are stored and processed as DBT block sparse tensors. The iterative construction of the Y1,2 matrices, as well as the RHS of the Z-vector equation O(τj), is done with the DBCSR library.45 As for RI-HFX, the RI basis is automatically generated by default. The only non-trivial contribution to the stress tensor stems from the GPW numerical integrals since the integration grids change with cell size. The procedure is described in Ref. 78. More details about the CP2K implementation can be found in  Appendix B, and more details about the derivation of the equations can be found in the supplementary material.

2. Direct RPA

Within the RI approximation, the cubic-scaling RPA correlation energy is expressed as31 
(23)
where Q(ω) is a square matrix in the RI basis,
(24)
with the time-dependent Q(τ) matrix defined the same way as in Eq. (11). The only difference lies in the expressions for the occupied and virtual pseudo-densities, where the Fermi energy ɛF is used,
(25)
There are two integrals to evaluate: one over frequency in Eq. (23) and one over time in Eq. (24). Both are solved numerically such that
(26)
with grid points ωk and weights σk for the frequency grid and τj, λkj for the time grid. The identity Tr[log(A)] = log[det(A)] for positive-definite matrices was used. From there, the gradient of the energy can be taken,
(27)
where the integration over the frequency grid can be performed early, defining the symmetric Bj matrix in the process,
(28)
The above operation is possible at no extra cost, provided that the collection of Q(ωk) matrices is available from the energy calculation. More importantly, this makes the RPA energy gradient expression very similar to that of low-scaling SOS-MP2 [see Eq. (15)], allowing for large parts of common code between the two methods. The final formula reads
(29)
with
(30)
The R matrices are adapted from Eq. (19),
(31)
while the Y matrices are defined the same way as in SOS-MP2 [see Eq. (20)], with R = Rvirt, P = Pocc, and A = τjPocc(FɛFS) for Y1, and R = Rocc, P = Pvirt, and A = −τjPvirt(FɛFS) for Y2. The O(τj) contributions from different gird points can be summed to form the RHS of a Z-vector equation. Its solution, PZ-vec, is used to evaluate the orbital relaxation contribution to the RPA atomic forces. It also allows us to build the RPA correction to the density matrix, exactly in the same way as in Eq. (22).

The requirements to calculate low-scaling RPA forces are very similar to SOS-MP2. Therefore, the discussion about the various libraries and methods used in the CP2K implementation at the end of Sec. III B 1 applies here as well. Note that the RPA forces are more complex due to the presence of the double time and frequency integration grids. The final expression is also lengthier because of the Fermi energy in the definition of the pseudo-densities. The details about the implementation can be found in  Appendix B, while a detailed derivation of the gradient formula can be found in the supplementary material.

Double-hybrid density functionals combine elements of DFT and wave function theory (WFT) in a “best of both worlds” approach.1,4,12,79,80 Indeed, while DFT offers a good description of short-range interactions, wave function methods are better at treating long-range interactions. In its most general formulation, the exchange–correlation energy of a double-hybrid is as follows:
(32)
where DFT and HF exchange, as well as DFT and WFT correlation, are mixed as a linear combination. In CP2K, any of the methods described in this work can be used to calculate both double-hybrid level energies and forces. In the latter case, the Fock matrix in Eqs. (16)(31) has to be replaced by the Kohn–Sham matrix. Extra care has to be taken with its derivative because of the appearance of a second-order exchange–correlation energy kernel (in particular, when dealing with the solution of the Z-vector equation13).

In most instances, the RPA correlation energy is calculated on top of a DFT or hybrid DFT level SCF rather than HF.81 At the end of the calculation, the DFT exchange–correlation energy is replaced by HF exchange, evaluated using the SCF orbitals. In terms of forces, this requires the addition of the HF Hamiltonian to the RHS of the Z-vector equation.73 Such a double-hybrid functional is referred to as RPA@DFT (e.g., RPA@PBE or RPA@PBE0).

The cost and scaling of a double-hybrid calculation are always dominated by the correlated wave function method of choice. Similarly, the basis set convergence of double-hybrids is limited by that of WFT, which is typically slower than DFT.7,8 The low-scaling methods presented in this work should help reach double-hybrid DFT accuracy in large systems.

The main focus of this work is on the performance and scalability of the new sparse tensor based implementations. For all three methods of interest, benchmark calculations were run on the GPU partition of two large supercomputers: Piz Daint at CSCS in Switzerland, and LUMI-G at CSC in Finland. A Piz Daint node is made of an Intel Xeon E5-2690 v3 processor with 12 cores and 64 GB of memory, together with a 16 GB NVIDIA Tesla P100 GPU. The latest update to the machine hardware was made in November 2016. On the other hand, LUMI-G is a state-of-the-art installation still in its pilot phase as of December 2022. A LUMI-G node is made of an AMD EPYC 7A53 processor with 64 cores and 512 GB of memory, with four AMD Instinct MI250X GPUs with 128 GB of memory each. The following benchmarks showcase the code performance on these two machines, with different manufacturers and architectures. Note that LUMI-G was still in pre-production mode at the time of the benchmarks and that improvements can be expected. All calculations were run with 4 MPI ranks and 3 OMP threads per node on Piz Daint and 16 MPI ranks with 3 OMP threads on LUMI-G. These configurations have been shown to be optimal for sparse tensor contractions. Note that all GPUs of the LUMI-G node are used, but not all CPUs. This is due to the machine temporarily running in low-noise mode, where one processor is reserved for the operating system. All LUMI-G calculations reported in this work were run in the same configuration. This is not expected to change the scaling behaviors observed in this work.

The benchmarks for RI-HFX are organized as follows: we first discuss the choice of RI metric and its impact on performance and accuracy, with respect to standard four-center HFX. Then, we measure the scaling of RI-HFX and 4c-HFX with respect to system size, and finally, we study strong scaling performance. We conclude with a discussion about the advantages of both methods and when to apply them.

Two Γ-point periodic systems were selected for these benchmarks: liquid water and bulk silicon. Because sparsity is expected to have a strong impact on performance, both for integral screening with 4c-HFX and for sparse contraction with RI-HFX, this choice is meant to be representative. Indeed, liquid water is a much sparser system than crystalline silicon. For comparison purposes, all calculations were executed with and without the ADMM approximation.

1. RI metric

We are interested in the performance and accuracy of RI-HFX for energy and forces, as a function of the RI metric range. Four metrics are considered: the overlap and the truncated Coulomb potential with cutoff radii of 1.0, 2.0, and 3.0 Å. Because relative energies are more important than their absolute counterparts, the accuracy of RI-HFX energies was tested by taking the energy difference between two different configurations of the same system. For water, two uncorrelated MD frames were considered. For bulk silicon, the optimized geometry and a random perturbation of it (with atoms moved by up to 0.15 Å) were taken. To test the accuracy of RI-HFX forces, the component-wise root mean square error of the atomic force was calculated with respect to the 4c-HFX reference, only on systems with non-vanishing forces. All calculations reported in this section were done on the Piz Daint supercomputer, at the HF level of theory. The double-zeta ccGRB-D and admm-dzp basis sets, distributed as part of the CP2K software package, were used.

The energy results are displayed in Fig. 1. For liquid water, accuracy seems to converge non-monotonically with the RI metric range, regardless of the use of ADMM. Note that this behavior is also observed in larger water systems and with various other basis sets (see the supplementary material). For bulk silicon, accuracy is well converged already with the overlap RI metric. Moreover, using longer range operators seems to only increase noise levels. This different behavior with respect to the RI metric range is most likely due to system density. Because water is much sparser, it probably requires longer ranged RI metrics to have sufficient interactions with the rest of the system. In terms of performance, the cost of RI-HFX increases linearly with the RI metric range in all cases. This correlates exactly with the occupancy of the three-center ERI tensor (λσP), as summarized in Table I. Compared to the timings of standard 4c-HFX, RI-HFX is not systematically faster. In the particular case of water within the ADMM approximation, RI-HFX is at best 1.5 times more expensive and at least twice as expensive for sufficient accuracy. Without ADMM, the two methods are roughly equivalent to water. In the denser bulk silicon system, however, RI-HFX performs the best overall. Using the overlap RI metric, converging the SCF comes at 65% and 38% of the cost of 4c-HFX with and without ADMM, respectively. We observe that 4c-HFX performs comparatively better within the ADMM approximation. This is to be expected, since the ADMM basis sets are optimized for performance when calculating four-center ERIs of the type (λσ|μν). The spread of the auxiliary basis function is limited in order to maximize the screening of the non-overlapping λ, σ and μ, ν pairs. Moreover, they are typically kept as uncontracted bases, which is the optimal format for the LIBINT library within CP2K. Note that RI-HFX also benefits from ADMM, both for the evaluation and the overall sparsity of the three-center (λσP) tensor. The benefits are, however, not as dramatic since the tensor contractions are the main computational bottleneck.

FIG. 1.

Error on the energy difference between two perturbed structures of the same periodic system, as a function of the RI metric range. The reference energy is that of a standard four-center HF calculation using the truncated Coulomb potential, with a range of 6 Å. Ovlp stands for the overlap metric, and TC RC stands for the truncated Coulomb metric with a truncation radius of RC Å. The relative timing of the reference 4c-HFX calculation is always set to 1. The water system is made of 64 molecules with the ccGRB-D basis set and the admm-dzp auxiliary basis set when ADMM is used. The silicon system is made of 128 atoms with the ccGRB-D/admm-dzp basis sets. RI basis sets were automatically generated.

FIG. 1.

Error on the energy difference between two perturbed structures of the same periodic system, as a function of the RI metric range. The reference energy is that of a standard four-center HF calculation using the truncated Coulomb potential, with a range of 6 Å. Ovlp stands for the overlap metric, and TC RC stands for the truncated Coulomb metric with a truncation radius of RC Å. The relative timing of the reference 4c-HFX calculation is always set to 1. The water system is made of 64 molecules with the ccGRB-D basis set and the admm-dzp auxiliary basis set when ADMM is used. The silicon system is made of 128 atoms with the ccGRB-D/admm-dzp basis sets. RI basis sets were automatically generated.

Close modal
TABLE I.

Occupancy of the three-center (λσP) ERI tensor as a function of the RI metric range, for RI-HFX calculations of 64 water molecules and 128 atoms of bulk silicon. Basis sets are ccGRB-D, admm-dzp, and automatically generated RI bases.

Liquid waterBulk silicon
ADMM (%)No ADMM (%)ADMM (%)No ADMM (%)
Overlap 6.5 20.8 21.2 69.4 
TC 1.0 Å 12.2 33.3 33.4 81.2 
TC 2.0 Å 20.7 47.2 47.3 88.0 
TC 3.0 Å 31.1 60.2 59.8 91.7 
Liquid waterBulk silicon
ADMM (%)No ADMM (%)ADMM (%)No ADMM (%)
Overlap 6.5 20.8 21.2 69.4 
TC 1.0 Å 12.2 33.3 33.4 81.2 
TC 2.0 Å 20.7 47.2 47.3 88.0 
TC 3.0 Å 31.1 60.2 59.8 91.7 

Figure 2 reports the accuracy and performance of RI-HFX forces, as a function of the RI metric range. The trends observed for the energies are the same here: accuracy converges with the range for water, whereas it is already converged with the overlap metric for bulk Si. The same noisy behavior can also be observed for the truncated Coulomb metric with longer ranges in silicon. Note that the component-wise RMSE on the forces for the denser system is particularly low. RI-HFX force timings also increase linearly with the RI metric range and, therefore, with tensor occupancy. Note that the sparsity levels of the three-center tensors containing the derivatives (λσP)(x) are very close to that of the integrals displayed in Table I. In most cases, the cost of RI-HFX compared to that of 4c-HFX is competitive, in particular for silicon with the overlap metric, where the force evaluation takes 45% and 3% of the time required by the standard method, with and without ADMM, respectively. On the other hand, for the sparser water system with an ADMM basis optimized for four-center integrals and derivatives, RI-HFX is far less efficient.

FIG. 2.

Component-wise RMSE on atomic forces for periodic systems, as a function of the RI metric range. The reference force is that of a standard four-center HF calculation using the truncated Coulomb potential, with a range of 6 Å. Ovlp stands for the overlap metric, and TC RC for the truncated Coulomb metric with a truncation radius of RC Å. The relative timing of the reference 4c-HFX calculation is always set to 1. The water system is made of 64 molecules with the ccGRB-D basis set and the admm-dzp auxiliary basis set when ADMM is used. The silicon system is made of 128 atoms with the ccGRB-D/admm-dzp basis sets. RI basis sets were automatically generated.

FIG. 2.

Component-wise RMSE on atomic forces for periodic systems, as a function of the RI metric range. The reference force is that of a standard four-center HF calculation using the truncated Coulomb potential, with a range of 6 Å. Ovlp stands for the overlap metric, and TC RC for the truncated Coulomb metric with a truncation radius of RC Å. The relative timing of the reference 4c-HFX calculation is always set to 1. The water system is made of 64 molecules with the ccGRB-D basis set and the admm-dzp auxiliary basis set when ADMM is used. The silicon system is made of 128 atoms with the ccGRB-D/admm-dzp basis sets. RI basis sets were automatically generated.

Close modal

The performance of standard 4c-HFX and RI-HFX depends strongly on the choice of some threshold values passed in the CP2K input file. For the first method, EPS_SCHWARZ and EPS_SCHWARZ_FORCES control the Cauchy–Schwarz pre-screening of the four-center ERIs (λσ|μν) and their derivatives (λσ|μν)(x). For RI-HFX, the EPS_FILTER parameter controls the sparsity of all two- and three-center tensors, discarding elements smaller than the given threshold. The 4c-HFX calculations reported in this section used the default 10−10 value for the integrals and 10−8 for the derivatives. RI-HFX runs systematically with a default value of 10−9 for the filtering threshold. These choices should ensure high enough accuracy for meaningful comparisons between the two methods. Changing these values would impact the exact timings reported in this work but not the trends observed. This last remark would also apply if different basis sets should be used, or should the calculations be run on a different computer.

2. Method scaling

The scaling of the RI-HFX and four-center HFX energy and forces with respect to system size was measured. Calculations with simulation cells of increasing size were run on both Piz Daint and LUMI-G, for liquid water and bulk silicon. The computational resources required are reported in Fig. 3. All calculations were executed with ADMM and the ccGRB-D/admm-dzp basis sets. The RI metrics of choice were the truncated Coulomb operator with a cutoff radius of 1.5 Å for water and the overlap for silicon. Note that all simulation cells are cubic for water but not for silicon. In the latter case, the 64 atom system is cubic, and larger cells are built by duplication along one direction (2 × 1 × 1, 2 × 2 × 1, 2 × 2 × 2). The filtering threshold for the RI-HFX calculation was kept at its default value of 10−9, and so was the Schwarz screening threshold for the ERIs (10−10). The Schwarz screening threshold for the derivatives was set to the default value of 10−6 for water and to 10−7 for silicon.

FIG. 3.

Scaling of the RI-HFX and four-center HFX energy and force calculations with respect to system size. The systems considered are liquid water and bulk silicon, with both ADMM-PBE0 (using a fixed range truncated Coulomb potential) and the ccGRB-D/admm-dzp basis sets. All systems were calculated on both the GPU partition of Piz Daint and LUMI-G. Sub-figures (a) and (c) correspond to liquid water on Piz Daint and LUMI-G, respectively, while sub-figures (b) and (d) correspond to bulk silicon.

FIG. 3.

Scaling of the RI-HFX and four-center HFX energy and force calculations with respect to system size. The systems considered are liquid water and bulk silicon, with both ADMM-PBE0 (using a fixed range truncated Coulomb potential) and the ccGRB-D/admm-dzp basis sets. All systems were calculated on both the GPU partition of Piz Daint and LUMI-G. Sub-figures (a) and (c) correspond to liquid water on Piz Daint and LUMI-G, respectively, while sub-figures (b) and (d) correspond to bulk silicon.

Close modal

When performing calculations in periodic boundary conditions (PBCs), the slowly decaying 1/r12 Coulomb potential leads to spurious exchanges between electrons and their periodic images.60 This can be avoided by using short range exchange potentials, such as the truncated Coulomb operator or the error function.82 Therefore, in the limit of large periodic systems with a fixed range potential, the standard four-center HFX is expected to scale linearly with system size. RI-HFX, on the other hand, still involves the contraction of three-center tensors, thus keeping its cubic scaling behavior. The log–log fit in Fig. 3 confirms these expectations, with a scaling of O(N1.01.4) for four-center HFX and O(N2.22.6) for RI-HFX. Note that the latter performs better than its predicted cubic scaling. This is due to the high levels of sparsity achieved in large systems because of the use of a short range RI metric. For dense systems such as bulk silicon, the prefactor of the linear scaling four-center HFX is rather high. Therefore, calculating smaller systems with the steeper scaling RI-HFX method is still advantageous. Bear in mind that the exact crossover point between the two HFX methods could vary widely depending on the screening thresholds used. The small variations in the fitted scaling across the different computers come from calculations being run at slightly different levels of parallel efficiency.

3. Strong scaling

Figures 47 summarize the strong scaling experiments conducted with HFX for this work. Several steps of molecular dynamics at the PBE083 level were taken for a liquid water system of 64 molecules, with and without the ADMM approximation. The same calculation was run with an increasing amount of computational resources, and the average wall time per MD step was reported. For bulk silicon, a PBE0 geometry optimization was run instead for a system of 128 atoms. The ccGRB-D/admm-dzp basis sets and an automatically generated RI basis were used throughout. In all cases, the time related to the first step, where the SCF takes longer to converge, was left out of the average. The screening and filtering thresholds used here are the same as those in Sec. IV A 2.

FIG. 4.

Strong scaling of the four-center and RI-HFX methods for molecular dynamics on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the ADMM-PBE0 hybrid functional and the ccGRB-D/admm-dzp basis sets. The average wall time per MD step is reported (the initial step is omitted from the mean).

FIG. 4.

Strong scaling of the four-center and RI-HFX methods for molecular dynamics on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the ADMM-PBE0 hybrid functional and the ccGRB-D/admm-dzp basis sets. The average wall time per MD step is reported (the initial step is omitted from the mean).

Close modal
FIG. 5.

Strong scaling of the four-center and RI-HFX methods for molecular dynamics on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the PBE0 hybrid functional and the ccGRB-D basis sets. The average wall time per MD step is reported (the initial step is omitted from the mean).

FIG. 5.

Strong scaling of the four-center and RI-HFX methods for molecular dynamics on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the PBE0 hybrid functional and the ccGRB-D basis sets. The average wall time per MD step is reported (the initial step is omitted from the mean).

Close modal
FIG. 6.

Strong scaling of the four-center and RI-HFX methods for geometry optimization on Piz Daint and LUMI-G. The benchmark system is 128 atoms of bulk silicon, with the ADMM-PBE0 hybrid functional and the ccGRB-D/admm-dzp basis sets. The average wall time per GEO_OPT step is reported (the initial step is omitted from the mean).

FIG. 6.

Strong scaling of the four-center and RI-HFX methods for geometry optimization on Piz Daint and LUMI-G. The benchmark system is 128 atoms of bulk silicon, with the ADMM-PBE0 hybrid functional and the ccGRB-D/admm-dzp basis sets. The average wall time per GEO_OPT step is reported (the initial step is omitted from the mean).

Close modal
FIG. 7.

Strong scaling of the four-center and RI-HFX methods for geometry optimization on Piz Daint and LUMI-G. The benchmark system is 128 atoms of bulk silicon, with the PBE0 hybrid functional and the ccGRB-D basis sets. The average wall time per GEO_OPT step is reported (the initial step is omitted from the mean).

FIG. 7.

Strong scaling of the four-center and RI-HFX methods for geometry optimization on Piz Daint and LUMI-G. The benchmark system is 128 atoms of bulk silicon, with the PBE0 hybrid functional and the ccGRB-D basis sets. The average wall time per GEO_OPT step is reported (the initial step is omitted from the mean).

Close modal

Both RI-HFX and four-center HFX implementations exhibit good strong scaling behavior, regardless of the machine on which the calculation was run, and the strong scaling efficiency is high across the board. The GPU acceleration of the RI-HFX method is good overall. On Piz Daint, the average speedup due to GPUs is 1.64 and 1.95 for ADMM and non-ADMM runs, respectively. On LUMI-G, the average GPU acceleration is a factor of 2.37 for ADMM and of 2.95 for non-ADMM calculations. Tables displaying parallel efficiencies and speedups, as well as an in-depth discussion comparing Piz Daint and LUMI-G performances, are available in the supplementary material.

4. Discussion

RI-HFX offers a good alternative to standard four-center HFX in periodic systems, but only under certain circumstances. For sparse systems with basis sets optimized for (λσ|μν) ERIs screening, such as ADMM liquid water, four-center HFX is likely to outperform RI-HFX. However, for denser systems of small to medium size, the newer implementation is expected to be superior. This is especially true when the ADMM approximation is not used, as illustrated by the silicon tests. It is important to note that when both RI and ADMM are used at the same time, the accuracy will, at best, be that of ADMM. Pure RI, while slower, can still be used to accelerate HFX calculations with a higher accuracy.62 RI-HFX can also potentially outperform 4c-HFX when memory is scarce, and all the four-center ERIs (λσ|μν) cannot be stored. In such a case, a fraction of the integrals must be recalculated at each SCF step, which can significantly slow down the procedure. In contrast, the (λσP) RI integrals required far less memory for storage. When non-periodic calculations are performed with the 1/r12 Coulomb operator, the quartic scaling of four-center HFX with the basis set size is a major concern. RI-HFX offers an efficient alternative, should large basis sets be required (see the supplementary material for an example). Finally, RI-HFX is GPU accelerated, while 4c-HFX is not. There is, therefore, more progression potential for the former, as GPU hardware keeps improving and the low-level DBT and DBM code can be optimized accordingly.

The low-scaling post-HF benchmarks cover scaling tests for the cost of the methods with respect to system size, as well as strong scaling measurements. The selected systems are liquid water and bulk rutile TiO2. This choice is meant to be representative because of the varying densities and basis set sizes, especially with respect to the l quantum numbers for Ti atoms. The two variants of correlated wave function methods studied in this section are the standard SOS-MP2 with cOS = 1.3 on top of a HF ground state, and RPA@PBE0. The RI metrics used with each system are the same throughout this section, with an overlap for titania and a truncated Coulomb with a cutoff of 1.5 Å for water.

1. Method scaling

Figure 8 illustrates the scaling of the sparse tensor based SOS-MP2 and RPA methods as a function of system size for energy and force calculations. All SCF and Z-vector equation solutions were converged within the ADMM approximation, with 4c-HFX for water and RI-HFX for TiO2 (overlap metric). The admm-dzp auxiliary basis set was used, as well as cc-DZ and RI_DZ77 for water and ccGRB-D for titania. The post-HF RI basis set for the latter system is an uncontracted version of the automatically generated one, for optimal LIBINT performance when calculating three-center ERIs. Note that double-zeta basis sets are generally considered to be of too low quality for production level post-HF calculation.84 However, since the method of scaling is independent of basis set size, this allows for the benchmarking of large systems at reasonable costs. All calculations were run with a six point minimax grid. All liquid water cells are cubic. The smallest titania cell is orthorhombic with dimensions of 9.33 × 9.33 × 6.07 Å3, and all larger cells are built by duplicating it along one or more dimensions (1 × 1 × 2, 1 × 2 × 2, 2 × 2 × 2). Note that the figure lacks data points for the largest TiO2 system on LUMI-G. This is due to stability issues preventing the completion of long calculations on a large number of nodes.

FIG. 8.

Scaling of the sparse tensor based SOS-MP2 and RPA@PBE0 energy and force calculations with respect to system size. The systems considered are liquid water and bulk rutile TiO2. Water is simulated with the cc-DZ and the optimized RI_DZ basis sets, while TiO2 uses ccGRB-D and an automatically generated and decontracted RI basis. Note that the SCF is converged within the ADMM approximation using the admm-dzp basis in all cases, with four-center HFX for water and RI-HFX for TiO2. All systems were calculated on both the GPU partitions of Piz Daint and LUMI-G. Sub-figures (a) and (c) correspond to liquid water on Daint and LUMI-G, respectively, while sub-figures (b) and (d) correspond to rutile titania.

FIG. 8.

Scaling of the sparse tensor based SOS-MP2 and RPA@PBE0 energy and force calculations with respect to system size. The systems considered are liquid water and bulk rutile TiO2. Water is simulated with the cc-DZ and the optimized RI_DZ basis sets, while TiO2 uses ccGRB-D and an automatically generated and decontracted RI basis. Note that the SCF is converged within the ADMM approximation using the admm-dzp basis in all cases, with four-center HFX for water and RI-HFX for TiO2. All systems were calculated on both the GPU partitions of Piz Daint and LUMI-G. Sub-figures (a) and (c) correspond to liquid water on Daint and LUMI-G, respectively, while sub-figures (b) and (d) correspond to rutile titania.

Close modal

The nominal scaling of these AO and sparse tensor based implementations is cubic. However, the log–log fits of Fig. 8 yield effective scaling of O(N2.32.5) for liquid water and O(N2.52.7) for bulk TiO2. As previously mentioned in the context of RI-HFX, this is due to the increasing levels of sparsity with system size. Table II summarizes the occupancy of the three-center ERIs (λσP) for all systems considered. Note that despite the use of a shorter ranged RI metric for TiO2, occupancy is generally higher. These numbers account for most of the difference in cost between the water and titania calculations, i.e., the offset between the scaling curves. The calculation of the three-center integrals and derivatives is also much more expensive for the latter system, due to the large basis sets and high angular momenta associated with titanium atoms. However, their relative cost is directly proportional to occupancy, ranging from 37% of the total cost for the 1 × 1 × 1 system to 2% for 2 × 2 × 2 (including ADMM RI-HFX integrals). Note that the RPA and SOS-MP2 curves are virtually identical. This is to be expected since most of the tensor contraction code is shared between the two methods, and they both rely on a hybrid-DFT level SCF and Z-vector equation.

TABLE II.

Occupancy and number of non-zero elements of the post-HF three-center ERI tensor (λσP) as a function of system size. The values are identical for SOS-MP2 and RPA. The sparsity levels of the three-center derivatives (λσP)(x) are similar.

Liquid waterBulk TiO2
SizeOccupancy (%)Non-zeroSizeOccupancy (%)Non-zero
32 45.7 5.9 × 108 48 60.1 1.5 × 109 
64 14.7 1.5 × 109 96 39.0 7.5 × 109 
128 3.7 3.1 × 109 192 16.1 2.5 × 1010 
256 0.9 6.1 × 109 384 6.2 7.6 × 1010 
Liquid waterBulk TiO2
SizeOccupancy (%)Non-zeroSizeOccupancy (%)Non-zero
32 45.7 5.9 × 108 48 60.1 1.5 × 109 
64 14.7 1.5 × 109 96 39.0 7.5 × 109 
128 3.7 3.1 × 109 192 16.1 2.5 × 1010 
256 0.9 6.1 × 109 384 6.2 7.6 × 1010 
In a paper describing the implementation of the low-scaling RPA energy method,31 the reported fitted scaling for liquid water is O(N2.2). This is more advantageous than the scaling observed in this work. For a better comparison, the specific timings due to energy and forces were isolated, and their scaling was fitted. A summary is available in Table III. We observe that the energy part of the calculation scales more favorably and that the overall behavior is dominated by forces. This is because of the following component of the correlation force from Eq. (29), due to the three-center derivatives:
(33)
where some densification of the TμνR tensor is unavoidable. The cost of the final contraction with the two-center tensor KBjKT can be kept low because the sparsity of the resulting three-center quantity is required to be that of the (μνT)(x) derivatives. The densification of TμνR happens because of the exponential in the definition of the pseudo-densities [Eq. (14)], which leads to matrices with a lower sparsity than the standard density matrix. In the limit of very large systems, where the pseudo-densities become sparse, the scaling of energy and forces is expected to be the same. In this work, the scaling of the RPA energy for water is O(N1.81.9). This improvement with respect to the original implementation is likely due to the refactoring of the code with the new sparse tensor library. Note that other factors, such as the basis set and RI metric choices, may have some influence as well.
TABLE III.

Detailed fitted scaling of the energy and force timings for the post-HF calculations. The total times are considered, including those from the SCF.

Liquid waterBulk TiO2
EnergyForcesEnergyForces
SOS-MP2, Daint 1.9 2.6 2.6 2.7 
SOS-MP2, LUMI 1.9 2.4 2.4 2.8 
RPA@PBE0, Daint 1.8 2.6 2.5 2.6 
RPA@PBE0, LUMI 1.9 2.5 2.5 2.7 
Liquid waterBulk TiO2
EnergyForcesEnergyForces
SOS-MP2, Daint 1.9 2.6 2.6 2.7 
SOS-MP2, LUMI 1.9 2.4 2.4 2.8 
RPA@PBE0, Daint 1.8 2.6 2.5 2.6 
RPA@PBE0, LUMI 1.9 2.5 2.5 2.7 

Regardless of system sparsity, operations such as matrix inverse or square root are necessary for both energy and forces. In the current implementation, these are performed on dense matrices via the ScaLAPACK library.85 Therefore, in the limit of very large systems, all post-HF calculations are expected to scale cubically. Note that the scaling numbers reported in Table III are specific to the systems studied in this work and not a general result of the implementation. For the same range of systems, the cost of calculating the gradients ranges from a factor of two to eight times the cost of an energy calculation, as illustrated in Table IV for SOS-MP2. Similar numbers are observed for RPA.

TABLE IV.

Relative timing of the SOS-MP2 force calculation with respect to energy. The total times are considered, including those from the SCF.

Liquid waterBulk TiO2
SizeDaintLUMISizeDaintLUMI
32 2.0 1.9 48 3.4 3.7 
64 2.5 2.6 96 3.2 3.2 
128 4.0 4.7 192 3.1 3.4 
256 8.2 5.7 384 4.2 ⋯ 
Liquid waterBulk TiO2
SizeDaintLUMISizeDaintLUMI
32 2.0 1.9 48 3.4 3.7 
64 2.5 2.6 96 3.2 3.2 
128 4.0 4.7 192 3.1 3.4 
256 8.2 5.7 384 4.2 ⋯ 

2. Strong scaling

The strong scaling behavior of the SOS-MP2 and RPA@PBE0 forces was measured on Piz Daint and LUMI-G. The results are displayed in Figs. 912. All calculations use the triple-zeta cc-TZ and RI_TZ basis sets to emulate production level conditions. The HFX exchange contribution to the SCF and the Z-vector equation is calculated within the ADMM approximation with the admm-tzp auxiliary basis. The standard four-center HFX method was selected for liquid water, and the RI-HFX implementation with an overlap metric was selected for TiO2.

FIG. 9.

Strong scaling of the sparse tensor based SOS-MP2 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the pure ADMM-HF level with the admm-tzp basis set.

FIG. 9.

Strong scaling of the sparse tensor based SOS-MP2 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the pure ADMM-HF level with the admm-tzp basis set.

Close modal
FIG. 10.

Strong scaling of the sparse tensor based SOS-MP2 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 72 atoms of bulk rutile TiO2, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the pure RI-ADMM-HF level with the admm-tzp basis set.

FIG. 10.

Strong scaling of the sparse tensor based SOS-MP2 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 72 atoms of bulk rutile TiO2, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the pure RI-ADMM-HF level with the admm-tzp basis set.

Close modal
FIG. 11.

Strong scaling of the sparse tensor based RPA@PBE0 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the ADMM-PBE0 level with the admm-tzp basis set.

FIG. 11.

Strong scaling of the sparse tensor based RPA@PBE0 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 64 molecules of liquid water, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the ADMM-PBE0 level with the admm-tzp basis set.

Close modal
FIG. 12.

Strong scaling of the sparse tensor based RPA@PBE0 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 72 atoms of bulk rutile TiO2, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the RI-ADMM-PBE0 level with the admm-tzp basis set.

FIG. 12.

Strong scaling of the sparse tensor based RPA@PBE0 implementation for energy and forces on Piz Daint and LUMI-G. The benchmark system is 72 atoms of bulk rutile TiO2, with the cc-TZ and optimized RI_TZ basis sets. The SCF is converged at the RI-ADMM-PBE0 level with the admm-tzp basis set.

Close modal

Strong scaling performance is good overall, at least up to 256 nodes on Daint and 32 nodes on LUMI-G. High speedups and favorable parallel efficiencies are observed with increasing number of computational resources. GPU acceleration is good on both machines. On Piz Daint, the average speedup is 1.50. On LUMI-G, it amounts to 2.79. Note that these numbers are similar to those observed for GPU accelerated RI-HFX. Tables displaying parallel efficiencies and speedups, as well as an in depth discussion comparing Piz Daint and LUMI-G performances, are available in the supplementary material.

3. Discussion

The implementations of low-scaling post-HF forces in CP2K show a good performance. Despite not scaling as favorably as energies with system size, their effective scaling is kept under the cubic regime because of sparsity. A good strong scaling behavior is also observed across the board. There is significant GPU acceleration, especially on the LUMI-G machine.

Despite their advantageous scaling, these AO based methods come with a high prefactor. This was observed for RPA energies,31 and the same is expected for forces. Therefore, there is a place for the more standard quartic scaling MO based implementations of forces, which should outperform the methods discussed in this work for small to medium systems.

All calculations reported in this section used user provided, uncontracted RI basis sets. Even the TiO2 method scaling tests, where no optimized double-zeta basis was available for titanium atoms, were based on a manually uncontracted version of the automatically generated basis. Uncontracted basis sets are optimal for three-center ERI calculations in CP2K with the LIBINT library. This is relevant when larger basis sets, and especially high angular momenta, are required. Moreover, higher levels of sparsity can be achieved in AO based tensors than with contracted basis sets. However, all of this comes with a trade-off; uncontracted basis sets typically have more elements, leading to larger tensors overall. All non-standard basis sets used in this work are available in the supplementary material.

The default filtering threshold of 10−9 for the sparse tensor was used throughout. Using looser thresholds can potentially speed up calculations. However, careful testing is advised to make sure accuracy is retained. The same goes for the choice of the RI metric. It was selected to be the truncated Coulomb potential with a 1.5 Å cutoff radius for liquid water. Using a smaller range would lead to greater sparsity and more efficient calculations, but tests should ideally be conducted to ensure accuracy.

The default three-center tensor block size of 5 × 5 × 5 was consistently used. The ideal block size is a trade-off between the sparsity for smaller blocks and the efficiency of block-wise matrix–matrix multiplications for larger ones. As discussed previously, one of the bottlenecks of the low-scaling post-HF force implementation is due to tensor densification [Eq. (33)]. As energy and forces scale differently, the latter will overwhelmingly dominate timings for very large systems. In such cases, calculations can be accelerated by using larger block sizes such that denser tensors are more efficiently contracted.

In this paper, we reported the efficient implementation of the cubic-scaling RI-HFX, SOS-MP2, and direct RPA energy gradients for periodic systems in CP2K. All methods rely on the resolution-of-the-identity approximation in the AO basis, leading to sparse tensor contractions being the computational bottleneck of each calculation. This is efficiently taken care of with the recently implemented DBT and DBM libraries, presented in this work as well. The performance benchmark calculations performed for a variety of systems show a good strong scaling behavior and parallel efficiency up to a large number of nodes. The implementations benefit from robust GPU acceleration, reaching a factor of 2.95 speedup on the state-of-the-art LUMI-G supercomputer.

All three methods enjoy sub-cubic scaling because of tensor sparsity. In the case of RI-HFX, this is not a favorable comparison with respect to the standard four-center integrals implementation, which scales linearly with system size. However, for small to medium size dense systems and molecules, RI-HFX is more efficient. The scaling of the AO based SOS-MP2 and RPA energy gradients is advantageous compared to their quartic scaling MO based counterparts. However, it is expected that their large prefactor makes them competitive for larger systems only.

This work enables the calculation of large systems at double-hybrid DFT accuracy. While the code scales well on large supercomputers, the calculation of many molecular dynamics frames remains expensive. These methods can, however, be used for the generation of machine learning training data.86,87 Compared to the MO based RPA and SOS-MP2 methods, using low-scaling post-HF implementations speeds up the overall fitting process for large systems. This should allow for long MD trajectories of chemically interesting systems to be produced at an even lower cost while keeping high accuracy.

See the supplementary material for the detailed derivation of the low-scaling post-HF energy gradient equations, RI-HFX supporting figures, and custom basis sets.

This work was funded by the Swiss Platform for Advanced Scientific Computing (PASC) as part of the 2021–2024 “Ab Initio Molecular Dynamics at the Exa-Scale” project. Computing time on Piz Daint was provided by the CSCS Swiss National Supercomputing Centre under Account Nos. d110 and c18. LUMI-G computing time was made available as part of its pilot phase, under Project No. 465000151.

The authors have no conflicts to disclose.

Augustin Bussy: Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Ole Schütt: Software (equal); Writing – original draft (equal). Jürg Hutter: Funding acquisition (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal).

The data that support the findings of this study, namely CP2K input and output files, are openly available in the Materials Cloud at http://doi.org/10.24435/materialscloud:kc-ex.

In CP2K, the RI-HFX gradients are calculated on top of a converged SCF energy calculation. Therefore, it is assumed that certain quantities are already available. This is the case for the three-center ERIs (μνS), the two-center ERIs in the RI basis (P|Q) and VPQ1, and the density matrix in the AO basis Pμν. The derivatives of the ERIs need to be calculated, and certain properties can be exploited for optimal storage and efficiency. For example, translational invariance allows for the calculation and storage of two-center derivatives with respect to one center only,
(A1)
where xP denotes the x coordinate of the P center. Therefore, only three matrices (for x, y, and z) are required for the storage of (P|Q)(x) and VPQ(x), respectively. Similarly, since μ, ν can be symmetrically swapped in the three-center ERIs (μνS), a total of six tensors are required for the derivatives (x, y, and z, with respect to AO and RI). Note that the derivative of the inverse RI metric VPQ1 is required. This is calculated by
(A2)

The algorithm used in the code is outlined in Table V. At step 4, a loop over batches of AO indices is initiated. This is to ensure that the tensor TμνP1 fits in memory since the contraction of the three-center ERIs with the density matrix leads to higher occupancy. At steps 7 and 8, the tensor TμνS2 is calculated before its trace is taken with the three-center derivatives. In principle, TμνS2 would be a fairly dense tensor. However, we can force it to take the sparsity pattern of (μνS)(x). This leads to savings in memory and computation time.

TABLE V.

Pseudocode for RI-HFX forces.

Pre-calculate and store the two-center derivatives 
  (Q|R) (x) and VQS(x) 
Pre-calculate and store the 3c derivatives (μνS)(x) 
Perform the matrix product MPS1=VPQ1(Q|R)VRS1 
Loop over batches of μ and ν AOs: 
Contract the density matrices with the 3c integrals: 
  TμνP1=PλμPσν(λσP) 
Loop over batches of S RI basis elements: 
Perform the contraction TμνS2=TμνP1MPS1 
Calculate the trace ExHF,(x)=TμνS2(μνS)(x) 
Perform the contraction RPS+=TμνP1(μνS) 
10 Perform the matrix product MQR2=VQP1RPSVSR1 
11 Calculate the trace ExHF,(x)=12MQR2(Q|R)(x) 
12 Perform the matrix product MQS3=MQR2(R|P)VPS1 
13 Calculate the trace ExHF,(x)=MQS3VQS(x) 
Pre-calculate and store the two-center derivatives 
  (Q|R) (x) and VQS(x) 
Pre-calculate and store the 3c derivatives (μνS)(x) 
Perform the matrix product MPS1=VPQ1(Q|R)VRS1 
Loop over batches of μ and ν AOs: 
Contract the density matrices with the 3c integrals: 
  TμνP1=PλμPσν(λσP) 
Loop over batches of S RI basis elements: 
Perform the contraction TμνS2=TμνP1MPS1 
Calculate the trace ExHF,(x)=TμνS2(μνS)(x) 
Perform the contraction RPS+=TμνP1(μνS) 
10 Perform the matrix product MQR2=VQP1RPSVSR1 
11 Calculate the trace ExHF,(x)=12MQR2(Q|R)(x) 
12 Perform the matrix product MQS3=MQR2(R|P)VPS1 
13 Calculate the trace ExHF,(x)=MQS3VQS(x) 

The implementation of the RPA energy gradient in CP2K is summarized as a pseudocode in Tables VIVIII. It is assumed that certain quantities, such as the two- and three-center integrals, and the matrices P(τj) and Q(τj) are available from the energy calculation. The considerations about two- and three-center derivatives made in the RI-HFX appendix apply here as well. Note that the SOS-MP2 algorithm is easily recoverable by setting BPQj=Q̄PQ(τj) and ɛF = 0. At step 3, the tensor TσμT1 is calculated. Its occupancy is typically significantly higher than that of the three-center integrals. To save memory, it is stored in a compressed format described in Ref. 88. Note that the three-center operations described in Table VIII are performed over batches of AOs and RI basis elements. Because tensor densification happens as the algorithm progresses, this scheme ensures that all relevant three–center quantities can be stored in memory. This is especially critical on machines with low-memory GPUs. The number of batches along each tensor dimension can be passed in the CP2K input under the MEMORY_CUT keyword. A high value leads to a lower memory footprint at the cost of more overhead. Finally, the sparsity pattern of tensor TσλT3 has to match that of the three-center derivatives because of the trace at step 14. This can be exploited to save memory and accelerate the execution of step 13. Note that the R matrices calculated at steps 8 and 9 are passed onto the rest of the code.

TABLE VI.

Pseudocode for post-HF forces.

Pre-calculate and store the two-center derivatives 
  (P|R)(x) and VPR(x) 
Pre-calculate and store the 3c derivatives (σλT)(x) 
Calculate 
  BPQj=1πkσkλkjcos(τjωk)1+Q(ωk)PQ1δPQ 
Loop over τj quadrature points: 
Perform two-center operations: Table VII  
Perform three-center operations: Table VIII  
Perform the matrix multiplications 
  A1 = τjPocc(FɛFS), A2 = −τjPvirt(FɛFS
Iteratively calculate matrices Y1, Y2, eA1, and eA2 
Perform the matrix multiplications: 
  Mμν1=S1eτjPvirt(FεFS)TRoccS1μν
  Mμν2=τjS1Y2T(FεFS)S1μν
  Mμν3=τjPoccY1TPvirtY2Tμν 
10 Calculate the trace ERPA(x)=M1M2+εFM3μνSμν(x) 
11 Calculate the trace ERPA(x)+=Mμν3Fμν(x) 
12 Add to the Z-vector equation RHS: 
  O+=RvirteA1RocceA2+M3TF[Pocc]Pocc 
+ τjFεFSY1+Y2 
13 Solve the Z-vector equation and update the forces 
Pre-calculate and store the two-center derivatives 
  (P|R)(x) and VPR(x) 
Pre-calculate and store the 3c derivatives (σλT)(x) 
Calculate 
  BPQj=1πkσkλkjcos(τjωk)1+Q(ωk)PQ1δPQ 
Loop over τj quadrature points: 
Perform two-center operations: Table VII  
Perform three-center operations: Table VIII  
Perform the matrix multiplications 
  A1 = τjPocc(FɛFS), A2 = −τjPvirt(FɛFS
Iteratively calculate matrices Y1, Y2, eA1, and eA2 
Perform the matrix multiplications: 
  Mμν1=S1eτjPvirt(FεFS)TRoccS1μν
  Mμν2=τjS1Y2T(FεFS)S1μν
  Mμν3=τjPoccY1TPvirtY2Tμν 
10 Calculate the trace ERPA(x)=M1M2+εFM3μνSμν(x) 
11 Calculate the trace ERPA(x)+=Mμν3Fμν(x) 
12 Add to the Z-vector equation RHS: 
  O+=RvirteA1RocceA2+M3TF[Pocc]Pocc 
+ τjFεFSY1+Y2 
13 Solve the Z-vector equation and update the forces 
TABLE VII.

Post-HF two-center operations.

Perform the matrix multiplications 
  MPQ1=V1P(τj)KBjPQ
  MPR2=MPQ1(Q|R)1/2
  MPR3=MPQ1KQRT 
Calculate the trace ERPA(x)+=MPR2(P|R)(x) 
Calculate the trace ERPA(x)=2MPR3VPR(x) 
Perform the matrix multiplications 
  MPQ1=V1P(τj)KBjPQ
  MPR2=MPQ1(Q|R)1/2
  MPR3=MPQ1KQRT 
Calculate the trace ERPA(x)+=MPR2(P|R)(x) 
Calculate the trace ERPA(x)=2MPR3VPR(x) 
TABLE VIII.

Post-HF three-center operations.

Calculate pseudo-densities Dμλocc(τj) and Dνσvirt(τj) 
Perform the matrix multiplication MRT1=KBjKTRT 
Perform the tensor contraction TσμT1=(σμR)MRT1 
Loop over batches of σ AOs: 
Calculate MσνTocc=(λνT)Dσλocc(τj) 
Calculate MσνTvirt=(λνT)Dσλvirt(τj) 
Loop over batches of T RI basis elements: 
Calculate Rμνocc+=TσμT1MσνTocc 
Calculate Rμνvirt+=TσμT1MσνTvirt 
10 Loop over batches of λ AOs: 
11 Contract TσλR2=MσνRoccDλνvirt(τj) 
12 Loop over batches of T RI basis elements: 
13 Contract TσλT3=TσλR2MRT1 
14 Calculate the trace ERPA(x)+=2TσλT3(σλT)(x) 
Calculate pseudo-densities Dμλocc(τj) and Dνσvirt(τj) 
Perform the matrix multiplication MRT1=KBjKTRT 
Perform the tensor contraction TσμT1=(σμR)MRT1 
Loop over batches of σ AOs: 
Calculate MσνTocc=(λνT)Dσλocc(τj) 
Calculate MσνTvirt=(λνT)Dσλvirt(τj) 
Loop over batches of T RI basis elements: 
Calculate Rμνocc+=TσμT1MσνTocc 
Calculate Rμνvirt+=TσμT1MσνTvirt 
10 Loop over batches of λ AOs: 
11 Contract TσλR2=MσνRoccDλνvirt(τj) 
12 Loop over batches of T RI basis elements: 
13 Contract TσλT3=TσλR2MRT1 
14 Calculate the trace ERPA(x)+=2TσλT3(σλT)(x) 
1.
S.
Grimme
, “
Semiempirical hybrid density functional with perturbative second-order correlation
,”
J. Chem. Phys.
124
,
034108
(
2006
).
2.
L.
Goerigk
,
A.
Hansen
,
C.
Bauer
,
S.
Ehrlich
,
A.
Najibi
, and
S.
Grimme
, “
A look at the density functional theory zoo with the advanced GMTKN55 database for general main group thermochemistry, kinetics and noncovalent interactions
,”
Phys. Chem. Chem. Phys.
19
,
32184
32215
(
2017
).
3.
G.
Santra
,
N.
Sylvetsky
, and
J. M. L.
Martin
, “
Minimally empirical double-hybrid functionals trained against the GMTKN55 database: revDSD-PBEP86-D4, revDOD-PBE-D4, and DOD-SCAN-D4
,”
J. Phys. Chem. A
123
,
5129
5143
(
2019
).
4.
J. M. L.
Martin
and
G.
Santra
, “
Empirical double-hybrid density functional theory: A ‘third way’ in between WFT and DFT
,”
Isr. J. Chem.
60
,
787
804
(
2020
).
5.
C.
Møller
and
M. S.
Plesset
, “
Note on an approximation treatment for many-electron systems
,”
Phys. Rev.
46
,
618
(
1934
).
6.
F.
Furche
, “
Molecular tests of the random phase approximation to the exchange-correlation energy functional
,”
Phys. Rev. B
64
,
195120
(
2001
).
7.
C.
Schwartz
, “
Importance of angular correlations between atomic electrons
,”
Phys. Rev.
126
,
1015
(
1962
).
8.
O.
Franck
,
B.
Mussard
,
E.
Luppi
, and
J.
Toulouse
, “
Basis convergence of range-separated density-functional theory
,”
J. Chem. Phys.
142
,
074107
(
2015
).
9.
K.
Sharkas
,
J.
Toulouse
,
L.
Maschio
, and
B.
Civalleri
, “
Double-hybrid density-functional theory applied to molecular crystals
,”
J. Chem. Phys.
141
,
044105
(
2014
).
10.
G.
Sansone
,
B.
Civalleri
,
D.
Usvyat
,
J.
Toulouse
,
K.
Sharkas
, and
L.
Maschio
, “
Range-separated double-hybrid density-functional theory applied to periodic systems
,”
J. Chem. Phys.
143
,
102811
(
2015
).
11.
J.
Cheng
and
J.
VandeVondele
, “
Calculation of electrochemical energy levels in water using the random phase approximation and a double hybrid functional
,”
Phys. Rev. Lett.
116
,
086402
(
2016
).
12.
F.
Stein
,
J.
Hutter
, and
V. V.
Rybkin
, “
Double-hybrid DFT functionals for the condensed phase: Gaussian and plane waves implementation and evaluation
,”
Molecules
25
,
5174
(
2020
).
13.
F.
Stein
and
J.
Hutter
, “
Double-hybrid density functionals for the condensed phase: Gradients, stress tensor, and auxiliary-density matrix method acceleration
,”
J. Chem. Phys.
156
,
074107
(
2022
).
14.
J.
Toulouse
,
W.
Zhu
,
J. G.
Angyán
, and
A.
Savin
, “
Range-separated density-functional theory with the random-phase approximation: Detailed formalism and illustrative applications
,”
Phys. Rev. A
82
,
032502
(
2010
).
15.
F.
Bruneval
, “
Range-separated approach to the RPA correlation applied to the van der Waals bond and to diffusion of defects
,”
Phys. Rev. Lett.
108
,
256403
(
2012
).
16.
É.
Brémond
,
M.
Savarese
,
Á. J.
Pérez-Jiménez
,
J. C.
Sancho-García
, and
C.
Adamo
, “
Range-separated double-hybrid functional from nonempirical constraints
,”
J. Chem. Theory Comput.
14
,
4052
4062
(
2018
).
17.
P. D.
Mezei
and
M.
Kállay
, “
Construction of a range-separated dual-hybrid direct random phase approximation
,”
J. Chem. Theory Comput.
15
,
6678
6687
(
2019
).
18.
H.-J.
Werner
,
F. R.
Manby
, and
P. J.
Knowles
, “
Fast linear scaling second-order Møller-Plesset perturbation theory (MP2) using local and density fitting approximations
,”
J. Chem. Phys.
118
,
8149
8160
(
2003
).
19.
C.
Hättig
,
D. P.
Tew
, and
B.
Helmich
, “
Local explicitly correlated second-and third-order Møller–Plesset perturbation theory with pair natural orbitals
,”
J. Chem. Phys.
136
,
204105
(
2012
).
20.
S. A.
Maurer
,
L.
Clin
, and
C.
Ochsenfeld
, “
Cholesky-decomposed density MP2 with density fitting: Accurate MP2 and double-hybrid DFT energies for large systems
,”
J. Chem. Phys.
140
,
224112
(
2014
).
21.
T.
Schäfer
,
B.
Ramberger
, and
G.
Kresse
, “
Quartic scaling MP2 for solids: A highly parallelized algorithm in the plane wave basis
,”
J. Chem. Phys.
146
,
104101
(
2017
).
22.
Y.
Jung
,
Y.
Shao
, and
M.
Head-Gordon
, “
Fast evaluation of scaled opposite spin second-order Møller–Plesset correlation energies using auxiliary basis expansions and exploiting sparsity
,”
J. Comput. Chem.
28
,
1953
1964
(
2007
).
23.
S. A.
Maurer
,
J.
Kussmann
, and
C.
Ochsenfeld
, “
Communication: A reduced scaling J-engine based reformulation of SOS-MP2 using graphics processing units
,”
J. Chem. Phys.
141
,
051106
(
2014
).
24.
C.
Song
and
T. J.
Martínez
, “
Atomic orbital-based SOS-MP2 with tensor hypercontraction. I. GPU-based tensor construction and exploiting sparsity
,”
J. Chem. Phys.
144
,
174111
(
2016
).
25.
A.
Förster
,
M.
Franchini
,
E.
van Lenthe
, and
L.
Visscher
, “
A quadratic pair atomic resolution of the identity based SOS-AO-MP2 algorithm using Slater type orbitals
,”
J. Chem. Theory Comput.
16
,
875
891
(
2020
).
26.
H.
Eshuis
,
J.
Yarkony
, and
F.
Furche
, “
Fast computation of molecular random phase approximation correlation energies using resolution of the identity and imaginary frequency integration
,”
J. Chem. Phys.
132
,
234114
(
2010
).
27.
D.
Neuhauser
,
E.
Rabani
, and
R.
Baer
, “
Expeditious stochastic calculation of random-phase approximation energies for thousands of electrons in three dimensions
,”
J. Phys. Chem. Lett.
4
,
1172
1176
(
2013
).
28.
M.
Kaltak
,
J.
Klimeš
, and
G.
Kresse
, “
Low scaling algorithms for the random phase approximation: Imaginary time and Laplace transformations
,”
J. Chem. Theory Comput.
10
,
2498
2507
(
2014
).
29.
M.
Kállay
, “
Linear-scaling implementation of the direct random-phase approximation
,”
J. Chem. Phys.
142
,
204105
(
2015
).
30.
H. F.
Schurkus
and
C.
Ochsenfeld
, “
Communication: An effective linear-scaling atomic-orbital reformulation of the random-phase approximation using a contracted double-Laplace transformation
,”
J. Chem. Phys.
144
,
031101
(
2016
).
31.
J.
Wilhelm
,
P.
Seewald
,
M.
Del Ben
, and
J.
Hutter
, “
Large-scale cubic-scaling random phase approximation correlation energy calculations using a Gaussian basis
,”
J. Chem. Theory Comput.
12
,
5851
5859
(
2016
).
32.
D.
Graf
,
M.
Beuerle
,
H. F.
Schurkus
,
A.
Luenser
,
G.
Savasci
, and
C.
Ochsenfeld
, “
Accurate and efficient parallel implementation of an effective linear-scaling direct random phase approximation method
,”
J. Chem. Theory Comput.
14
,
2505
2515
(
2018
).
33.
I.
Duchemin
and
X.
Blase
, “
Separable resolution-of-the-identity with all-electron Gaussian bases: Application to cubic-scaling RPA
,”
J. Chem. Phys.
150
,
174120
(
2019
).
34.
J.
Yan
,
L.
Li
, and
C.
O’Grady
, “
Graphics processing unit acceleration of the random phase approximation in the projector augmented wave method
,”
Comput. Phys. Commun.
184
,
2728
2733
(
2013
).
35.
M.
Del Ben
,
J.
Hutter
, and
J.
VandeVondele
, “
Electron correlation in the condensed phase from a resolution of identity approach based on the Gaussian and plane waves scheme
,”
J. Chem. Theory Comput.
9
,
2654
2671
(
2013
).
36.
D.
Bykov
and
T.
Kjaergaard
, “
The GPU-enabled divide-expand-consolidate RI-MP2 method (DEC-RI-MP2)
,”
J. Comput. Chem.
38
,
228
237
(
2017
).
37.
M.
Kabić
,
S.
Pintarelli
,
A.
Kozhevnikov
, and
J.
VandeVondele
, “
COSTA: Communication-optimal shuffle and transpose algorithm with process relabeling
,” in
International Conference on High Performance Computing
(
Springer
,
2021
), pp.
217
236
.
38.
N. C.
Handy
and
H. F.
Schaefer
, III
, “
On the evaluation of analytic energy derivatives for correlated wave functions
,”
J. Chem. Phys.
81
,
5031
5033
(
1984
).
39.
M.
Guidon
,
J.
Hutter
, and
J.
VandeVondele
, “
Auxiliary density matrix methods for Hartree–Fock exchange calculations
,”
J. Chem. Theory Comput.
6
,
2348
2364
(
2010
).
40.
J. L.
Whitten
, “
Coulombic potential energy integrals and approximations
,”
J. Chem. Phys.
58
,
4496
4501
(
1973
).
41.
B. I.
Dunlap
,
J. W. D.
Connolly
, and
J. R.
Sabin
, “
On some approximations in applications of Xα theory
,”
J. Chem. Phys.
71
,
3396
3402
(
1979
).
42.
O.
Vahtras
,
J.
Almlöf
, and
M. W.
Feyereisen
, “
Integral approximations for LCAO-SCF calculations
,”
Chem. Phys. Lett.
213
,
514
518
(
1993
).
43.
T. D.
Kühne
,
M.
Iannuzzi
,
M.
Del Ben
,
V. V.
Rybkin
,
P.
Seewald
,
F.
Stein
,
T.
Laino
,
R. Z.
Khaliullin
,
O.
Schütt
,
F.
Schiffmann
et al, “
CP2K: An electronic structure and molecular dynamics software package-quickstep: Efficient and accurate electronic structure calculations
,”
J. Chem. Phys.
152
,
194103
(
2020
).
44.
P. O.
Seewald
, “
Low-scaling electronic structure methods based on sparse tensor contraction
,” Ph.D. thesis,
University of Zurich
,
2020
.
45.
U.
Borštnik
,
J.
VandeVondele
,
V.
Weber
, and
J.
Hutter
, “
Sparse matrix multiplication: The distributed block-compressed sparse row library
,”
Parallel Comput.
40
,
47
58
(
2014
).
46.
O.
Schütt
,
P.
Messmer
,
J.
Hutter
, and
J.
VandeVondele
, “
GPU-accelerated sparse matrix–matrix multiplication for linear scaling density functional theory
,” in
Electronic Structure Calculations on Graphics Processing Units
(
John Wiley & Sons, Ltd.
,
2016
), Chap. 8, pp.
173
190
.
47.
J.
VandeVondele
,
U.
Borštnik
, and
J.
Hutter
, “
Linear scaling self-consistent field calculations with millions of atoms in the condensed phase
,”
J. Chem. Theory Comput.
8
,
3565
3573
(
2012
).
48.
M.
Krishnan
and
J.
Nieplocha
, “
SRUMMA: A matrix multiplication algorithm suitable for clusters and scalable shared memory systems
,” in
18th International Parallel and Distributed Processing Symposium, 2004. Proceedings
(
IEEE
,
2004
), p.
70
.
49.
G.
Kwasniewski
,
M.
Kabić
,
M.
Besta
,
J.
VandeVondele
,
R.
Solcà
, and
T.
Hoefler
, “
Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication
,” in
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19
(
Association for Computing Machinery
,
New York
,
2019
).
50.
J.
Demmel
,
D.
Eliahu
,
A.
Fox
,
S.
Kamil
,
B.
Lipshitz
,
O.
Schwartz
, and
O.
Spillinger
, “
Communication-optimal parallel recursive rectangular matrix multiplication
,” in
2013 IEEE 27th International Symposium on Parallel and Distributed Processing
(
IEEE
,
2013
), pp.
261
272
.
51.
S.
Grimme
, “
Improved second-order Møller–Plesset perturbation theory by separate scaling of parallel-and antiparallel-spin pair correlation energies
,”
J. Chem. Phys.
118
,
9095
9102
(
2003
).
52.
Y.
Jung
,
R. C.
Lochan
,
A. D.
Dutoi
, and
M.
Head-Gordon
, “
Scaled opposite-spin second order Møller–Plesset correlation energy: An economical electronic structure method
,”
J. Chem. Phys.
121
,
9793
9802
(
2004
).
53.
A.
Sodt
and
M.
Head-Gordon
, “
Hartree–Fock exchange computed using the atomic resolution of the identity approximation
,”
J. Chem. Phys.
128
,
104106
(
2008
).
54.
F.
Weigend
,
M.
Kattannek
, and
R.
Ahlrichs
, “
Approximated electron repulsion integrals: Cholesky decomposition versus resolution of the identity methods
,”
J. Chem. Phys.
130
,
164106
(
2009
).
55.
F.
Neese
,
F.
Wennmohs
,
A.
Hansen
, and
U.
Becker
, “
Efficient, approximate and parallel Hartree–Fock and hybrid DFT calculations. a ‘chain-of-spheres’ algorithm for the Hartree–Fock exchange
,”
Chem. Phys.
356
,
98
109
(
2009
).
56.
P.
Merlot
,
T.
Kjaergaard
,
T.
Helgaker
,
R.
Lindh
,
F.
Aquilante
,
S.
Reine
, and
T. B.
Pedersen
, “
Attractive electron–electron interactions within robust local fitting approximations
,”
J. Comput. Chem.
34
,
1486
1496
(
2013
).
57.
A. C.
Ihrig
,
J.
Wieferink
,
I. Y.
Zhang
,
M.
Ropo
,
X.
Ren
,
P.
Rinke
,
M.
Scheffler
, and
V.
Blum
, “
Accurate localized resolution of identity approach for linear-scaling hybrid density functionals and for many-body perturbation theory
,”
New J. Phys.
17
,
093020
(
2015
).
58.
J.
Kalinowski
,
F.
Wennmohs
, and
F.
Neese
, “
Arbitrary angular momentum electron repulsion integrals with graphical processing units: Application to the resolution of identity Hartree–Fock method
,”
J. Chem. Theory Comput.
13
,
3160
3170
(
2017
).
59.
J.
Spencer
and
A.
Alavi
, “
Efficient calculation of the exact exchange energy in periodic systems using a truncated Coulomb potential
,”
Phys. Rev. B
77
,
193110
(
2008
).
60.
M.
Guidon
,
J.
Hutter
, and
J.
VandeVondele
, “
Robust periodic Hartree–Fock exchange for large-scale simulations using Gaussian basis sets
,”
J. Chem. Theory Comput.
5
,
3010
3021
(
2009
).
61.
D. L.
Strout
and
G. E.
Scuseria
, “
A quantitative study of the scaling properties of the Hartree–Fock method
,”
J. Chem. Phys.
102
,
8448
8452
(
1995
).
62.
E.
Rebolini
,
R.
Izsák
,
S. S.
Reine
,
T.
Helgaker
, and
T. B.
Pedersen
, “
Comparison of three efficient approximate exact-exchange algorithms: The chain-of-spheres algorithm, pair-atomic resolution-of-the-identity method, and auxiliary density matrix method
,”
J. Chem. Theory Comput.
12
,
3514
3522
(
2016
).
63.
E. F.
Valeev
, “
Libint: A library for the evaluation of molecular integrals of many-body operators over Gaussian functions
,” http://libint.valeyev.net/,
2022
, version 2.8.0.
64.
G. L.
Stoychev
,
A. A.
Auer
, and
F.
Neese
, “
Automatic generation of auxiliary basis sets
,”
J. Chem. Theory Comput.
13
,
554
562
(
2017
).
65.
D.
Cremer
, “
Density functional theory: Coverage of dynamic and non-dynamic electron correlation effects
,”
Mol. Phys.
99
,
1899
1940
(
2001
).
66.
K. E.
Riley
and
P.
Hobza
, “
Assessment of the MP2 method, along with several basis sets, for the computation of interaction energies of biologically relevant hydrogen bonded and dispersion bound complexes
,”
J. Phys. Chem. A
111
,
8257
8263
(
2007
).
67.
H.
Eshuis
and
F.
Furche
, “
A parameter-free density functional that works for noncovalent interactions
,”
J. Phys. Chem. Lett.
2
,
983
989
(
2011
).
68.
M.
Dion
,
H.
Rydberg
,
E.
Schröder
,
D. C.
Langreth
, and
B. I.
Lundqvist
, “
Van der Waals density functional for general geometries
,”
Phys. Rev. Lett.
92
,
246401
(
2004
).
69.
A.
Tkatchenko
and
M.
Scheffler
, “
Accurate molecular van der Waals interactions from ground-state electron density and free-atom reference data
,”
Phys. Rev. Lett.
102
,
073005
(
2009
).
70.
S.
Grimme
,
J.
Antony
,
S.
Ehrlich
, and
H.
Krieg
, “
A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu
,”
J. Chem. Phys.
132
,
154104
(
2010
).
71.
S.
Schweizer
,
B.
Doser
, and
C.
Ochsenfeld
, “
An atomic orbital-based reformulation of energy gradients in second-order Møller–Plesset perturbation theory
,”
J. Chem. Phys.
128
,
154101
(
2008
).
72.
C.
Moler
and
C.
Van Loan
, “
Nineteen dubious ways to compute the exponential of a matrix
,”
SIAM Rev.
20
,
801
(
1978
).
73.
M.
Beuerle
and
C.
Ochsenfeld
, “
Low-scaling analytical gradients for the direct random phase approximation using an atomic orbital formalism
,”
J. Chem. Phys.
149
,
244111
(
2018
).
74.
A.
Takatsuka
,
S.
Ten-No
, and
W.
Hackbusch
, “
Minimax approximation for the decomposition of energy denominators in Laplace-transformed Møller–Plesset perturbation theories
,”
J. Chem. Phys.
129
,
044112
(
2008
).
75.
D.
Kats
,
D.
Usvyat
,
S.
Loibl
,
T.
Merz
, and
M.
Schütz
, “
Comment on ‘Minimax approximation for the decomposition of energy denominators in Laplace-transformed Møller–Plesset perturbation theories’ [J. Chem. Phys. 129, 044112 (2008)]
,”
J. Chem. Phys.
130
,
127101
(
2009
).
76.
G.
Lippert
,
J.
Hutter
, and
M.
Parrinello
, “
A hybrid Gaussian and plane wave density functional scheme
,”
Mol. Phys.
92
,
477
488
(
1997
).
77.
M.
Del Ben
,
J.
Hutter
, and
J.
VandeVondele
, “
Second-order Møller–Plesset perturbation theory in the condensed phase: An efficient and massively parallel Gaussian and plane waves approach
,”
J. Chem. Theory Comput.
8
,
4177
4188
(
2012
).
78.
M.
Del Ben
,
J.
Hutter
, and
J.
VandeVondele
, “
Forces and stress in second order Møller-Plesset perturbation theory for condensed phase systems within the resolution-of-identity Gaussian and plane waves approach
,”
J. Chem. Phys.
143
,
102803
(
2015
).
79.
I. Y.
Zhang
and
X.
Xu
, “
Doubly hybrid density functional for accurate description of thermochemistry, thermochemical kinetics and nonbonded interactions
,”
Int. Rev. Phys. Chem.
30
,
115
160
(
2011
).
80.
E.
Brémond
,
I.
Ciofini
,
J. C.
Sancho-García
, and
C.
Adamo
, “
Nonempirical double-hybrid functionals: An effective tool for chemists
,”
Acc. Chem. Res.
49
,
1503
1513
(
2016
).
81.
X.
Ren
,
P.
Rinke
,
V.
Blum
,
J.
Wieferink
,
A.
Tkatchenko
,
A.
Sanfilippo
,
K.
Reuter
, and
M.
Scheffler
, “
Resolution-of-identity approach to Hartree–Fock, hybrid density functionals, RPA, MP2 and GW with numeric atom-centered orbital basis functions
,”
New J. Phys.
14
,
053020
(
2012
).
82.
Y.
Tawada
,
T.
Tsuneda
,
S.
Yanagisawa
,
T.
Yanai
, and
K.
Hirao
, “
A long-range-corrected time-dependent density functional theory
,”
J. Chem. Phys.
120
,
8425
8433
(
2004
).
83.
C.
Adamo
and
V.
Barone
, “
Toward reliable density functional methods without adjustable parameters: The PBE0 model
,”
J. Chem. Phys.
110
,
6158
6170
(
1999
).
84.
T.
Helgaker
,
P.
Jorgensen
, and
J.
Olsen
,
Molecular Electronic-Structure Theory
(
John Wiley & Sons
,
2013
).
85.
L. S.
Blackford
,
J.
Choi
,
A.
Cleary
,
E.
D’Azevedo
,
J.
Demmel
,
I.
Dhillon
,
J.
Dongarra
,
S.
Hammarling
,
G.
Henry
,
A.
Petitet
,
K.
Stanley
,
D.
Walker
, and
R. C.
Whaley
,
ScaLAPACK Users’ Guide
(
Society for Industrial and Applied Mathematics
,
Philadelphia
,
1997
).
86.
J.
Behler
and
M.
Parrinello
, “
Generalized neural-network representation of high-dimensional potential-energy surfaces
,”
Phys. Rev. Lett.
98
,
146401
(
2007
).
87.
J.
Lan
,
V.
Kapil
,
P.
Gasparotto
,
M.
Ceriotti
,
M.
Iannuzzi
, and
V. V.
Rybkin
, “
Simulating the ghost: Quantum dynamics of the solvated electron
,”
Nat. Commun.
12
,
766
(
2021
).
88.
M.
Guidon
,
F.
Schiffmann
,
J.
Hutter
, and
J.
VandeVondele
, “
Ab initio molecular dynamics using hybrid density functionals
,”
J. Chem. Phys.
128
,
214104
(
2008
).

Supplementary Material