We present a low-prefactor, cubically scaling scaled-opposite-spin second-order Møller-Plesset perturbation theory (SOS-MP2) method which is highly suitable for massively parallel architectures like graphics processing units (GPU). The scaling is reduced from
I. INTRODUCTION
Second-order Møller-Plesset perturbation theory (MP2)1 is an important method in the field of ab initio quantum chemistry, however, the
Our method is based on the Laplace scaled-opposite-spin MP2 approach within the resolution-of-the-identity (RI) approximation (SOS-RI-MP2) of Jung et al.9 where the essential idea is a reversed order of summation compared to a conventional RI-MP2 calculation, i.e., the molecular orbital (MO) indices are contracted first before the summations over auxiliary functions are carried out. In this work, we use an atomic orbital-based reformulation which leads to an asymptotic cubic scaling, that is one order of magnitude smaller than in the original MO-based SOS-RI-MP2 approach, and which allows for efficient evaluation on graphics processing units (GPUs). In contrast to the asymptotically cubic scaling local SOS-RI-MP2 variant of Jung et al.,10 our method does not rely on a local metric or orbital localization, but uses local atomic orbitals (AOs).
In the last years several methods focusing on using GPUs for post Hartree-Fock (HF) methods have been published11–15 that employ GPUs only for linear algebra operations. In this work, however, our reformulation for SOS-RI-MP2 allows to replace the computationally demanding contraction step with a modified J-engine algorithm, which has already been proven to be highly efficient on GPUs within self-consistent field (SCF) calculations.16,17 This integral-direct approach also enables a complete avoidance of disk-IO.
II. THEORY
The conventional RI-MP2 expression for the opposite-spin MP2 term reads
where the matrix J contains the two-center two-electron integrals in the auxiliary basis (R, S). Following the Laplace AO-MP2 approach2–4 this expression can be transformed into the AO basis
The sum over α corresponds to the quadrature sum of the Laplace transform of the energy denominator where usually 5-6 points have been found to provide good accuracy.4,18,19 The pseudo-densities
For an efficient evaluation, we define intermediates
which allow to calculate the OS-MP2 energy via
and
The crucial step is the calculation of the intermediate matrix Z which can be calculated with cubic scaling cost as shown below. It should be noted that the matrix Z differs from the matrices X and Y used in the methods of Jung et al. (see Eqs. (12) and (19) in Ref. 10), where the inverse, or inverse square root, respectively, of the matrix J is included.
The formation of the Z matrix requires transformed three-center integrals
which provides a coefficient matrix
with
It should be noted that the number of occupied Cholesky pseudo-MOs is equal or, in some cases, slightly smaller than the number of occupied MOs (see Ref. 19 for an in-depth discussion) and the dimension of the matrices is therefore largely reduced by the first transformation with the Cholesky coefficients. In any of the three multiplications, two matrix dimensions correspond to the number of atomic orbitals and one corresponds to the number of occupied pseudo-MOs. Further savings in the final transformation step are possible, if one exploits the restriction of the subsequent contraction Eq. (4) to significant basis function products: The final multiplication in Eq. (9) can be restricted to matrix blocks which contain (
An outline of the steps in our method is given in Table I and timings on linear alkanes and DNA systems are given in Tables II and III. Steps (1)–(4), (7), and (8) have negligible cost (<10% in all calculations, 2% for the largest DNA system) and the effective scaling for these steps is at most cubically with system size. The cubic scaling steps (3), (4), and (7) are performed using highly optimized linear algebra routines. The significant steps are the transformation of the three-center integrals, step (5), and the contraction with the untransformed integrals in step (6) which is performed on GPU – details of our algorithm are presented in Sec. III. The scaling of the transformation step (5) is asymptotically quadratic and this behavior is observed as the general trend for large linear alkanes but superimposed with considerable fluctuations due to the blocking in the BCSR format with the chosen block size of around 100 × 100. At the present stage of development, we do not adjust the block structure for the rectangular Cholesky matrices, which leads to some variable overhead in the multiplications which can be significant for smaller systems. For the DNA systems, the transformation matrices are less sparse and the quadratic scaling regime is not reached but an effective cubic scaling is observed.
(1) | Calculation of (R|μν) | $\mathcal {O}(\mathrm{N^2})$ |
(2) | Calculation of JRS = (R|S) | $\mathcal {O}(\mathrm{N^2})$ |
(3) | Calculation of $\bf J^{-1}$ | $\mathcal {O}(\mathrm{N^3})$ |
(4) | Calculation of pseudo-densities | $\mathcal {O}(\mathrm{N^3})$ |
(5) | Transformation of (R|μν) to $(R|\underline{\mu }\overline{\nu })$ | $\mathcal {O}(\mathrm{N^2})$ |
(6) | Contraction $\sum _{\mu \nu } (R|\underline{\mu }\overline{\nu })(\mu \nu |S)$ (on GPU) | $\mathcal {O}(\mathrm{N^3})$ |
(7) | Multiplication Z J−1 | $\mathcal {O}(\mathrm{N^3})$ |
(8) | Contraction $\sum _{R S} \tilde{Z}_{R S} \tilde{Z}_{S R}$ | $\mathcal {O}(\mathrm{N^2})$ |
(1) | Calculation of (R|μν) | $\mathcal {O}(\mathrm{N^2})$ |
(2) | Calculation of JRS = (R|S) | $\mathcal {O}(\mathrm{N^2})$ |
(3) | Calculation of $\bf J^{-1}$ | $\mathcal {O}(\mathrm{N^3})$ |
(4) | Calculation of pseudo-densities | $\mathcal {O}(\mathrm{N^3})$ |
(5) | Transformation of (R|μν) to $(R|\underline{\mu }\overline{\nu })$ | $\mathcal {O}(\mathrm{N^2})$ |
(6) | Contraction $\sum _{\mu \nu } (R|\underline{\mu }\overline{\nu })(\mu \nu |S)$ (on GPU) | $\mathcal {O}(\mathrm{N^3})$ |
(7) | Multiplication Z J−1 | $\mathcal {O}(\mathrm{N^3})$ |
(8) | Contraction $\sum _{R S} \tilde{Z}_{R S} \tilde{Z}_{S R}$ | $\mathcal {O}(\mathrm{N^2})$ |
def2-SVP . | # Basis . | Total . | Transformation (5) . | Contraction (6) on GPU . | Steps (1)–(4), (7), (8) . | ||||
---|---|---|---|---|---|---|---|---|---|
system . | functions . | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}\,$
. |
C20H42 | 490 | 98 | … | 51 | … | 39 | … | 8 | … |
C40H82 | 970 | 575 | 2.59 | 374 | 2.92 | 171 | 2.16 | 30 | 1.94 |
C80H162 | 1930 | 2248 | 1.98 | 1051 | 1.50 | 1048 | 2.64 | 149 | 2.33 |
C160H322 | 3850 | 14 134 | 2.66 | 6206 | 2.57 | 7127 | 2.78 | 801 | 2.44 |
C320H642 | 7690 | 95 765 | 2.77 | 33 881 | 2.45 | 56 358 | 2.99 | 5526 | 2.79 |
def2-TZVP | # Basis | Total | Transformation (5) | Contraction (6) on GPU | Steps (1)–(4), (7), (8) | ||||
system | functions | time [s] | ${\cal O\mbox{(N^{x})}}$ | time [s] | ${\cal O\mbox{(N^{x})}}$ | time [s] | ${\cal O\mbox{(N^{x})}}$ | time [s] | ${\cal O\mbox{(N^{x})\,}}$ |
C20H42 | 872 | 406 | … | 261 | … | 120 | … | 25 | … |
C40H82 | 1732 | 1831 | 2.19 | 932 | 1.85 | 784 | 2.74 | 115 | 2.22 |
C80H162 | 3452 | 11 800 | 2.70 | 5790 | 2.65 | 5445 | 2.81 | 565 | 2.31 |
def2-SVP . | # Basis . | Total . | Transformation (5) . | Contraction (6) on GPU . | Steps (1)–(4), (7), (8) . | ||||
---|---|---|---|---|---|---|---|---|---|
system . | functions . | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}\,$
. |
C20H42 | 490 | 98 | … | 51 | … | 39 | … | 8 | … |
C40H82 | 970 | 575 | 2.59 | 374 | 2.92 | 171 | 2.16 | 30 | 1.94 |
C80H162 | 1930 | 2248 | 1.98 | 1051 | 1.50 | 1048 | 2.64 | 149 | 2.33 |
C160H322 | 3850 | 14 134 | 2.66 | 6206 | 2.57 | 7127 | 2.78 | 801 | 2.44 |
C320H642 | 7690 | 95 765 | 2.77 | 33 881 | 2.45 | 56 358 | 2.99 | 5526 | 2.79 |
def2-TZVP | # Basis | Total | Transformation (5) | Contraction (6) on GPU | Steps (1)–(4), (7), (8) | ||||
system | functions | time [s] | ${\cal O\mbox{(N^{x})}}$ | time [s] | ${\cal O\mbox{(N^{x})}}$ | time [s] | ${\cal O\mbox{(N^{x})}}$ | time [s] | ${\cal O\mbox{(N^{x})\,}}$ |
C20H42 | 872 | 406 | … | 261 | … | 120 | … | 25 | … |
C40H82 | 1732 | 1831 | 2.19 | 932 | 1.85 | 784 | 2.74 | 115 | 2.22 |
C80H162 | 3452 | 11 800 | 2.70 | 5790 | 2.65 | 5445 | 2.81 | 565 | 2.31 |
. | . | . | Transformation . | Contraction . | Steps . | . | . | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
. | # Basis . | Total . | (5) . | (6) on GPU . | (1)–(4), (7), (8) . | Referencea . | |||||
System . | functions . | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. |
DNA1 | 625 | 238 | … | 143 | … | 82 | … | 13 | … | 438 | … |
DNA2 | 1332 | 2096 | 2.88 | 1279 | 2.90 | 721 | 2.87 | 96 | 2.64 | 8290 | 3.88 |
DNA4 | 2746 | 19 601 | 3.09 | 13 289 | 3.24 | 5750 | 2.87 | 562 | 2.44 | 146 000 | 3.96 |
DNA8 | 5574 | 185 659 | 3.18 | 126 060 | 3.18 | 55 882 | 3.21 | 3717 | 2.67 | … | … |
. | . | . | Transformation . | Contraction . | Steps . | . | . | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
. | # Basis . | Total . | (5) . | (6) on GPU . | (1)–(4), (7), (8) . | Referencea . | |||||
System . | functions . | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. | time [s] . | ${\cal O\mbox{(N^{x})}}$
. |
DNA1 | 625 | 238 | … | 143 | … | 82 | … | 13 | … | 438 | … |
DNA2 | 1332 | 2096 | 2.88 | 1279 | 2.90 | 721 | 2.87 | 96 | 2.64 | 8290 | 3.88 |
DNA4 | 2746 | 19 601 | 3.09 | 13 289 | 3.24 | 5750 | 2.87 | 562 | 2.44 | 146 000 | 3.96 |
DNA8 | 5574 | 185 659 | 3.18 | 126 060 | 3.18 | 55 882 | 3.21 | 3717 | 2.67 | … | … |
Reference calculation with conventional SOS-RI-MP2 (1 core@Intel Xeon E5-2620, only serial version available).
It should be noted that the scaling of the computational cost given in Table I is based on increased sparsity in large molecular systems with a fixed choice of basis set. More generally, one can also express the scaling behavior in terms of the number of atom-centers (Nc) and number of AOs per center (Nao/c). The number of auxilliary AOs per center can usually be given in terms of Nao/c multiplied by a constant caux, i.e., as caux × Nao/c. Thus, we obtain for the two rate-determining steps (5) and (6) a scaling behavior of cocc × caux ×
In order to provide some measure of comparison for the efficiency of our algorithm, we performed SOS-RI-MP2 calculations with a development version of Q-Chem22 for the first three DNA-fragments (1 core, Intel Xeon [email protected]), showing a speed-up of approx. 7.5 for DNA4. Here, it has to be stressed that a fair and balanced comparison of two different algorithms aiming for different computing architectures is very difficult. For larger systems, however, the speed-up of the presented algorithm will be far larger due to the less favorable
In our present implementation, the untransformed three-center integrals are saved to disk once and read in the transformation step for every Laplace point. An implementation without any I/O can be easily realized by performing steps (3), (4), and (5) in batches of R. For each batch of this auxiliary index, all three-center integrals can be kept in memory. The transformation can be performed in memory and in step (5) the untransformed integrals are calculated on-the-fly and directly contracted so that all operations can be performed without any hard disk access. It also has to be stressed that the current implementation is strictly sequential with respect to the single steps given in Table I, i.e., the GPU-kernels block the further execution of the code in order to provide meaningful timings for the single steps of the algorithm. However, the overall computational time can be significantly reduced by executing the GPU-kernels in a non-blocking fashion, so that the contraction step (6) and the transformation step (5) can be executed simultaneously.
III. DETAILS: GPU-BASED ALGORITHM
For an efficient evaluation of the contraction step (6), a modification of the J-engine23,24 based GPU-algorithm for the evaluation of the Coulomb matrix in SCF calculations17 is employed. The algorithm is based on the 1 thread/1 primitive integral (1T1PI) approach and the shell-pair ordering as proposed by Ufimtsev and Martínez.16,25 To be consistent with this work with respect to the choice of bra and ket distributions and to make the analogy to the Coulomb matrix construction most clear, we rewrite Eq. (4) (for real functions) as
with
In the regular algorithm to determine the Coulomb matrix for SCF calculations, the ket shell-pairs are contracted with a single density matrix only. Here, however, Naux density-like matrices have to be contracted, so that the use of the conventional algorithm would show the same large prefactor and some adjustments of the algorithm were required.
Crucial to the performance of the algorithm is the use of shared memory to hold the intermediate values J[pq], however, this memory is strictly limited on GPUs. Thus, we decided to retain the two-dimensional thread-block setup but we extend the grid by a further dimension to represent the offset to the corresponding indices of the current R-batch. Therefore, we strongly reduce the communication between host and GPU-devices. It should be noted that in this case the bra-data does not represent shell-pairs but only the shells of the auxilliary basis.
The complete algorithm is also depicted in Fig. 1. Note that the elements of
IV. CONCLUSION
We present a reformulation of the SOS-RI-MP2 algorithm which not only reduces the computational effort to
ACKNOWLEDGMENTS
C.O. acknowledges financial support by the “Deutsche Forschungsgemeinschaft” (DFG) for the project Oc35/4-1. Further funding was provided by the DFG via the initiatives SFB 749 “Dynamik und Intermediate molekularer Transformationen” (project C7) and the DFG cluster of excellence EXC 114 “Center for Integrated Protein Science Munich” (CIPSM).