We present a low-prefactor, cubically scaling scaled-opposite-spin second-order Møller-Plesset perturbation theory (SOS-MP2) method which is highly suitable for massively parallel architectures like graphics processing units (GPU). The scaling is reduced from

${\cal O\mbox{(N^{5})}}$
O(N5) to
${\cal O\mbox{(N^{3})}}$
O(N3)
by a reformulation of the MP2-expression in the atomic orbital basis via Laplace transformation and the resolution-of-the-identity (RI) approximation of the integrals in combination with efficient sparse algebra for the 3-center integral transformation. In contrast to previous works that employ GPUs for post Hartree-Fock calculations, we do not simply employ GPU-based linear algebra libraries to accelerate the conventional algorithm. Instead, our reformulation allows to replace the rate-determining contraction step with a modified J-engine algorithm, that has been proven to be highly efficient on GPUs. Thus, our SOS-MP2 scheme enables us to treat large molecular systems in an accurate and efficient manner on a single GPU-server.

Second-order Møller-Plesset perturbation theory (MP2)1 is an important method in the field of ab initio quantum chemistry, however, the

${\cal O\mbox{(N^{5})}}$
O(N5) scaling with molecular size N hampers its applicability to large systems. Therefore, many efficient low- or linear-scaling methods have been developed over the last decades, e.g., see Refs. 2–8. Despite its success, reducing the prefactors is a central issue which is the focus of our present work.

Our method is based on the Laplace scaled-opposite-spin MP2 approach within the resolution-of-the-identity (RI) approximation (SOS-RI-MP2) of Jung et al.9 where the essential idea is a reversed order of summation compared to a conventional RI-MP2 calculation, i.e., the molecular orbital (MO) indices are contracted first before the summations over auxiliary functions are carried out. In this work, we use an atomic orbital-based reformulation which leads to an asymptotic cubic scaling, that is one order of magnitude smaller than in the original MO-based SOS-RI-MP2 approach, and which allows for efficient evaluation on graphics processing units (GPUs). In contrast to the asymptotically cubic scaling local SOS-RI-MP2 variant of Jung et al.,10 our method does not rely on a local metric or orbital localization, but uses local atomic orbitals (AOs).

In the last years several methods focusing on using GPUs for post Hartree-Fock (HF) methods have been published11–15 that employ GPUs only for linear algebra operations. In this work, however, our reformulation for SOS-RI-MP2 allows to replace the computationally demanding contraction step with a modified J-engine algorithm, which has already been proven to be highly efficient on GPUs within self-consistent field (SCF) calculations.16,17 This integral-direct approach also enables a complete avoidance of disk-IO.

The conventional RI-MP2 expression for the opposite-spin MP2 term reads

(1)

where the matrix J contains the two-center two-electron integrals in the auxiliary basis (R, S). Following the Laplace AO-MP2 approach2–4 this expression can be transformed into the AO basis

(2)

The sum over α corresponds to the quadrature sum of the Laplace transform of the energy denominator where usually 5-6 points have been found to provide good accuracy.4,18,19 The pseudo-densities

$\underline{\bf P}$
P̲ and
$\overline{\bf P}$
P¯
are defined as

(3)

For an efficient evaluation, we define intermediates

(4)

which allow to calculate the OS-MP2 energy via

(5)

and

(6)

The crucial step is the calculation of the intermediate matrix Z which can be calculated with cubic scaling cost as shown below. It should be noted that the matrix Z differs from the matrices X and Y used in the methods of Jung et al. (see Eqs. (12) and (19) in Ref. 10), where the inverse, or inverse square root, respectively, of the matrix J is included.

The formation of the Z matrix requires transformed three-center integrals

$(R|\underline{\mu }\overline{\nu })$
(R|μ̲ν¯) which we calculate in an asymptotically quadratic step from the untransformed integrals (R|μν) using sparse matrix multiplications in the highly efficient block-compressed sparse row (BCSR) format20 (see Ref. 21 for details of our BCSR implementation). For an efficient transformation, it is furthermore advantageous to perform a Cholesky decomposition of the occupied pseudo-density,

(7)

which provides a coefficient matrix

$\underline{\bf L}$
L̲ for local occupied pseudo-MOs19 that can be used as an intermediate basis. The transformation starts with a sparse square matrix of integrals for a given auxiliary index M(R) with
$M^{(R)}_{\mu \nu } = (R|\mu \nu )$
Mμν(R)=(R|μν)
where the sparsity reflects the linear-scaling number of significant basis function products. The transformation to the transformed integral block

(8)

with

$\underline{\overline{M}}^{(R)} = (R|\underline{\mu }\overline{\nu })$
M¯̲(R)=(R|μ̲ν¯) is then performed in three consecutive multiplications,

(9)

It should be noted that the number of occupied Cholesky pseudo-MOs is equal or, in some cases, slightly smaller than the number of occupied MOs (see Ref. 19 for an in-depth discussion) and the dimension of the matrices is therefore largely reduced by the first transformation with the Cholesky coefficients. In any of the three multiplications, two matrix dimensions correspond to the number of atomic orbitals and one corresponds to the number of occupied pseudo-MOs. Further savings in the final transformation step are possible, if one exploits the restriction of the subsequent contraction Eq. (4) to significant basis function products: The final multiplication in Eq. (9) can be restricted to matrix blocks which contain (

$\underline{\mu }, \overline{\nu }$
μ̲,ν¯⁠) pairs that match significant basis function pairs. All matrices have a sparse structure and the matrix multiplications therefore scale asymptotically linearly with system size for a given auxiliary index which leads to an asymptotically quadratic scaling of the total transformation step.

An outline of the steps in our method is given in Table I and timings on linear alkanes and DNA systems are given in Tables II and III. Steps (1)–(4), (7), and (8) have negligible cost (<10% in all calculations, 2% for the largest DNA system) and the effective scaling for these steps is at most cubically with system size. The cubic scaling steps (3), (4), and (7) are performed using highly optimized linear algebra routines. The significant steps are the transformation of the three-center integrals, step (5), and the contraction with the untransformed integrals in step (6) which is performed on GPU – details of our algorithm are presented in Sec. III. The scaling of the transformation step (5) is asymptotically quadratic and this behavior is observed as the general trend for large linear alkanes but superimposed with considerable fluctuations due to the blocking in the BCSR format with the chosen block size of around 100 × 100. At the present stage of development, we do not adjust the block structure for the rectangular Cholesky matrices, which leads to some variable overhead in the multiplications which can be significant for smaller systems. For the DNA systems, the transformation matrices are less sparse and the quadratic scaling regime is not reached but an effective cubic scaling is observed.

Table I.

Outline of the algorithm for SOS-RI-AO-MP2. Steps (4)–(7) are repeated for every Laplace point.

(1) Calculation of (R|μν) 
$\mathcal {O}(\mathrm{N^2})$
O(N2)
 
(2) Calculation of JRS = (R|S
$\mathcal {O}(\mathrm{N^2})$
O(N2)
 
(3) Calculation of
$\bf J^{-1}$
J1
 
$\mathcal {O}(\mathrm{N^3})$
O(N3)
 
(4) Calculation of pseudo-densities 
$\mathcal {O}(\mathrm{N^3})$
O(N3)
 
(5) Transformation of (R|μν) to
$(R|\underline{\mu }\overline{\nu })$
(R|μ̲ν¯)
 
$\mathcal {O}(\mathrm{N^2})$
O(N2)
 
(6) Contraction
$\sum _{\mu \nu } (R|\underline{\mu }\overline{\nu })(\mu \nu |S)$
μν(R|μ̲ν¯)(μν|S)
(on GPU) 
$\mathcal {O}(\mathrm{N^3})$
O(N3)
 
(7) Multiplication Z J−1 
$\mathcal {O}(\mathrm{N^3})$
O(N3)
 
(8) Contraction
$\sum _{R S} \tilde{Z}_{R S} \tilde{Z}_{S R}$
RSZ̃RSZ̃SR
 
$\mathcal {O}(\mathrm{N^2})$
O(N2)
 
(1) Calculation of (R|μν) 
$\mathcal {O}(\mathrm{N^2})$
O(N2)
 
(2) Calculation of JRS = (R|S
$\mathcal {O}(\mathrm{N^2})$
O(N2)
 
(3) Calculation of
$\bf J^{-1}$
J1
 
$\mathcal {O}(\mathrm{N^3})$
O(N3)
 
(4) Calculation of pseudo-densities 
$\mathcal {O}(\mathrm{N^3})$
O(N3)
 
(5) Transformation of (R|μν) to
$(R|\underline{\mu }\overline{\nu })$
(R|μ̲ν¯)
 
$\mathcal {O}(\mathrm{N^2})$
O(N2)
 
(6) Contraction
$\sum _{\mu \nu } (R|\underline{\mu }\overline{\nu })(\mu \nu |S)$
μν(R|μ̲ν¯)(μν|S)
(on GPU) 
$\mathcal {O}(\mathrm{N^3})$
O(N3)
 
(7) Multiplication Z J−1 
$\mathcal {O}(\mathrm{N^3})$
O(N3)
 
(8) Contraction
$\sum _{R S} \tilde{Z}_{R S} \tilde{Z}_{S R}$
RSZ̃RSZ̃SR
 
$\mathcal {O}(\mathrm{N^2})$
O(N2)
 
Table II.

Wall times and scaling behavior with respect to the number of basis functions for SOS-RI-AO-MP2 calculations on linear alkanes in a def2-SVP and def2-TZVP basis on a computing node with two Intel Xeon E5-2620 processors (12 CPU cores) and four Nvidia GeForce GTX Titan GPUs. Timings and scaling exponents [

${\cal O\mbox{(N^{x})}}$
O(Nx)] are given for the whole calculation as well as selected steps of the algorithm defined in Table I.

def2-SVP# BasisTotalTransformation (5)Contraction (6) on GPUSteps (1)–(4), (7), (8)
systemfunctionstime [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}\,$
O(Nx)
C20H42 490 98 … 51 … 39 … … 
C40H82 970 575 2.59 374 2.92 171 2.16 30 1.94 
C80H162 1930 2248 1.98 1051 1.50 1048 2.64 149 2.33 
C160H322 3850 14 134 2.66 6206 2.57 7127 2.78 801 2.44 
C320H642 7690 95 765 2.77 33 881 2.45 56 358 2.99 5526 2.79 
def2-TZVP # Basis Total Transformation (5) Contraction (6) on GPU Steps (1)–(4), (7), (8) 
system functions time [s] 
${\cal O\mbox{(N^{x})}}$
O(Nx)
 
time [s] 
${\cal O\mbox{(N^{x})}}$
O(Nx)
 
time [s] 
${\cal O\mbox{(N^{x})}}$
O(Nx)
 
time [s] 
${\cal O\mbox{(N^{x})\,}}$
O(Nx)
 
C20H42 872 406 … 261 … 120 … 25 … 
C40H82 1732 1831 2.19 932 1.85 784 2.74 115 2.22 
C80H162 3452 11 800 2.70 5790 2.65 5445 2.81 565 2.31 
def2-SVP# BasisTotalTransformation (5)Contraction (6) on GPUSteps (1)–(4), (7), (8)
systemfunctionstime [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}\,$
O(Nx)
C20H42 490 98 … 51 … 39 … … 
C40H82 970 575 2.59 374 2.92 171 2.16 30 1.94 
C80H162 1930 2248 1.98 1051 1.50 1048 2.64 149 2.33 
C160H322 3850 14 134 2.66 6206 2.57 7127 2.78 801 2.44 
C320H642 7690 95 765 2.77 33 881 2.45 56 358 2.99 5526 2.79 
def2-TZVP # Basis Total Transformation (5) Contraction (6) on GPU Steps (1)–(4), (7), (8) 
system functions time [s] 
${\cal O\mbox{(N^{x})}}$
O(Nx)
 
time [s] 
${\cal O\mbox{(N^{x})}}$
O(Nx)
 
time [s] 
${\cal O\mbox{(N^{x})}}$
O(Nx)
 
time [s] 
${\cal O\mbox{(N^{x})\,}}$
O(Nx)
 
C20H42 872 406 … 261 … 120 … 25 … 
C40H82 1732 1831 2.19 932 1.85 784 2.74 115 2.22 
C80H162 3452 11 800 2.70 5790 2.65 5445 2.81 565 2.31 
Table III.

Wall times and scaling behavior with respect to the number of basis functions for SOS-RI-AO-MP2 calculations on DNA systems in a def2-SVP basis on a computing node with two Intel Xeon E5-2620 processors (12 CPU cores) and four Nvidia GeForce GTX Titan GPUs. Timings and scaling exponents [

${\cal O\mbox{(N^{x})}}$
O(Nx)] are given for the whole calculation as well as selected steps of the algorithm defined in Table I.

   TransformationContractionSteps  
 # BasisTotal(5)(6) on GPU(1)–(4), (7), (8)Referencea
Systemfunctionstime [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
DNA1 625 238 … 143 … 82 … 13 … 438 … 
DNA2 1332 2096 2.88 1279 2.90 721 2.87 96 2.64 8290 3.88 
DNA4 2746 19 601 3.09 13 289 3.24 5750 2.87 562 2.44 146 000 3.96 
DNA8 5574 185 659 3.18 126 060 3.18 55 882 3.21 3717 2.67 … … 
   TransformationContractionSteps  
 # BasisTotal(5)(6) on GPU(1)–(4), (7), (8)Referencea
Systemfunctionstime [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
time [s]
${\cal O\mbox{(N^{x})}}$
O(Nx)
DNA1 625 238 … 143 … 82 … 13 … 438 … 
DNA2 1332 2096 2.88 1279 2.90 721 2.87 96 2.64 8290 3.88 
DNA4 2746 19 601 3.09 13 289 3.24 5750 2.87 562 2.44 146 000 3.96 
DNA8 5574 185 659 3.18 126 060 3.18 55 882 3.21 3717 2.67 … … 
a

Reference calculation with conventional SOS-RI-MP2 (1 core@Intel Xeon E5-2620, only serial version available).

It should be noted that the scaling of the computational cost given in Table I is based on increased sparsity in large molecular systems with a fixed choice of basis set. More generally, one can also express the scaling behavior in terms of the number of atom-centers (Nc) and number of AOs per center (Nao/c). The number of auxilliary AOs per center can usually be given in terms of Nao/c multiplied by a constant caux, i.e., as caux × Nao/c. Thus, we obtain for the two rate-determining steps (5) and (6) a scaling behavior of cocc × caux ×

$\mathcal {O}(\mathrm{N_{c}^{2}N_{ao/c}^{3}})$
O(Nc2N ao /c3) and
$c_{aux}^{2}\times\mathcal {O}(\mathrm{N_{c}^{3}N_{ao/c}^{4}})$
caux2×O(Nc3N ao /c4)
, respectively, where either
$\rm N_c$
Nc
or
$\rm N_{ao/c}$
N ao /c
is constant (scaling with basis-set size or system size, respectively). The first expression with the prefactor cocc comes from the use of the Cholesky factor
$\underline{\mathbf {L}}$
L̲
in the sparse multiplications of step (5). These expressions show the scaling for a fixed system size with increasing basis set size, i.e.,
$\rm N_{ao/c}^{3}$
N ao /c3
for step (5) and
$\rm N_{ao/c}^{4}$
N ao /c4
for step (6). Nevertheless, the asymptotic scaling with respect to the system size (i.e.,
$\rm N_{ao/c}$
N ao /c
is constant) is a quadratic (5) or cubic (6) scaling behavior independent of the choice of the basis set.

In order to provide some measure of comparison for the efficiency of our algorithm, we performed SOS-RI-MP2 calculations with a development version of Q-Chem22 for the first three DNA-fragments (1 core, Intel Xeon [email protected]), showing a speed-up of approx. 7.5 for DNA4. Here, it has to be stressed that a fair and balanced comparison of two different algorithms aiming for different computing architectures is very difficult. For larger systems, however, the speed-up of the presented algorithm will be far larger due to the less favorable

${\cal O\mbox{(N^{4})}}$
O(N4) of the conventional SOS-RI-MP2 algorithm.

In our present implementation, the untransformed three-center integrals are saved to disk once and read in the transformation step for every Laplace point. An implementation without any I/O can be easily realized by performing steps (3), (4), and (5) in batches of R. For each batch of this auxiliary index, all three-center integrals can be kept in memory. The transformation can be performed in memory and in step (5) the untransformed integrals are calculated on-the-fly and directly contracted so that all operations can be performed without any hard disk access. It also has to be stressed that the current implementation is strictly sequential with respect to the single steps given in Table I, i.e., the GPU-kernels block the further execution of the code in order to provide meaningful timings for the single steps of the algorithm. However, the overall computational time can be significantly reduced by executing the GPU-kernels in a non-blocking fashion, so that the contraction step (6) and the transformation step (5) can be executed simultaneously.

For an efficient evaluation of the contraction step (6), a modification of the J-engine23,24 based GPU-algorithm for the evaluation of the Coulomb matrix in SCF calculations17 is employed. The algorithm is based on the 1 thread/1 primitive integral (1T1PI) approach and the shell-pair ordering as proposed by Ufimtsev and Martínez.16,25 To be consistent with this work with respect to the choice of bra and ket distributions and to make the analogy to the Coulomb matrix construction most clear, we rewrite Eq. (4) (for real functions) as

(10)

with

$\underline{\overline{M}}^{(R)}_{\mu \nu } = (R|\underline{\mu }\overline{\nu })$
M¯̲μν(R)=(R|μ̲ν¯) (cf. Eq. (8)). In contrast to the regular Coulomb matrix construction, we execute a Coulomb-type contraction for any auxiliary index R to calculate one row of Z using the matrix of transformed three-center integrals
$\underline{\overline{\bf M}}^{(R)}$
M¯̲(R)
in place of the density matrix.

In the regular algorithm to determine the Coulomb matrix for SCF calculations, the ket shell-pairs are contracted with a single density matrix only. Here, however, Naux density-like matrices have to be contracted, so that the use of the conventional algorithm would show the same large prefactor and some adjustments of the algorithm were required.

Crucial to the performance of the algorithm is the use of shared memory to hold the intermediate values J[pq], however, this memory is strictly limited on GPUs. Thus, we decided to retain the two-dimensional thread-block setup but we extend the grid by a further dimension to represent the offset to the corresponding indices of the current R-batch. Therefore, we strongly reduce the communication between host and GPU-devices. It should be noted that in this case the bra-data does not represent shell-pairs but only the shells of the auxilliary basis.

The complete algorithm is also depicted in Fig. 1. Note that the elements of

$(R|\underline{\mu }\overline{\nu })$
(R|μ̲ν¯) are already transformed to the cartesian basis and non-axial normalization coefficients have been incorporated. Furthermore, the elements are sorted according to the l-quantum number combination of
$\underline{\mu }\overline{\nu }$
μ̲ν¯
, so that the batches can be read from disk in a coalesced fashion.

FIG. 1.

Scheme of the algorithm for the contraction step Eq. (10). If nothing else is indicated, the steps are executed on CPU.

FIG. 1.

Scheme of the algorithm for the contraction step Eq. (10). If nothing else is indicated, the steps are executed on CPU.

Close modal

We present a reformulation of the SOS-RI-MP2 algorithm which not only reduces the computational effort to

${\cal O\mbox{(N^{3})}}$
O(N3)⁠, but also allows for an efficient evaluation of the rate-determining contraction step on GPUs by employing a modified J-engine algorithm. As a – to our knowledge – first reformulation of a post-HF method which is highly suitable for GPUs beyond the mere use of GPU-accelerated linear algebra libraries, our ansatz may also be applicable to other post-HF methods in order to enable an efficient use of massively parallel processors.

C.O. acknowledges financial support by the “Deutsche Forschungsgemeinschaft” (DFG) for the project Oc35/4-1. Further funding was provided by the DFG via the initiatives SFB 749 “Dynamik und Intermediate molekularer Transformationen” (project C7) and the DFG cluster of excellence EXC 114 “Center for Integrated Protein Science Munich” (CIPSM).

1.
C.
Møller
and
M. S.
Plesset
,
Phys. Rev.
46
,
618
(
1934
).
2.
J.
Almlöf
,
Chem. Phys. Lett.
181
,
319
(
1991
).
3.
M.
Häser
and
J.
Almlöf
,
J. Chem. Phys.
96
,
489
(
1992
).
4.
M.
Häser
,
Theor. Chim. Acta
87
,
147
(
1993
).
5.
P. Y.
Ayala
and
G. E.
Scuseria
,
J. Chem. Phys.
110
,
3660
(
1999
).
6.
S.
Saebø
and
P.
Pulay
,
J. Chem. Phys.
115
,
3975
(
2001
).
7.
H.-J.
Werner
,
F. R.
Manby
, and
P. J.
Knowles
,
J. Chem. Phys.
118
,
8149
(
2003
).
8.
B.
Doser
,
D. S.
Lambrecht
,
J.
Kussmann
, and
C.
Ochsenfeld
,
J. Chem. Phys.
130
,
064107
(
2009
).
9.
Y.
Jung
,
R. C.
Lochan
,
A. D.
Dutoi
, and
M.
Head-Gordon
,
J. Chem. Phys.
121
,
9793
(
2004
).
10.
Y.
Jung
,
Y.
Shao
, and
M.
Head-Gordon
,
J. Comput. Chem.
28
,
1953
(
2007
).
11.
R.
Olivares-Amaya
,
M. A.
Watson
,
R. G.
Edgar
,
L.
Vogt
,
Y.
Shao
, and
A.
Aspuru-Guzik
,
J. Chem. Theory Comput.
6
,
135
(
2010
).
12.
E. A.
DePrince
 III
and
J. R.
Hammond
,
J. Chem. Theory Comput.
7
,
1287
(
2011
).
13.
W.
Ma
,
S.
Krishnamoorthy
,
O.
Villa
, and
K.
Kowalski
,
J. Chem. Theory Comput.
7
,
1316
(
2011
).
14.
K.
Bhaskaran-Nair
,
W.
Ma
,
S.
Krishnamoorthy
,
O.
Villa
,
H. J. J.
van Dam
,
E.
Aprà
, and
K.
Kowalski
,
J. Chem. Theory Comput.
9
,
1949
(
2013
).
15.
M.
Del Ben
,
J.
Hutter
, and
J.
VandeVondele
,
J. Chem. Theory Comput.
9
,
2654
(
2013
).
16.
I. S.
Ufimtsev
and
T. J.
Martinez
,
J. Chem. Theory Comput.
5
,
1004
(
2009
).
17.
J.
Kussmann
and
C.
Ochsenfeld
,
J. Chem. Phys.
138
,
134114
(
2013
).
18.
S. A.
Maurer
,
D. S.
Lambrecht
,
J.
Kussmann
, and
C.
Ochsenfeld
,
J. Chem. Phys.
138
,
014101
(
2013
).
19.
S.
Maurer
,
L.
Clin
, and
C.
Ochsenfeld
,
J. Chem. Phys.
140
,
224112
(
2014
).
20.
F. G.
Gustavson
,
ACM Trans. Math. Softw.
4
,
250
(
1978
).
21.
J.
Kussmann
and
C.
Ochsenfeld
,
J. Chem. Phys.
127
,
054103
(
2007
).
22.
Development version of the Q-Chem program package, http://www.q-chem.com.
23.
C. A.
White
and
M.
Head-Gordon
,
J. Chem. Phys.
104
,
2620
(
1996
).
24.
G. R.
Ahmadi
and
J.
Almlöf
,
Chem. Phys. Lett.
246
,
364
(
1995
).
25.
I. S.
Ufimtsev
and
T. J.
Martínez
,
J. Chem. Theory Comput.
4
,
222
(
2008
).