We present a tensor hypercontracted (THC) scaled opposite spin second order Møller-Plesset perturbation theory (SOS-MP2) method. By using THC, we reduce the formal scaling of SOS-MP2 with respect to molecular size from quartic to cubic. We achieve further efficiency by exploiting sparsity in the atomic orbitals and using graphical processing units (GPUs) to accelerate integral construction and matrix multiplication. The practical scaling of GPU-accelerated atomic orbital-based THC-SOS-MP2 calculations is found to be *N*^{2.6} for reference data sets of water clusters and alanine polypeptides containing up to 1600 basis functions. The errors in correlation energy with respect to density-fitting-SOS-MP2 are less than 0.5 kcal/mol for all systems tested (up to 162 atoms).

## I. INTRODUCTION

Second order Møller-Plesset perturbation theory^{1} (MP2) is the simplest wavefunction-based method that can describe dynamic electron correlation effects. MP2 has been used extensively in studies of hydrogen bonding networks^{2,3} and noncovalent interactions in biological systems.^{4} However, the computational cost of canonical MP2 is dominated by the need to transform the two electron Coulomb repulsion integrals (ERIs) from the atomic orbital (AO) basis to the molecular orbital (MO) basis. This operation scales as the fifth power of molecular size (more precisely, the number of basis functions) and severely restricts the size of molecules where MP2 may be applied. As a result, there have been many efforts aiming to reduce the computational cost of MP2 calculations.

Two general approaches to reducing this computational complexity are possible. First, one can reduce the prefactor, leading to a method that is less computationally costly but with the same (inconvenient) scaling behavior. Second, one can reduce the scaling exponent. Such scaling reductions often come at the cost of an increased prefactor, so there is no gain until the molecule reaches a certain size.

The density fitting (DF) approximation,^{5–7} also known as resolution-of-the-identity (RI), aims to reduce the computational prefactor and memory requirements of MP2. DF/RI methods introduce a set of auxiliary basis functions to describe the charge density, thereby approximating the fourth order ERI tensor as a product of lower order tensors. By optimizing the auxiliary basis functions, DF-MP2 can achieve good agreement with canonical MP2.^{8–10} In addition, DF-MP2 is particularly well suited for implementation on massively parallel computers,^{11,12} enabling its application to large molecules with up to 4000 basis functions.^{13} Despite the great success of DF-MP2 at reducing the computational cost of MP2, one deficiency is that calculating the exchange-like correlation term still scales as the fifth power of the system size. This exchange term therefore becomes the bottleneck for large molecules.

An alternative way to reduce the computational prefactor is to utilize highly parallel hardware such as graphical processing units (GPUs). GPUs can launch thousands of threads simultaneously and have become a powerful tool in *ab initio* quantum chemistry.^{14–17} GPUs have been applied to MP2^{18} and also many other types of electronic structure methods such as density functional theory,^{17,19,20} coupled cluster,^{21} and plane wave basis set techniques.^{22}

Most approaches to reduce the scaling behavior of correlation methods rely on spatial locality. This is most straightforward when the basis functions themselves exhibit the desired locality, as in the often-used atom-centered Gaussian basis sets. Within the context of MP2, one example is atomic orbital Laplace-transformed MP2,^{23,24} which scales quadratically for molecules with relatively large HOMO–LUMO gaps. Some methods attempt to further exploit the locality of electron correlation by introducing localized molecular orbitals. One successful example is the local electron correlation treatment originally proposed by Pulay^{25} and first implemented by Saebo and Pulay for Møller–Plesset perturbation theory up to fourth order, where projected atomic orbitals (PAOs) were used as a localized basis to expand the correlated space.^{26} By introducing multipole approximations and efficient prescreening algorithms, the scaling of both MP2^{27} and CCSD^{28} can be reduced to linear. Neese *et al.* proposed the use of pair natural orbitals (PNOs) to overcome the problem that large PAO domains may be needed to converge the correlation energy.^{29} A PNO-based local-MP2 method has recently been reported by Werner *et al.*, which scales linearly with molecular size and achieves high parallel efficiency.^{30} These local correlation concepts have been formalized into the sparse map framework, which provides a systematic way to treat electron correlation with sparse tensor techniques.^{31}

In the case of MP2, a second approach to reduce the scaling behavior consists of neglecting the exchange-like term. This does not affect the scaling of canonical MP2 (because the integral transformation remains the computational bottleneck) but in combination with the DF/RI approximation yields a quartic scaling method. Of course, the missing exchange contribution to the correlation energy must then be approximated in a computationally inexpensive way. One of the most important methods falling into this category is the scaled opposite spin MP2 (SOS-MP2) method,^{32} which approximates the correlation energy by scaling the Coulomb-like opposite-spin correlation energy with an empirical parameter. In combination with density fitting, this leads to quartic scaling with system size for the DF-SOS-MP2 energy (because the quintic scaling exchange term has been neglected). By using localized orbitals and exploiting the sparsity of local integral matrices and expansion coefficients, the scaling can be further reduced to cubic.^{33} Very recently, GPUs have been used to accelerate SOS-MP2 calculations based on a J-engine reformulation.^{34}

In addition to the traditional tensor-based approach, recently, there have also been several low-scaling methods proposed based on stochastic sampling that allow MP2 calculations to be performed in grid-based representations. Neuhauser *et al.* proposed a stochastic scheme that replaces the full summation with random averages.^{35} This is achieved by replacing the energy denominator with integration over real-time and introducing random wave functions. An alternative formulation in the energy-domain was later proposed by Ge *et al.*^{36} Another example is the Monte Carlo MP2 method proposed by Willow *et al.*,^{37} which evaluates the correlation energy through Monte Carlo integration by performing a stochastic sampling in the electron coordinate space. For each Monte Carlo step in this algorithm, the computational cost formally scales quadratically with respect to system size. These stochastic approaches are easily parallelized over many processors^{38} but have distinct disadvantages if energy gradients are required (for example, in *ab initio* molecular dynamics).

Tensor hypercontraction (THC)^{39,40} is a recently proposed method, which, like DF, approximates the ERI tensor by factorization into lower order tensors.^{41} Unlike DF, THC leads to a complete unpinning of all four indices in the ERI. This provides more flexibility compared to other methods (canonical and/or density-fitting MP2) as fewer indices may need to be carried during tensor operations. In tensor hyper-contraction, the ERI tensor is approximated as

where *p*, *q*, *r*, s are orbital indices and *P*, *Q* refer to a set of auxiliary variables. These auxiliary variables can have an abstract meaning if general tensor decompositions are employed,^{39} but they can also be given a concrete representation as gridpoints if desired, which is related to the grid-based representation adopted in the aforementioned stochastic algorithms. This emphasizes the relationship of Eq. (1) to a renormalized numerical quadrature^{42} and is the viewpoint taken in least-squares THC (LS-THC).^{40} Previous work^{39,43–45} has applied Eq. (1) in the MO basis. In principle, there is no restriction to the MO basis, but the accuracy that can be achieved for a fixed number of auxiliary indices could depend on the representation. In this work, we instead use the AO basis. As we will show, this can be more efficient and does not degrade the accuracy substantially. Furthermore, an AO formulation opens the door to efficient analytic derivative formulations.

In least-squares THC (LS-THC),^{40} one starts with a predetermined grid and the elements of the *X* tensors in Eq. (1) are then constrained to be values of orbitals at gridpoints

The quadrature grids used for LS-THC are constructed in a manner similar to the grids used in the density functional theory. Molecular grids are constructed from the union of spherical grids centered on each atom, which are formed as the direct product of angular (i.e., Lebedev-Laikov grids^{46}) and radial grids (i.e., the Handy radial scheme^{47}).^{45} When the MO basis is used, the resulting *X* tensors are dense since molecular orbitals are usually delocalized. In contrast, when THC is applied in the AO basis, the *X* tensors become sparse due to the locality of atom-centered Gaussian atomic orbitals. This can be exploited by using sparse tensor techniques.

As shown previously,^{39,40,45} the flexibility of THC allows cubic scaling computation of the Coulomb-like term in MP2 and quartic scaling computation of the exchange-like term.^{39} This implies cubic scaling of THC-SOS-MP2, where only the opposite spin correlation term (i.e., the Coulomb-like term) is computed. However, the construction of the *Z* tensor needed for THC scales either quartically or quintically (see below) and this bottleneck must be addressed in order to realize a cubic scaling THC-SOS-MP2 algorithm. In this work, we describe an efficient implementation of THC-SOS-MP2 by accelerating integral construction and matrix multiplication operations with GPUs and by exploiting the locality of atomic orbitals during any matrix multiplication operations involving the *X* tensors. Our algorithm does not depend on localization of the molecular orbitals but instead exploits locality in the atomic orbital basis set.

The structure of this paper is as follows. First, we discuss how *X* and *Z* (the fundamental ingredients in THC) are constructed on GPUs in the AO basis. Second, the working equations for THC-SOS-MP2 are presented, emphasizing the computational scaling when the locality of the AO basis set is exploited. Next, we test the accuracy for AO-based THC-SOS-MP2 by comparing with DF-SOS-MP2. Finally, we show the timings and scaling for the method on a variety of test sets.

The indices appearing in this work are summarized as follows:

*μ*,*ν*,*λ*,*σ*: Atomic Orbital (AO) basis indices, which are atom-centered contracted Gaussian functions. These indices range from 1…*N*_{μ}, where*N*_{μ}is proportional to system size*N*.*A*,*B*,*C*: auxiliary basis indices for density fitting (DF) approximation, which are atom-centered Gaussian functions. These indices range from 1…*N*, where_{A}*N*∝_{A}*N*. For satisfactory accuracy,*N*is usually three to four times larger than_{A}*N*_{μ}.*P*,*Q*,*R*,*S*: auxiliary basis indices for THC approximation, which we take to represent gridpoints in physical space in this work. These indices range from 1…*N*, where_{P}*N*∝_{P}*N*. For satisfactory accuracy,*N*is usually eight to ten times larger than_{P}*N*_{μ}.*i*,*j*,*k*: occupied MO basis indices ranging from 1…*N*._{occ}*a*,*b*,*c*: virtual MO basis indices ranging from 1…*N*. Usually,_{virt}*N*is significantly larger than_{virt}*N*._{occ}*κ*: Laplace quadrature point indices. The range of these indices is independent of*N*, and 7-10 quadrature points are usually sufficient.

## II. METHODS

### A. GPU-based construction of *X* and *Z* tensors

In the most general formulation of THC, the *X* and *Z* tensors both need to be determined.^{39} In LS-THC, the *X* tensors are prescribed (as collocation matrices on a predetermined molecular grid) and the *Z* tensor is determined by least-squares fitting,^{40} minimizing the difference between the THC-approximated and analytical integrals. The analytical solution for the *Z* tensor is

where the joint collocation matrix is defined as

and the THC (“physical space”) metric matrix is defined as

As written in Eq. (3), the scaling of *Z* tensor construction with respect to molecular size is quintic. This scaling can be reduced to quartic by using the density fitting approximation for the electron repulsion integrals as follows:

Unfortunately, this would be the bottleneck if we proceeded to compute the *Z* tensor in this form, since the scaling of the THC-SOS-MP2 energy calculation is only cubic (*vide infra*). Thus, we show below how Eq. (6) can be evaluated in cubic time for a localized basis set.

Figure 1 illustrates how the *X* tensors are constructed on the GPUs. In order to exploit locality in atomic orbitals, we first divide the space containing the entire molecular geometry into cubic boxes with a specified edge length. The choice of the box size is discussed below. Primitive Gaussian basis functions and gridpoints are assigned to boxes according to their positions, and a sorting operation ensures that primitive Gaussians and gridpoints belonging to the same box are stored together. For every box, we define a box center $R\u0304Box$ and a box radius *ρ ^{Box}*. The box center is the average position of all gridpoints within the box. The box radius is the maximum distance from any of its gridpoints to its center. We also introduce a user-defined threshold,

*X*, such that elements in the

_{thre}*X*tensor with absolute value less than

*X*can be safely neglected (

_{thre}*X*= 1 × 10

_{thre}^{−12}atomic units is used for all the results shown in this paper). The threshold

*X*is used to compute two cutoff distances for each angular momentum: the gridpoint cutoff distance

_{thre}*r*, which determines whether a gridpoint is located far enough to be neglected with respect to a primitive Gaussian function; and the box cutoff distance

_{thre}*R*, which determines whether an entire box of gridpoints can be neglected. The gridpoint cutoff distance

_{thre}*r*is determined by solving

_{thre} where *L* is the angular momentum of the primitive Gaussian function; *w _{max}* is the maximum absolute value of gridpoint weight over all gridpoints;

*d*and

_{max}*α*are the maximum absolute value of contraction normalization factor and minimum exponent, respectively, over all primitive Gaussian functions. Therefore, the value of

_{min}*r*reflects the properties of the most diffuse primitive Gaussian function in the system. For water clusters with cc-pVDZ basis set,

_{thre}*r*is ≈10 for

_{thre}*X*= 1 × 10

_{thre}^{−12}(values in atomic units). The box cutoff distance

*R*is computed as $Rthre=rthre+\rho maxBox$, where $\rho maxBox$ is the maximum radius among all boxes. The three thresholds

_{thre}*R*,

_{thre}*r*, and

_{thre}*X*are passed as parameters to the GPU kernels and are applied to all GPU threads. When computing on the GPUs, a 1-thread-1-primitive strategy

_{thre}^{13}is adopted where each thread only computes the elements of the

*X*tensor for a single primitive Gaussian function. As shown in Figure 1, we exploit locality at three levels. First, each thread neglects boxes whose center is farther than

*R*from the center of the thread’s primitive Gaussian. Second, for boxes located within the cutoff distance, each thread then neglects gridpoints located farther than

_{thre}*r*from the thread’s primitive Gaussian. Finally, for the remaining gridpoints, the element value is computed but is only written to global memory if its absolute value exceeds the threshold

_{thre}*X*. In order to balance the workload between threads, we require that all threads within a GPU block are assigned primitive Gaussian functions belonging to the same box. After the values of primitive Gaussians on gridpoints are transferred back to the central processing unit (CPU), a post-processing procedure is invoked to construct the final

_{thre}*X*tensors as a sparse matrix. Each box constructs its own

*X*tensor, with the first index (major index) consistent with its own gridpoints, and the second index keeping track of all atomic orbitals with nontrivial values on any of its own gridpoints. As the number of contributing boxes is roughly constant for each primitive Gaussian, the nominal scaling of

*X*tensor construction is linear. The accuracy of the above strategy is solely controlled by

*X*. Therefore, the choice of box size has no effect on the accuracy but may affect the performance. A large box size increases the box cutoff distance

_{thre}*R*, and therefore, is inefficient for exploiting locality in

_{thre}*X*tensor construction. In contrast, although a small box size is favorable for screening the gridpoints at earlier stage, the resulting

*X*sub-tensor for each box may not be large enough to saturate the graphics card in future tensor operations to benefit from its parallel capability. To balance between these two factors, in our implementation, we set the box length equal to 5.0 a.u. in the

*X*tensor construction. After the post-processing procedure, boxes containing small

*X*sub-tensors with fewer than 500 gridpoints are merged with adjacent boxes.

To exploit locality in the joint collocation matrix *R* defined in Eq. (4) for *Z* tensor construction, we adopt the same box partitioning used in the *X* tensor construction. Although one could reuse the *X* tensor to compute *R*, we found better performance and lower memory consumption by recomputing the elements of *R* in each box. Figure 2 defines the concept of *neighborhoods* that we use to exploit locality in *R* as we construct the *Z* tensor. Figure 2(a) illustrates the neighborhood for a given primitive pair *φ*(**r**; **P**), where the center of the corresponding charge distribution **P** is defined by the Gaussian product rule. As the primitive pair is also a Gaussian function, we can compute *R* in the primitive pairs on GPUs using a similar strategy as that for *X* tensor construction. This involves first presorting the primitive pairs to consolidate those that are in the same box (this is done as a preprocessing step on the CPU). Primitive pairs *pq* with Schwarz bound,^{48} i.e., (*pq*|*pq*)^{1/2}, lower than a user-defined threshold (1.0 × 10^{−13} a.u. is used for all test results shown in this paper) are neglected. Each GPU thread then computes *R* values for one primitive pair, considering only boxes of gridpoints within the neighborhood of this primitive pair. When a contracted basis set is used, an atomic orbital pair *ϕ*_{μ}(**r**) *ϕ*_{ν}(**r**) is represented as the contraction of several primitive pairs $\varphi \mu r\varphi \nu r=\u2211mdm\phi mr;Pm$ each with its own unique center (where *d _{m}* are the contraction coefficients). The neighborhood of this contracted atomic orbital pair is then defined as the union of the neighborhoods for its component primitive pairs, as shown in Figure 2(b). Each atomic orbital pair is assigned to a box according to the average coordinates of its primitive pair centers

where **A** and **B** represent the centers of two atomic orbitals, *M* and *N* represent the number of primitive Gaussian functions in each contracted atomic orbital, and *ζ*_{i}/*ζ*_{j} represent the exponents of each primitive Gaussian. As illustrated in Figure 2(c), the neighborhood of a box of atomic orbital pairs is the union of the neighborhoods of each of the contracted atomic orbital pairs in the box. For a large enough molecule, the size of all the neighborhoods discussed here will not depend strongly on molecular size. This implies that computational effort for constructing the joint collocation matrix will scale linearly with the number of atomic orbitals.

Figure 3 shows the entire workflow used to accelerate construction of the *Z* tensor with the density fitting approximation. As the number of boxes comprising the neighborhood for a given box of atomic orbital pairs remains roughly constant, the calculation of the THC metric matrix defined in Eq. (5), which involves contraction over atomic orbital pairs, can be computed with linearly scaling operations. By using the same strategy, the contraction of the joint collocation matrix with the three-center two-electron integrals

can be computed with quadratic scaling operations. Therefore, the overall nominal scaling for the integral calculations is quadratic. When the integral calculations are completed, a series of cubic-scaling linear algebra operations (bottom two boxes in Figure 3) are invoked to construct the final *Z* tensor. The inversions of the DF Coulomb metric matrix (A|C) and the THC metric matrix *S ^{phys}* are computed on the CPU. As the THC metric matrix

*S*is usually ill-conditioned, a pseudo-inversion is performed by first diagonalizing the matrix, and then neglecting eigenvalues smaller than a predefined threshold (1.0 × 10

^{phys}^{−12}a.u. in this work). We then perform a series of cubic scaling matrix multiplications on the GPU to compute the Z tensor as

From the scaling analysis, the linear algebra operations are expected to dominate the computations as the systems become larger.

### B. THC-SOS-MP2

Once *X* and *Z* have been constructed, they can be used to calculate the correlation energy. The canonical expression for the SOS-MP2 correlation energy based on a restricted, closed-shell HF reference is given as

where $\Delta ijab$ is the orbital energy denominator

and *c _{os}* is the empirical factor, which approximates the total MP2 correlation energy by scaling the opposite spin correlation energy (a value of

*c*= 1.3 was previously suggested).

_{os}^{32}The Laplace transformation is applied to factorize the orbital energy denominator, which can then be evaluated by numerical quadrature

^{49,50}

We start by writing the Coulomb-like correlation energy in terms of the MO-THC approximation and the Laplace transformed energy denominators

In order to exploit locality in the AO basis, we first construct the following two transformation matrices:^{23,51}

where the scaling of the required operations has been shown next to the equations. Given these transformation matrices, the Coulomb-like term can be written in the AO basis as

By exploiting symmetry, the expression can be simplified as

where

By applying the transformation matrices to the *X* matrix, we then define

The nominal scaling of the operations has made use of the fact that the number of significant gridpoints for each atomic orbital is nearly constant. Now, the *G* matrix defined in Eq. (19) and the SOS-MP2 energy in Eq. (17) can be evaluated in *O*(*N*^{3}) operations

Thus, the calculation of the SOS-MP2 energy (given *X* and *Z* tensors in the AO basis) formally scales cubically with molecular size.

## III. RESULTS AND DISCUSSION

The method described above was implemented in the TeraChem ^{14–16} quantum chemistry package (which currently supports atom-centered *spd* Gaussian basis sets). All calculations were performed with the cc-pVDZ basis^{52} without the frozen core approximation. We will use the term AO-THC-DF-SOS-MP2 to denote this work, which indicates that THC factorization is applied to AO-based integrals approximated with density fitting (Eq. (6)), instead of the exact integrals (Eq. (3)). Similarly, the term MO-THC-DF-SOS-MP2 will be used to denote our previous implementations^{45} where THC factorization is applied in MO-basis. We used grids previously optimized for THC-MP2 in the molecular orbital basis.^{45} The THC grids contain around 60 gridpoints per hydrogen atom, and around 120 gridpoints per each carbon, nitrogen, or oxygen atom. The cc-pVDZ-RI auxiliary basis set is used in the density fitting approximation.^{53}

The accuracy of the AO-THC-DF-SOS-MP2 correlation energy was verified by comparing to the DF-SOS-MP2 energy implemented in Q-Chem.^{54} Errors for AO-THC-DF-SOS-MP2 with respect to DF-SOS-MP2 are presented in Figures 4 and 5 for alanine polypeptides and water clusters, respectively. The errors introduced by using the THC approximation are less than 0.5 kcal/mol for alanine polypeptides up to 16 amino acids (162 atoms, 1610 basis functions) and less than 0.21 kcal/mol for water clusters containing up to 48 water molecules (144 atoms, 1200 basis functions). These errors correspond to less than 0.0075% of the total correlation energy and are well within chemical accuracy. In Figure S1,^{58} we compare the error from the THC approximation with that arising from the DF approximation (for a series of alkenes), showing that the THC error is more than an order of magnitude less than the DF error (which has already been found to be sufficiently accurate in many applications). In Figure S2,^{58} we show the THC error, which arises for basis sets containing diffuse functions (for a series of water clusters). As there are currently no optimized grids for basis sets with diffuse functions, we used a grid optimized for the cc-pVTZ basis set.^{45} Even so, the THC error is well below 1 kcal/mol. Optimized grids for diffuse basis sets are under development, and these should decrease the error substantially. Finally, it is important to show that the THC errors are systematic, i.e., that they at least partially cancel when taking energy differences. We have shown this for canonical MP2 previously. In Figure S3,^{58} we compare the relative DF-SOS-MP2 energies of 21 (H_{2}O)_{16} water cluster structures corresponding to local minima. The RMS error of these relative energies for AO-THC-DF-SOS-MP2 compared to DF-SOS-MP2 is 0.015 kcal/mol well within chemically acceptable errors.

Using the water clusters as a test set, Figure 6 compares the computational scaling with respect to the number of basis functions for AO-THC-DF-SOS-MP2, MO-THC-DF-SOS-MP2, DF-SOS-MP2, and Hartree-Fock. In these tests, AO-THC-DF-SOS-MP2 and HF are computed using TeraChem ^{55} on the GPU, MO-THC-DF-SOS-MP2 is computed on the CPU with a development version of Psi4,^{56} and DF-SOS-MP2 is computed on the CPU with Q-Chem v4.2.^{54} When comparing absolute times, one should bear in mind that these correspond to different architectures and programs. The observed scaling should not be sensitive to the variation in architectures but instead is a reflection of the underlying algorithms. The formal scaling of MO-based THC-DF-SOS-MP2 is *O*(*N*^{4}), which is the same as DF-SOS-MP2. In contrast, the formal scaling of AO-based THC-DF-SOS-MP2 is *O*(*N*^{3}). Figure 6 shows that DF-SOS-MP2 exhibits exactly the expected quartic scaling, while MO-THC-DF-SOS-MP2 is only cubic in this size regime. The AO-THC-SOS-MP2 method scales even more favorably with an observed scaling in this size regime of *N*^{2.6}. This scaling is only slightly worse than the preceding Hartree-Fock calculations (as performed by TeraChem on the GPU). For (H_{2}O)_{48}, which is the largest water cluster tested, AO-THC-DF-SOS-MP2 requires about 5 times as much computing time as the prerequisite Hartree-Fock calculation. We have also performed the same tests on alanine polypeptides and obtained similar scaling behavior (up to 1610 basis functions) as shown in Figure S4.^{58} For the largest tested alanine polypeptide (ALA)_{16}, AO-THC-DF-SOS-MP2 requires about 3 times longer than the preceding Hartree-Fock calculation.

In order to better understand the performance-limiting step, we examined the timing and scaling for each component of the AO-THC-DF-SOS-MP2 calculations, as shown in Figure 7 for water clusters. Similar information for alanine polypeptides is provided in Figure S5.^{58} The correlation energy calculation (“SOS Energy,” Eqs. (14)-(25)) scales sub-cubically and is not the computational bottleneck for all systems tested. This indicates that Eqs. (15), (16), and (25) (which formally scale cubically) have low prefactors compared to the remaining equations for the SOS energy (which formally scale quadratically). The component that has the highest scaling is the construction of the *Z* tensor in Eq. (6), which becomes the computational bottleneck for large systems. Although this involves a number of matrix multiplications, the dominant step is the inversion of the DF and THC metric matrices (inner and outer terms on the right-hand side in Eq. (10)). As mentioned above, these pseudo-inversions are currently performed on the CPU by diagonalizing the full matrix. They are therefore much more expensive than matrix multiplications, which have the same cubic scaling but are greatly accelerated by the GPUs. In order to further extend the size of molecules for which our method is feasible, it will be necessary to make this matrix inversion more efficient. This work is currently in progress.

## IV. CONCLUSIONS

By taking advantage of the THC factorization, we demonstrated a reduction in the formal scaling of SOS-MP2 from quartic to cubic with respect to the size of the system. In practice, the observed scaling is an even more favorable *N*^{2.6} in the size regime up to 160 atoms. As the method is highly efficient, it can potentially be applied to study properties of large-scale hydrogen bonding networks and to understand noncovalent interactions in biological molecules.

The implementation of GPU-accelerated *X* and *Z* tensor construction provides an efficient way to compute the THC approximated integrals and can be used in other types of THC-approximated correlation methods. Even when the *X* and *Z* tensor construction is not the computational bottleneck, the GPU implementation will still be helpful in reducing the total wall time. For example, it can be applied to previous algorithms for THC coupled cluster (THC-CC2 and THC-CCSD)^{43,57} and THC equation-of-motion coupled cluster (THC-EOM-CC2)^{43} calculations.

We have demonstrated that THC applied in the AO basis is not only efficient due to locality inherent in the AO basis but also gives accurate correlation energies. In addition to the advantages in terms of low computational cost as we have discussed, AO-based THC is more appealing than the MO-based THC for developing analytic gradients of THC approximated methods. When computing the gradients of the *Z* tensor (Eq. (3) or Eq. (6)) in the MO basis with respect to the atomic degrees of freedom, one should include the contributions from the gradients of molecular orbital coefficients, which requires the solution of coupled-perturbed Hartree-Fock equations. The expression of the *Z* tensor in the AO basis overcomes this problem and opens up the possibility of developing analytical gradients for THC and extending the applications of THC to *ab initio* molecular dynamics. We are currently working on these enhancements.

In the last part of the Discussion section, we have observed that inverting the metric matrices becomes the bottleneck as system sizes become large. In a forthcoming paper,^{59} we will explore methods that can reduce the cost of these matrix inversions. Such advances would make THC-SOS-MP2 feasible for larger biological systems such as proteins.

## Acknowledgments

This work was supported by the National Science Foundation (Grant No. CHE-1565249) and used the XStream computational resource supported by the NSF MRI Program (Grant No. ACI-1428930). We thank Edward Hohenstein and Lee-Ping Wang for helpful suggestions and insightful discussions. T.J.M. is grateful to the Department of Defense (Office of the Assistant Secretary of Defense for Research and Engineering) for a National Security Science and Engineering Faculty Fellowship (NSSEFF). C.S. is grateful for a Stanford Graduate Fellowship.

## REFERENCES

_{2}O)(

_{16}) and (H

_{2}O)(

_{17}): A new global minimum for (H

_{2}O)(

_{16})

Throughout this article, we use the word tensor to indicate any multi-dimensional array and do not mean to imply any specific transformation properties.