Developed over the past decade, TeraChem is an electronic structure and *ab initio* molecular dynamics software package designed from the ground up to leverage graphics processing units (GPUs) to perform large-scale ground and excited state quantum chemistry calculations in the gas and the condensed phase. TeraChem’s speed stems from the reformulation of conventional electronic structure theories in terms of a set of individually optimized high-performance electronic structure operations (e.g., Coulomb and exchange matrix builds, one- and two-particle density matrix builds) and rank-reduction techniques (e.g., tensor hypercontraction). Recent efforts have encapsulated these core operations and provided language-agnostic interfaces. This greatly increases the accessibility and flexibility of TeraChem as a platform to develop new electronic structure methods on GPUs and provides clear optimization targets for emerging parallel computing architectures.

## I. INTRODUCTION

Over the years, a wide variety of different electronic structure packages have become available, each with their own specific implementations and target applications. Practicing good design principles, such as code encapsulation and well-defined interfaces, enables quick prototyping for new methods and allows developers to focus on algorithmic advancements without disrupting downstream workflows. These principles remain important in today’s scientific computing environment, where high-level languages such as Python are used as “glue” to enable interoperability between modules of one package or between several electronic structure codes. For example, Psi4^{1} provides C++ and Python interfaces to their libraries and driver-level code and has even recently provided a set of reference electronic structure implementations leveraging a combination of Psi4 and standard Python libraries such as NumPy.^{2} Meanwhile, lower-level offerings, such as LibInt,^{3} LibXC,^{4} and XCFun,^{5} provide standard interfaces to Gaussian integrals and exchange–correlation (XC) functionals for density functional theory (DFT) and have seen some adoption throughout the community. In addition to the benefits already mentioned, well-chosen and physically motivated interfaces can greatly influence algorithmic development for new methods.

Encapsulation and clear interfaces are also critical when considering the use of hardware accelerators such as graphics processing units (GPUs). Originally designed for rendering pipelines, the GPU architecture is well-suited to take advantage of single-instruction multiple-data (SIMD) parallelism. Typical design principles for achieving high throughput on GPUs involve pipelining expensive central processing unit (CPU)–GPU memory transfers and minimizing idle threads in warps (groups of threads that execute instructions simultaneously), although recent advances in GPU technology have decreased the performance penalty for warp divergence and provided mechanisms to increase CPU–GPU memory bandwidth. GPUs typically have stronger single precision performance, particularly consumer-grade (e.g., NVIDIA GeForce GTX) cards that do not have the double precision hardware support available as in scientific-grade (e.g., NVIDIA Tesla) cards. It is hard to redesign general electronic structure algorithms for many methods with these constraints in mind, so a better approach is to build several highly tuned operations that can be used in many methods throughout the codebase.

The TeraChem software package was developed in 2008, following the release of NVIDIA’s Compute Unified Device Architecture^{6} (CUDA) toolchain for general-purpose GPU computing in 2007. The initial goal was to use GPUs to accelerate the evaluation of the electron repulsion integrals (ERIs),^{7–10} which form the computational bottleneck in many electronic structure theories and have been the target of many efforts in the community.^{11–19} This was most efficiently implemented in terms of Coulomb and exchange matrix builds, where contractions of the density matrix with the ERIs were computed directly. As the codebase grew, the most expedient route for implementing new methods was to reuse these encapsulated GPU-accelerated operations. The development of TeraChem’s configuration-interaction (CI) library^{20} directly followed this pattern, where one- and two-particle density matrix construction and other rate-limiting matrix–vector operations were accelerated on GPUs and then used as algorithmic building blocks for novel method development. This approach is successful as the energy of any non-DFT Hamiltonian system can be written as a summation of one- and two-electron operators contracted with the corresponding density matrices; hence, one should focus on obtaining density matrices and their operator expressions within the appropriate basis in a computationally feasible manner. For self-consistent field (SCF) methods, density matrices are straightforward to obtain through the diagonalization of the Fock matrix, but the operator expressions (i.e., contraction against the ERIs) are challenging. In CI approaches, getting the density matrices (or underlying wavefunctions) is already difficult, and therefore, significant effort has been expended to efficiently solve for the wavefunction and density matrices.

Together, these advances provide TeraChem with electronic structure methods that use GPUs effectively and allow for efficient on-the-fly energy and gradient evaluations, which form the basis of many quantum chemistry workflows. TeraChem has been used to perform *ab initio* molecular dynamics (AIMD) and optimizations on entire proteins^{21,22} and to provide real-time interactive AIMD^{23} on desktop workstations. The *ab initio* nanoreactor framework^{24,25} combines accelerated AIMD with geometry and minimum energy path optimizations to perform automated reaction network discovery. An interface to the FMS90 package facilitates *ab initio* multiple spawning^{26} (AIMS) simulations, which have been used to study photochemical processes in the gas phase and protein environments.^{27–33} Nonadiabatic dynamics for multichromophoric light-harvesting assemblies are enabled through the *ab initio* exciton model.^{34–36} A new socket-based interface for TeraChem based on Google’s Protocol Buffers^{37} has been developed to facilitate GPU acceleration of these higher-level quantum chemistry workflows and has now been tested with a large variety of applications.

Here, we present the current modular structure of the TeraChem software package, as shown in Fig. 1. First, we describe interfaces and key GPU implementation details of the five main libraries: IntBox, which provides direct atomic orbital (AO) integral evaluation routines; SQMBox, effectively a semiempirical Mulliken-approximated equivalent for IntBox; THCBox, which uses tensor hypercontraction (THC) to leverage rank-sparsity in the ERIs for alternate contraction schemes; CIBox, which provides efficient routines for determining CI wavefunctions; and CCBox, which provides coupled-cluster methods. Throughout the discussion of the various libraries, we also highlight the rise of several encapsulated GPU-accelerated operations that have become core algorithmic building blocks for most electronic structure methods in TeraChem. Next, we provide a short summary of our new TeraChem Protocol Buffer (TCPB) server interface. Finally, we conclude with a discussion of rapid electronic structure method development in the context of the aforementioned core electronic structure operations; in particular, we emphasize how encapsulation and well-defined interfaces to these operations provide a path forward for leveraging emerging parallel architectures without the need to redesign each electronic structure method implementation.

## II. INTBOX

IntBox is a library for direct one- and two-electron integral evaluation for contracted atom-centered Gaussian basis sets, currently applicable for *spd* angular momenta. The IntBox interface, shown schematically in Fig. 2, is exposed as pure C for portability and simplicity. Users provide a Gaussian-type orbital (GTO) basis set, nuclear charge and position, and a generalized density matrix as inputs. IntBox then provides the following quantities, described in greater detail below, as outputs: one-electron (i.e., overlap, kinetic energy, and nuclear attraction) integrals, two-electron quantities such as Coulomb and exchange matrices, effective core potentials, spin–orbit couplings, and all corresponding gradient contributions. Overlap and kinetic energy integrals are performed solely on the CPU, while the remaining quantities are accelerated through the use of GPUs.

Each contracted GTO can be expressed as a linear combination of atom-centered primitive Gaussians,

where *μ*, *ν*, *λ*, and *σ* index contracted atomic orbitals (AOs). Most electronic structure methods rely on several one-electron integrals, namely, the overlap, kinetic energy, and nuclear-attraction integrals, given as

where *A* indexes atoms, while **r** and **R** are electronic and nuclear coordinates, respectively. The two-electron ERIs for the contracted GTOs are given as

where *μ*_{i} denotes the *i*th primitive corresponding to the *μ*th contracted basis function, and the primitive integrals (*χ*_{i}*χ*_{j}|*χ*_{k}*χ*_{l}) in Eq. (5) can be computed analytically.^{38,39} The ERIs rarely appear in isolation; rather, many terms include the ERIs contracted with an arbitrary density matrix *D*_{λσ}, as in the Coulomb and exchange matrices,

Since ERI evaluation is often the computational bottleneck in electronic structure codes, significant effort has been expended to use GPUs to accelerate these **J** (i.e., Coulomb) and **K** (i.e., exchange) builds from Eqs. (7) and (8) in IntBox.^{8–10} The two-electron primitive integrals are presorted into batches on the CPU by angular momentum and density-weighted Schwarz bound,^{40}

Each GPU kernel then uses the density-weighted Schwarz bound as a screening criterion, which is extremely effective when working in the AO basis where spatial sparsity can be exploited. Additionally, the presorting results in coalesced memory access and prevents thread divergence, which are all important factors in gaining high GPU performance. The early version of TeraChem screening for exchange builds neglected any integrals where the density-weighted Schwarz bound dropped below the integral threshold multiplied with a “guard” parameter.^{9} In later work, this was replaced by monitoring the product of the Schwarz bound with the largest element of the corresponding density matrix block,

Although less aggressive than the original “guard” screening procedure, this strategy still results in good scaling for **K** builds, as shown in Fig. 3. Although these operations formally scale as $ONAO4$, the observed scaling is often quadratic due to the exploitation of the aforementioned sparsity. The **J** builds have a smaller prefactor compared to **K** builds, but the **K** builds scale better because these interactions fall off more steeply than the Coulomb interaction.

IntBox takes a tiered mixed precision strategy, where the largest integrals are computed in double precision, other significant integrals are computed in single precision, and the remainder are neglected. In the future, this strategy could also potentially be extended to half precision to take advantage of new hardware accelerators designed primarily for machine learning, such as tensor cores in GPUs. For iterative procedures such as the self-consistent field (SCF) procedure, dynamic precision schemes take this one step further by gradually tightening the single/double precision threshold over the course of the SCF.^{41} Mixed and dynamic precision schemes work particularly well when combined with incremental Fock matrix builds,^{42} as integral screening becomes more effective as many elements of the difference density approach zero. Through the use of automated code generation techniques, IntBox performance can be regarded as near-optimal for all available angular momenta (i.e., up to *d* functions)^{43} and effective core potentials^{44,45} on a variety of different NVIDIA GPU architectures.

The basic quantities computed by IntBox are straightforwardly used to construct the core Hamiltonian and Fock matrices in self-consistent field (SCF) procedures,

where the superscripts in Eq. (12) indicate which density matrix is contracted against the ERIs; in the case of Hartree–Fock (HF) and Kohn–Sham (KS) density functional theory (DFT), the ground state density *P*_{λσ} is used. However, one can also leverage these quantities in a variety of other contexts; for example, the molecular orbital (MO) basis ERIs are often needed for post-SCF methods and can be calculated by using **J** builds to half-transform the ERIs for each pair of MOs. Consider the following two term groupings for the AO to MO transformation:

where *p*, *q*, *r*, and *s* index molecular orbitals and *C*_{μp} are the molecular orbital coefficients. The traditional algorithm in Eq. (13) performs four sequential quarter-transformations and scales as O($NAO4NMO$); meanwhile, Eq. (14) groups the first half-transformation into a Coulomb matrix and formally scales as O($NAO4NMO2$) since one **J** build is required for each MO pair. However, the effective quadratic scaling of the **J** builds makes the algorithm in Eq. (14) favorable in practice. This is especially true when considering that this transformation is generally used for transforming small subsets of the MO space, such as for complete active space (CAS) methods. Since active space sizes are typically much smaller than the full orbital space, the scaling behavior of the **J** build dominates and the entire integral transformation step is observed to have subquadratic scaling with respect to system size.^{46}

For methods such as Møller–Plesset perturbation theory or coupled-cluster singles and doubles (CCSD) that require the full MO space, factorization of the ERIs into products of third-order tensors—as in density fitting^{40,47–50} (DF) or Cholesky decomposition^{51,52} (CD) approximations—may be preferable,

Here, the **L** tensors represent an incomplete CD of the ERIs. Since we have already shown that multiple **J** builds can be used to access the full ERI tensor, it is clearly possible to perform the CD of the ERIs using only **J** and **K** builds. The question that remains is whether or not such a **J**/**K**-based algorithm would be efficient enough to be useful. In Algorithm 1, we present our purely **J**/**K**-based approach to the CD of the ERIs. In order to define the pivots, we must have access to the diagonal elements of the ERI tensor [i.e., (*μν*|*μν*)-type integrals]. These can be accessed by either $NAO2$**J** builds or *N*_{AO}**K** builds; a **K** build using a density matrix with a single entry of unity on the diagonal will result in a **K** matrix with *N*_{AO} of the diagonal ERIs on its diagonal. While there is some unnecessary overhead in this approach (i.e., the full K matrix is constructed when only the diagonal is needed), we find that it takes a small fraction of the total time required to perform the CD. With the diagonal elements in hand, we can begin the formation of the **L** tensors. Here, **J** builds are used to obtain the slices of the ERI tensor corresponding to the largest errors on the diagonal. The rest of the algorithm proceeds as usual. We use the resulting CD of the ERIs in the context of CCSD computations in our new CCBox module; we will discuss the performance of the **J**/**K**-based CD algorithm later in that context.

1: | Initialize: E = 0, P = 0, L = 0 | ||||

2: | for m Basis functions do | ||||

3: | Set: P_{mm} = 1 | ||||

4: | Form: $K\mu \nu =\u2211\lambda \sigma (\mu \lambda |\nu \sigma )P\lambda \sigma $ | ||||

5: | for n Basis functions do | ||||

6: | Set: E_{mn,mn} = K_{nn} | ||||

7: | end for | ||||

8: | Set: P_{mm} = 0 | ||||

9: | end for | ||||

10: | Initialize: N_{CD} = 0 | ||||

11: | while max(E) < thresh do | ||||

12: | Set: r,s = index of max(E) | ||||

13: | Set: P_{rs} = 1 | ||||

14: | Form: $J\mu \nu =\u2211\lambda \sigma (\mu \nu |\lambda \sigma )P\lambda \sigma $ | ||||

15: | for m Basis functions do | ||||

16: | for n Basis functions do | ||||

17: | Set: $LmnNCD=Jmn$ | ||||

18: | for $A<NCD$ do | ||||

19: | Set: $LmnNCD=LmnNCD\u2212LmnALrsA$ | ||||

20: | end for | ||||

21: | end for | ||||

22: | end for | ||||

23: | Set: $\alpha =LrsNCD$ | ||||

24: | for m Basis functions do | ||||

25: | for n Basis functions do | ||||

26: | if$m,n=r,s$ then | ||||

27: | Set: $LmnNCD=\alpha $ | ||||

28: | else | ||||

29: | Set: $LmnNCD=LmnNCD/\alpha $ | ||||

30: | end if | ||||

31: | Update: $Emn,mn=Emn,mn\u2212LmnNCDLmnNCD$ | ||||

32: | end for | ||||

33: | end for | ||||

34: | Set: $Prs=0$ | ||||

35: | Set: $NCD=NCD+1$ | ||||

36: | end while |

1: | Initialize: E = 0, P = 0, L = 0 | ||||

2: | for m Basis functions do | ||||

3: | Set: P_{mm} = 1 | ||||

4: | Form: $K\mu \nu =\u2211\lambda \sigma (\mu \lambda |\nu \sigma )P\lambda \sigma $ | ||||

5: | for n Basis functions do | ||||

6: | Set: E_{mn,mn} = K_{nn} | ||||

7: | end for | ||||

8: | Set: P_{mm} = 0 | ||||

9: | end for | ||||

10: | Initialize: N_{CD} = 0 | ||||

11: | while max(E) < thresh do | ||||

12: | Set: r,s = index of max(E) | ||||

13: | Set: P_{rs} = 1 | ||||

14: | Form: $J\mu \nu =\u2211\lambda \sigma (\mu \nu |\lambda \sigma )P\lambda \sigma $ | ||||

15: | for m Basis functions do | ||||

16: | for n Basis functions do | ||||

17: | Set: $LmnNCD=Jmn$ | ||||

18: | for $A<NCD$ do | ||||

19: | Set: $LmnNCD=LmnNCD\u2212LmnALrsA$ | ||||

20: | end for | ||||

21: | end for | ||||

22: | end for | ||||

23: | Set: $\alpha =LrsNCD$ | ||||

24: | for m Basis functions do | ||||

25: | for n Basis functions do | ||||

26: | if$m,n=r,s$ then | ||||

27: | Set: $LmnNCD=\alpha $ | ||||

28: | else | ||||

29: | Set: $LmnNCD=LmnNCD/\alpha $ | ||||

30: | end if | ||||

31: | Update: $Emn,mn=Emn,mn\u2212LmnNCDLmnNCD$ | ||||

32: | end for | ||||

33: | end for | ||||

34: | Set: $Prs=0$ | ||||

35: | Set: $NCD=NCD+1$ | ||||

36: | end while |

Linear response excited state methods such as configuration-interaction singles (CIS) and Tamm–Dancoff approximation time-dependent density functional theory (TDA-TDDFT) can be formulated in terms of Fock-like matrix builds with a nonsymmetric transition density matrix,

where *i*, *j*, *k*, and *l* index the occupied orbitals, *a*, *b*, *c*, and *d* index the virtual orbitals, and *X*_{ia} are the transition amplitudes given in the CIS/TDA-TDDFT response equation **AX** = **Xω**.^{53,54} As another illustrative example, we consider the use of **J** and **K** matrix builds in the AO-direct formalism of the coupled-perturbed state-averaged complete active space SCF (CP-SA-CASSCF) equations. In our implementation, the coupled-perturbed multiconfiguration self-consistent field equations are solved iteratively, and Fock-like matrix builds with the following generalized density matrices are sufficient to recover the response due to the orbital-CI block of the Hessian matrix:

where $\kappa \u0303\Theta $ are estimates of the orbital Lagrangian multipliers.^{55} These examples suggest that one can view **J** and **K** builds as general ERI contraction engines, where methods are reformulated in terms of Coulomb-like and exchange-like terms and the necessary entries are packed into density matrices for the GPU-accelerated contraction against the ERIs. Therefore, we consider the GPU-accelerated **J** and **K** matrix builds provided by IntBox as the first of several core building blocks for the electronic structure.

## III. SQMBOX

The emergence of the electronic structure code TeraChem in 2008 is directly tied to the development of the GPU-accelerated IntBox library mentioned above. With this tool in hand, the entire electronic structure development in TeraChem has been pursued with the maxim of leveraging the basic quantities provided by IntBox. Semiempirical methods, in particular, semiempirical density functional tight-binding methods,^{56,57} can be interpreted as a minimal basis set Hartree–Fock/Kohn–Sham density functional theory type treatment with a modified Hamiltonian. In most semiempirical methods, that boils down to applying the Mulliken approximation^{58} to the one-electron and possibly two-electron integrals and subsequent semiempirical approximation of the remaining one- and two-center terms,

The recently proposed GFN- and GFN2-xTB methods^{59,60} furthermore share with *ab initio* methods the use of a nonorthogonal (though atomwise orthogonal) GTO basis set, thus requiring a setup that is formally completely equivalent to typical *ab initio* electronic structure workflows and only differs in terms of the explicit Hamiltonian matrix elements. We have recently added a semiempirical integral library called SQMBox to TeraChem. This library provides semiempirical equivalents to the basic quantities provided by the IntBox library, i.e., quantities that have been approximated as shown in Eqs. (18) and (19) (see Fig. 4). SQMBox evaluates the Mulliken-approximated integrals from the provided overlap integrals; therefore, the library is agnostic to the underlying basis function type and can, in principle, be combined with other AO types (e.g., Slater orbitals). Hamiltonians for the GFN-xTB and GFN2-xTB methods are currently available in SQMBox. Additionally, the library has full support for the Mulliken-approximated Fock exchange, which is absent in the original tight-binding methods due to the significant increase in the computational cost with conventional formulations.^{56} However, the computational cost for the Mulliken-approximated exchange becomes negligible with our GPU implementation.^{61}

The power of our modular electronic structure environment could be demonstrated by easily making novel method combinations available. We successfully combined the GFN-xTB Hamiltonians with the spin-restricted ensemble-referenced Kohn–Sham (REKS) density functional theory method.^{62} Due to the entire electronic structure framework being built around the basic quantities from IntBox^{63} and the one-to-one mapping of *ab initio* (i.e., IntBox) and semiempirical (i.e., SQMBox) Hamiltonian terms, the novel state-averaged (SA) state-interaction (SI) REKS-xTB method combination is available with the same full functionality as the *ab initio* implementation. Hence, the fairly cumbersome rederivation of gradients and nonadiabatic coupling elements^{64} for a separate semiempirical workflow becomes unnecessary. This directly enabled us to conduct an initial benchmarking study, which clearly indicated that exchange is crucial to make xTB Hamiltonians applicable for photochemical problems. Reparameterizing an exchange-including xTB Hamiltonian for this purpose and combining SQMBox with other electronic structure workflows in TeraChem are ongoing developments.

## IV. THCBOX

The THCBox library uses tensor hypercontraction (THC) to provide an alternate formalism for contracting density matrices against the ERIs by leveraging rank-reduction techniques rather than the element sparsity used by IntBox. Tensor hypercontraction factorizes the ERIs as

where *P*, *Q*, *R*, and *S* are THC auxiliary function indices; in practice, the auxiliary functions are chosen as gridpoints in real space. THC has successfully reduced the formal scaling of many correlation methods, such as second-order Møller–Plesset perturbation theory (MP2),^{65–67} scaled opposite spin MP2 (SOS-MP2),^{68–70} second-order complete active space perturbation theory (CASPT2),^{71} and coupled-cluster theories.^{72–75} Therefore, the development of THCBox is aimed at providing these low-scaling THC-based correlation energy and gradient methods through a set of well-defined APIs.

The design of THCBox is divided into three layers: the bottom layer that constructs the THC tensors, the middle layer that provides tensor operations, and the upper layer that provides the library APIs for correlation energies and gradients, as shown in Fig. 5. THC-MP2 and THC-SOS-MP2 require the number of occupied/virtual orbitals and the molecular orbital energies and coefficients from a preceding SCF calculation as inputs and return the corresponding correlation energy and nuclear gradients. The THC-CASPT2 method takes the molecular energies and coefficients from a preceding complete active space self-consistent field (CASSCF) calculation, as well as the occupied-active and active-virtual off-diagonal blocks of the Fock matrix, the CASSCF configuration-interaction amplitudes, and a level-shifting value; at this time, only the correlation energy is implemented for the CASPT2 method. Both the bottom layer (i.e., THC construction) and the middle layer (i.e., tensor operations) are accelerated on the GPUs and described below.

### A. THC Tensor Construction Layer

The THC tensor construction layer is responsible for forming the X and Z tensors. The X tensor is computed as density of atomic orbitals on gridpoints,

Due to the locality of atomic orbitals, the X tensor is inherently sparse. If sparsity is enabled, then the X tensor will be stored as a sparse matrix, where we keep track of the list of atomic orbitals that contribute to a certain group of points; otherwise, the X tensor is stored as one matrix with dimension *N*_{AO} by *N*_{Pt}.

Different variants of THC mainly differ in the definition of the Z tensor. In Least-Squares-AO-THC (LS-AO-THC),^{68} the Z tensor is stored as one matrix with dimension *N*_{Pt} by *N*_{Pt} and is defined using a least-squares fitting procedure that minimizes the difference between the approximated integrals and targeting integrals,

By inserting the density-fitting approximation, the analytical solution to the least-squares problem takes the following form:

where

The LS-Dual-Grid THC^{70} method is a variation of LS-AO-THC where the three-electron integrals in Eq. (25) are evaluated numerically using a second set of dense grids indexed with $P\u0303$ as

Analytical gradients are available for both LS-AO-THC and LS-Dual-Grid THC.^{70}

The efficient construction of the THC tensors relies on exploiting both spatial sparsity and GPU acceleration. This includes using the Schwarz bound to screen AO pairs, as done in IntBox. In addition, *ϕ*_{μ}(**r**_{P})*ϕ*_{ν}(**r**_{P}) quickly vanishes as the gridpoint *P* moves away from the center of the primitive pair. In order to exploit this sparsity on the GPUs, we first divide the physical space into cubic boxes (each side equal to 5 a.u. in our implementation) and then populate primitive pairs and grid points into the boxes according to their center positions. For the purpose of coalesced memory access as well as avoiding thread divergence, the data are rearranged such that the grid and orbital information of the same box is stored together. The scaling of computing the X tensor is thus O(*N*_{AO}). For computing the Z tensor, by properly exploiting spatial sparsity, the contraction $\u2211\mu \nu X\mu RX\nu R\u22c5\mu \nu |A$ that appears in Eq. (25) scales only as O($NAO2$), while Eq. (24) scales as O(*N*_{AO}). Therefore, the scaling of computing the Z tensor is O($NAO3$), which is dominated by the linear algebra (i.e., matrix multiplication and matrix inversion) in Eq. (23). We also provide Local-LS-AO-THC that overcomes the bottleneck of cubic scaling matrix inversion by using a localized least-squares procedure, a natural extension of the aforementioned gridpoint sparsity screening approach; however, this is currently only implemented for SOS-MP2 energies.^{69}

### B. Tensor Operation Layer

The middle layer of THCBox provides efficient implementations of tensor operations that involve the THC tensors; for all correlation methods except SOS-MP2, this layer is the most computationally expensive component in THCBox. These tensor operations are abstracted from the contraction patterns that appear repetitively in correlation energy and gradient evaluations and thus are unique to THCBox. One such example is the following contraction:

which is required in computing intermediates for both MP2 and CASPT2 energies. A detailed list of tensor operations that have been implemented can be found in our previous work.^{71}

Our current strategy for achieving high performance is through a combination of using the sparsity of the X tensor, as well as taking advantage of the mature BLAS library by flattening high order tensors to matrices or vectors. Currently, all BLAS level-1 and level-2 calls are performed on the CPU using MKL, while level-3 operations (i.e., matrix multiplication and diagonalization) are accelerated on the GPU using the cuBLAS or MAGMA libraries. In addition, we also implemented specialized functions for the following two operations:

which take advantage of the inherent sparsity structure of the X tensor. There are two drawbacks in the current implementation. The first is that some operations cannot be efficiently evaluated using the existing BLAS library. One such example is the hypercontraction (summing over a repeated index with more than two occurrences),

which is one of the most expensive steps in evaluating MP2 and CASPT2 exchange-like energies. Second, the memory transfers between CPU and GPUs are too frequent in our current implementation to account for situations when the tensors cannot all fit into GPU memory, which often happens for large molecules. As the tensor operation layer contains the steepest scaling steps in the calculations, further optimizations of the layer that can overcome these two drawbacks would lead directly to better performance of the correlation methods.

## V. CIBOX

The CIBox library provides access to GPU-accelerated software for solving configuration-interaction (CI) related problems. In addition to a direct determinantal CI program^{20} with enhanced diagonalization performance,^{76} tools are provided which enable spin purification methods^{77} that improve eigenvalue problem subspace stability. Direct CI avoids the construction and diagonalization of the electronic Hamiltonian matrix, instead using an iterative procedure to determine a few of the low-lying eigenvalues (and their eigenvectors).^{78} The computational bottleneck in direct CI is formation of **σ**, the matrix–vector product of the Hamiltonian matrix and a trial vector,

Wavefunction information is often presented in terms of one-electron properties, requiring evaluation of the one-particle density matrix (OPDM). Analytical energy gradients (using a Lagrange based formulation) rely on the efficient formation of the generalized OPDMs and two-particle density matrices (TPDMs),

Here, *I* and *J* index configurations, **c** is the CI vector, $\xcapq$ is an excitation operator from orbital *p* to orbital *q*, and superscripts *A* and *B* denote that two separate CI vectors can be used in the generalized builds. The schematic in Fig. 6 shows that CIBox takes one- and two-electron integrals in the MO basis, generated by IntBox using the **J** matrix algorithm described in Eq. (14), and configuration space parameters (i.e., number of α/β electrons and active orbitals) as inputs. In return, CIBox provides energies, CI vectors, and string occupancy patterns as solutions to the CI eigenvalue problem for a given spin multiplicity, as well as allowing access to the underlying **σ** and generalized OPDM and TPDM builds from Eqs. (31)–(33), respectively.

As with IntBox, access to the underlying GPU-accelerated quantities enables the rapid development of new electronic structure methods. When developing the rank-reduced full CI (RR-FCI) method,^{79} we initially constructed a pilot program using **σ** vectors from direct CI. This allowed us to focus our efforts on rapidly implementing the framework with knowledge that the **σ** vectors were correct. Once our RR-FCI program was completed, we were then able to derive and implement factorized **σ** vectors, knowing that the framework was reliable. This separation of concerns permits both efficient and robust prototyping. A second example is implementation of the time-dependent CASCI method in TeraChem.^{80} In this case, the propagation of the electronic wavefunction corresponds to **σ** vector formation provided by CIBox for trial vectors in real and imaginary space. As a final example, we consider the aforementioned response of the orbital-CI Hessian matrix in the AO-direct CP-SA-CASSCF method; after building the Lagrangian multiplier density matrices from Eq. (17), an effective Hamiltonian can be built as

where *t*, *u*, *v*, and *w* index active orbitals, $\gamma tuIJ$ and $\Gamma tuvwIJ$ use the generalized OPDM and TPDM build from CIBox, respectively, and the two-electron integrals come from IntBox following the workflow of Eq. (14). Finally, a **σ**-build using this effective Hamiltonian can then be constructed as part of the full Hessian matrix-vector product needed for the coupled-perturbed multiconfiguration self-consistent field equations.^{55}

The **σ** and generalized OPDM/TPDM builds have proven to be crucial quantities for CI-based correlated electronic structure methods and therefore enjoy the same status as the **J** and **K** builds from IntBox as key building blocks for the electronic structure. In Fig. 7, we show CASCI benchmarks on myoglobin using a hybrid quantum mechanical/molecular mechanics (QM/MM) scheme with a (16, 16) active space (i.e., 16 electrons in 16 orbitals). We calculate three states (the ground state and two excited states) as a representative calculation for the Q_{x} and Q_{y} bands of the iron–porphyrin complex and compare the **J**/**K** enabled SCF, the two-electron integral transform from Eq. (14), and CASCI utilizing **σ**, OPDM, and TPDM builds. Unsurprisingly, the cost of the CASCI procedure remains roughly constant with respect to the QM size and dominates for small regions. The two-electron integral transformation code becomes more expensive at large (i.e., over 1000 atoms and 5000 basis functions) QM regions, but both the transform and CASCI are insignificant when compared with the cost of the Hartree–Fock procedure for the reference orbitals. We additionally show that encapsulating these building blocks also provides a mechanism to make improvements to the underlying code without alteration of downstream software. This ability was leveraged recently when extending sigma, OPDM, and TPDM formation to multiple GPUs^{81} and again when developing mixed precision algorithms for sigma vector formation.^{82}

## VI. CCBOX

The CCBox module is one of the newest features in TeraChem. This library performs GPU-accelerated CCSD^{83–85} computations (as well as any method that can be written as a subset of the CCSD diagrams). This library is initialized with the definition of a single reference determinant (usually from an RHF computation); practically speaking, this definition comes from the molecular orbital coefficients and the assignment of those orbitals to the frozen core, occupied, and virtual orbital subspaces. The CCBox library also requires a CD of the ERIs to be provided. To construct the Fock operator, CCBox calls the **J**/**K** builds provided by IntBox. The CCSD program uses the spin-adapted formulation of Koch and co-workers for closed-shell singlet wavefunctions.^{86} The GPU algorithm follows along similar lines to that of DePrince and co-workers.^{87–89} We attempt to minimize data transfer to and from the GPU by performing array permutations in place of the GPU using custom CUDA kernels. All tensor contractions are performed as matrix multiplications using the cuBLAS library. Using 8 Tesla V100 GPUs, CCBox is able to perform CCSD computations with roughly 1300 basis functions and 100 atoms in less than one day on a single node. Additional performance timings and detailed comparisons to CPU-based CCSD implementations are available in an upcoming paper.^{90} In Fig. 8, we show the timings of one CCSD iteration and our aforementioned **J**/**K**-based CD algorithm in a TZVP basis set^{91} for a series of water clusters containing up to 30 molecules (930 basis functions). For the largest system, each CCSD iteration takes approximately 22 min and the CD takes 14.5 min. From the perspective of CCBox, the **J**/**K**-based CD algorithm is efficient enough to be useful, since usual CCSD computations take between 15 and 20 iterations to converge. In the context of planned extensions to equation-of-motion CCSD (EOM-CCSD), the time spent performing the CD will become an even smaller fraction of the total computation time.

## VII. TERACHEM PROTOCOL BUFFER (TCPB) SERVER

Dense GPU nodes have seen an upswing in popularity among high performance computing (HPC) clusters recently. For example, Oak Ridge’s new Summit supercomputer features nodes with two 22-core IBM POWER9^{TM} CPUs and 6 Tesla V100 GPUs.^{92} However, given the competitive nature of applying for time on these few select HPC resources, cloud providers such as Amazon Web Services^{93} (AWS), Google Cloud^{94} (GC), and NVIDIA GPU Cloud^{95} (NGC) can be an attractive source of GPU computing resources. Although several file-based and Message Passing Interface (MPI) interfaces exist for TeraChem, they rely on tightly coupled computing resources (i.e., a shared file system or a specific network topology) and are not well-suited for modern distributed heterogeneous compute systems and cloud computing.

We have developed a socket-based interface based on Google’s Protocol Buffers^{37} for extensible, language-agnostic serialization, as presented in Fig. 9. Since there is no hard-coded serialization with protocol buffers, it is easy to handle general electronic structure calculations and simple to add additional capabilities to the interface without breaking backwards compatibility. The serialized protocol buffer is then sent in a standard type-length-value (TLV) TCP packet to a Python or C++ client. There are four types of protocol buffer messages: Status, Mol, JobInput, and JobOutput. Status messages include fields to provide information about the server availability, job status (e.g., accepted, rejected, in progress, or completed), and basic job information (e.g., job id and working directory). Mol messages include all the information that is normally stored in XYZ files, such as atom types, molecular coordinates, units, charge, and spin multiplicity. The JobInput message includes fields for specifying ground state methods, calculation runtype, and flags to compute optional properties. All additional input options, including those for post-SCF methods, are passed through as key-value pairs. In our experience, keeping the JobInput message “thin” is advantageous as it enables reuse of existing parsing logic, encouraging consistent job specifications between the standard input files and the TCPB server. For post-SCF methods, CIS/TDDFT and a variety of CAS methods (e.g., FOMO-CASCI, CASSCF, and CISNO-CASCI) are available through the TCPB server mode. Finally, the JobOutput message has fields for all requestable properties, which include the following at the time of writing: energies, gradients, nonadiabatic coupling, CI vector overlap, atomic charge and spin, dipoles, MO energies/occupations, bond order matrices, relaxed and unrelaxed dipoles and transition dipoles for CIS methods, and transition dipoles for CAS methods. Additionally, paths to the output file and working directory are also included in the JobOuput message. Access to the working directory has proven useful for debugging, providing verbose error messages (as the last 10 lines of the output file), and access to binary file dumps, which is most commonly used to seed orbitals for subsequent calculations. Further information on the protocol buffers and communication strategy used by the TCPB interface is provided in the supplementary material.

The TeraChem Protocol Buffer (TCPB) server mode provides greater flexibility for loosely coupled computing resources and a clean interface for electronic structure calculations as a remote service. As we have shown in our recent work, this set of interfaces is more commensurate with the cloud computing environment.^{96} This interface is in line with other efforts within the community to increase the interoperability of electronic structure codes, such as the calculators in the Atomic Simulation Environment (ASE),^{97} the Molecular Sciences Software Institute’s (MolSSI) QCEngine project,^{98} or the MolSSI Driver Interface (MDI).^{99} The TCPB server interface has been utilized for several quantum chemistry workloads, including *ab initio* molecular dynamics (AIMD), geometry minimization, and minimum energy path optimizations for automated reaction discovery,^{24,25} and nonadiabatic dynamics with fragment-based models such as the *ab initio* exciton model.^{34–36} Further development and inclusion of TeraChem in other downstream applications, such as real-time interactive AIMD,^{23} AIMS,^{26} and conical intersection optimization,^{100} is currently ongoing.

## VIII. CONCLUSIONS

In the 1980s, numerical linear algebra was revolutionized by recasting algorithms in terms of matrices and vectors to take advantage of the highly optimized BLAS and LAPACK libraries. In a similar vein, we consider the identification, encapsulation, and subsequent reuse of several core operations as an important step in the development of electronic structure software. In addition to providing code modularity and encapsulation, the existence of the right set of highly optimized routines influences the development of new algorithms. We believe that density matrices and efficient evaluation of their operator representations serve as a physically motivated encapsulation scheme, which is particularly powerful when combined with hardware accelerators such as GPUs, where tensor-like operations can be computed efficiently.

From the past decade of development in the TeraChem software package, two main classes of operations have arisen that change the way we look at electronic structure software development today. The first class of operations are for the evaluation of Hamiltonian matrix elements: IntBox provides **J** and **K** builds for evaluating contractions over the ERI tensor, THCBox uses rank-reduction techniques to offer alternate contraction schemes with reduced scaling, and SQMBox gives the Mulliken-approximated equivalents to IntBox. The second class of operations construct the wavefunction: the density matrix in SCF methods or transition density matrices for the excited state methods in TeraChem, the OPDM and TPDM from CIBox for configuration-interaction methods, and the coupled-cluster amplitudes in CCBox. In the case of CIBox, the **σ**-build also provides an efficient way to contract the Hamiltonian and wavefunction. Together, we showed how reusing these electronic structure building blocks (i.e., **J** and **K** builds from IntBox and OPDM/TPDM/**σ** builds from CIBox) can inform new algorithmic breakthroughs, as exemplified by the efficient implementation of the AO-direct SA-CP-CASSCF equations and the novel combination of the GFN-xTB semiempirical methods with REKS. Our recent work in implementing implicit solvent models has highlighted two new one-electron operations: (i) the solvent cavity surface–solute density interaction and (ii) the solvent contribution to the Fock matrix.^{101} These quantities are also useful for constructing electrostatic potentials (ESPs), restricted ESP charge fitting, and embedding schemes, so further development to encapsulate and test the broader applicability of these operations is ongoing.^{102}

Another exciting avenue of current research is the development of these core electronic structure operations on emerging parallel architectures. During the discussion of mixed precision schemes in IntBox, it was already mentioned that mixed precision could be extended to make use of hardware support for half precision, which is available in the tensor cores of newer GPUs and is intended for machine learning. The latest generation of consumer-grade NVIDIA GPUs, the GeForce RTX line, also features RT cores designed for real-time ray tracing, which have not yet been used for the electronic structure. Tensor processing units (TPUs) may also provide similar computational motifs to GPUs and may provide alternate strategies for higher-order tensor operations. The success of IntBox and CIBox in accelerating SCF and CI-based methods also provides specific targets for exploring what gains can be made from using application-specific integrated circuits (ASICs) for the electronic structure. Field-programmable gate arrays (FPGAs) could serve as a stepping-stone toward electronic structure ASICs. While some work on using FPGAs in quantum chemistry has already been done,^{103–105} creating an FPGA equivalent of IntBox or CIBox would have immediate performance benefits. The clear separation and encapsulation of these operations enable developers to rapidly prototype new methods while ensuring their implementations will efficiently leverage the next generation of hardware accelerators.

## SUPPLEMENTARY MATERIAL

See supplementary material for a detailed description of the Protocol Buffers interface with usage example and “.proto” definition file.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## ACKNOWLEDGMENTS

This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through the Advanced Computing (SciDAC) program. S.S. acknowledges support from an NSF Graduate Research Fellowship. C.B. thanks the German National Academy of Sciences Leopoldina for support through the Leopoldina Fellowship Program (Project No. LPDS 2018-09).