We present here a massively parallel implementation of the recently developed CPS(D-3) excitation energy model that is based on cluster perturbation theory. The new algorithm extends the one developed in Baudin et al. [J. Chem. Phys., 150, 134110 (2019)] to leverage multiple nodes and utilize graphical processing units for the acceleration of heavy tensor contractions. Furthermore, we show that the extended algorithm scales efficiently with increasing amounts of computational resources and that the developed code enables CPS(D-3) excitation energy calculations on large molecular systems with a low time-to-solution. More specifically, calculations on systems with over 100 atoms and 1000 basis functions are possible in a few hours of wall clock time. This establishes CPS(D-3) excitation energies as a computationally efficient alternative to those obtained from the coupled-cluster singles and doubles model.

## I. INTRODUCTION

Electronic excitation energies are relevant in several research fields, such as photochemistry, where chemical reactions are initiated by electromagnetic radiation, usually visible or ultraviolet light. The energy of electromagnetic radiation initiates a multitude of photochemical processes relevant to atmospheric chemistry, the vision in the eye, and the techniques behind photography and detection, along with the photosynthesis in the green leaves.^{1,2} In the latter case, the chlorophyll in the leaves absorbs parts of the light from the Sun and thereby provides energy to form glucose and oxygen from carbon dioxide and water. Other interesting photochemical reactions include industrial products produced by photochemical processes, such as the pesticide lindane (hexachlorocyclohexane) and the breakdown of environmentally harmful organic compounds by photo-oxidation with hydrogen peroxide or by photocatalytic oxidation using titanium oxide as catalyst.^{1,2} Excitation energies are also crucial when investigating chemical reactions induced by the monochromatic radiation of a laser where it is possible to form reactants with high precision in specific excited quantum states with accurate excitation energies.^{1,2} The laser-induced chemical reactions involve bimolecular reactions and unimolecular reactions (photochemistry with lasers), and in order to understand and control these reactions, it is crucial to perform accurate calculations of excitation energies.

With the advent of exascale supercomputers and increasing availability of computational resources, *ab initio* quantum chemistry enters a phase where a key challenge in pushing the limits of current methodologies is to develop massively parallel codes that can efficiently utilize heterogeneous clusters with several hundreds of nodes that contain both multiple central processing units (CPUs) and multiple graphical processing units (GPUs). This work consists of two main facets. First, the theoretical models need to be parallel in nature such that a parallel implementation is possible. The second challenge is to develop codes that exploit the potential parallelization of a given method and to ensure that it also makes efficient use of accelerators, such as GPUs, to increase the speed of, e.g., heavy tensor contractions in quantum chemistry by several orders of magnitude.^{3–5}

For methodologies such as Hartree–Fock (HF)^{6–14} and Density Functional Theory (DFT),^{15–24} both of the above-mentioned challenges have been extensively addressed.^{25,26} Both were implemented in highly parallel codes and scale very well with large amounts of computational resources as work can be distributed over different nodes and/or GPUs using distributed Fock builds. An example of this is a recent work that has led to a GPU-accelerated two-electron repulsion integral (ERIs) code that enables HF calculations on systems as large as a green fluorescent protein with iterations times of a few hours using multiple nodes with multiple GPUs.^{27}

Meanwhile, correlated wave function theories, such as Coupled Cluster (CC) methods, produce much more reliable results than those obtained from HF and DFT and could enjoy a much wider scope in applications. Unfortunately, parallelization of conventional CC theory presents a substantial challenge. Not only is the formal scaling of commonly used CC models, such as CC2,^{28} CCSD, CC3,^{29} and CCSD(T),^{30} much higher than mean-field theories, but they are also particularly difficult to scale to a large amount of computational resources as the solution to the non-linear amplitude equations is inherently difficult to parallelize using distributed memory schemes. Yet, over the last few decades, several different approaches have been explored in order to overcome the steep scaling of CC calculations. These include, but are not limited to, the following: (I) Methods that exploit rank sparsity in the ERIs, such as the resolution of the identity (RI) approximation (also known as density fitting)^{31–35} and Cholesky decomposition^{36–38} as well as further factorization using the pseudospectral approximation,^{39–43} the chain-of-spheres approximation,^{44,45} or tensor hypercontraction.^{46–49} Another alternative are tensor compression techniques that project, e.g., the double or triple amplitudes into a reduced space to reduce the size of the calculations.^{50–54} These approximations can efficiently lower the formal scaling of CC methods as well as the memory requirements at the cost of some accuracy. However, they have not yet solved the difficulties in parallelizing conventional CC theories. (II) Methods that utilize element-wise locality in CC calculations where, e.g., the molecular orbitals are localized in space such that the CC calculation can be divided into smaller CC calculations on fragments of the system. These fragments are only correlated with other fragments with which they actually interact based on some cutoff thresholds, e.g., distance. These methods can be divided into two families: (1) based on reduction in excitation orbital spaces^{55–69} and (2) correlation within local fragments^{70–86} that have seen great success for energy calculations and in achieving massively parallel implementations. However, the same success has not been achieved for molecular properties, such as excitation energies, that are inherently delocalized in nature.

Another strategy for developing approximation to CC models that have both a lower formal scaling and are more prone to parallelization is to utilize perturbation theory. In this context, the most prevalent examples of this are Møller–Plesset perturbation theory (MPPT)^{87} and Coupled Cluster Perturbation Theory (CCPT).^{88–90} Yet, none of these have been successfully extended to comprehend time-dependent molecular properties. Recently, we developed Cluster Perturbation (CP) theory,^{91–98} which has overcome the drawbacks of MPPT and CCPT. CP series have been developed for both ground state energies (first-order molecular properties) and excitation energies while we have also shown the existence of CP series for response properties. The essence of CP theory is that the target excitation space is divided into a parent and an auxiliary excitation space. The parent space spans the zeroth-order solution to the perturbation series, while the contributions from the auxiliary space are added perturbatively. An example is the CPS(D) model where the zeroth-order solution is the CCS wave function and its associated properties, and we target the CCSD wave function and its associated properties by deriving the perturbative corrections arising from the doubles’ excitation space and its interaction with the singles’ excitation space. Generally, single reference-based approaches perform excellently for molecular systems that are dominated by a single Slater determinant and investigated at the equilibrium molecular structure. For cases where there is more than one dominating Slater determinant it is crucial to utilise multi-reference methods.^{28–30,99–102}

In Ref. 93, an initial implementation of excitation energies using the CPS(D-3) model was outlined and it is seen that the CPS(D-3) model gives excitation energies of CCSD quality. In the current paper, we extend this implementation of CPS(D-3) excitation energies to exploit massively parallel heterogeneous computer hardware with accelerators, such as GPUs. This paper is structured in the following way. In Sec. II, we present the theoretical aspects of the CPS(D-3) method for excitation energies, and in Sec. III, we outline the developed CPS(D-3) excitation energy algorithm. Following that, we present a scaling analysis and showcase the ability of the algorithm to exploit a large amount of computational resources in Sec. IV. Finally, Sec. V contains some concluding remarks.

## II. CLUSTER PERTURBATION THEORY FOR CCSD EXCITATION ENERGIES

*ω*

_{x}are obtained as eigenvalues of the non-Hermitian CCSD Jacobian. We have illustrated that by using right- and left-eigenvalue equations

*R*_{x}and

*L*_{x}are the corresponding right and left eigenvectors for the excited state

*x*with the eigenvalue or excitation energy

*ω*

_{x}.

*R*_{x}and

*L*_{x}are usually chosen to be orthonormal,

*H*

_{0}is the electronic Hamiltonian of the molecular system and

*T*is the cluster operator. The cluster operator in CCSD theory is given as

*i*denoting excitation level and

*μ*

_{i}being a specific excitation at that level. The cluster amplitudes are determined such that they satisfy the CCSD amplitude equation

*N*is the size of the system. This scaling represents the formal scaling, which, together with the memory requirements, limit the systems that can be treated using CCSD.

### A. The CPS(D-n) model through third order

*f*is the Fock operator and Φ is the fluctuation potential. The fluctuation potential is the perturbation operator in the CPS(D) series. The zeroth-order excitation energy is the CCS excitation energy

^{103}and it can be obtained with a formal scaling of $O(N5)$. Meanwhile, the bottleneck of the CPS(D-3) calculation is the third-order correction to the excitation energy that formally scales $O(N6)$ with system size. Our focus in Sec. III will therefore be limited to the extension of the algorithm developed in CP III for the calculation of

*ω*

^{(3)}to utilize several nodes and GPU acceleration of tensor contractions to enable CPS(D-3) excitation energy calculations for large molecular systems with small time to solution.

## III. IMPLEMENTATION

^{104,105}for two-electron repulsion integrals, three center integral fitting coefficients are constructed from an auxiliary basis set using the Coulomb metric

*ω*

^{(3)}in batches over a virtual index where incremental contributions to

*ω*

^{(3)}from each batch can be determined independently and on-the-fly. Furthermore, the strategy effectively eliminates the storage of fourth-order tensors in main memory as these are decomposed using the RI approximation such that the three index fitting coefficients are the largest quantity to be stored. Even though this approach requires recalculation of, e.g., first-order doubles amplitudes $t2(1)$ for each batch, it enables the batch size to be scaled to fit the available memory as the largest quantity, being the fully virtual three index fitting coefficients obtained from the RI approximation, can be written to disk and read into memory in batches over a virtual index. These features enable a massively parallel algorithm.

1: Calculate transformation matrices, $X\u0304$, $Y\u0304$, $X\u0306$, $Y\u0306$ . |
---|

2: Calculate three center integral fitting coefficients, $BaiP$, $BijP$, $Bi\u0306aP$, $Bia\u0306P$ |

3: Calculate one index transformed Fock matrices $F\u0304ia$, $F\u0304ij\u2032$, $F\u0304ab\u2032$ |

4: Calculate fully virtual three index integral fitting coefficients $BabP$ and $Ba\u0304bP$ |

and distribute in global memory using tiled tensors. |

5: for A (batch over virtual index a) do |

6: Calculate second-order doubles intermediate $XAibjt$ (see Algorithm 2) |

7: Calculate $(iA|\u0306jb)=\u2211P(Bi\u0306APBjbP+BiA\u0306PBjbP+BiAPBj\u0306bP+BiAPBjb\u0306P)$ |

8: Get excitation energy contribution from the current batch: |

$\omega (3)=\omega (3)+\u2211a\u2208A\u2211ibj(2Xaibjt\u2212Xajbit)\epsilon i\u2212\epsilon a+\epsilon j\u2212\epsilon b(P\u0302ijabRaiCCSF\u0304jb\u2212(ia|\u0306jb))$ |

9: End batch loop over A |

10: Deallocate $Bi\u0306aP$ and $Bia\u0306P$ from memory |

11: Calculate three center integral fitting coefficients $Ba\u0304i$, $Bai\u0304$, and $Bi\u0304j$ |

12: for A (batch over virtual index a) do |

13: Calculate second-order doubles intermediate $XAibjR$ (see Algorithm 2) |

14: Calculate $(Ai|\u0304bj)=\u2211P(BA\u0304iPBbjP+BAi\u0304PBbjP+BAiPBb\u0304jP+BAiPBbj\u0304P)$ |

15: Get excitation energy contribution from the current batch: |

$\omega (3)=\omega (3)+\u2211a\u2208A\u2211ibj((2XaibjR\u2212XajbiR)\epsilon i\u2212\epsilon a+\epsilon j\u2212\epsilon b+2RaiCCStbj(2)\u2212RajCCStbi(2))(Ai|\u0304bj)$ |

End batch loop over A |

1: Calculate transformation matrices, $X\u0304$, $Y\u0304$, $X\u0306$, $Y\u0306$ . |
---|

2: Calculate three center integral fitting coefficients, $BaiP$, $BijP$, $Bi\u0306aP$, $Bia\u0306P$ |

3: Calculate one index transformed Fock matrices $F\u0304ia$, $F\u0304ij\u2032$, $F\u0304ab\u2032$ |

4: Calculate fully virtual three index integral fitting coefficients $BabP$ and $Ba\u0304bP$ |

and distribute in global memory using tiled tensors. |

5: for A (batch over virtual index a) do |

6: Calculate second-order doubles intermediate $XAibjt$ (see Algorithm 2) |

7: Calculate $(iA|\u0306jb)=\u2211P(Bi\u0306APBjbP+BiA\u0306PBjbP+BiAPBj\u0306bP+BiAPBjb\u0306P)$ |

8: Get excitation energy contribution from the current batch: |

$\omega (3)=\omega (3)+\u2211a\u2208A\u2211ibj(2Xaibjt\u2212Xajbit)\epsilon i\u2212\epsilon a+\epsilon j\u2212\epsilon b(P\u0302ijabRaiCCSF\u0304jb\u2212(ia|\u0306jb))$ |

9: End batch loop over A |

10: Deallocate $Bi\u0306aP$ and $Bia\u0306P$ from memory |

11: Calculate three center integral fitting coefficients $Ba\u0304i$, $Bai\u0304$, and $Bi\u0304j$ |

12: for A (batch over virtual index a) do |

13: Calculate second-order doubles intermediate $XAibjR$ (see Algorithm 2) |

14: Calculate $(Ai|\u0304bj)=\u2211P(BA\u0304iPBbjP+BAi\u0304PBbjP+BAiPBb\u0304jP+BAiPBbj\u0304P)$ |

15: Get excitation energy contribution from the current batch: |

$\omega (3)=\omega (3)+\u2211a\u2208A\u2211ibj((2XaibjR\u2212XajbiR)\epsilon i\u2212\epsilon a+\epsilon j\u2212\epsilon b+2RaiCCStbj(2)\u2212RajCCStbi(2))(Ai|\u0304bj)$ |

End batch loop over A |

1: Calculate $tAidk(1)=(Ai|dk)/(\epsilon i\u2212\epsilon A+\epsilon k\u2212\epsilon d)$ . |
---|

2: Get $BcAP$ tiled distributed tensor |

3: For D (batch over virtual index) do |

4: Extract $tAiDk(1)$ from $tAidk(1)$ |

5: Get $BbDP$ tiled distributed tensor |

6: $(bD|kj)=\u2211PBbDPBkjP$ |

7: $XAibjt=XAibjt\u2212\u2211DktAiDk(1)(bD|kj)$ |

8: $XAibjt=XAibjt\u2212\u2211DktAkDj(1)(bD|ki)$ |

9: $(cA|bD)=\u2211PBcAPBbDP$ |

10: Calculate $tDjci(1)$ |

11: $XAibjt=XAibjt+12\u2211DctDjci(1)(cA|bD)$ |

12: End batching over D |

13: For L (batch over occupied index) do |

14: Extract $tAkbL(1)$ from $tAkbl(1)$ |

15: $(ik|jL)=\u2211PBikPBjLP$ |

16: $XAibjt=XAibjt+12\u2211kLtAkbL(1)(ik|jL)$ |

17: End batching over L |

18: $YAiP=\u2211ck(2tAick(1)\u2212tAkci(1))BckP$ |

19: $XAibjt=XAibjt+\u2211PYAiPBbjP$ |

1: Calculate $tAidk(1)=(Ai|dk)/(\epsilon i\u2212\epsilon A+\epsilon k\u2212\epsilon d)$ . |
---|

2: Get $BcAP$ tiled distributed tensor |

3: For D (batch over virtual index) do |

4: Extract $tAiDk(1)$ from $tAidk(1)$ |

5: Get $BbDP$ tiled distributed tensor |

6: $(bD|kj)=\u2211PBbDPBkjP$ |

7: $XAibjt=XAibjt\u2212\u2211DktAiDk(1)(bD|kj)$ |

8: $XAibjt=XAibjt\u2212\u2211DktAkDj(1)(bD|ki)$ |

9: $(cA|bD)=\u2211PBcAPBbDP$ |

10: Calculate $tDjci(1)$ |

11: $XAibjt=XAibjt+12\u2211DctDjci(1)(cA|bD)$ |

12: End batching over D |

13: For L (batch over occupied index) do |

14: Extract $tAkbL(1)$ from $tAkbl(1)$ |

15: $(ik|jL)=\u2211PBikPBjLP$ |

16: $XAibjt=XAibjt+12\u2211kLtAkbL(1)(ik|jL)$ |

17: End batching over L |

18: $YAiP=\u2211ck(2tAick(1)\u2212tAkci(1))BckP$ |

19: $XAibjt=XAibjt+\u2211PYAiPBbjP$ |

Using a master-worker model, Algorithm 1 is parallelized for distributed memory architectures using the message passing interface (MPI). The master rank initializes the *ω*^{(3)} calculation by determining the quantities in lines 1–4 and broadcasts these to all other ranks. In line 5, the master rank distributes the calculation of the first contributions to *ω*^{(3)} in batches over the virtual index *a* to the other ranks that execute the calculation of the contribution to *ω*^{(3)} from a specific batch and return the incremental contribution to *ω*^{(3)} to the master rank that sums these to get the total *ω*^{(3)} correction. To avoid reading the fully virtual three index integral fitting coefficients from disk as done in Ref. 93, these are distributed in the collective memory of all nodes in line 4 using tiled distributed tensors through the Scalable Tensor Library (ScaTeLib).^{106} Once all worker ranks finish the calculation of the first contribution to *ω*^{(3)}, all ranks discard the no longer needed quantities and the master node determines a new set of three center integral fitting coefficients in line 11 as these are needed for the second contribution to *ω*^{(3)}. In line 12, the master rank distributes the calculation of the second contribution to *ω*^{(3)} in batches over the virtual index *a* in complete analogy to the first batch loop, yielding the full *ω*^{(3)} correction.

Algorithm 1 thus enables the calculation of CPS(D-3) excitation energies on system sizes that are beyond the reach of a conventional CCSD calculation, as the model can be effectively parallelized via the on-the-fly calculation of contributions from distributed batches. To that end, the time to solution for a calculation can be decreased by supplying more computational resources until the point where each rank effectively only does a calculation for a single virtual index, i.e., when the number of ranks equals the number of virtual orbitals. At this point, further batching could be introduced inside each virtual batch, but it would reduce compute density; thus, this was not an option that we have pursued in our current implementation.

To further accelerate the CPS(D-3) excitation energy calculation, all expensive tensor contractions in each batch are offloaded to GPUs using the Tensor Algebra Library for Shared Memory Computers (TALSH)^{107} locally on each rank as well as Open Multiprocessing (OMP) to utilize shared memory parallelism locally for each rank. Using the same philosophy, we also parallelized the calculation of $\mu 1[\Phi ,T2(1)]HF$, which is required in the calculation of *ω*^{(2)} and $t1(2)$ as the $O(N5)$ scaling of this term proved to be a bottleneck when moving to larger systems.

Consequently, we have outlined and developed a GPU-accelerated hybrid MPI/OMP parallel implementation of CPS(D-3) excitation energies that should scale very efficiently with increasing system size and amounts of computational resources allowing calculations on very large molecular systems, which we will display in Sec. IV.

## IV. NUMERICAL RESULTS

In the following, we present numerical test calculations to display the performance and scalability of the developed implementation. All calculations utilized the RI-approximation for two-electron integrals employing the RI-basis set corresponding to the utilized (aug-)cc-pVXZ basis set,^{108–110} e.g., calculations using aug-cc-pVTZ will use the aug-cc-pVTZ-RI. Furthermore, all calculations correlate all electrons. Timings presented in this work are obtained from calculations performed using one or multiple IBM Power System AC922 nodes (two IBM POWER9 processors, six NVIDIA Tesla V100 accelerators, and 512 GB of DDR4 memory) on Summit at the Oak Ridge Leadership Computing Facility. Timings only relate to the calculation of *ω*^{(3)} and do not include the time spent in the preceding Hartree–Fock, CCS, and CIS(D) calculations, as these do not represent a bottleneck in the calculations. All calculations were run using a local development version of the LSDalton program.^{111}

### A. Scaling analysis

To test the scalability of the developed CPS(D-3) algorithm, we perform CPS(D-3)/aug-cc-pVTZ excitation energy calculations on a series of fatty acids CH_{3}(CH_{2})_{n}COOH and CPS(D-3)/aug-cc-pVDZ excitation energy calculations on alanine oligomer chains ranging from 8 atoms to 53 atoms and from 276 basis functions, 380 auxiliary basis functions to 1196 basis functions, and 1560 auxiliary basis functions using varying node counts.

The success of the developed algorithm is critically dependent on leveraging the computational resources efficiently to achieve near-ideal speedup when multiple nodes are used. In Fig. 1, we, therefore, display the strong scaling of the CPS(D-3) algorithm in the form of wall times and speedup for different components of the CPS(D-3) algorithm as well as the total CPS(D-3) in the calculation of the first excitation energy of caproic acid [CH_{3}(CH_{2})_{4}COOH, 828 basis functions, 1088 auxiliary basis functions] using nodes between 1 and 32.

_{3}(CH

_{2})

_{n}COOH with

*n*going from 1 to 6 in aug-cc-pVTZ basis using

*n*nodes and on alanine oligomer chains with between 1 and 5 alanine units in aug-cc-pVDZ basis using a node count corresponding to the number of units. In Figs. 2 and 3, we present the wall time and computational cost of the outlined calculations on fatty acids and alanine oligomer chains, where the computational cost is defined as

From Figs. 2 and 3, it is evident that the empirically observed computational scaling of *N*^{3.5} and *N*^{4} for the calculation of the third-order excitation energy correction is significantly lower than the expected *N*^{6}. This most likely stems from two factors. First, the calculations involve non-negligible parts that formally scales only as *N*^{4} and *N*^{5}, such as the initialization of RI-integrals and the construction of the RHS for Eq. (29), which shows *N*^{3.8} and *N*^{4.2} scaling in the two cases, the iterative solution of Eq. (29), which shows *N*^{3} and *N*^{3.6} in the two cases, and many of the smaller tensor contractions. Consequently, it is only parts of third-order excitation correction that scale as *N*^{6}, which means that the observed overall scaling appears lower. Second, algorithms, such as the one developed here, are often more efficient than the formal scaling prescribes as they never reach the asymptotic limit and their efficiency increases with system size, e.g., due to the exploitation of sparsity in libraries, such as cuBLAS.

In both Figs. 2 and 3, we thus find that our algorithm efficiently leverages computational resources with increasing system sizes and that the scalability of the developed implementation enables routine calculations of CPS(D-3) excitation energies of system sizes over 1000 basis functions. As mentioned previously, the current implementation only employs batching over a single virtual index that limits the scalability with respect to computational resources. Providing a larger number of MPI ranks than the number of virtual orbitals in the system will leave some ranks redundant, yet, this problem can be alleviated by introducing further batching.

### B. Large scale calculations: CPS(D-3) excitation energies of azaoxahelicenes

To display the potential usage of the developed algorithm in large-scale excitation energy calculations, we perform CPS(D-3) excitation energy calculations on an azaoxahelicene that has been investigated both experimentally and computationally in previous studies^{114,115} due to its interesting optical properties. Helicenes have previously been researched heavily as chiral antennas in optoelectronic devices in molecular electronics and as sources of chiral emission due to their strong circularly polarized emission.^{116–119} It is thus crucial to be able to determine accurate excitation energies in order to scrutinize the potential applicability of helicenes and derivatives thereof in said applications from *in silico* investigations to computationally model and predict the optical properties of helicenes. We do note that triples and even quadruples excitations are in many cases necessary to obtain results that are in complete alignment with experiments, yet, currently such methods are in general out of scope for systems of this size and the CPS(D-3) model is a good option in this case as it neither suffers from the empirical approximations, such as those in, e.g., DFT or the size limitations of CCSD implementations.

We perform CPS(D-3) excitation energy calculations of the 8 lowest singlet excited states on B3LYP-D3/6-31+G(d) structure of the [9]-azaoxahelicene (**MeO[9]OMe**) displayed in Fig. 4 using the cc-pVDZ basis set with all electrons correlated. We note that this basis set is probably not sufficient for producing results that are directly comparable to experiments, yet, our current goal is simply to showcase the performance of the developed CPS(D-3) implementation in a HPC setting and that calculations, such as the ones performed here, can be carried out routinely. In the CPS(D-3) calculations, we utilize 128 nodes with 6 MPI ranks on each node, as this is where the batching over virtual indices has one batch per rank. This means that the calculations utilize 5292 cores, 756 GPUs, and 387 TB of memory in total.

The resulting CCS, CIS(D), and CPS(D-3) excitation energies are given in Table I along with relevant information from the calculations. It is evident that the accumulated corrections to the CCS excitation energies are quite substantial as they range from around 0.3–0.9 eV and that all corrections lead to a lower total excitation energy. Furthermore, we observe that the second-order correction in CIS(D) is much larger than the third-order correction that is added going from CIS(D) to CPS(D-3), which is perfectly in line with previous CP studies on excitation energies in Refs. 92 and 93. It is also seen that CCS does not order the excitation energies in the same way as CIS(D) and CPS(D-3) with, e.g., the second CCS root becoming the lowest excitation root in CIS(D) and CPS(D-3). These tendencies illustrate that CCS is not adequate for determining accurate excitation energies, as the effects of doubles excitations are crucial when determining the excited state properties.

Root . | CCS (eV) . | CIS(D) (eV) . | CPS(D-3) (eV) . |
---|---|---|---|

1 | 4.064 682 | 3.758 556 | 3.797 664 7 |

2 | 4.378 592 | 3.637 146 | 3.731 542 0 |

3 | 4.804 915 | 4.190 844 | 4.247 765 5 |

4 | 4.913 226 | 4.084 771 | 4.144 026 5 |

5 | 5.105 541 | 4.302 162 | 4.327 723 0 |

6 | 5.178 502 | 4.360 824 | 4.337 567 8 |

7 | 5.498 913 | 4.738 371 | 4.829 778 2 |

8 | 5.871 168 | 5.312 514 | 5.383 732 3 |

Number of contracted basis functions | 915 | ||

Number of primitive basis functions | 1601 | ||

Number of auxiliary basis functions | 3402 | ||

Average CPS(D-3) wall time per root | 147.0 min | ||

Number of nodes | 128 |

Root . | CCS (eV) . | CIS(D) (eV) . | CPS(D-3) (eV) . |
---|---|---|---|

1 | 4.064 682 | 3.758 556 | 3.797 664 7 |

2 | 4.378 592 | 3.637 146 | 3.731 542 0 |

3 | 4.804 915 | 4.190 844 | 4.247 765 5 |

4 | 4.913 226 | 4.084 771 | 4.144 026 5 |

5 | 5.105 541 | 4.302 162 | 4.327 723 0 |

6 | 5.178 502 | 4.360 824 | 4.337 567 8 |

7 | 5.498 913 | 4.738 371 | 4.829 778 2 |

8 | 5.871 168 | 5.312 514 | 5.383 732 3 |

Number of contracted basis functions | 915 | ||

Number of primitive basis functions | 1601 | ||

Number of auxiliary basis functions | 3402 | ||

Average CPS(D-3) wall time per root | 147.0 min | ||

Number of nodes | 128 |

With regard to timings, the average wall time required to determine the CPS(D-3) correction to a specific root is 147 min using 128 nodes. This illustrates that, with the advent of large-scale HPC facilities CPS(D-3), excitation energy calculations can be carried out routinely for larger molecules with a low time to solution. The low wall times demonstrate that the readily parallelizable CPS(D) framework and our implementation efficiently leverage large amount of computational resources and enable large scale CPS(D-3) excitation energy calculations. By introducing more batching, we could run the calculations using an even larger amount of computational resources and a more specialized GPU code would also provide higher efficiency and enable even larger calculations. However, our current implementation demonstrates the potential for developing massively parallel CPS(D) codes that enable routine calculations on large molecular systems that are not possible using conventional CCSD.

## V. CONCLUSIONS

Using the readily parallelizable framework of the CPS(D) methodology, we have developed a large-scale implementation of CPS(D-3) excitation energies. The introduction of hybrid MPI/OMP parallelization and GPU acceleration in the previously outlined CPS(D-3) excitation energy implementation of CP III enables the efficient exploitation of modern high performance computing infrastructure. By batching the calculation of the third-order excitation energy correction *ω*^{(3)} over a virtual molecular orbital index, the calculation is readily divided into a subset of incremental excitation energy correction evaluations. The fact that each incremental contribution to *ω*^{(3)} can be evaluated individually provides an ideal domain decomposition that makes the CPS(D-3) model excellently suited for parallelization over large-scale modern heterogeneous computing architectures.

Our hybrid MPI/OMP parallelized and GPU accelerated implementation readily scales to a very large number of compute nodes and achieves high parallel efficiency provided that the size of the system under study is large enough. As evidenced by our numerical illustrations, CPS(D-3) excitation energies can be determined for systems well beyond 1000 basis functions in only a few hours. Furthermore, the empirically observed scaling is only $O(N4)$ for the systems studied here as opposed to the formal $O(N6)$ scaling. Coupling these results with the comparison of CPS(D-3) excitation energies to CCSD and CC3 excitation energies in Ref. 93, it is thus evident that our current work makes excitation energies of CCSD quality readily available for system sizes of several hundred atoms using several thousand basis functions. This establishes CP theory as a viable alternative, e.g., to DFT methods, which is currently the go-to choice for quantum chemical calculations on large molecular systems.

In our current investigation, we have only considered excitation energies, yet, CP theory also comprehends the calculation of ground state energies, first-order properties, and higher-order properties through response functions. All of these can be formulated in a completely analogous way with batching over one or several molecular orbital indices such that incremental perturbation corrections can be evaluated individually on the fly. This allows massively parallel implementations, such as the one presented here. It is thus our aim to establish similar implementations of the CPS(D) series for these properties in the future to allow large scale CPS(D) calculations of various molecular properties. These efforts should make CPS(D) calculations routine for systems containing 100–200 atoms and thus make results of CCSD quality readily available.

On a different note, it would also be interesting to apply tensor decomposition and tensor compression to CP theory as has been done with great success for Møller–Plesset perturbation theory and standard CC theory in the past few years.^{46–49} Not only have these efforts shown that you can reduce the formal scaling of Møller–Plesset perturbation theory and CC methods, it is also possible to compress the doubles amplitudes such that only a small fraction of the doubles amplitudes are considered with only minor implications on the results.^{50–54} In future studies, we aim to apply these techniques in CP theory to achieve reduced scaling and reduced storage requirements, and, ultimately, enable calculations on much larger systems.

## ACKNOWLEDGMENTS

This research used resources from the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract No. DEAC05-00OR22725. A.E.H.B. and K.V.M. acknowledge the Danish Council for Independent Research (Grant No. DFF-0136-00081B) and the European Union’s Horizon 2020 Framework Program under Grant Agreement No. 951801 for financial support.

Notice: This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a non-exclusive, paid up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for U.S. Government purposes. The DOE will provide public access to these results in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Andreas Erbs Hillers-Bendtsen**: Data curation (equal); Formal analysis (equal); Investigation (equal); Validation (equal). **Dmytro Bykov**: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Supervision (equal). **Ashleigh Barnes**: Investigation (equal). **Dmitry Liakh**: Investigation (equal). **Hector H. Corzo**: Investigation (equal); Validation (equal). **Jeppe Olsen**: Formal analysis (equal); Methodology (equal). **Poul Jørgensen**: Formal analysis (equal); Investigation (equal). **Kurt V. Mikkelsen**: Conceptualization (equal); Investigation (equal); Project administration (equal); Supervision (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX A: EXPLICIT EXPRESSIONS FOR THE AMPLITUDES, THE JACOBIAN, AND EXCITATION VECTORS

### APPENDIX B: WORKING EQUATIONS IN SINGLET BIORTHONORMAL BASIS

## REFERENCES

*Photochemistry—A Modern Theoretical Perspective*

*Photochemistry and Photophysics Concepts Research Applications*

*Molecular Electronic-Structure Theory*