A novel implementation of the coupled-cluster singles and doubles (CCSD) approach is presented that is specifically tailored for the treatment of large symmetric systems. It fully exploits Abelian point-group symmetry and the use of the Cholesky decomposition of the two-electron repulsion integrals. In accordance with modern CCSD algorithms, we propose two alternative strategies for the computation of the so-called particle–particle ladder term. The code is driven toward the optimal choice depending on the available hardware resources. As a large-scale application, we computed the frozen-core correlation energy of buckminsterfullerene (C_{60}) with a polarized valence triple-zeta basis set (240 correlated electrons in 1740 orbitals).

The accurate and reliable determination of the energy and properties of a molecular system is a pivotal aspect of quantum chemistry. In this context, coupled-cluster (CC) methods^{1–3} are an optimal choice due to their remarkable accuracy and size-extensivity. For many applications, CC with singles and doubles excitations (CCSD)^{4} offers a good compromise between accuracy and computational cost. Nevertheless, its routine use is restricted to small- to medium-sized systems. The computational cost scales with the sixth power of the number of basis functions, and the storage requirements are high due to the fact that two-electron integrals involving three and four virtual orbitals need in principle to be stored.

To overcome such computational limitations, various strategies have been developed and are often combined together to get the most efficient implementation. A first class of strategies aims at reducing the overall scaling of the calculation by enforcing the local nature of electron correlation.^{5–12} Other low-scaling solutions are given by methods that compress the double-excitation amplitudes^{13} as, for example, in the tensor hypercontraction (THC) scheme,^{14–16} or by real-space methods.^{17} Using such techniques, impressive large-scale calculations can be done.^{18} However, reduced-scaling methods come with various difficulties, which include the use of cutoffs to enforce the locality of correlation, complicated expressions for analytical derivatives,^{19–23} and overall accuracy control that requires some user experience. A different family of strategies does not reduce the formal scaling of the method, but allows for very efficient implementations that can be vectorized, parallelized, and formulated mostly in terms of highly optimized matrix–matrix multiplications. This is achieved by introducing a low-rank approximation of the two-electron repulsion integrals (ERI), such as density fitting (DF)^{24–28} and Cholesky decomposition (CD).^{29–37} Such techniques can be used to reduce the prefactor in the overall method complexity, which remains $O(N6)$, and can drastically lower integral storage and bandwith requirements and improve the parallelization potential.

Simultaneously and in parallel with this, new CC implementations also focus on deploying efficient computational realizations that leverage the full potential of modern computer architectures and utilize parallel computing. Recently, various works that combine massively parallel implementation strategies—based on the joint effort given by the message-passing interface (MPI) and OpenMP—with or without DF/CD, showed impressive calculations on large systems without exploiting local correlation.^{38–49} To the best of our knowledge, the largest CC calculation, actually at the CCSD(T)^{50} level, was done with as many as 1624 orbitals (without the use of point-group symmetry) using 288 cores.^{49}

In this work, we present an implementation of CCSD that exploits the CD of the ERI matrix. The main element of novelty in our implementation is that it fully exploits Abelian point-group symmetry. The implementation is highly vectorized, as it mostly relies on efficient level 3 BLAS routines to perform the bulk of the floating-point operations, and is parallelized using OpenMP. It is therefore well suited to treat, on accessible computational resources, medium-sized molecular systems, and in particular large symmetric molecules. We choose CD instead of DF because, though DF affords a more compact representation of the ERIs, its accuracy depends on the fitted auxiliary basis set that is used for the calculation and cannot be determined *a priori* nor improved in a systematic way. On the contrary, the accuracy of the CD representation is rigorously controlled by the threshold used to truncate the decomposition, which is the only user-defined parameter that controls the calculation. Our implementation has been realized in a development version of the cfour suite of programs.^{51,52}

*μ*,

*ν*,

*σ*, … refer to atomic orbitals;

*p*,

*q*,

*r*, … refer to generic molecular orbitals;

*i*,

*j*,

*k*, … refer to occupied orbitals;

*a*,

*b*,

*c*, … refer to virtual orbitals; and

*P*,

*Q*, … refer to the vectors of the Cholesky basis. To get the CCSD correlation energy for closed-shell systems

*f*

_{ai}is the

*ai*element of the Fock matrix; $\tau \u0303ijab=2\tau ijab\u2212\tau jiab$ is the spin-adapted tau amplitude, which is the usual combination of double-excitation $(tijab)$ and single-excitation $(tia)$ amplitudes and in the closed-shell case assumes the following form: $\tau ijab=tijab+tiatjb$; ⟨

*ij*|

*ab*⟩ is a two-electron integral, written using the Dirac convention; |0⟩ is the reference Hartree–Fock (HF) determinant; $|\Phi ia\u232a$ and $|\Phi ijab\u232a$ are singly and doubly excited determinants, respectively; and $H\u0304=e\u2212THeT$ is the similarity-transformed Hamiltonian. The algebraic expressions can be found elsewhere. In particular, we have implemented the spin-restricted closed-shell equations.

^{53}We formulated the equations using spin-adapted forms of the $F$ and $W$ intermediates first introduced in the spin-orbital basis by Stanton, Gauss, and co-workers.

^{54,55}

*P*th Cholesky vector (CV). In order to account for Abelian point-group symmetry, the CD is done using symmetry-adapted linear combinations of atomic orbitals. The resulting CVs are classified by means of the irreducible representation to which the index

*P*belongs (Γ

_{P}). For the totally symmetric vectors, we store only the lower triangular block while for the others we store only the sub-blocks for which Γ

_{μ}≤ Γ

_{ρ}= Γ

_{P}⊗ Γ

_{μ}. The CVs are transformed to the molecular orbital basis at the beginning of the CC calculation. Furthermore, in our implementation, we always assemble and keep in core the integrals with at most two virtual indices, such as ⟨

*ab*|

*ij*⟩ and ⟨

*ij*|

*kl*⟩.

*h*

^{2},

*h*being the order of the point-group. Given a generic contraction between two four-index arrays, it is performed using the direct-product decomposition (DPD) approach,

^{54}which means that the contraction is subdivided into

*h*distinct operations. The rate-determining step in a CCSD code is given by the so-called particle–particle ladder (PPL) term

^{53,56}which was also successfully exploited in other recent CD/DF-CC implementations.

^{57–59}

*O*and

*V*denote the number of occupied and virtual orbitals, respectively. In particular, $Zijab$ is computed as the sum of a totally symmetric (

*S*) and an antisymmetric (

*A*) contribution,

*O*

^{4}

*V*

^{2}operations and

*O*

^{2}

*V*

^{2}words of memory for its computation. The price of computing such a term is compensated by not having to take into account single-excitation amplitudes in the evaluation of the PPL term, as in Eq. (6). This approach can thus be seen as a simple tool to slightly reduce the operation count and memory demand of the PPL contraction subroutine when the T

_{1}-transformed Hamiltonian is not used.

^{60}We note that the quantity $LaeP\u2212taeP$ is not symmetric with respect to the interchange of indices

*a*and

*e*.

An efficient implementation of the PPL term should take into account memory issues, and it should distribute the right amount of data to the CPU in order to retrieve its peak performance. These two considerations require in general the adoption of different strategies that are often in conflict with respect to each other. As the main cornerstone, we avoid to assemble and thus to store *V*^{4} and *V*^{3}*O* arrays; for this reason, only batches of integrals are assembled and contracted on-the-fly. In particular, we implemented two different strategies for the PPL term. In the first version, which is outlined in Algorithm 1, we fix index *a* and distribute the operations over shared-memory threads using the OpenMP paradigm; therefore, the memory required to store temporary arrays is *V*^{3}*N*_{threads}. Here, we used for $W$ the expression reported in Eq. (6). This algorithm has the advantage of computing the PPL contraction with few, large matrix–matrix multiplications that are efficiently performed using the level 3 BLAS DGEMM routine, but as the number of threads increases, it can become too demanding in terms of memory requirements. We therefore implemented as a second option a different algorithm, which is shown in Algorithm 2, where we fix both the *a* and *b* indices and again distribute the operations over OpenMP threads. In this situation, the memory required to store the intermediates scales as *V*^{2}*N*_{threads}. This time, Eq. (10) was used.

1: Form $\tau ijab\xb1,\u2200irr$ |

2: for irr2 = 1, n_irrep do ⊳ OpenMP parallelized |

3: for a = 1, n_vir(irr2) do ⊳ OpenMP parallelized |

4: $Iebfa=\u2211P(LaeP\u2212taeP)LbfP$ |

5: $Jfeba=Iebfa\u2212\u2211mtmb\u2211PLmfPLaeP$ |

6: $Wfeba\xb1=12(Jfeba\xb1Jefba)$ |

7: Form $Sijab$ and $Aijab$ |

8: Distribute in $Zijab$ |

9: end for |

10: end for |

1: Form $\tau ijab\xb1,\u2200irr$ |

2: for irr2 = 1, n_irrep do ⊳ OpenMP parallelized |

3: for a = 1, n_vir(irr2) do ⊳ OpenMP parallelized |

4: $Iebfa=\u2211P(LaeP\u2212taeP)LbfP$ |

5: $Jfeba=Iebfa\u2212\u2211mtmb\u2211PLmfPLaeP$ |

6: $Wfeba\xb1=12(Jfeba\xb1Jefba)$ |

7: Form $Sijab$ and $Aijab$ |

8: Distribute in $Zijab$ |

9: end for |

10: end for |

1: Form $\tau ijab\xb1,\u2200irr$ |

2: for irr2 = 1, n_irrep do ⊳ OpenMP parallelized |

3: for a = 1, n_vir(irr2) do ⊳ OpenMP parallelized |

4: for irr = 1, n_nirrep do ⊳ OpenMP parallelized |

5: irr1 = irr × irr2 |

6: if (irr1 < irr2) cycle |

7: b_end = n_vir(irr1) |

8: if (irr = = 1) b_end = a |

9: for b = 1, b_end do ⊳ OpenMP parallelized |

10: $Iefab=\u2211P(LaeP\u2212taeP)(LbfP\u2212tbfP)$ |

11: $Wefab\xb1=12(Iefab\xb1Ifeab)$ |

12: Form $Sijab$ and $Aijab$ |

13: Distribute in $Zijab$ |

14: end for |

15: end for |

16: end for |

17: end for |

1: Form $\tau ijab\xb1,\u2200irr$ |

2: for irr2 = 1, n_irrep do ⊳ OpenMP parallelized |

3: for a = 1, n_vir(irr2) do ⊳ OpenMP parallelized |

4: for irr = 1, n_nirrep do ⊳ OpenMP parallelized |

5: irr1 = irr × irr2 |

6: if (irr1 < irr2) cycle |

7: b_end = n_vir(irr1) |

8: if (irr = = 1) b_end = a |

9: for b = 1, b_end do ⊳ OpenMP parallelized |

10: $Iefab=\u2211P(LaeP\u2212taeP)(LbfP\u2212tbfP)$ |

11: $Wefab\xb1=12(Iefab\xb1Ifeab)$ |

12: Form $Sijab$ and $Aijab$ |

13: Distribute in $Zijab$ |

14: end for |

15: end for |

16: end for |

17: end for |

To test the two PPL algorithms, we ran calculations on coronene (C_{24}H_{12}) enforcing *D*_{2h} symmetry using Dunning’s *cc*-pVTZ basis set^{61} and the frozen-core (fc) approximation, that is 888 orbitals of which 24 are kept frozen. The geometry is given as the supplementary material in the Zenodo repository.^{62} We note in passing that for non-symmetric systems our code performs similarly to the recent implementation presented by Schnack-Petersen *et al.*;^{59} here, we focus on the performance of the implementation for symmetric molecules. All calculations were run on a single Intel Xeon Gold 6140M node equipped with 1.1 TB of RAM. The results are summarized in Table I. For both algorithms, we run the calculation using 32 OpenMP threads, as expected the peak memory required is considerably reduced for the “loop ab” algorithm, and in particular it coincides with the memory required by the PPL subroutine. However, the timings for a single iteration are approximately a factor of two slower with respect to the implementation using the first algorithm. We observe that both algorithms present an analogous speedup plot, as shown in Fig. 1, with the “loop a” algorithm performing slightly better for smaller numbers of threads but hitting a plateau sooner than the “loop ab” algorithm. This behavior is expected, as both algorithms execute in parallel a large number of DGEMM matrix–matrix multiplications, which make the code memory-bound. This in turns causes cache misses and other cache-related issue, making the cores idle while they are starving for more data to process. The “loop ab” algorithm scales slightly better because the matrix–matrix multiplications performed within the parallel loop are smaller, which reduce the memory bottleneck. Nevertheless, a somewhat satisfactory speedup factor of 8–10 can be achieved. Different parallelization strategies, as well as the deployment of the code on different architectures, will be investigated in the future.

Algorithm . | Memory (GB) . | Time it. (min) . |
---|---|---|

Loop a | 272 | 3.5 |

Loop ab | 47 | 6 |

Algorithm . | Memory (GB) . | Time it. (min) . |
---|---|---|

Loop a | 272 | 3.5 |

Loop ab | 47 | 6 |

The benefit obtained by exploiting point-group symmetry can be discerned by comparing the theoretical factor of reduction due to symmetry (FRS) with respect to the achieved FRSs. Theoretical FRS can be computed as the ratio between the total number of floating-point operations required for a specific contraction term in the CC equations with symmetry and the one without symmetry. Achieved FRSs are given by the ratio between the elapsed CPU time required for a specific contraction performed with and without enforcing Abelian symmetry. The results are summarized in Table II. As test systems and basis sets we used corenene with (fc)*cc*-pVTZ, buckminsterfullerene (C_{60}) with (fc)*cc*-pVDZ, and a cobalt *β*-diketiminato oxo complex [(nacnac)Co^{III}O] with *cc*-pVDZ whose geometry was taken from Ref. 63. The three molecular representations can be seen in Fig. 2. We considered only the most expensive contractions, namely, the ones that exhibit a scaling of *O*^{3}*V*^{3} and *O*^{2}*V*^{4} (PPL). We note that the achieved FRS are systematically larger for the *O*^{3}*V*^{3} terms with respect to the PPL contraction. This can be explained since, even for the most efficient PPL algorithm (i.e., Algorithm 1), we need to perform two nested loops over the irreducible representations, which means that enforcing point group symmetry results in a larger number of significantly smaller matrix–matrix multiplications, with a loss of efficiency with respect to the *C*_{1} treatment. On the other hand, the *O*^{3}*V*^{3} terms are implemented in a way that only requires one explicit loop over the irreducible representations, affecting thus less the overall efficiency of their evaluation. This effect is less apparent for calculations that are run sequentially (see the numbers in parentheses in Table II). Furthermore, we note that for the *C*_{2v} cobalt complex, where only 4 irreducible representations are present, the achieved FRS are closer to the theoretical FRS.

. | Theoretical FRS . | Achieved FRS . | ||
---|---|---|---|---|

Molecule . | O^{3}V^{3}
. | O^{2}V^{4}
. | O^{3}V^{3}
. | O^{2}V^{4}
. |

C_{24}H_{12} | 59 | 59 | 35 (47) | 10 (23) |

C_{60} | 64 | 63 | 44 (57) | 15 (28) |

(nacnac)Co^{III}O | 16 | 15 | 12 (14) | 6 (9) |

. | Theoretical FRS . | Achieved FRS . | ||
---|---|---|---|---|

Molecule . | O^{3}V^{3}
. | O^{2}V^{4}
. | O^{3}V^{3}
. | O^{2}V^{4}
. |

C_{24}H_{12} | 59 | 59 | 35 (47) | 10 (23) |

C_{60} | 64 | 63 | 44 (57) | 15 (28) |

(nacnac)Co^{III}O | 16 | 15 | 12 (14) | 6 (9) |

To show the full potential of our new implementation, we ran a calculation on a challenging system, namely buckminsterfullerene (C_{60}) with the *cc*-pVTZ basis set,^{61} which consists of 1800 basis functions (1620 virtual and 180 doubly occupied orbitals). The geometry can be downloaded from the Zenodo repository.^{62} The computational complexity for this system is twofold since both the number of virtual orbitals and the number of electrons are high. We performed the calculation with the frozen-core approximation—thus considering 120 occupied orbitals. We used a Cholesky decomposition threshold of 10^{−4} and obtained 10 169 CVs, which are nearly evenly distributed among the eight irreducible representations. The full set of CVs occupies only 29 GB of memory, which can be compared with the immense—and perhaps prohibitive—amount of memory, or rather disk space, that a traditional implementation would demand, which is about 3.4 TB for the full set of integrals ⟨*ab*|*cd*⟩ as estimated with a dry run. We note here that, in our implementation, to achieve maximum efficiency, we store the ⟨*ab*|*cd*⟩ integrals with *a* ≤ *b* and no other restriction. If one wanted to store only the unique elements of the ⟨*ab*|*cd*⟩ array, one would still need slightly less than 1 TB of memory, which is still a formidable amount of memory that comes at the price of having to expand lower-triangular matrices into square matrices before performing the relevant matrix–matrix multiplications. To accelerate the convergence of the non-linear equation solver, we exploited as usual the direct inversion in the iterative subspace (DIIS) strategy, and in order to avoid the in-core storage of the past expansion and error vectors, we input/output them at each iteration. The memory required to store a set of amplitude $(tijab)$ amounts in fact to ∼30 GB, making it a potential bottleneck, especially when large DIIS expansion spaces are used.

We first performed the computation using the “loop ab” PPL algorithm and distributing the work over 32 OpenMP threads. With this setup, the total memory requested amounts to 478 GB, and the time for a single iteration is 6 h 50 min proving the feasibility of the calculation. Furthermore, we note that using this algorithm the requested memory is almost independent of the number of threads used. We repeated the calculation using the “loop a” PPL algorithm and 12 OpenMP threads. Within this configuration, however, the amount of memory requested is 1080 GB. A single iteration took about 2 h, and the calculation converged in 1 day 19 h to 10^{−7} in the maximum norm of the residual. The PPL contraction is the most time consuming operation since about 65% of a single-iteration time is spent in computing this term, while the rest is mostly spent in computing the *O*^{3}*V*^{3} contributions. The converged CCSD energy is −2281.300 20 E_{h} of which −8.915 74 E_{h} represents the correlation contribution.

In summary, we present a CCSD code that exploits Abelian point-group symmetry and the Cholesky decomposition of the two-electron integrals. The most expensive contribution—namely the particle–particle ladder term—has been implemented in two different versions using the symmetric/antisymmetric strategy. The two implementations differ in the memory required to store temporary arrays and the code is driven toward the best option depending on the available hardware resources. The implementation is suited to treat medium and large symmetric systems as demonstrated by performing CCSD calculations on buckminsterfullerene (C_{60}) with a polarized valence triple-zeta basis set.

F.L. and T.N. acknowledge financial support from ICSC-Centro Nazionale di Ricerca in High Performance Computing, Big Data, and Quantum Computing, funded by the European Union—Next Generation EU—PNRR, Missione 4 Componente 2 Investimento 1.4. In Mainz, this work was supported by the Deutsche Forschungsgemeinschaft (DFG) via project B5 within the TRR 146 “Multiscale Simulation Methods for Soft Matter Systems.”

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Tommaso Nottoli**: Data curation (equal); Investigation (equal); Methodology (equal); Software (lead); Writing – original draft (equal); Writing – review & editing (equal). **Jürgen Gauss**: Conceptualization (equal); Investigation (equal); Methodology (equal); Supervision (equal); Writing – review & editing (equal). **Filippo Lipparini**: Conceptualization (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Resources (equal); Supervision (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in the Zenodo repository.^{62}

## REFERENCES

*Many-Body Methods in Chemistry and Physics: MBPT and Coupled-Cluster Theory*

*N*-body potentials in many-body quantum problems

*X*

_{α}theory

*Linear-Scaling Techniques in Computational Chemistry and Physics: Methods and Applications*

*t*

_{1}-transformed Hamiltonian