A novel implementation of the coupled-cluster singles and doubles (CCSD) approach is presented that is specifically tailored for the treatment of large symmetric systems. It fully exploits Abelian point-group symmetry and the use of the Cholesky decomposition of the two-electron repulsion integrals. In accordance with modern CCSD algorithms, we propose two alternative strategies for the computation of the so-called particle–particle ladder term. The code is driven toward the optimal choice depending on the available hardware resources. As a large-scale application, we computed the frozen-core correlation energy of buckminsterfullerene (C60) with a polarized valence triple-zeta basis set (240 correlated electrons in 1740 orbitals).
The accurate and reliable determination of the energy and properties of a molecular system is a pivotal aspect of quantum chemistry. In this context, coupled-cluster (CC) methods1–3 are an optimal choice due to their remarkable accuracy and size-extensivity. For many applications, CC with singles and doubles excitations (CCSD)4 offers a good compromise between accuracy and computational cost. Nevertheless, its routine use is restricted to small- to medium-sized systems. The computational cost scales with the sixth power of the number of basis functions, and the storage requirements are high due to the fact that two-electron integrals involving three and four virtual orbitals need in principle to be stored.
To overcome such computational limitations, various strategies have been developed and are often combined together to get the most efficient implementation. A first class of strategies aims at reducing the overall scaling of the calculation by enforcing the local nature of electron correlation.5–12 Other low-scaling solutions are given by methods that compress the double-excitation amplitudes13 as, for example, in the tensor hypercontraction (THC) scheme,14–16 or by real-space methods.17 Using such techniques, impressive large-scale calculations can be done.18 However, reduced-scaling methods come with various difficulties, which include the use of cutoffs to enforce the locality of correlation, complicated expressions for analytical derivatives,19–23 and overall accuracy control that requires some user experience. A different family of strategies does not reduce the formal scaling of the method, but allows for very efficient implementations that can be vectorized, parallelized, and formulated mostly in terms of highly optimized matrix–matrix multiplications. This is achieved by introducing a low-rank approximation of the two-electron repulsion integrals (ERI), such as density fitting (DF)24–28 and Cholesky decomposition (CD).29–37 Such techniques can be used to reduce the prefactor in the overall method complexity, which remains , and can drastically lower integral storage and bandwith requirements and improve the parallelization potential.
Simultaneously and in parallel with this, new CC implementations also focus on deploying efficient computational realizations that leverage the full potential of modern computer architectures and utilize parallel computing. Recently, various works that combine massively parallel implementation strategies—based on the joint effort given by the message-passing interface (MPI) and OpenMP—with or without DF/CD, showed impressive calculations on large systems without exploiting local correlation.38–49 To the best of our knowledge, the largest CC calculation, actually at the CCSD(T)50 level, was done with as many as 1624 orbitals (without the use of point-group symmetry) using 288 cores.49
In this work, we present an implementation of CCSD that exploits the CD of the ERI matrix. The main element of novelty in our implementation is that it fully exploits Abelian point-group symmetry. The implementation is highly vectorized, as it mostly relies on efficient level 3 BLAS routines to perform the bulk of the floating-point operations, and is parallelized using OpenMP. It is therefore well suited to treat, on accessible computational resources, medium-sized molecular systems, and in particular large symmetric molecules. We choose CD instead of DF because, though DF affords a more compact representation of the ERIs, its accuracy depends on the fitted auxiliary basis set that is used for the calculation and cannot be determined a priori nor improved in a systematic way. On the contrary, the accuracy of the CD representation is rigorously controlled by the threshold used to truncate the decomposition, which is the only user-defined parameter that controls the calculation. Our implementation has been realized in a development version of the cfour suite of programs.51,52
An efficient implementation of the PPL term should take into account memory issues, and it should distribute the right amount of data to the CPU in order to retrieve its peak performance. These two considerations require in general the adoption of different strategies that are often in conflict with respect to each other. As the main cornerstone, we avoid to assemble and thus to store V4 and V3O arrays; for this reason, only batches of integrals are assembled and contracted on-the-fly. In particular, we implemented two different strategies for the PPL term. In the first version, which is outlined in Algorithm 1, we fix index a and distribute the operations over shared-memory threads using the OpenMP paradigm; therefore, the memory required to store temporary arrays is V3Nthreads. Here, we used for the expression reported in Eq. (6). This algorithm has the advantage of computing the PPL contraction with few, large matrix–matrix multiplications that are efficiently performed using the level 3 BLAS DGEMM routine, but as the number of threads increases, it can become too demanding in terms of memory requirements. We therefore implemented as a second option a different algorithm, which is shown in Algorithm 2, where we fix both the a and b indices and again distribute the operations over OpenMP threads. In this situation, the memory required to store the intermediates scales as V2Nthreads. This time, Eq. (10) was used.
PPL: LOOP OVER A.
1: Form |
2: for irr2 = 1, n_irrep do ⊳ OpenMP parallelized |
3: for a = 1, n_vir(irr2) do ⊳ OpenMP parallelized |
4: |
5: |
6: |
7: Form and |
8: Distribute in |
9: end for |
10: end for |
1: Form |
2: for irr2 = 1, n_irrep do ⊳ OpenMP parallelized |
3: for a = 1, n_vir(irr2) do ⊳ OpenMP parallelized |
4: |
5: |
6: |
7: Form and |
8: Distribute in |
9: end for |
10: end for |
PPL: LOOP OVER A AND B.
1: Form |
2: for irr2 = 1, n_irrep do ⊳ OpenMP parallelized |
3: for a = 1, n_vir(irr2) do ⊳ OpenMP parallelized |
4: for irr = 1, n_nirrep do ⊳ OpenMP parallelized |
5: irr1 = irr × irr2 |
6: if (irr1 < irr2) cycle |
7: b_end = n_vir(irr1) |
8: if (irr = = 1) b_end = a |
9: for b = 1, b_end do ⊳ OpenMP parallelized |
10: |
11: |
12: Form and |
13: Distribute in |
14: end for |
15: end for |
16: end for |
17: end for |
1: Form |
2: for irr2 = 1, n_irrep do ⊳ OpenMP parallelized |
3: for a = 1, n_vir(irr2) do ⊳ OpenMP parallelized |
4: for irr = 1, n_nirrep do ⊳ OpenMP parallelized |
5: irr1 = irr × irr2 |
6: if (irr1 < irr2) cycle |
7: b_end = n_vir(irr1) |
8: if (irr = = 1) b_end = a |
9: for b = 1, b_end do ⊳ OpenMP parallelized |
10: |
11: |
12: Form and |
13: Distribute in |
14: end for |
15: end for |
16: end for |
17: end for |
To test the two PPL algorithms, we ran calculations on coronene (C24H12) enforcing D2h symmetry using Dunning’s cc-pVTZ basis set61 and the frozen-core (fc) approximation, that is 888 orbitals of which 24 are kept frozen. The geometry is given as the supplementary material in the Zenodo repository.62 We note in passing that for non-symmetric systems our code performs similarly to the recent implementation presented by Schnack-Petersen et al.;59 here, we focus on the performance of the implementation for symmetric molecules. All calculations were run on a single Intel Xeon Gold 6140M node equipped with 1.1 TB of RAM. The results are summarized in Table I. For both algorithms, we run the calculation using 32 OpenMP threads, as expected the peak memory required is considerably reduced for the “loop ab” algorithm, and in particular it coincides with the memory required by the PPL subroutine. However, the timings for a single iteration are approximately a factor of two slower with respect to the implementation using the first algorithm. We observe that both algorithms present an analogous speedup plot, as shown in Fig. 1, with the “loop a” algorithm performing slightly better for smaller numbers of threads but hitting a plateau sooner than the “loop ab” algorithm. This behavior is expected, as both algorithms execute in parallel a large number of DGEMM matrix–matrix multiplications, which make the code memory-bound. This in turns causes cache misses and other cache-related issue, making the cores idle while they are starving for more data to process. The “loop ab” algorithm scales slightly better because the matrix–matrix multiplications performed within the parallel loop are smaller, which reduce the memory bottleneck. Nevertheless, a somewhat satisfactory speedup factor of 8–10 can be achieved. Different parallelization strategies, as well as the deployment of the code on different architectures, will be investigated in the future.
Summary of the comparison between the two implementations for the PPL contraction. The calculations are done on coronene with the cc-pVTZ basis set using up to 32 OpenMP threads. We report for each algorithm type the peak-memory consumption in GB, which coincides with the memory requested by the PPL subroutine, and the average time for a single CCSD iteration in minutes.
Algorithm . | Memory (GB) . | Time it. (min) . |
---|---|---|
Loop a | 272 | 3.5 |
Loop ab | 47 | 6 |
Algorithm . | Memory (GB) . | Time it. (min) . |
---|---|---|
Loop a | 272 | 3.5 |
Loop ab | 47 | 6 |
Speedup plot for CD based CCSD/cc-pVTZ calculations on coronene. For both algorithms, we plot the ratio between iteration time for the serial calculation and the iteration time of the parallel ones. Both axes are in the log2 scale.
Speedup plot for CD based CCSD/cc-pVTZ calculations on coronene. For both algorithms, we plot the ratio between iteration time for the serial calculation and the iteration time of the parallel ones. Both axes are in the log2 scale.
The benefit obtained by exploiting point-group symmetry can be discerned by comparing the theoretical factor of reduction due to symmetry (FRS) with respect to the achieved FRSs. Theoretical FRS can be computed as the ratio between the total number of floating-point operations required for a specific contraction term in the CC equations with symmetry and the one without symmetry. Achieved FRSs are given by the ratio between the elapsed CPU time required for a specific contraction performed with and without enforcing Abelian symmetry. The results are summarized in Table II. As test systems and basis sets we used corenene with (fc)cc-pVTZ, buckminsterfullerene (C60) with (fc)cc-pVDZ, and a cobalt β-diketiminato oxo complex [(nacnac)CoIIIO] with cc-pVDZ whose geometry was taken from Ref. 63. The three molecular representations can be seen in Fig. 2. We considered only the most expensive contractions, namely, the ones that exhibit a scaling of O3V3 and O2V4 (PPL). We note that the achieved FRS are systematically larger for the O3V3 terms with respect to the PPL contraction. This can be explained since, even for the most efficient PPL algorithm (i.e., Algorithm 1), we need to perform two nested loops over the irreducible representations, which means that enforcing point group symmetry results in a larger number of significantly smaller matrix–matrix multiplications, with a loss of efficiency with respect to the C1 treatment. On the other hand, the O3V3 terms are implemented in a way that only requires one explicit loop over the irreducible representations, affecting thus less the overall efficiency of their evaluation. This effect is less apparent for calculations that are run sequentially (see the numbers in parentheses in Table II). Furthermore, we note that for the C2v cobalt complex, where only 4 irreducible representations are present, the achieved FRS are closer to the theoretical FRS.
We report the theoretical and achieved factor of reduction due to symmetry (FRS) for D2h coronene, D2h buckminsterfullerene, and C2v cobalt complex. We ran the calculations using 32 OpenMP threads, while between parenthesis we report the achieved FRS obtained with a serial calculation.
. | Theoretical FRS . | Achieved FRS . | ||
---|---|---|---|---|
Molecule . | O3V3 . | O2V4 . | O3V3 . | O2V4 . |
C24H12 | 59 | 59 | 35 (47) | 10 (23) |
C60 | 64 | 63 | 44 (57) | 15 (28) |
(nacnac)CoIIIO | 16 | 15 | 12 (14) | 6 (9) |
. | Theoretical FRS . | Achieved FRS . | ||
---|---|---|---|---|
Molecule . | O3V3 . | O2V4 . | O3V3 . | O2V4 . |
C24H12 | 59 | 59 | 35 (47) | 10 (23) |
C60 | 64 | 63 | 44 (57) | 15 (28) |
(nacnac)CoIIIO | 16 | 15 | 12 (14) | 6 (9) |
Molecular systems used to test the CD-CCSD code. From the left: (nacnac)CoIIIO, buckminsterfullerene, and coronene. Carbon is represented in silver, oxygen in red, nitrogen in blue, cobalt in pink, and hydrogen in white.
Molecular systems used to test the CD-CCSD code. From the left: (nacnac)CoIIIO, buckminsterfullerene, and coronene. Carbon is represented in silver, oxygen in red, nitrogen in blue, cobalt in pink, and hydrogen in white.
To show the full potential of our new implementation, we ran a calculation on a challenging system, namely buckminsterfullerene (C60) with the cc-pVTZ basis set,61 which consists of 1800 basis functions (1620 virtual and 180 doubly occupied orbitals). The geometry can be downloaded from the Zenodo repository.62 The computational complexity for this system is twofold since both the number of virtual orbitals and the number of electrons are high. We performed the calculation with the frozen-core approximation—thus considering 120 occupied orbitals. We used a Cholesky decomposition threshold of 10−4 and obtained 10 169 CVs, which are nearly evenly distributed among the eight irreducible representations. The full set of CVs occupies only 29 GB of memory, which can be compared with the immense—and perhaps prohibitive—amount of memory, or rather disk space, that a traditional implementation would demand, which is about 3.4 TB for the full set of integrals ⟨ab|cd⟩ as estimated with a dry run. We note here that, in our implementation, to achieve maximum efficiency, we store the ⟨ab|cd⟩ integrals with a ≤ b and no other restriction. If one wanted to store only the unique elements of the ⟨ab|cd⟩ array, one would still need slightly less than 1 TB of memory, which is still a formidable amount of memory that comes at the price of having to expand lower-triangular matrices into square matrices before performing the relevant matrix–matrix multiplications. To accelerate the convergence of the non-linear equation solver, we exploited as usual the direct inversion in the iterative subspace (DIIS) strategy, and in order to avoid the in-core storage of the past expansion and error vectors, we input/output them at each iteration. The memory required to store a set of amplitude amounts in fact to ∼30 GB, making it a potential bottleneck, especially when large DIIS expansion spaces are used.
We first performed the computation using the “loop ab” PPL algorithm and distributing the work over 32 OpenMP threads. With this setup, the total memory requested amounts to 478 GB, and the time for a single iteration is 6 h 50 min proving the feasibility of the calculation. Furthermore, we note that using this algorithm the requested memory is almost independent of the number of threads used. We repeated the calculation using the “loop a” PPL algorithm and 12 OpenMP threads. Within this configuration, however, the amount of memory requested is 1080 GB. A single iteration took about 2 h, and the calculation converged in 1 day 19 h to 10−7 in the maximum norm of the residual. The PPL contraction is the most time consuming operation since about 65% of a single-iteration time is spent in computing this term, while the rest is mostly spent in computing the O3V3 contributions. The converged CCSD energy is −2281.300 20 Eh of which −8.915 74 Eh represents the correlation contribution.
In summary, we present a CCSD code that exploits Abelian point-group symmetry and the Cholesky decomposition of the two-electron integrals. The most expensive contribution—namely the particle–particle ladder term—has been implemented in two different versions using the symmetric/antisymmetric strategy. The two implementations differ in the memory required to store temporary arrays and the code is driven toward the best option depending on the available hardware resources. The implementation is suited to treat medium and large symmetric systems as demonstrated by performing CCSD calculations on buckminsterfullerene (C60) with a polarized valence triple-zeta basis set.
F.L. and T.N. acknowledge financial support from ICSC-Centro Nazionale di Ricerca in High Performance Computing, Big Data, and Quantum Computing, funded by the European Union—Next Generation EU—PNRR, Missione 4 Componente 2 Investimento 1.4. In Mainz, this work was supported by the Deutsche Forschungsgemeinschaft (DFG) via project B5 within the TRR 146 “Multiscale Simulation Methods for Soft Matter Systems.”
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Tommaso Nottoli: Data curation (equal); Investigation (equal); Methodology (equal); Software (lead); Writing – original draft (equal); Writing – review & editing (equal). Jürgen Gauss: Conceptualization (equal); Investigation (equal); Methodology (equal); Supervision (equal); Writing – review & editing (equal). Filippo Lipparini: Conceptualization (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Resources (equal); Supervision (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are openly available in the Zenodo repository.62