We present a detailed discussion of our novel diagrammatic coupled cluster Monte Carlo (diagCCMC) [Scott *et al.* J. Phys. Chem. Lett. **10**, 925 (2019)]. The diagCCMC algorithm performs an imaginary-time propagation of the similarity-transformed coupled cluster Schrödinger equation. Imaginary-time updates are computed by the stochastic sampling of the coupled cluster vector function: each term is evaluated as a randomly realized diagram in the connected expansion of the similarity-transformed Hamiltonian. We highlight similarities and differences between deterministic and stochastic linked coupled cluster theory when the latter is re-expressed as a sampling of the diagrammatic expansion and discuss details of our implementation that allow for a walker-less realization of the stochastic sampling. Finally, we demonstrate that in the presence of locality, our algorithm can obtain a fixed errorbar per electron while only requiring an asymptotic computational effort that scales *quartically* with system size, *independent* of the truncation level in coupled cluster theory. The algorithm only requires an asymptotic memory cost scaling linearly, as demonstrated previously. These scaling reductions require no *ad hoc* modifications to the approach.

## I. INTRODUCTION

The goal of quantum chemistry is to provide accurate and cost-effective methodologies for the solution of the molecular electronic Schrödinger equation. One needs to be able not only to reproduce experimentally measurable observables but also to understand the microscopic origin of these measurements and eventually predict and guide experiments.

Stochastic approaches to the solution of the Schrödinger equation provide an appealing alternative to deterministic strategies, and a number of Monte Carlo (MC) sampling methods have been continuously developed since the early days of quantum chemistry.^{1,2} At the cost of introducing statistical uncertainty in the results, quantum Monte Carlo (QMC) offers a low-scaling, parallelizable route to high-accuracy results. Despite the favorable scaling and scalability, QMC suffers from two well-known problems. The statistical errorbar can be decreased but at a very slow rate with increasing length of the simulation, that is, a larger number of random samples. Furthermore, for fermionic systems of interest in molecular electronic structure theory, the nodal structure of the wavefunction needs to be fixed *a priori* to avoid collapse onto the bosonic ground state,^{3,4} introducing an uncontrolled approximation that, thus, far cannot be efficiently relaxed to exactness. Despite this, their low polynomial scaling allows large-scale applications, particularly within condensed matter systems where they can provide accurate results while allowing extrapolation to remove finite-size errors.^{1,2,5}

Within the deterministic realm, the two above-mentioned problems do not appear. No statistical uncertainity riddles the results, and the methods are all formulated in the appropriate Fock space, guaranteeing that the solution, while approximate, is properly antisymmetrized. The coupled cluster (CC) wavefunction ansatz arguably provides the most effective framework for accurate simulations of single-reference molecular systems. The CC model provides an exponential parametrization of the molecular electronic wavefunction and enjoys a number of favorable properties. It provides a systematic route toward the exact, full configuration interaction (FCI) solution, while maintaining size-extensivity and size-consistency of results at any truncation level. Despite the exponential, nonlinear parametrization of the wavefunction, the computational cost of CC theory scales as a polynomial of system size, albeit with potentially high values for the exponents. Scaling and scalability are thus much less favorable than with stochastic approaches: a number of approximations have to be introduced,^{6–27} and many technical challenges need to be surmounted.^{28–36} In particular, while many high-performance implementations of coupled cluster with single and double (CCSD) substitutions and CCSD with perturbative triples correction [CCSD(T)] are nowadays available, the large gain in efficiency seen in local theories has yet to be reproduced for higher truncation levels in the CC hierarchy.

With these considerations in mind, efforts in the past decade have been directed at combining the best of both worlds into the formulation and implementation of Fock-space QMC methods. The full configuration interaction quantum Monte Carlo (FCIQMC) was the first such method to be presented, and the FCI secular problem is solved as the dynamics of a population of signed particles.^{37–40} This results in an exponentially scaling algorithm but with a dramatically reduced prefactor. Building on this, some of us have further developed a similar projector MC algorithm to solve the *unlinked* and *linked* CC equations.^{41–45} These CCMC algorithms, implemented in the HANDE-QMC software package,^{46} are fully general with respect to the excitation level, allowing one to perform arbitrary order CC simulations with a sparse representation of the wavefunction.

We recently showed that neither FCIQMC nor CCMC rigorously fulfills size-extensivity for noninteracting systems.^{47} Both algorithms perform imaginary-time propagation of unlinked many-body equations,^{48} which results in the unnecessary sampling of zero-on-average terms. This unnecessary work negatively impacts the memory and central processing unit (CPU) costs of the simulation and is particularly severe for CCMC, as it quickly precludes scaling to larger systems and/or higher orders of CC theory. To remedy this situation, we put forth a MC algorithm that performs the imaginary-time propagation governed by the *linked* CC equations. These are evaluated by the random sampling of the connected terms in the similarity-transformed Hamiltonian, conveniently represented as a diagrammatic expansion. The diagCCMC algorithm restores size-extensivity, and our preliminary tests have shown how localization can be readily exploited without further assumptions.^{49}

We should note that this is not the only avenue toward leveraging the benefits of MC sampling within the CC approach. Deustua *et al.* have shown how deterministic iterative CC solvers can be seeded with amplitudes from partially converged Fock-space QMC results. Combined with moment expansion corrections,^{50–52} this approach is a powerful technique, enabling access to higher levels of CC theory at a reduced cost. However, the high computational scaling of the QMC methods used to determine important higher-level amplitudes will eventually dominate the overall computational cost of this approach, and thus, our work provides a complementary solution.

In this work, we will first describe in detail the theoretical framework in which our diagCCMC algorithm rests. Section II summarizes CC theory with particular emphasis on its diagrammatic formulation. In Sec. III, we present a derivation of the imaginary-time update step and its usage in the CCMC and diagCCMC algorithms. We will then discuss the structure of the implemented algorithm and highlight differences and similarities to a deterministic implementation of CC theory.

## II. BACKGROUND THEORY

### A. Notation

We will use the tensor notation for second quantization.^{53–55} We denote the elementary, anticommuting fermion creation and annihilation operators as:

A *k*-electron excitation operator with respect to the physical vacuum (|vac⟩) is the product of *k* creation and *k* annihilation operators. In tensor notation,

such excitation operators are particle number-conserving. Explicitly, the one- and two-electron substitutions are

The Born–Oppenheimer molecular electronic Hamiltonian is then expressed as

where the integrals are given in an orthonormal basis of one-electron spin–orbitals,

For single-reference theories, it is more convenient to work in terms of the Fermi, rather than the physical, vacuum state. Our Fermi vacuum will be a single-determinant reference function |*D*_{0}⟩ for an *N*-electron system with 2*M* spin–orbitals.

Occupied one-particle states in |*D*_{0}⟩, *i*_{1}, *i*_{2}, …, *i*_{N}, will be referred to as *hole* states, whereas virtual one-particle states, *a*_{N+1}, *a*_{N+2}, …, will be referred to as *particle* states. A normal-ordered *k*-electron substitution operator will be denoted as

Using Wick’s theorem,^{55,56} these operators can be rewritten as a finite sum of subsets of permutations of elementary operators times contractions. The latter are elements of *k*-electron reduced density matrices (*k*-RDMs),

Thus, the normal-ordered one- and two-electron substitutions are^{57}

Imposing normal ordering on the molecular Hamiltonian, we obtain

where the energy of the reference determinant, the one-body Fock operator, and the two-body fluctuation potential appear,

We will use the symbol $\tau ^k$ for pure excitation operators or *excitors*. These are *k*-electron substitutions between hole and particle states in the reference determinant and, thus, particle number- and charge-conserving.

Since the *k*-RDMs in (7) are zero whenever any of the indices, upper or lower, refers to a particle state, the excitors are automatically normal-ordered,

Here, we have introduced the multi-index $k=a1a2\u2026aki1i2\u2026ik$ to compactly represent the *k*-electron substitution affected by the excitor, which is, up to a phase, a *k*-excited determinant,

A summary of our notation can be found in Table I.

Symbol . | Short description . |
---|---|

|D_{0}⟩ | The reference Slater determinant |

$\xe2p$ | Fermion creation operator |

$\xe2p$ | Fermion annihilation operator |

$\xe2r1r2\u2026rks1s2\u2026sk$ | k-electron excitation operator with respect to the physical vacuum |

$\xear1r2\u2026rks1s2\u2026sk$ | k-electron excitation operator, normal-ordered with respect to the reference |

j, k, … | Replacement multi-indices, e.g., $k=a1a2\u2026aki1i2\u2026ik$ |

|D_{k}⟩ | The k-th replacement excited determinant |

p, q, r, s, … | General spin–orbital indices |

i_{1}, i_{2}, …, i_{k} | Hole spin–orbitals in |D_{0}⟩ |

a_{1}, a_{2}, …, a_{k} | Particle spin–orbitals in |D_{0}⟩ |

$\tau ^k$ | Excitor for the k-th replacement |

t_{k} | Cluster amplitude for the k-th replacement |

$tk\tau ^k$ | Connected (non-composite) cluster |

$12!tktl\tau ^k\tau ^l$ | Disconnected (composite) cluster |

Symbol . | Short description . |
---|---|

|D_{0}⟩ | The reference Slater determinant |

$\xe2p$ | Fermion creation operator |

$\xe2p$ | Fermion annihilation operator |

$\xe2r1r2\u2026rks1s2\u2026sk$ | k-electron excitation operator with respect to the physical vacuum |

$\xear1r2\u2026rks1s2\u2026sk$ | k-electron excitation operator, normal-ordered with respect to the reference |

j, k, … | Replacement multi-indices, e.g., $k=a1a2\u2026aki1i2\u2026ik$ |

|D_{k}⟩ | The k-th replacement excited determinant |

p, q, r, s, … | General spin–orbital indices |

i_{1}, i_{2}, …, i_{k} | Hole spin–orbitals in |D_{0}⟩ |

a_{1}, a_{2}, …, a_{k} | Particle spin–orbitals in |D_{0}⟩ |

$\tau ^k$ | Excitor for the k-th replacement |

t_{k} | Cluster amplitude for the k-th replacement |

$tk\tau ^k$ | Connected (non-composite) cluster |

$12!tktl\tau ^k\tau ^l$ | Disconnected (composite) cluster |

### B. The coupled cluster ansatz

The coupled cluster wavefunction is parametrized as an exponential transformation of a reference single-determinant wavefunction |*D*_{0}⟩,

where the cluster operator $T^$ is given as a sum of second-quantized excitation operators,

with the *m*th order cluster operators expressed as sums of excitors weighted by the corresponding *cluster amplitudes*,

Note that in the tensor notation adopted, upper and lower indices of the cluster amplitudes appear reversed with respect to other conventions.

The CC correlation energy is the right eigenvalue of the Schrödinger equation for the normal-ordered Hamiltonian in (9),

This equation is solved by performing a similarity transformation of the Hamiltonian,

and then projecting onto the excitation manifold {|*D*_{j}⟩},

Equation (18b) defines the CC residual *ω*_{k}(** t**), which is zero at a solution of the nonlinear linked equations.

^{58}Whereas (16) and (18a) can be proved to be identical,

^{59}the

*linked*formulation in the latter is size-extensive order-by-order and term-by-term. For notational convenience, we have dropped the subscript N for the Hamiltonian. The similarity-transformed Hamiltonian $H\xaf$ can be expanded into a Baker–Campbell–Hausdorff (BCH) commutator series,

For the molecular Hamiltonian in Eq. (4), at most two-body operators are involved. Hence, regardless of the truncation level in the cluster operator $T^$, the expansion truncates at the fourfold nested commutator,

showing that only finitely many terms are included in Eq. (18b). Despite the fact that $H\xaf$ is no longer Hermitian, the linked formulation is still more advantageous than the unlinked formulation.^{59–61}

Since all excitors are normal-ordered and commuting, Wick’s theorem^{56,60,61} lets us reduce the Hamiltonian-excitor products to only those terms that are *connected* (in the diagrammatic sense). Excitors will only appear to the *right* of the Hamiltonian, and only terms where each excitor shares at least one index with the Hamiltonian will be nonzero,

The requirement of shared indices between the Hamiltonian and cluster coefficients enables the resulting equations to be solved via a series of tensor contractions: a process highly amenable to rapid evaluation on conventional computing architectures^{62} but non-trivial to parallelize.^{28,29,35}

### C. Diagrammatic representation

The algebraic derivation of the linked CC equations to a general truncation level from Eq. (18b) is lengthy and error prone. A diagrammatic representation can be effectively used to generate all unique terms in the equations.^{60,63,64} The normal-ordering and application of Wick’s theorem are key to these developments. The normal-ordered Hamiltonian features 13 *interaction vertices*, 4 coming from the Fock operator,

and 9 from the fluctuation potential,

Each of these vertices can be characterized by an integer representing their excitation level (0, ±1, ±2) and by a sign sequence encoding the pattern of open particle (+) and hole (−) lines *below* the interaction vertex (see Table II). Cluster operators can be classified similarly in terms of their excitation level (any integer ≥1) and their sign sequence.

. | Vertex . | Matrix element . | Excitation level . | Sign sequence . |
---|---|---|---|---|

1 | $fa1a2$ | 0 | + | |

2 | $fi1i2$ | 0 | − | |

3 | $fi1a1$ | −1 | +− | |

4 | $g\xafa1a2a3a4$ | 0 | ++ | |

5 | $g\xafi1i2i3i4$ | 0 | −− | |

6 | $g\xafa1i1a2i2$ | 0 | +− | |

7 | $g\xafa1a2a3i1$ | +1 | + | |

8 | $g\xafi1a1i2i3$ | +1 | ++− | |

9 | $g\xafa1i1a2a3$ | −1 | − | |

10 | $g\xafi1i2i3a1$ | −1 | +−− | |

11 | $g\xafi1i2a1a2$ | −2 | ++−− | |

12 | $fa1i1$ | +1 | 0 | |

13 | $g\xafa1a2i1i2$ | +2 | 0 |

. | Vertex . | Matrix element . | Excitation level . | Sign sequence . |
---|---|---|---|---|

1 | $fa1a2$ | 0 | + | |

2 | $fi1i2$ | 0 | − | |

3 | $fi1a1$ | −1 | +− | |

4 | $g\xafa1a2a3a4$ | 0 | ++ | |

5 | $g\xafi1i2i3i4$ | 0 | −− | |

6 | $g\xafa1i1a2i2$ | 0 | +− | |

7 | $g\xafa1a2a3i1$ | +1 | + | |

8 | $g\xafi1a1i2i3$ | +1 | ++− | |

9 | $g\xafa1i1a2a3$ | −1 | − | |

10 | $g\xafi1i2i3a1$ | −1 | +−− | |

11 | $g\xafi1i2a1a2$ | −2 | ++−− | |

12 | $fa1i1$ | +1 | 0 | |

13 | $g\xafa1a2i1i2$ | +2 | 0 |

For any given excitation level in the allowed manifold (up to double excitations for CCSD, triple excitations for CCSDT, and so forth), the diagrammatic generation of the corresponding CC equations proceeds via the following steps:

At the bottom, we draw a combination of at most four excitors.

At the top, we draw a Hamiltonian vertex. The valid vertices are limited by two requirements: (a) the final diagram be connected and (b) the overall excitation level of the projection manifold.

We pair the Hamiltonian vertex and excitor(s) sign sequences in all distinct ways to generate the sign sequences for all unique diagrams. The sign sequence encodes the diagram topology and ensuing contraction pattern.

We read the algebraic expression for the corresponding term in the CC equations off from the generated diagrams. The rules of interpretation associate target indices to the external (open) lines and dummy summation indices to the internal lines, Hamiltonian matrix elements to the interaction vertices, and products of amplitudes to the excitor vertices. Topological and permutational symmetries are taken into account by similar simple rules.

^{60,61,63}

The rules for generating and interpreting diagrams as algebraic expressions are independent of the CC truncation order and can be encoded into a computer program.^{61,65–70} However, a proper factorization of intermediates is essential to achieve acceptable time to solution and memory requirements.^{66}

## III. STOCHASTIC REALIZATIONS OF COUPLED CLUSTER THEORY

The solution of the CC equations can be achieved by means of stochastic algorithms. This stochastic realization is, however, not unique, and multiple algorithms have been put forward in the literature.^{41–43} All these different realizations are based on reformulating the time-dependent Schrödinger equation in imaginary time. The corresponding diffusion-like equation can be solved by the repeated application of an approximate propagator on a trial state. Employing a Fock-space representation circumvents the fermion sign problem without the need for fixing the nodes *a priori*.^{71}

### A. The imaginary-time propagation

After performing a Wick rotation *τ* ← i*t* to imaginary time, the time-dependent CC Schrödinger equation reads as^{72,73}

The *τ*-derivative on the left-hand side is (see the Appendix)

Excitation operators are assumed to be time-independent,

and since all excitors commute, the nested commutator expansion truncates at *l* = 0,

The imaginary-time Schrödinger equation (24) then becomes

and upon projection onto $\u2009Dk|exp\u2212T^(\tau )=\u2009D0|\tau ^k\u2020\u2061exp\u2212T^(\tau )$,

since by construction $\u27e8D0|\tau ^k\u2020\tau ^j|D0\u27e9=\delta kj$. Equation (29) is an imaginary-time ordinary differential equation (ODE) that we can solve by discretization.

The stochastic propagation of the linked CC equations is, thus, directly related to those utilized within FCIQMC,^{37} diffusion Monte Carlo (DMC),^{1,74} and the original unlinked coupled cluster Monte Carlo (CCMC) approach.^{41,42} This allows us to understand limits on the time step due to the spectral range of the Hamiltonian and more directly compare computational costs with prior stochastic coupled cluster theory.

### B. Discretized imaginary-time propagation and preconditioning

The imaginary-time ODE in Eq. (29) can be discretized in a number of ways. In principle, we would like to (a) use as large a time step as possible without losing the stability of the integrator and (b) perform the fewest possible number of evaluations of the CC vector function per time step. The usual approach in CCMC and FCIQMC is the explicit Euler method with a time step *h*,

where $tk[n+1]$ and $tk[n]$ are the cluster amplitudes at times *τ* + *h* and *τ*, respectively, and $\omega k[n]$ is the CC vector function at time *τ*.

Alternatively, one could use an implicit Euler scheme,

where the right-hand side now depends on the CC vector function evaluated at time *τ* + *h*. We can approximate this term using the Newton–Raphson step,^{59}

where the CC Jacobian has been introduced,

and obtain the Rosenbrock–Euler method,^{75,76}

Under the assumption of non-singular Jacobian, we can use a Woodbury-type identity to compute the inverse,^{77}

and retaining the first term only yields the *deterministic* Newton–Raphson step,^{59}

Given this point of view, it is possible to relate the imaginary-time propagation to a number of standard techniques in numerical analysis. Given a time-step *δτ*, the generalized step,

will be equivalent to a *relaxed* Newton–Raphson method.

The use of the full CC Jacobian for preconditioning would be extremely expensive, and a more pragmatic route is taken in practice. The simplest choice is to approximate the Jacobian with the identity matrix, i.e., no preconditioning is applied to the iterations. A more sophisticated approach is to only retain iteration-independent terms in Eq. (33),

where the “d” and “od” stand for diagonal and off-diagonal, respectively. We can then propose two cheap preconditioners. We can either use the diagonal part of the Fock operator,^{78}

or the diagonal part of the full Hamiltonian,

The former is universally implemented in deterministic CC codes, and its effectiveness can be justified through perturbative arguments.^{59} The use of the latter has not, to the best of our knowledge, been attempted before.

The derivation here presented makes explicit connection with preconditioning, already discussed by some of us in connection with FCIQMC^{79} and unlinked CCMC.^{80} We will discuss how preconditioning is implemented for diagCCMC in Sec. IV C.

Finally, let us point out that Jarlebring *et al.* showed how a specific instance of a nonlinear eigenvalue problem is equivalent to a Rosenbrock-type discretization of an associated imaginary-time ODE.^{76} An adaptive time step integrator can be, thus, formulated based on convergence estimates similar to those presented in Ref. 76.

## IV. DIAGRAMMATIC COUPLED CLUSTER MONTE CARLO

We wish to stochastically solve the linked CC equation (18b). Additionally, and at variance with the approach of Franklin *et al.*, we wish to overcome the need for a corrected update step and the sampling of extraneous unlinked terms.^{43} Whereas the latter have been observed to cancel out on average, they impose limitations to what system sizes are approachable before the memory cost becomes prohibitive.

In the diagCCMC algorithm,^{47} we use the *uncorrected* update step in Eq. (30). Two novel insights allow us to achieve this goal:

The CC wavefunction is stored in a compressed representation

*without*invoking particles or walkers. It is comparatively easier to enforce constant unit intermediate normalization within a walker-less algorithm.The CC vector function appearing in the update step is an integral expressible as a terminating series expansion. Terms in this expansion can be evaluated stochastically.

The use of diagrammatic techniques automatically guarantees that only connected terms in the similarity-transformed Hamiltonian are included. The sampling will, thus, happen in “diagram space” and relies on the even selection algorithm of Scott *et al.*^{44}

### A. Stochastic compression without walkers

Previous algorithms to stochastically solve the linked CC equations modified the propagation in (30) to approach the correct solution. The need for such modifications can be attributed to the use of a *variable* intermediate normalization,

where the additional normalization parameter *N*_{0} is constrained by the energy equation,

and the unknown CC energy must be substituted by the shift *S*. At the beginning of the simulation, *S* = *E*_{ref} and this causes the energy estimator to converge incorrectly prior to initialization of population control. However, upon closer inspection, the wavefunction ansatz in (41) is seen to be equivalent to the conventional CC ansatz with (a) overlap with the reference set to *N*_{0} and (b) all nonzero cluster amplitudes *t*_{k} represented by values larger than $1N0$. The floating intermediate normalization can then be interpreted as an algorithmic choice to determine the *granularity* of representation during the calculation and achieve compression of the CC wavefunction. This choice is arbitrary and can be related back to the conventional CC ansatz. Assume then that the intermediate normalization is now a constant value ⟨*D*_{0}|CCMC⟩ = *N*_{0}, set as an input parameter to the calculation. At sufficiently small granularities, the calculation will spontaneously stabilize at a system-dependent population of walkers without the need for population control. The stochastic realization of the modified explicit Euler integration,

would then

**compress**the cluster amplitudes to the selected granularity, by*stochastically*rounding those amplitudes for which $|t\u0303k|<1$ to $sgn(t\u0303k)\xd71$ or 0,**evaluate**the CC vector function by taking a large enough number of samples such that diagrams in $\omega \u0303k$ of magnitude 1 are, on average, selected once,**adjust**the time step*δτ*as to avoid particle blooms, which are a large spawning event, which would destabilize the calculation dynamics.

We can, however, take one further step and cast away the walker interpretation entirely. The thresholding implied in the previous algorithmic sketch can be rigorously formulated *without* recourse to walkers. We introduce three strictly positive calculation parameters: the *representation* granularity Δ, the *evaluation* granularity *γ*, and the maximum diagram contribution *ϵ*. The algorithm then will

**compress**the cluster amplitudes to the chosen representation granularity, by stochastic rounding amplitudes for which |*t*_{k}| < Δ to sgn(*t*_{k}) × Δ or 0,**evaluate**the CC vector function stochastically such that diagrams with magnitude*γ*are selected once on average,**adjust**the time step*δτ*such that the maximum diagram contribution, $\delta \tau wdiagrampdiagram$, is of magnitude*ϵ*.

The walker and walker-less representations are entirely equivalent. The representation granularity is the inverse of the intermediate normalization constant $\Delta =1N0$, the condition *γ* = Δ defines the even selection approach,^{44} and the ratio $\u03f5\Delta $ is the maximum allowed size for a spawning event. The resultant approach to the imposition of sparsity bears some resemblance to recent fast randomized iteration approaches.^{81,82}

Within this approach, the total walker population is the sum of rescaled cluster coefficient absolute magnitudes and the reference $1\Delta +\u2211i|ti|\Delta $. It is, thus, not needed to set the hard-to-predict total walker population as a calculation parameter: choosing to stochastically round all cluster coefficients below a certain value gives a more intuitively stable treatment between different calculations. The total walker population can vary dramatically with system size: evaluating and comparing computational cost and performance for systems of varying size can be a nontrivial challenge. Instead, we expect the walker-less picture to manifest the *transferability* property of cluster amplitudes:^{83} the magnitude of the amplitudes should be relatively unchanged with system size, especially when localized orbitals are used, allowing equivalent parameters for different calculations to be easily identified.

We have found *γ* = 10^{−3} to be the lowest evaluation granularity giving a calculation stable enough to extract statistics from. While smaller *γ* values achieve more stable calculations, with 10^{−4} providing a reasonable compromise between the computational cost and stability. We have also continued to use conventions from the particle representation for now by setting $\Delta \gamma =1$ and $\u03f5\gamma =3$.

### B. Selection of diagrams

The second essential insight enabling the diagCCMC algorithm is the stochastic evaluation of the CC vector function on the right-hand side of the uncorrected update step. At any given excitation level in the CC hierarchy, the BCH expansion of $H\xaf$ will truncate at the fourfold nested commutator: *ω*_{k} is expressible as a sum of a finite enumerable number of terms. We choose to represent these terms as diagrams and generate such an expansion on the fly rather than enumerating the allowed diagrams beforehand. In each main Monte Carlo cycle in the algorithm, we perform the evaluation of the integral by attempting to select *n*_{a} *fully specified* diagrams from its expansion. The action of the similarity-transformed Hamiltonian on the reference determinant can be written compactly as

where the amplitude $w$_{l} = $w$_{H} ∏_{m}*t*_{m} is a product of a one- or two-body integral from the Hamiltonian and a cluster of excitors. The multi-index **l** is fully specified, meaning that all hole and particle lines are *explicitly* labeled. The amplitude is determined by the contraction pattern randomly selected during diagram generation. Finally, since $\u27e8D0|\tau ^p\u2020\tau ^q|D0\u27e9=\delta pq$, the selected diagram can contribute to one and only one cluster amplitude: the one whose multi-index corresponds to the *external* lines in $w$_{l}. The rules for the deterministic enumeration of diagrams that were briefly detailed in Sec. II C are largely unmodified in our stochastic algorithm. Each step corresponds to an event occurring with an easily computed probability:

Sample the action of the wave operator: with probability

*p*_{sel}, choose a term from the BCH expansion (20), that is, select a cluster of size*N*≤ 4 and the excitation level of each constituent excitor. We use the even selection scheme of Scott*et al.*^{44}in a walker-less representation (see Sec. V A).Sample the action of the Hamiltonian: with probability

*p*_{hver}, choose one of the 13 interaction vertices in Table II. This step is not independent of the former, and we use importance sampling (see Sec. V A).Sample the admissible contraction patterns: with probability

*p*_{cont}, choose a specific Kucharski–Bartlett sign sequence,^{60,61,63}see Table III, for example.Sample the index set to label internal lines. Given the number of internal hole and particle lines in the selected contraction, the probability associated with this step

*p*_{int}is computed combinatorially.Sample the index set to label external lines. As for the previous step, the probability

*p*_{ext}is also computed combinatorially.

Excitors . | Interaction . | Contraction . | Diagram . |
---|---|---|---|

+ −| + −| + + + −−− | + + −− | + −| + |− | |

+ −|−| + | |||

+ |−−| + | |||

−| + + |− | |||

+ |−| + − |

Excitors . | Interaction . | Contraction . | Diagram . |
---|---|---|---|

+ −| + −| + + + −−− | + + −− | + −| + |− | |

+ −|−| + | |||

+ |−−| + | |||

−| + + |− | |||

+ |−| + − |

With this process, we are able to obtain a given diagram with probability *p*_{diagram} = *p*_{select}*p*_{hver}*p*_{cont}*p*_{int}*p*_{ext}, and in each Monte Carlo step, the diagram is sampled *p*_{diagram} × *n*_{a} times.

We need further minor modifications to the deterministic enumeration of diagrams to ensure that the CC vector function is evaluated correctly. Our algorithm singles out specific diagrams, where all lines, internal and external, are explicitly labeled. This procedure identifies a single cluster amplitude *t*_{k} to which the selected diagram will contribute without having to sum over internal lines. Permutational symmetries will, thus, have to be handled differently such that our probability distributions are properly normalized. Sums of the form $12\u2211i,j$ have to be replaced with $\u2211i>j+12\delta ij$ to ensure that there is only a single way to select diagrams related by the following:

the antipermutation of indices stemming from antisymmetrized interaction vertices,

the antipermutation of hole or particle indices stemming from excitor vertices, and

the commutation of excitors.

For the first two cases, terms with *i* = *j* would vanish when summing over equivalent indices. In the last case, the diagonal case *i* = *j* indicates additional symmetries of the resulting diagram. In our stochastic diagram enumeration, each pair of equivalent internal or external lines will not require a $12$ factor. Moreover, upon selection, a well-defined ordering of excitors is established, which removes the need for $12$ factors in diagrams where excitors of the same rank appear. These modifications to the deterministic evaluation rules ensure a unique selection of a contraction pattern.^{60,61,63} The action of permutation operators for inequivalent external lines is subsumed into the permutation of hole and particle indices needed to store the result of the sampling in antisymmetrized ordering, which provides the appropriate parity factor (−1)^{σ}. With these considerations, a contribution to *t*_{k} is computed as

and the sampling algorithm is designed such that *p*_{diagram} = |$w$_{diagram}|. Ultimately, our aim is to achieve importance sampling between contributions (see Sec. V C).

### C. Preconditioning

While the imaginary-time propagation discussed in Sec. III will generally be used within our work, we could also make use of arbitrary preconditioners. Apart from the identity, we implemented two additional options: the diagonal of the Fock operator (diagrams 1 and 2 in Table II) and the diagonal of the full Hamiltonian (diagrams 1, 2, 4, 5, and 6 in Table II). The connected portions of these vertices do not modify excitors when applied. These preconditioners are the iteration-independent approximations to the Jacobian discussed in Sec. III B and strike a balance between the computational complexity and improvement of convergence. The diagonal Fock preconditioner is ubiquitously implemented in deterministic CC approaches.^{59} Since all relevant quantities are precomputed, the values of either preconditioner can be evaluated with a cost independent of the system size, unlike the implementation of the similar approach within FCIQMC and CCMC.^{79,80}

The portion of the Hamiltonian used for preconditioning can then be applied via a straightforward rescaling of the original cluster amplitudes by a factor of 1 − *δτ*. The remainder of the Hamiltonian is applied explicitly, as in the imaginary-time propagation, before rescaling by the preconditioner.

We will not investigate the benefits of preconditioning here but wanted to observe that the diagrammatic formalism lends itself to a straightforward implementation of a range of preconditioners without introducing additional computational costs scaling with system size. This results from the use of the connected portions of all preconditioners, unlike previous stochastic approaches.^{79,80}

## V. $H\xaf$, IMPORTANCE SAMPLING, AND EVEN SELECTION

The even selection algorithm was proposed by Scott and Thom^{44} to improve the sampling of the action of the CC wave operator on the reference determinant. Even selection was specifically designed to alleviate calculation instabilities due to the occurrence of large particle blooms. Sampling proceeds via the selection of clusters containing a specific number of excitors of each rank, termed a *combination*, separately. In this section, we summarize the adaptation of even selection in a walker-less context. We then illustrate the need for the importance sampling of $(H\u2061expT)c$ and describe the strategy implemented in diagCCMC.

### A. Walker-less even selection

The original even selection algorithm defined the probability of selecting a particular set of excitors *e* from the combination *c* of size *s* as

a series of conditional probabilities. We re-express the selection probability as follows:

to simplify considerations to follow. We also assume the unit intermediate normalization.

We adopt the same notation used in Ref. 44 and denote the number of excitors of rank *j* within combination *c* as *η*_{cj}. *L*_{j} is the sum of absolute magnitudes of cluster amplitudes at rank *j*, that is, the *ℓ*_{1}-norm of *T*_{j}.

In keeping with the original approach, *η*_{cj} denotes the number of excitors of rank *j* contained within combination *c*, *L*_{j} is the sum of cluster coefficient absolute magnitudes at rank *j*, and *n*_{a} is the number of sampling attempts to be made for that iteration.

In the walker-less representation, the amplitude of a given cluster is the product of cluster coefficients $w$_{e} = ∏_{p}*t*_{p}. The evaluation granularity is, by definition, equal to the absolute magnitude of the MC weight,

Even selection for all clusters requires that evaluation and representation granularities be the same: *γ* = Δ. As we also require $|we|pselect(e)$ to be an excitor-independent constant, we obtain

with normalization constant

where *n*_{combo} is the total number of combinations. The number of random samples to take within a calculation is, thus, obtained from the evaluation granularity given as input,

### B. Motivation for importance sampling

Each pairing of excitor combinations with Hamiltonian vertices can result in a different number of admissible contractions and, thus, fully indexed diagrams. It is non-trivial to ensure that |*x*_{diagram}| in (45) is in any sense comparable between the different pairings. The selection of a Hamiltonian interaction vertex is not independent of the selection of excitor combination: *p*_{hver} will be a probability conditional on *p*_{select}. This enables the use of *truncated excitation generation*, that is, the *a priori* exclusion of Hamiltonian–excitor pairings that will not be able to contribute to any stored amplitude.^{84} Furthermore, one could easily exclude any class of diagrams that we wish to evaluate with a different algorithm.

Clusters from different combinations can contribute to a set of allowed diagrams, whose number can undergo large variations, especially with varying system size. In a CC calculation to any order, let us consider two limiting cases in the sampling to clarify this statement. Assume that we selected a cluster from the combination $T14$ with no repeated excitors. The only fully connected diagrams stemming from such a cluster are of the form

as such

only one Hamiltonian interaction vertex is admissible:

with probability*p*_{hver}= 1.0.selecting a contraction pattern boils down to the choice of which two excitors from the four to be connected to the interaction vertex via hole-type lines. There are $42=6$ possible ways of doing so, which gives $pcont=16$.

being single excitations, each excitor has one particle and one hole line. Once the contraction pattern is set, there is only one choice to make per hole line in the diagram and each will be made with probability

*p*_{int}= 1.0.as all external indices are fully determined by the selected cluster and contraction, there is only a single possible choice of external indices, so

*p*_{ext}= 1.0.

Each cluster from a $T14$ combination can contribute to six valid diagrams, independent of the system size and truncation level. If all diagrams are selected without weighting, $pdiagram=16pselect$.

Now assume, instead, that we are sampling the action of a bare Hamiltonian vertex, that is, a cluster of size 0. There are only two admissible choices in such a case,

There is no contraction to decide upon and, hence, no internal indices to decide upon: *p*_{cont} = 1.0 = *p*_{int}. However, the amount of such terms to sample varies with system size. For an *N*-electron system with *V* virtual orbitals, there are *O*(*N*) possible external hole and *O*(*V*) possible external particle indices, respectively. For a one-body interaction vertex, there are *O*(*NV*) admissible labelings of the diagram, while for the two-body case, there are *O*(*N*^{2}*V*^{2}) such labelings.

These examples show how sampling different classes of diagrams will require a varying number of attempts *n*_{a} in each MC step. The original even selection prescription will need to be modified to accommodate this, or else calculations will rapidly become untenably expensive.

### C. Importance sampling of clusters and Hamiltonian vertices

To compensate for the difference in diagram generation between different combinations, we will now modify our sampling to include combination-dependent constants *α*_{c},

such that

Each combination will be evaluated to a different granularity *γ*_{c}, defined as in (48),

where we used the value of *n*_{a} given in (51).

With this modification, we now require an approximately constant contribution to the integrals $\u2009D0|\tau ^k\u2020H\xaf|D0\u2009$. From Sec. IV B, we know the diagram amplitude,

must then be a constant Ξ. This constant is a product of interaction vertex-specific and diagram-specific factors,

where *ζ*_{ci} is the probability of choosing the *i*th Hamiltonian interaction vertex (Table II) when sampling excitor combination *c* and

The *ζ*_{ci} probabilities are normalized, ∑_{i}*ζ*_{ci} = 1.

Unfortunately, having *all* contributions be of constant magnitude is not a tenable aim. We can, instead, aim to have either the average or maximum contribution from the sampling of each combination with each admissible interaction vertex being a constant value. We, thus, require

where $Bci\u22c6$ can be either the maximum or the average value for the diagram-specific term. Considering the maximum contribution, we then obtain

and similarly, for the average contribution involving $Bciave$. Utilizing these expressions requires on-the-fly accumulation of the values $Bci\u22c6$ during a calculation. Ξ will be the value of the maximum (average) contribution for all diagrams using admissible excitor combinations and interaction vertices before scaling by the time step *δτ*. The latter can be set according to

where the parameter *ϵ* was introduced in Sec. IV A and *χ*_{c} is calculated using $Bcimax$. Using $Bciave$ would set the time step such that the average contribution magnitude was *ϵ*.

The accumulated values for the diagram-specific terms $Bci\u22c6$ guarantee that this procedure uses information from all valid diagrams ever generated, whether spawning attempts were successful or not. As such, it can converge to a stable importance sampling of the wavefunction with minimal user input. To avoid certain classes of diagrams being entirely neglected as a result of a single negligible diagram, the values *γ*_{ci} must be fixed until sufficient information has been collected. This is achieved by requiring 100 random diagrams of each type be selected during the calculation before starting the importance sampling procedure.

### D. Truncated excitation generation and computational scaling

The application of truncated excitation generation, as introduced in Sec. V B, follows naturally in this algorithm and provides considerable computational benefits.

Most notably, while the time step is still expected to decrease as $ON\u22124$ due to the sampling of the bare Hamiltonian, *H*_{N}, the cost of sampling larger clusters will rapidly fall. This is due to the number of possible diagrams for higher excitation level clusters having a scaling lower than $ON4$, and so *α*_{c} correspondingly falling with system size to compensate.

As an example, selected clusters of excitation level *l* + 2 only have $O1$ possible diagrams due to the restriction to connected diagrams contributing to the excitation level *l* or below. The first example discussed in Sec. V B corresponds to this case in CCSD. To compensate for this, *α*_{c} will fall as $ON\u22124$ for this combination as fewer samples are required. This gives the overall scaling of sampling these higher terms as $ON2l+4$ for this case. For clusters of lower excitation level, sampling the cluster expansion will have reduced computational expense, which will, however, be offset by a corresponding increase in the number of connected diagrams these clusters can contribute to. This differs from prior unlinked approaches, where all Hamiltonian vertices are sampled for every cluster, so every cluster effectively contributes to $ON4$ (possibly disconnected) diagrams. This results in an additional factor of $ON4$ in the computational scaling of unlinked approaches compared to diagCCMC.

As such, the overall cost of a diagCCMC calculation will scale as $ON2l+4$ for a fixed errorbar per electron in the absence of any simplifying properties of the cluster amplitudes. This gives an asymptotic scaling of $ON8$ for CCSD with a fixed errorbar per electron. The asymptotic scaling matches that of deterministic unfactorized CC theory, unlike prior stochastic CC approaches.^{44} If a fixed errorbar were instead required, this would lead to a scaling of $ON9$ for CCSD and $ON2l+5$ in general.

In previous work,^{47} we demonstrated that for systems of noninteracting replicas, the memory cost for diagCCMC is proportional to the number of replicas, regardless of the truncation level of the theory, and that this extends to interacting systems provided cluster amplitudes decay sufficiently rapidly with distance. Here, we extend this consideration to show that the computational effort will asymptotically scale as *at most* $ON4$ in the presence of locality, regardless of the truncation level, for a fixed errorbar per electron.

Our only requirement is that cluster amplitudes be *homogeneous*: the absolute magnitude of all $T^m$ is proportional to some measure of system size, *N*, as is expected to be the case when locality is present, for instance, in insulators over reasonable length scales. This means that we can sample the contribution of a cluster of *n* excitors to a fixed granularity using only $ONn$ random samples. The linked diagram theorem then restricts us to clusters containing at most 4 excitors and ensures that the number of “free,” external coupling indices on the Hamiltonian coupling vertex that must be sampled is at most 4 − *n*. Sampling each external index will require $ON$ additional samples of that diagram, so the maximal scaling to sample a cluster of size *n*(≤4) is $ONnON4\u2212n=ON4$. This scaling will be reflected in the number of attempts per unit of imaginary time, *n*_{a}*δτ*^{−1}, and is *independent* of the chosen truncation level in the CC hierarchy. In the case of noninteracting replicas, the computational effort per replica, $na\delta \tau \u22121nreplicas\u22121$, is expected to scale as $ON3$, again independent of the truncation level.

It is important to note here a benefit of the stochastic approach that the sparsity and structure within the cluster amplitudes are exploited automatically, but only if they are present. In the absence of such a structure, the result will still be equivalent to a conventional CC calculation. This is different from local deterministic approaches, which by necessity neglect nonlocal contributions according to some categorization. If locality is not present to the appropriate degree, such approaches will obtain a different answer from a conventional CC calculation on the same system. While this may seem technical, being able to exploit locality while still estimating the exact CC energy is a considerable benefit.

## VI. DATA STRUCTURES AND ALGORITHMS

Our diagCCMC algorithm is implemented in a standalone package. The package is written in the Python programming language, which allows fast prototyping and experimentation.

A *sparse stochastic array* is the basic data structure. This is used to store the compressed representation of cluster operators *T*_{m} of any rank. We use a Python *dictionary*, an associative key-value array implemented as a hash table.^{85} The excitation indices are the keys,

with cluster amplitudes stored as floating-point numbers. The key is arranged as a 2-tuple of *k*-tuples: each *k*-tuple representing the hole and particle indices, respectively. This sparse stochastic array is designed to (a) be exchange symmetry-aware, (b) perform stochastic rounding to a preset threshold, Δ, and (c) enable the importance sampling of its elements. The keys in the dictionary are sorted in *ascending* order, both in the hole and particle tuples, which ensures no storage redundancy. Upon insertion in the data structure, the supplied index is first sorted in ascending order, while the supplied value is multiplied by the corresponding parity phase factor, (−1)^{σ}. The new value is inserted after stochastic rounding,

and finally, the *ℓ*_{1} norm of the *T*_{m} cluster operator is updated accumulating the new value: $\u2225$*T*_{m}$\u2225$_{1} ← abs(value). Similarly, upon lookup, the supplied index is first sorted and then looked up into the dictionary. If present, the returned value accounts for the parity phase factor. Iteration and various vector-like operations can be implemented on top of this storage object. Compressed sparse matrix formats could replace the hash table. However, the algorithm is not GEMM-driven,^{62} and compressed sparse representation would be suboptimal for importance sampling. Insertion and retrieval are the essential operations in our algorithm, and they can be performed on a hash table with $O1$ complexity in the average case. Furthermore, data needed for importance sampling can be accumulated upon insertion into the hash table, eliminating the need for complete traversals of the data structure.

The importance sampling of the data in the sparse stochastic array requires building the corresponding sampling distribution, either using a cumulative magnitude array or the alias method^{86,87} with the Vose sampler.^{88} If *n* is the number of elements in the discrete set to sample, the alias method constructs the distribution in $On$, while sampling is achieved in $O1$.

The cluster operator *T* is a collection of sparse stochastic arrays, indexed on the rank of its constituent excitations. Various vector-like algebraic operations can be implemented for this data structure, e.g., the calculation of the *ℓ*_{1} and *ℓ*_{2} norms of *T*, in the form of reductions over the sparse stochastic arrays of the component operators.

Finally, we handle the importance sampling described in Sec. V C in a separate data structure: the sampling store. This data structure computes the probabilities for the selection of excitor combinations, *p*_{combo},^{44} and for the selection of a combination–Hamiltonian vertex pairing in diagram generation, *p*_{hver}, the latter being conditional on the former. The sampling store is also responsible for accumulating data needed to update the sampling distributions, which is further used to determine the time step for the next iteration.

We show a high-level overview of diagCCMC in Algorithm 1. A diagCCMC calculation requires as input the following:

Molecular integrals in an orthogonal molecular orbital (MO) basis. These are expected in FCIDUMP format.

^{89}A truncation level for the cluster operator.

The number of steps,

*N*_{QMC}, to perform.The stochastic granularity, Δ, which defaults to 10

^{−4}in our implementation.The time step,

*δτ*, which defaults to 0.01 in our implementation.The preconditioner, which defaults to the identity in our implementation.

1: procedure diagCCMC(H, t^{[0]}, N_{QMC}) |

2: Initialize sparse stochastic storage for T |

3: Bootstrap importance sampling of T |

4: for n < N_{QMC} do |

5: Update sampling distribution for T^{[n]} ⊳ see Sec. V A |

6: Stochastic propagation ⊳ see Algorithm 2 |

7: Compute the energy deterministically: |

$\Delta ECC=\u2211aitaifia+14\u2211ijabtabijg\xafijab+12\u2211ijabtaitbjg\xafijab$ |

8: Update T^{[n+1]} ← ω^{[n]} (“annihilation”) ⊳ see Sec. IV C |

9: end for |

10: end procedure |

1: procedure diagCCMC(H, t^{[0]}, N_{QMC}) |

2: Initialize sparse stochastic storage for T |

3: Bootstrap importance sampling of T |

4: for n < N_{QMC} do |

5: Update sampling distribution for T^{[n]} ⊳ see Sec. V A |

6: Stochastic propagation ⊳ see Algorithm 2 |

7: Compute the energy deterministically: |

$\Delta ECC=\u2211aitaifia+14\u2211ijabtabijg\xafijab+12\u2211ijabtaitbjg\xafijab$ |

8: Update T^{[n+1]} ← ω^{[n]} (“annihilation”) ⊳ see Sec. IV C |

9: end for |

10: end procedure |

The CC wavefunction is initialized using the MP1 amplitudes, easily computed from the provided MO basis integrals. Both the representation of the CC wavefunction at the current time step and its update (the residual) are represented as stochastic sparse arrays, but only the former will be used for sampling purposes. We initialize importance sampling with a short trial run to sample all possible pairings of excitor combinations and Hamiltonian vertices.

Each MC cycle starts by updating the sampling distribution for the cluster operator, and we leverage information about selection probabilities for each Hamiltonian vertex with each excitor combination, accumulated from diagram generation attempts in the previous cycle, to define an importance sampling weight for each combination of excitors.

The stochastic propagation step performs *n*_{a} attempts at sampling the CC residual, *ω*^{[n]}, constructing diagrams on the fly and is schematically described in Algorithm 2. Our current implementation features a process-based parallelization of this step. Given *p* helper processes, each available helper performs $\u2308nap\u2309$ attempts and stores their results in a queue.^{90} These are aggregated by the main process, which also takes care of cleaning up the queue before entering the next QMC step in the simulation. For each attempt, we first obtain a random cluster and accumulate its relevant contributions to estimators, e.g., the energy. Given the cluster, we sample its diagonal and off-diagonal actions, which we term “death” and “spawn” attempts, respectively, in analogy with existing Fock-space QMC terminology. The “death” step consists of the exact evaluation of all components of the Hamiltonian, which results in contributions to the same excitor, as was originally sampled, provided these have not been incorporated into the preconditioner, as discussed in Sec. IV C. This consists of contributions from vertices 1, 2, 4, 5, and 6 in Table II. We may sample exclusion-principle violating (EPV) diagrams^{64} within “death.” These would cancel out exactly in a deterministic evaluation and are, thus, not stored into the sparse representation of the cluster operator.

1: procedure Diagrams(H, T^{[n]}, n_{a}) |

2: for i < n_{a} do |

3: Obtain random cluster t_{i}…t_{l} from T^{[n]} |

4: Accumulate contribution of selected cluster to estimators |

5: Sample diagonal action of cluster (“death”) |

6: if “death” diagram is not EPV then |

7: Apply preconditioning ⊳ see Sec. IV C |

8: Store in residual object after stochastic rounding |

9: end if |

10: Sample off-diagonal action of cluster (“spawn”) ⊳ see Algorithm 3 |

11: if “spawn” successful then |

12: Evaluate diagram |

13: Apply preconditioning ⊳ see Sec. IV C |

14: Store in residual object after stochastic rounding |

15: end if |

16: Accumulate “spawn” statistics for importance sampling |

17: end for |

18: end procedure |

1: procedure Diagrams(H, T^{[n]}, n_{a}) |

2: for i < n_{a} do |

3: Obtain random cluster t_{i}…t_{l} from T^{[n]} |

4: Accumulate contribution of selected cluster to estimators |

5: Sample diagonal action of cluster (“death”) |

6: if “death” diagram is not EPV then |

7: Apply preconditioning ⊳ see Sec. IV C |

8: Store in residual object after stochastic rounding |

9: end if |

10: Sample off-diagonal action of cluster (“spawn”) ⊳ see Algorithm 3 |

11: if “spawn” successful then |

12: Evaluate diagram |

13: Apply preconditioning ⊳ see Sec. IV C |

14: Store in residual object after stochastic rounding |

15: end if |

16: Accumulate “spawn” statistics for importance sampling |

17: end for |

18: end procedure |

During the “spawn” attempts, on-the-fly diagram generation will occur, as described in Algorithm 3. Note that this algorithm is short-circuiting: an unsuccessful random selection at any step will return an empty diagram and result in the accumulation of a failed attempt.

1: procedure diagram generation(t_{i}…t_{l}, sampling distribution, |D_{0}⟩) |

2: Select Hamiltonian vertex |

3: Select contraction pattern |

4: Select internal indices |

5: Select external indices ⊳ spin-conservation constraints are enforced |

6: end procedure |

1: procedure diagram generation(t_{i}…t_{l}, sampling distribution, |D_{0}⟩) |

2: Select Hamiltonian vertex |

3: Select contraction pattern |

4: Select internal indices |

5: Select external indices ⊳ spin-conservation constraints are enforced |

6: end procedure |

Once a diagram has been selected, its evaluation is done *deterministically* by applying the algebraic interpretation rules with modifications described in Sec. IV B. The energy is also evaluated deterministically, but note that a stochastic estimator can also be built during “death” and “spawning” steps. Finally, the cluster operator is updated before moving on to the next MC cycle, taking into account the preconditioning of the residual (see Sec. IV C).

## VII. NUMERICAL EXAMPLES

To demonstrate the retention of the favorable properties of our approach when applied to higher excitation levels in systems of multireference character, we consider calculations including up to quadruple excitations upon H_{4}, in a square of side length 1.5 Å. At this geometry, two restricted Hartree–Fock (RHF) solutions are degenerate.^{91} Any single configuration provides only a poor representation of the system, while coupled cluster with up to quadruple substitutions (CCSDTQ) is equivalent to FCI. Each truncation level has a clearly identifiable difference in energy, which can be resolved despite stochastic error. We also consider noninteracting replicas of this system such that the wavefunction will become a product.

This system, while small, is by no means trivial for a projection-based approach. Its multireference nature and small gap between the ground and excited states necessitates projection through over 50 units of imaginary time to converge to the ground state. The imaginary-time propagation was not preconditioned. While preconditioning can afford taking larger time steps,^{79} we observed that it can lead to an unstable propagation in this particularly challenging case.

The resultant energies are shown in Table IV, demonstrating the size-extensivity of the energies, within stochastic errorbars, for multiple noninteracting replicas.

n_{replicas}
. | CCSD . | CCSDT . | CCSDTQ . |
---|---|---|---|

1 | −0.1678(2) | −0.1701(2) | −0.1624(2) |

2 | −0.3353(3) | −0.3398(3) | −0.3242(8) |

3 | −0.5022(8) | −0.5129(4) | a |

4 | −0.6688(7) | a | a |

n_{replicas}
. | CCSD . | CCSDT . | CCSDTQ . |
---|---|---|---|

1 | −0.1678(2) | −0.1701(2) | −0.1624(2) |

2 | −0.3353(3) | −0.3398(3) | −0.3242(8) |

3 | −0.5022(8) | −0.5129(4) | a |

4 | −0.6688(7) | a | a |

^{a}

Values not computed due to computational constraints.

The memory cost per replica, as measured by the *n*_{states} metric, is shown in Fig. 1. The $ON$ asymptotic scaling for noninteracting systems was already observed in Ref. 47 for noninteracting Be replica systems and is confirmed here also for the H_{4} systems. As discussed in Sec. V D, this is an intrinsic property of the diagCCMC algorithm, and our results confirm that it is preserved even in cases where the description of the electronic structure is challenging.

The *n*_{a}*δτ*^{−1} metric measures, instead, the computational requirements per replica and is shown in Fig. 2. From the discussion in Sec. V D, this is expected to scale cubically with *n*_{replicas}.

The results of a log-linear regression analysis of the observed *n*_{a}*δτ*^{−1} against *n*_{replicas} are reported in Table V. We also include the same analysis on similar data obtained for Be in Ref. 47. All observed scaling exponents are *below* the expected maximum scaling of $ON3$. This is not a surprising result: we are not in the asymptotic large-system limit, and the highest-scaling contributions will not necessarily dominate the computational cost.

System . | Truncation . | ln c
. | α
. |
---|---|---|---|

H_{4} | CCSD | 12.73 | 2.39 ± 0.06 |

CCSDT | 13.98 | 2.53 ± 0.06 | |

CCSDTQ | 14.70 | 2.54 ± 0.00 | |

Be | CCSD | 12.76 | 2.75 ± 0.04 |

CCSDT | 13.72 | 2.85 ± 0.03 | |

CCSDTQ | 14.00 | 2.88 ± 0.02 |

System . | Truncation . | ln c
. | α
. |
---|---|---|---|

H_{4} | CCSD | 12.73 | 2.39 ± 0.06 |

CCSDT | 13.98 | 2.53 ± 0.06 | |

CCSDTQ | 14.70 | 2.54 ± 0.00 | |

Be | CCSD | 12.76 | 2.75 ± 0.04 |

CCSDT | 13.72 | 2.85 ± 0.03 | |

CCSDTQ | 14.00 | 2.88 ± 0.02 |

Finally, we also present correlation energies for the symmetric double dissociation of water in a 6-31G basis (see Table VI). In this system, different correlation regimes are in effect along the potential energy surface, and it is, thus, one of the standard benchmarks for correlated methods.^{95} diagCCMC manages to reproduce values obtained with deterministic approaches at a range of truncation levels along the binding curve. As is the case for the H_{4} calculations presented earlier, the multireference nature of this problem at certain stretched geometries did not allow some of these more challenging calculations to complete. The diagCCMC algorithm is a projection method; despite its intrinsic computational benefits, it still struggles when applied to problems with an ill-defined single reference determinant and/or characterized by a small gap.

R_{O–H}/R_{e}
. | CCSD . | CCSDT . | CCSDTQ . |
---|---|---|---|

1.0 | −0.136 58(5) | −0.137 94(8) | −0.1380(2) |

1.5 | −0.194 3(1) | −0.199 7(4) | −0.2008(5) |

2.0 | −0.290 6(2) | −0.303 2(3) | a |

3.0 | −0.531 5(5)^{b} | −0.550 3(6) | a |

R_{O–H}/R_{e}
. | CCSD . | CCSDT . | CCSDTQ . |
---|---|---|---|

1.0 | −0.136 58(5) | −0.137 94(8) | −0.1380(2) |

1.5 | −0.194 3(1) | −0.199 7(4) | −0.2008(5) |

2.0 | −0.290 6(2) | −0.303 2(3) | a |

3.0 | −0.531 5(5)^{b} | −0.550 3(6) | a |

^{a}

Values not computed due to computational constraints.

^{b}

The calculation initially converges to the “canonical” CCSD solution, before decaying to a different solution with Δ*E*_{CCSD} = −0.5192(8) *E*h after 80 a.u. of imaginary time. This property of the imaginary-time propagation has been noted before.^{47}

## VIII. CONCLUSIONS

We have discussed in detail our new approach for a stochastic solution of the linked coupled cluster equations and demonstrated the resulting reduction in computational and memory costs with system size in the presence of locality. The diagrammatic coupled cluster Monte Carlo algorithm uses the rigorously order-by-order and term-by-term size-extensive linked formulation of coupled cluster theory and ensures the efficient sampling of it by on-the-fly construction of coupled cluster diagrams. The algorithm is made possible by two novel insights: (a) the stochastic compression of multidimensional vectors can be achieved *without* invoking walkers and populations and (b) the CC vector function is an integral, expressible as a finite sum of diagrams, that can be computed by Monte Carlo sampling. Both insights lead to an algorithm that clarifies how randomness and sampling can be effectively leveraged to solve the high-dimensional nonlinear CC problem with a lower memory footprint and a more favorable operation count. The use of the well-known diagrammatic theoretical framework clarifies few points of the CCMC methodology, such as the relation of imaginary-time evolution to iterative solvers and the use of preconditioning.^{79,80} The representation and evaluation granularity parameters characterize the diagrammatic approach on a spectrum between fully deterministic and fully stochastic; the same theoretical framework can accommodate different numerical approaches. This paves the way for further cross-adaptation of deterministic and Monte Carlo techniques. The approach we have presented uses a naïve enumeration of diagrams: the residual is evaluated in its unfactorized nonlinear form,^{60,61} rather than the more computationally advantageous factorized quasilinear form.^{66,96} As such, it exhibits a high operation count, theoretically higher than that of its deterministic counterpart for a given excitation level if all cluster amplitudes are homogeneous. Deterministically, one would rather implement a quasilinear factorization with an optimal space–time trade-off.^{66} We are currently investigating this approach. The use of the diagrammatic expansion also paves the way for a rigorous derivation of a semistochastic CC method, where important residual components are resolved on the fly to machine accuracy, with the remainder only resolved to a preset stochastic representation granularity.

## ACKNOWLEDGMENTS

C.J.C.S. is grateful to Dr. George Booth for his current role as Postdoctoral Research Associate under Grant Agreement No. 759063 of the European Union’s Horizon 2020 research and innovation program. R.D.R. acknowledges partial support by the Research Council of Norway through its Centres of Excellence scheme, Project No. 262695, and through its Mobility Grant scheme, project number 261873. T.D.C. was supported by Grant No.CHE-1900420 from the U.S. National Science Foundation. A.J.W.T. is grateful to the Royal Society for a University Research Fellowship under Grant Nos. UF110161 and UF160398. R.D.R. thanks Simen Kvaal (University of Oslo) for pointing out Ref. 76.

We used the goldstone LATEX package to draw the coupled cluster diagrams. The package is available on https://github.com/avcopan/styfiles

## DATA AVAILABILITY

All data and the code used for generation and analysis are freely available at doi.org/10.5281/zenodo.3997299.

### APPENDIX: DERIVATIVE OF THE EXPONENTIAL OF A PARAMETER-DEPENDENT OPERATOR

Consider an operator $\xd4$ dependent on a parameter *λ*, its derivative with respect to *λ* can be obtained as^{97}

since to first order in *δ* one has $\xd4(\lambda \u2032+\delta )=\xd4(\lambda \u2032)+\xd4\u0307(\lambda \u2032)\delta $. The differential $\u2009dexp\xd4(\lambda \u2032)=exp\xd4(\lambda \u2032)+\xd4\u0307(\lambda \u2032)\delta \u2212exp\xd4(\lambda \u2032)$ can be recast as a BCH series. Let us drop the *λ*′ dependence and rewrite the differential as

We can calculate the *z*-derivative as

where in the last step we dropped *O*(*δ*^{2}) terms. We then expand the last term in a BCH series,

We exchange summation and integration orders and perform the *z*-integration to obtain

such that the *λ*-derivative is

## REFERENCES

Let us note the existence of the diagrammatic MC (DiagMC) method in the quantum many-body literature. Both DiagMC and diagCCMC deal with integral equations by sampling in diagram space, but the diagrams that are sampled are markedly different: quantum statistics models, with denumerable *infinitely* many diagrams, and CC wavefunctions, with *finitely* many diagrams. In addition, the *divergent* series might arise in DiagMC requiring the stochastic realization to handle the resummation. These differences lead to quite distinct approaches to the sampling of terms. Despite the similar names, the two techniques have fairly little in common.

For a single-determinant reference function, one has $\gamma r1s1=\delta r1s1$ and $\gamma r1r2s1s2=\delta r1s1\delta r2s2\u2212\delta r1s2\delta r2s1$.

This is the zeroth-order Hamiltonian in the Møller–Plesset (MP) partitioning.