We explore the use of the stochastic resolution-of-the-identity (sRI) with the phaseless auxiliary-field quantum Monte Carlo (ph-AFQMC) method. sRI is combined with four existing local energy evaluation strategies in ph-AFQMC, namely, (1) the half-rotated electron repulsion integral tensor (HR), (2) Cholesky decomposition (CD), (3) tensor hypercontraction (THC), or (4) low-rank factorization (LR). We demonstrate that HR–sRI achieves no scaling reduction, CD–sRI scales as $O(N3)$, and THC–sRI and LR–sRI scale as $O(N2)$, albeit with a potentially large prefactor. Furthermore, the walker-specific extra memory requirement in CD is reduced from $O(N3)$ to $O(N2)$ with sRI, while sRI-based THC and LR algorithms lead to a reduction from $O(N2)$ extra memory to $O(N)$. Based on numerical results for one-dimensional hydrogen chains and water clusters, we demonstrated that, along with the use of a variance reduction technique, CD–sRI achieves cubic-scaling *without overhead*. In particular, we find that for the systems studied, the observed scaling of standard CD is $O(N3\u20134)$, while for CD–sRI, it is reduced to $O(N2\u20133)$. Once a memory bottleneck is reached, we expect THC–sRI and LR–sRI to be preferred methods due to their quadratic-scaling memory requirements and their quadratic-scaling of the local energy evaluation (with a potentially large prefactor). The theoretical framework developed here should facilitate large-scale ph-AFQMC applications that were previously difficult or impossible to carry out with standard computational resources.

## I. INTRODUCTION

The accurate *ab initio* simulation of the ground state properties of molecules and solids is essential for the understanding of major swaths of chemistry and physics. All known techniques for the reliable calculation of electronic structure are challenged by large systems containing strongly interacting electrons. Here, mean field approaches fail, and the exponential scaling of a brute force solution of the Schrödinger equation renders the description of all but the smallest systems impossible. A large array of powerful methods have been developed, which have been successfully employed in the study of large correlated systems.^{1–4} Among the many available electronic structure methodologies with which to attack such systems, quantum Monte Carlo (QMC) stands out as a unique tool for the calculation of ground state energies due to its attractive combination of scalability and accuracy.^{2}

While there are many flavors of QMC,^{2,5–7} our focus in this work centers on the auxiliary-field QMC (AFQMC) approach.^{8–10} AFQMC is a variant of projector QMC, which is formulated in a second-quantized determinant space. The unbiased version of AFQMC, often referred to as free-projection AFQMC, is exact, in principle, but has an exponential scaling with system size due to noise growth caused by the fermionic sign (or phase) problem.^{10} Zhang and co-workers developed the phaseless approximation to AFQMC (ph-AFQMC), which has become a practical and accurate tool for the *ab initio* simulation of molecules and materials. In ph-AFQMC, the phase of walker wavefunctions during imaginary-time propagation is constrained by the *a priori* chosen trial wavefunction, |Ψ_{T}⟩. While this approach is biased, it can yield answers that systematically approach the exact solution either by the release of the constraint^{11} or via the choice of progressively more sophisticated trial functions.^{12}

In standard AFQMC simulations, walker propagation scales as $O(OM2)$ where *O* is the number of electrons and *M* is the number of single-particle basis functions. Such a cubic-scaling propagation per sample makes ph-AFQMC attractive for general *ab initio* problems. While there have been many successful ph-AFQMC applications to *ab initio* problems,^{13–27} there are several remaining challenges that still need to be addressed before ph-AFQMC can become a universal tool for the study of large scale correlated electronic structure problems. The challenge that we address in this work is the steep cost of the local energy evaluation that is necessary for estimating the ph-AFQMC energy. The local energy of a walker with a wavefunction |*ψ*⟩ is given by

The cubic-scaling of the walker propagation mentioned above is asymptotically irrelevant in standard ph-AFQMC calculations because the local energy evaluation scales *quadratically* with system size without exploiting the sparsity or low-rank (LR) structure. This quartic complexity arises from the two-electron repulsion integral (ERI) tensor,

where {*ϕ*_{μ}} are the underlying single-particle basis functions. In ph-AFQMC, this integral is most commonly factorized via the Cholesky decomposition (CD) into

where **L** denotes a Cholesky vector. Even with CD, the evaluation of the local energy remains a quartic-scaling task in ph-AFQMC. For systems with translational symmetry, it is possible to work with plane waves and achieve cubic-scaling with^{15} or without^{21} fast Fourier transform. In this work, however, we focus on general basis functions that do not exploit such symmetry.

We mention two notable previous approaches that may be used to address this problem. The first is the tensor hypercontraction decomposition (THC)^{28–30} proposed by Martínez, Sherill, and co-workers to factorize the ERI tensor into

Malone *et al.* successfully applied the THC factorization of the ERI tensor to ph-AFQMC simulations and showed that the local energy evaluation can be brought down to $O(cTHC2OM2)$ where the value of *c*_{THC} was found to be ∼8 for accurate local energy evaluation.^{26} While the improved asymptotic scaling is satisfying, the resulting algorithm has a steep overhead such that the actual crossover between the conventional algorithm and the THC variant occurs for very large system sizes in practice.^{26,31}

An alternative approach has been proposed by Motta *et al.*^{32} In this formulation, the nested diagonalization of Cholesky vectors proposed by Peng and Kowalski^{33} is used to exploit the underlying low-rank structure of the ERI tensor. We refer to this factorization as the low-rank (LR) factorization. In the LR factorization, one writes

which arises from

This factorization achieves an asymptotically cubic-scaling local energy evaluation due to the fact that the number of terms in the summation over *α* in Eq. (6) is limited to $O(log\u2009N)$, where *N* is the system size. Again, due to a large overhead, this asymptotic scaling will generally not be achieved until the system reaches ≳1000 electrons.^{32}

The THC and LR factorization schemes enable a cubic-scaling local energy evaluation algorithm, but the overhead associated with both approaches is large enough that the actual efficiency crossover will generally not be observed for medium-sized molecules without obvious sparsity or low-rank structure. We note, however, that these algorithms offer quadratic scaling memory requirements, which results in a reduction of the usual cubic to quartic memory requirement of standard ph-AFQMC. In fact, such a reduction in the memory cost may be the biggest benefit gained from using the THC and LR factorizations. We will return to a discussion of the memory reduction afforded by these techniques before concluding.

Given the above facts, it is clear that there is a need for an *overhead-free* cubic-scaling algorithm, which can accelerate *ab initio* ph-AFQMC simulations for medium-sized systems. Based on the earlier work of Baer and Neuhauser,^{34} Neuhauser and co-workers developed the use of stochastic orbitals as a means of reducing scaling in a variety of electronic structure methodologies, including density functional theory (DFT),^{35–37} time-dependent DFT (TDDFT),^{38} second-order Møller–Plesset perturbation theory (MP2),^{39–41} second-order perturbative Green’s functions (GF2),^{42–44} the random phase approximation,^{45} GW,^{46} and the Bethe–Salpeter equation.^{46}

Drawing from this body of work, we explore the combined use of the stochastic resolution-of-the-identity (sRI) formulation of Takeshita *et al.*^{41} with ph-AFQMC. The central idea underlying sRI is a stochastic resolution of the Coulomb operator, which was first developed in Ref. 36. We will demonstrate that sRI can be naturally incorporated into ph-AFQMC and used to reduce the formal scaling of the local energy evaluation to cubic-scaling when combined with CD and to quadratic-scaling when combined with THC or LR factorization. Most notably, we will numerically demonstrate that the combined use of Cholesky decomposition and the stochastic resolution of the identity (CD–sRI) with a simple variance reduction (VR) technique allows for a cubic-scaling algorithm *without overhead*. Given this attractive feature, we expect the CD–sRI approach to become the standard method for the computation of such quantities for medium-sized systems and THC–sRI and LC–sRI formulations to be the preferred approach for very large ones.

This paper is organized as follows: (1) we give a brief review of the ph-AFQMC algorithm, (2) we analyze the formal scaling of the existing local energy evaluation strategies, (3) we examine scaling reduction in these strategies when combined with sRI, and (4) we present numerical examples (hydrogen chains and water clusters) to show speed-up and scaling reduction of the CD approach with sRI.

## II. THEORY

To set precise notation for our discussion of scaling, in the following, we use *O* to denote the number of occupied orbitals, *M* to denote the number of single-particle basis functions, *X* to denote the number of Cholesky vectors (or auxiliary basis functions), and *N* to denote the system size in general.

### A. Review of auxiliary-field quantum Monte Carlo

We briefly review the basic algorithmic details of AFQMC here and refer interested readers to Ref. 47 for a more extended review of modern AFQMC development. We start from an *ab initio* Hamiltonian *Ĥ* given by

where *Ĥ*_{1} and *Ĥ*_{2} are the one-body and two-body part of *Ĥ*, respectively. As with other projector QMC methods, AFQMC starts from

where *Ĥ* is the Hamiltonian, *τ* denotes imaginary time, |Ψ_{0}⟩ is the exact ground state, and |Φ_{0}⟩ can be any wavefunction with a non-zero overlap with the true ground state. A key feature of AFQMC is that one employs the Hubbard–Stratonovich transformation along with the Trotter approximation to simplify the many-body propagator in Eq. (8) and reduce the problem to that of a series of one-body problems in a fluctuating auxiliary field. With the Trotter approximation for a given time step Δ*τ*, the propagator is expressed as

The Hubbard–Stratonovich transformation allows for the rewriting of the two-body propagator as an integration over auxiliary fields **x**,

where the one-body operator $v^$ is obtained from

Equation (11) is usually achieved via the CD of the ERI tensor. Alternatively, density-fitting can be used. With this decomposition, the total propagator for given fields **x** and a given time step Δ*τ* is written as

By the virtue of the Thouless theorem,^{48,49} the application of this propagator to a single determinant remains a single determinant. This allows for an efficient walker propagation in AFQMC.

One can reduce statistical fluctuations greatly by employing a mean-field subtraction,^{47} which is closely related to the shifted contour technique developed by Rom and co-workers.^{50} The mean-field subtraction technique redefines the one-body and two-body using the expectation value of $v^P$ with a trial wavefunction,

We then define

which maintains *Ĥ* = $\u01241\u2032$ + $\u01242\u2032$. With this, the propagator in Eq. (12) uses $\u01241\u2032$ and $v^\u2212\u27e8v^\u27e9T$ instead of *Ĥ*_{1} and $v^$. This simple subtraction has been shown to be effective in reducing statistical fluctuations,^{47,50} and this is what is used throughout this work.

The imaginary-time equation-of-motion is dictated by the propagator in Eq. (12). With importance sampling via a trial wavefunction |Ψ_{T}⟩, we write the global AFQMC wavefunction at imaginary-time *τ* as

where $Nw$ is the number of walkers. Within the phaseless approximation,^{10} we update the *i*th walker weight and determinant via

where the force-bias, $x\xafi$, is defined as

The phaseless importance function is defined as

with the hybrid importance function given by

the overlap ratio of the *i*th walker *S*_{i} given by

and the phase *θ*_{i}(*τ*) given by

This completes our compact description of the ph-AFQMC algorithm.

### B. Existing strategies for the local energy evaluation

In practical AFQMC calculations, one usually targets the ground state energy estimated via

where *i* indexes the *i*th walker, $wi$ is the weight of the *i*th walker, and *E*_{L}[*ψ*_{i}] is the local energy of the *i*th walker, as defined in Eq. (1). Without any approximations, the two-body contribution to the local energy is, via Wick’s theorem, given by

where the Green’s function **G** is a walker-dependent quantity defined as

Here, **C**_{T} is the molecular orbital (MO) coefficient matrix for the trial wavefunction and Θ is defined as

with **C** being the MO coefficient matrix for the walker determinant |*ψ*⟩.

Without exploiting any structure in the ERI tensor, the local energy evaluation scales quadratically with system size. For instance, a standard way to evaluate Eq. (1) is to use the so-called “half-rotated” (HR) ERI tensor,^{14,26}

while storing (*ir*| *js*) in memory. This algorithm requires $O(O2M2)$ memory and scales as $O(O2M2)$. Each walker additionally requires $O(O2M)$ memory to save intermediates that appear when evaluating Eq. (28). Furthermore, the formation of (*ir*| *js*) at the beginning of the QMC run scales as $O(OM4)$, which is more expensive than the local energy evaluation, although it needs to be performed only once. Despite these relatively steep scaling behaviors, HR is perhaps the most widely used local energy algorithm when the memory cost is affordable. We refer this algorithm to as the “half-rotated” (HR) local energy evaluation.

Another standard evaluation procedure is to work with density-fitted integrals or Cholesky vectors directly. This may be achieved by using Eq. (3) within the local energy expression, yielding

which scales as $O(O2MX)$. In general, *X* is 4–5 times larger than *M*, and therefore, Eq. (29) is slower than Eq. (28). However, Eq. (29) requires only $O(OMX)$ memory storage, which makes it better suited for large-scale simulations. To save intermediates, each walker additionally requires $O(O2X)$ memory. We refer this algorithm to as the “Cholesky decomposition” (CD) local energy evaluation.

Finally, we note that some recent advances in integral factorization techniques have allowed for a cubic-scaling local energy evaluation with a quadratic memory storage.^{26} We first discuss the THC strategy^{28–30} invented by Sherill, Martínez and, co-workers, which was applied to AFQMC by Malone *et al.*^{26} Specifically, we use Eq. (4) within the local energy evaluation, yielding

All of the tensor contractions in Eq. (30) are simple matrix–matrix multiplications, leading to cubic scaling with system size. More precisely, the local energy calculation scales as $O(cTHC2OM2)$, where we define the THC rank to be *c*_{THC}*M*. The memory requirement is set by storing $\eta pP^$ and **M** that have only a quadratic number of non-zero values. Furthermore, each walker additionally requires $O(cTHC2M2)$ memory to save intermediates that naturally appear in Eq. (30). While these asymptotic properties are highly attractive, the THC algorithm has been shown to have a large prefactor compared to Eqs. (28) and (29) because *c*_{THC} can be quite large in practice.^{26} For example, *c*_{THC} was found to be ∼8 for the accurate calculation of the local energy of diamond within the double-zeta basis.

Similarly, for the LR factorization of Motta *et al.*,^{32} we use Eq. (6) with the AFQMC local energy expression, yielding

which scales as $O(nrOMX)$ where the rank *n*_{r} sets the upper limit on the summation over *α* and *β* and scales like log(*N*). Note that this approach also effectively achieves a cubic-scaling local energy evaluation algorithm. The memory requirement is $O(nrOX+nrMX)$, and each walker additionally requires $O(nrOX)$ to save intermediates. However, as with the use of THC, *n*_{r} is usually large, and therefore, for practical applications, useful cubic-scaling may not be observed,^{32} and only memory saving may be practically useful.

### C. Stochastic resolution-of-the-identity

We follow the explication of the stochastic resolution-of-the-identity (sRI) technique proposed by Takeshita *et al.*,^{41} which has been shown to lower the scaling of MP2 (RI-MP2)^{41} and the second order Green’s function method (RI-GF2).^{43,44} sRI is based on the simple mathematical observation that one may represent the Kronecker delta function with stochastic functions as

where *N*_{ξ} is the number of stochastic samples and *θ*^{ξ} is an sRI basis function whose entry is randomly chosen to be ±1. In practice, we will limit the number of samples to a finite number, which does not scale with system size. This restriction has been shown to be sufficient to achieve a fixed statistical error per particle in MP2 and GF2. The key feature of this approach is that it does not assume any structure (either sparsity or low-rank) in the underlying ERI tensor while still reducing the cost. Due to this fact, the overhead of this approach is almost negligible, especially when looking at size-intensive quantities for which ph-AFQMC applications are also well-suited.

We emphasize that the scaling of a QMC algorithm should not be discussed without the assessment of the underlying statistical error. For instance, one may argue that propagation within ph-AFQMC scales cubically, but the number of samples associated with a fixed statistical error scales linearly with system size. Then, this makes the ph-AFQMC propagation scale quadratically with system size for a fixed statistical error. The usual cubic scaling of the ph-AFQMC propagation quoted in the literature^{21} is the scaling for a fixed statistical error *per particle*. Similarly, the lower scaling of sRI-MP2 and sRI-GF2 approaches cited above is only justified when considering observables for a fixed statistical error *per particle*. This makes the combination of these two methods very natural. We also note that it is still possible to obtain the total energies for a fixed statistical error in both the ph-AFQMC and sRI methods, albeit with an increased cost.

#### 1. The half-rotated sRI (HR–sRI) algorithm

We first re-write Eq. (28) with two Kronecker deltas,

Next, one may employ two sets of stochastic orbitals to represent these Kronecker deltas and write the local energy as

where

and

The formation of Eq. (35) is the bottleneck, scaling as $O(N\xi O2M2)$. The additional memory usage of each walker scales as $O(N\xi O2M)$. Thus, there is no reason to employ this algorithm as the asymptotic scaling is not improved by sRI and no memory saving is obtained.

#### 2. The Cholesky decomposition sRI (CD–sRI) algorithm

We next insert Eq. (32) into Eq. (29) and show that the scaling of the local energy evaluation is reduced from quartic to cubic. The combination of the sRI expression to the CD-based local energy algorithm is most natural in the auxiliary basis function space. Specifically,

We further define an intermediate tensor **R**,

The formation of **R** scales as $O(N\xi OMX)$, which is cubic-scaling for a size-intensive quantity *N*_{ξ}. The Coulomb term in Eq. (29) is thus expressed as

which scales as $O(OMX)$. The extra memory cost for storing walker-specific intermediates scales as $O(X)$. Using **R**,

where

The summation over *P* is the term now replaced by the summation over *ξ*, which lowers the scaling from $O(OMX)$ to $O(OMN\xi )$. The walker-specific extra memory cost is reduced from $O(X)$ to $O(N\xi )$. The exchange term in Eq. (29) is

and scales as $O(O2MX)$ with $O(O2X)$ walker-specific extra memory cost. This term now becomes

where

which scales as $O(O2MN\xi )$. Its walker-specific memory cost scales as $O(N\xi O2)$. This completes the demonstration of a scaling reduction of Eq. (29) to cubic scaling. In summary, the asymptotic scaling of the CD–sRI algorithm is $O(N\xi OMX+N\xi O2M)$, which includes the cost of the formation of **R** as well. The additional memory cost due to storing **R** and other intermediates scales as $O(N\xi OM+N\xi O2)$. This is also an improvement over the conventional CD algorithm.

#### 3. Tensor hypercontraction sRI (THC–sRI) algorithm

We next apply sRI to Eq. (30) and show that the overall scaling can be reduced to *quadratic*. With the use of two sets of sRI insertions, we write

with

where we define

With a proper set of contractions, one can show that this expression scales as $O(N\xi cTHCM2+N\xi OM+N\xi cTHC2M2)$. The additional walker memory usage due to storing intermediates comes at a cost of $O(N\xi cTHCM+N\xi 2cTHCM)$. THC–sRI improves both scaling and memory usage over the conventional THC algorithm.

#### 4. Low-rank factorization sRI (LR–sRI) algorithm

The same conclusion can be deduced for the LR–sRI algorithm by employing two sets of sRI expressions. With sRI, one may write Eq. (31) as

with

Here, we have defined

With an appropriate series of contractions, it can be shown that this local energy evaluation scales as $O(N\xi nrMX+N\xi OM+N\xi nrOX)$. The extra memory required for storing walker-specific intermediates is $O(N\xi nrX)$. As with THC–sRI, we observe an improvement using LR–sRI in both scaling and memory usage over the LR algorithm.

### D. Sampling of the stochastic resolution-of-the-identity and the global energy

The implementation of the sRI local energy estimator can be achieved with only minor modifications to the existing AFQMC programs. There are two stochastic samplings to complete the evaluation of the sRI global energy: one is the standard AFQMC walker local energy sampling for each walker in Eq. (24) and the other is the sRI stochastic orbital sampling in Eq. (32). A simple algorithm for this is as follows:

Minimizing *N*_{ξ} is critical in order to achieve the aforementioned lower scaling with as little overhead as possible. Fortunately, within the QMC setup, it is natural to set *N*_{ξ} = 1 per sample because the representation power of sampled sRI orbitals increases as stochastic samples are accumulated throughout the imaginary-time propagation. In other words, at each time step, each walker samples one sRI orbital, and the global estimator is obtained by stochastically averaging over many such samples. In the limit of infinite statistical sampling, this algorithm converges to the exact ph-AFQMC energy (i.e., the one without sRI).

The use of a variance reduction (VR) technique^{51,52} for sRI-ph-AFQMC has been found to be very effective. For any of the algorithms outlined in this work, we write

where *ψ*_{T} is the trial wavefunction. As suggested by this equation, we compute the “correlation” contribution to the local energy via sRI and the mean-field energy is reused. By expressing the energy in this form, the statistical fluctuations are found to be greatly reduced. This technique is often referred to as the “control variate” approach in Monte Carlo.^{51,52} We note that the efficacy of the control variate approach in this work will ultimately depend on the quality of the trial wavefunction as well. As long as the trial wavefunction is of good quality, the variance reduction will be effective.

### E. Summary of the sRI algorithms

We have discussed a total of four local energy evaluation strategies and how the sRI can reduce the scaling of three of them. We summarize this in Table I. With the sRI, we have formulated one cubic-scaling (CD–sRI) and two quadratic-scaling (THC–sRI and LR–sRI) algorithms.

. | HR . | CD . | THC . | LR . |
---|---|---|---|---|

Conventional | $O(O2M2)$ | $O(O2MX)$ vb | $O(cTHC2OM2)$ | $O(nrOMX)$ |

sRI | $O(N\xi O2M2)$ | $ON\xi OMX+N\xi O2M$ | $ON\xi cTHCM2+N\xi OM+N\xi cTHC2M2$ | $ON\xi nrMX+N\xi OM+N\xi nrOX$ |

Leading speedup | None | O/N_{ξ} | O/N_{ξ} | O/N_{ξ} |

. | HR . | CD . | THC . | LR . |
---|---|---|---|---|

Conventional | $O(O2M2)$ | $O(O2MX)$ vb | $O(cTHC2OM2)$ | $O(nrOMX)$ |

sRI | $O(N\xi O2M2)$ | $ON\xi OMX+N\xi O2M$ | $ON\xi cTHCM2+N\xi OM+N\xi cTHC2M2$ | $ON\xi nrMX+N\xi OM+N\xi nrOX$ |

Leading speedup | None | O/N_{ξ} | O/N_{ξ} | O/N_{ξ} |

For each algorithm, we have also discussed the additional memory requirement of each walker for storing intermediates. This is summarized in Table II. With sRI, we have one algorithm (CD–sRI) that requires additional quadratic-scaling memory and two algorithms that require additional linear-scaling (THC–sRI and LR–sRI) memory. These are all improvements over their conventional counterparts.

. | HR . | CD . | THC . | LR . |
---|---|---|---|---|

Conventional | $O(O2M)$ | $O(O2X)$ | $O(cTHC2M2)$ | $O(nrOX+nrMX)$ |

sRI | $O(N\xi O2M)$ | $O(N\xi OM+N\xi O2)$ | $ON\xi cTHCM+N\xi 2cTHCM$ | $O(N\xi nrX)$ |

Leading saving | None | OX/(N_{ξ}M) | $cTHCM/N\xi 2$ | M/N_{ξ} |

. | HR . | CD . | THC . | LR . |
---|---|---|---|---|

Conventional | $O(O2M)$ | $O(O2X)$ | $O(cTHC2M2)$ | $O(nrOX+nrMX)$ |

sRI | $O(N\xi O2M)$ | $O(N\xi OM+N\xi O2)$ | $ON\xi cTHCM+N\xi 2cTHCM$ | $O(N\xi nrX)$ |

Leading saving | None | OX/(N_{ξ}M) | $cTHCM/N\xi 2$ | M/N_{ξ} |

Among these algorithms, we choose CD–sRI to demonstrate the behavior of sRI-ph-AFQMC methods due to its simple implementation, reasonable leading speedup of *O*/*N*_{ξ}, and leading walker-specific memory saving of $O(OX/(N\xi M))$. The implementation of THC–sRI and LR–sRI is straightforward if THC and LR are already implemented. However, in PAUXY,^{53} these are currently unavailable, which led us to choosing CD–sRI for the purpose of demonstration. Moreover, we apply the sRI only to the exchange contribution (i.e., *E*_{K}) because the Coulomb contribution (i.e., *E*_{J}) can be evaluated at cubic cost within the standard algorithm. More specifically, we write (with the VR technique)

where *E*_{K,CD}[*ψ*_{T}] is evaluated only once in the beginning and used throughout the simulation. Since *E*_{K,CD–sRI} is evaluated twice (once for each of *ψ* and *ψ*_{T}), this adds an additional prefactor of 2. This prefactor increase is negligible compared to the variance reduction that we gain, as we shall see. The resulting algorithm is overall cubic-scaling *per sample*.

## III. COMPUTATIONAL DETAILS

All calculations were performed with an open-source python-based AFQMC program called PAUXY.^{53} The pair-branching algorithm was employed for the population control.^{54} The time step of 0.005 a.u. was used in all computations. We set *N*_{ξ} = 1 throughout all sRI calculations. Increasing *N*_{ξ} provided almost no improvement in statistical efficiency for systems considered in this work. This enables a speedup of the order of *O* (the number of electrons) compared to the conventional CD algorithm. We use the cc-pVDZ^{55} basis set in all examples presented below. We used restricted Hartree–Fock as a trial wavefunction throughout because systems considered in this work are dominated mainly by dynamical correlation. The truncation threshold for the Cholesky decomposition was set to 10^{−5}. A total of 160 walkers were used in every calculation presented below. QMCPACK^{56,57} was used to cross-check the total energies reported in this work along with the underlying population bias associated with the pair-branching algorithm. The reported total energies have negligible population bias and time step error.

## IV. RESULTS AND DISCUSSION

For numerical examples, we picked a series of one-dimensional (1D) hydrogen chains (H-chains) and water clusters, which clearly illustrates the following:

The VR technique discussed in Sec. II D is very effective. The VR was found to be so effective with restricted Hartree–Fock trial wavefunctions for systems considered here, which allowed for a very small number of stochastic orbitals to sample (i.e.,

*N*_{ξ}= 1).The overhead of using CD–sRI with VR for computing ph-AFQMC total energies as opposed to using CD is negligible, unlike for the THC and LR approaches.

CD–sRI is a cubic-scaling local energy algorithm if implemented correctly (i.e., with a proper tensor contraction ordering).

(1) and (2) are rather difficult to show analytically, and therefore, we provide numerical examples to support our claims. While (3) was formally shown in Sec. II C 2, we provide timing benchmarks to bolster support for this analysis. For 1D H-chains, the inter-hydrogen distance was fixed to be 1.6 bohr, and the water cluster geometries were taken from Ref. 31.

We first focus on the effect of the VR technique discussed in Eq. (52). In Fig. 1, we present the energy per H atom as a function of imaginary time. In both H_{10} and H_{40}, we observe the same trend. Without VR, CD–sRI fluctuates significantly around the CD results, while CD–sRI with VR behaves statistically very similarly to CD. These results indicate that the overhead of CD–sRI with VR is almost negligible, which becomes more evident when looking at statistically averaged total energies.

In Table III, we compare CD and CD–sRI with VR for the total energies of H-chains. The difference in total energies between two algorithms is statistically insignificant since they are all within error bars. This is not surprising since CD–sRI is not expected to introduce any biases into the estimator and should recover the CD result with enough samples. Furthermore, the magnitude of statistical error is similar when comparing the two algorithms for a similar number of statistical samples.

. | CD . | CD–sRI with VR . | ||
---|---|---|---|---|

N
. | ⟨E⟩
. | N_{sample}
. | ⟨E⟩
. | N_{sample}
. |

10 | −5.571(1) | 4 000 | −5.5696(9) | 4 000 |

20 | −11.0990(6) | 21 506 | −11.0985(6) | 22 191 |

40 | −22.1612(7) | 48 127 | −22.1610(9) | 27 942 |

80 | −44.2895(7) | 138 429 | −44.2890(8) | 91 234 |

. | CD . | CD–sRI with VR . | ||
---|---|---|---|---|

N
. | ⟨E⟩
. | N_{sample}
. | ⟨E⟩
. | N_{sample}
. |

10 | −5.571(1) | 4 000 | −5.5696(9) | 4 000 |

20 | −11.0990(6) | 21 506 | −11.0985(6) | 22 191 |

40 | −22.1612(7) | 48 127 | −22.1610(9) | 27 942 |

80 | −44.2895(7) | 138 429 | −44.2890(8) | 91 234 |

We have also carried out similar numerical experiments for water clusters, (H_{2}O)_{N}, with *N* = 2, 4, 8. A representative imaginary-time propagation trajectory is given in Fig. 2. The conclusions drawn from the H-chain results hold for these cases as well. With VR, the statistical fluctuations of CD–sRI are close enough to CD. This is true for all values of *N* = 2, 4, 8, and we expect this to hold for larger clusters. Finally, we present the absolute energies of (H_{2}O)_{N} computed via CD and CD–sRI in Table IV. The total energies from two different algorithms lie within their respective error bars. Similarly to Table III, two methods exhibit comparable magnitude of error bars, given a similar number of statistical samples. This strongly suggests that CD–sRI with VR performs as well as CD *without overhead*.

. | CD . | CD–sRI with VR . | ||
---|---|---|---|---|

N
. | ⟨E⟩
. | N_{sample}
. | ⟨E⟩
. | N_{sample}
. |

2 | −152.4987(9) | 16 000 | −152.4976(9) | 16 000 |

4 | −305.033(1) | 16 000 | −305.033(1) | 16 000 |

8 | −609.989(1) | 49 855 | −609.9905(9) | 69 596 |

. | CD . | CD–sRI with VR . | ||
---|---|---|---|---|

N
. | ⟨E⟩
. | N_{sample}
. | ⟨E⟩
. | N_{sample}
. |

2 | −152.4987(9) | 16 000 | −152.4976(9) | 16 000 |

4 | −305.033(1) | 16 000 | −305.033(1) | 16 000 |

8 | −609.989(1) | 49 855 | −609.9905(9) | 69 596 |

We finish our discussion with timing benchmarks of walker memory saving of H-chains and water clusters. For a consistent increase in the number of Cholesky vectors, we employ the density-fitting approximation instead. The auxiliary basis set used here is that of Weigend *et al.*^{58} The timing benchmark here was obtained from an Intel(R) Xeon(R) Platinum 8268 central processing unit (CPU) 2.90 GHz processor, and only a single thread was used. Furthermore, a single local energy evaluation of a single walker is considered for the purpose of demonstration. The timing results are reported in Fig. 3. Panel (a) shows the results for 1D H-chains up to *N* = 300. The observed scaling for CD is $O(N3.13)$, whereas it is $O(N2.12)$ for CD–sRI (estimated with *R*^{2} greater than 0.99). For the obtained data points, there is no crossover between CD and CD–sRI. CD–sRI is *always* faster than CD, and there is no overhead for using CD–sRI compared to CD for a single sample. Similarly, panel (b) shows the timing benchmark for water clusters up to *N* = 64. The observed scaling for CD is $O(N3.63)$, whereas it is $O(N2.94)$ for CD–sRI (estimated with *R*^{2} greater than 0.999). The CD algorithm does not exhibit asymptotic scaling for these system sizes, which could be due to the efficient use of Basic Linear Algebra Subroutines (BLAS). CD–sRI appears to be quite close to its theoretically predicted asymptotic scaling. The memory results are obtained based on Table II and are shown in Fig. 4. Even for the smallest systems examined, CD–sRI requires an order of magnitude less memory for each walker. Due to the scaling differences between CD and CD–sRI, the difference in the memory usage becomes only larger as the system size increases. This highlights the utility of CD–sRI in reducing the additional memory usage for each walker.

In summary, based on simple benchmarks on 1D H-chains and water clusters, we have successfully demonstrated the utility of the CD–sRI algorithm. With the variance reduction technique introduced here, the CD–sRI algorithm achieves a cubic-scaling local energy evaluation *without overhead*.

## V. CONCLUSIONS

In this work, we have shown how the stochastic resolution-of-the-identity (sRI) technique developed by Takeshita *et al.* can be seamlessly integrated with the local energy evaluation of phaseless auxiliary-field quantum Monte Carlo (ph-AFQMC). We considered four different existing local energy evaluation strategies: a half-rotated (HR) electron repulsion integral tensor approach, a Cholesky decomposition-based approach (CD), tensor hypercontraction (THC), and low-rank (LR) factorization approaches. We have carefully analyzed their formal scalings and discussed possible scaling reduction as well as walker-specific memory reduction when combined with sRI.

We found that HR–sRI achieves neither scaling reduction nor walker-specific memory reduction. It scales quadratically with system size like the original HR approach with the same memory usage. On the other hand, the CD approach, which formally scales quadratically with system size, can be reduced to cubic-scaling when combined with sRI (CD–sRI). Furthermore, the walker-specific memory usage is reduced to quadratic from cubic. Similarly, the cubic-scaling approaches, THC and LR, can be reduced to quadratic-scaling with sRI. The additional walker-specific memory requirement is also reduced from quadratic to linear. Without sRI, previously available algorithms using THC and LR achieve cubic scaling, but the overhead associated with them limits the applicability of these algorithms. Therefore, we have examined a new cubic-scaling algorithm, CD–sRI, with a particular focus on characterizing the overhead associated with the algorithm.

We applied the CD–sRI algorithm to one-dimensional hydrogen chains (1D H-chains) and water clusters. With a variance reduction (VR) technique developed for CD–sRI, CD–sRI with VR exhibits no overhead compared to CD up to H_{80} and up to (H_{2}O)_{8}. By no overhead, we mean that CD–sRI and CD exhibit a similarly large error bar for a similar number of statistical samples. Therefore, CD–sRI adds no additional cost for computing the total ph-AFQMC energy for a fixed statistical error. Furthermore, we have performed a timing benchmark for H-chains up to H_{300} and water clusters up to (H_{2}O)_{64}. We observed reduced scaling from CD–sRI compared to CD. Specifically, the observed scaling of CD was $O(N3.13)$ for H-chains and $O(N3.63)$ for water clusters, while we observed $O(N2.12)$ for H-chains and $O(N2.94)$ for water clusters in the case of CD–sRI. Based on these observations, we argue that the CD–sRI algorithm should be the standard cubic-scaling local energy algorithm with an expected generic speed-up compared to the conventional CD algorithm.

The remaining challenges for large-scale ph-AFQMC simulations are the memory bottleneck and the cubic-scaling walker propagation that occurs every time step. Both THC and LR approaches have successfully addressed the memory bottleneck^{26,32} as they require only quadratic-scaling storage for each walker. More theoretical development is needed for accelerating the walker propagation. With THC–sRI or LR–sRI, the local energy evaluation scales quadratically with system size. In these cases, the bottleneck for ph-AFQMC is the walker propagation. It has been shown that a significant speed-up of the propagation is possible by utilizing graphical processing units (GPUs).^{59,60} Furthermore, the significant reduction in the requirements for walker-specific extra memory will be particularly useful with GPUs since in this case, the available memory is highly limited. With sRI, the propagation is the bottleneck in AFQMC for all system sizes. Therefore, it will be interesting to explore ideas such as sRI to design new algorithms for walker propagation, which scale quadratically (or less) with system size. We reserve this and the exploration of additional projects centered on THC–sRI or LR–sRI with GPUs for large-scale ph-AFQMC applications for future work.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## ACKNOWLEDGMENTS

We are indebted to Fionn Malone for stimulating discussions and pointing out the potential usefulness of control variate in reducing the variance. We also thank Fionn Malone for help with QMCPACK,^{56,57} which was used to verify the lack of population bias of reported numbers in this work. We thank Evgeny Epifanovsky for useful discussions on efficient implementations of the local energy evaluation. We thank Miguel Morales and Shiwei Zhang for helpful comments. D.R.R. acknowledges support from Grant No. NSF-CHE 1954791.

## REFERENCES

_{2}

_{36}fullerene and an iron porphyrin model complex

*et al.*, “