We present an algorithm for calculating the local energy of a multi-Slater wave function in orbital space quantum Monte Carlo (QMC). Recent developments in selected configuration interaction methods have led to increased interest in using multi-Slater trial wave functions in various QMC methods. For an ab initio Hamiltonian, our algorithm has a cost scaling of O(n5 + nc), as opposed to the O(n4nc) scaling of existing orbital space algorithms, where n is the system size and nc is the number of configurations in the wave function. We present our method using variational Monte Carlo calculations with the Jastrow multi-Slater wave function, although the formalism should be applicable for auxiliary field QMC. We apply it to polyacetylene and demonstrate the possibility of using a much larger number of configurations than possible using existing methods.
I. INTRODUCTION
Particle-hole excitations from a mean-field reference state have long been used to encode correlation in wave functions. In single-reference wave function methods, one usually starts from a single Hartree–Fock (HF) configuration, and the electron correlation is built on top of this zeroth-order state by adding particle-hole excitations, often variationally or perturbatively.1 When electron interactions are strong, high rank particle-hole excitations from an HF reference are required for a satisfactory description of the electronic structure. It is computationally infeasible to include arbitrarily high-rank excitations for even moderately sized systems due to the rapid increase in their number with system size. Sometimes, the strong interactions are confined to a small set of orbitals and electrons termed the active space.2 For small enough active spaces, it is possible to include all possible excitations or, equivalently, all configurations in the Hilbert space in a variational treatment, and to solve the problem exactly in the active space. This is usually only feasible for active spaces smaller than about 20 electrons in 20 orbitals.3 Various approximate methods have been devised to stretch this restrictive bound on the active space size.4–8 Recently, selected configuration interaction (CI) methods, first used about 50 years ago,9,10 have attracted renewed attention because of their ability to target the most important configurations contributing to the wave function at a relatively small cost.6,11–14 Fast implementations of such techniques are now available that allow calculations with millions of configurations on even small work-stations.
Due to the remarkable accuracy and easy computability of selected CI wave functions, they have been used in quantum Monte Carlo (QMC) approaches as trial wave functions. In real space QMC, they can be used as part of the Jastrow multi-Slater wave functions, where the Jastrow factors serve to capture the dynamic correlation efficiently, thereby greatly truncating the length of the selected CI expansion required for a given accuracy.15–17 Multi-Slater wave functions have also been used in the second-quantized setting of auxiliary field QMC (AFQMC),7,18,19 where the error incurred by the phaseless approximation20 can be controlled by using more accurate trial wave functions.21,22 For calculating properties of wave functions in QMC methods, one needs to evaluate the so-called local quantities for a walker in the MC run. For example, for calculating the energy, one averages the local energy given by
where is the Hamiltonian, |ϕ⟩ = ∑IcI|ϕI⟩ is the multi-Slater wave function with |ϕI⟩ being configurations in the CI expansion obtained by particle-hole excitations from a reference configuration, nc is the number of configurations, and |n⟩ is the walker given by electronic positions in real space QMC and by orbital occupations in orbital space QMC. At first glance, this expression seems to suggest that the cost of calculating the local energy for the multi-Slater wave function is nc times the cost of calculating it for a single configuration. However, this cost can be significantly reduced since CI configurations are not independent but are obtained by a small set of excitations from the reference. Various algorithms have been proposed to achieve this speed-up in real space QMC,13,23–25 most efficient being the one recently proposed by Filippi and co-workers.24,25 Their algorithm has a cost scaling of O(n3 + nc), where n is the system size, which has allowed millions of configurations to be included in trial wave functions in real space. While there has been some prior work on using multi-Slater trial wave functions in AFQMC,26,27 no such drastic speed-up has been reported for calculating energy in orbital space algorithms, thus restricting the number of configurations that can be used. In this article, we present an algorithm for calculating the local energy of a multi-Slater wave function that achieves speed-ups analogous to the real space algorithm for orbital space QMC. We will use the variational Monte Carlo (VMC) treatment of Jastrow multi-Slater wave functions to formulate our technique.
In the following, we start by briefly describing the wave function and defining the required notation. Then, we show how local energy calculations can be sped up by storing some intermediates. Finally, the method is used for polyacetylene ground-state calculations to demonstrate its efficiency.
II. THEORY
A. Overview
The Jastrow multi-Slater wave function ansatz is given by
where Jij are real numbers, is the number operator for spin–orbital i, and |ϕI⟩ are electronic configurations forming the multi-configurational state |ϕ⟩. We will suppress spin indices for brevity throughout. Let n and nc be the number of electrons and number of configurations in the CI expansion, respectively. For scaling considerations, we will assume the number of electrons to be proportional to the number of orbitals, serving as a proxy for the system size.
In general, the single-particle orbital sets used for defining the Jastrow and the CI expansion are different. The second-quantized operators for the spatially local orbitals used in the Jastrow factor28 will be denoted as , while those for the canonical orbitals used in the CI expansion will be denoted as . Indices i, j, a, b will be reserved for local orbitals, and p, q, t, u will be reserved for the canonical ones. We will assume both sets of orbitals to be orthogonal, with the unitary transformation relating them given by
where are real numbers forming the coefficient matrix. We define the following slices of the coefficient matrix:
where superscripts show the row indices and subscripts show the column indices used for slicing. Henceforth, all indices will be assumed to be relative to make the notation compact. For example, iσ will label the iσth occupied local orbital, tν will label the tνth empty canonical orbital, and so on. All configurations in the CI expansion are related to a single configuration |ϕ0⟩, termed the reference configuration, via a set of canonical orbital excitations,
The degree of excitation for CI configurations is assumed to be of O(k). In practice, k is usually less than 10. A summary of the notation is shown in Fig. 1.
In this article, we are interested in calculating the energy of this wave function, although similar considerations apply to some other physical properties of interest as well. We calculate the energy using variational Monte Carlo sampling,
where the walkers |n⟩ are configurations in the Hilbert space, and the random walk is performed according to the probability distribution ⟨ψ|ψ⟩. The quantity being averaged, the local energy, is given by
where the configurations |m⟩ are generated by the action of the Hamiltonian on the walker |n⟩. The local energy calculation is the rate limiting step in a VMC calculation, and thus, efficient algorithms for calculating it are essential for feasible VMC simulations. We use the local orbitals used in the Jastrow to define the walkers. This is a sensible choice because the Jastrow is diagonal in this basis and the Hamiltonian, because of its local interactions, has only a quadratically scaling number of two-electron integrals, as opposed to quartic in a general basis. Thus, the application of the Jastrow and the Hamiltonian onto a configuration can be performed efficiently in a local basis.29 This leaves us with the task of calculating the overlap of a walker in a local basis with the multi-Slater wave function defined in a canonical basis. We outline the algorithm for doing this efficiently for local energy calculations in Secs. II B –II D.
B. Transition matrix elements
For calculating overlap and local energy, we need to evaluate the matrix elements of strings of excitation operators between configurations expressed in local and canonical bases. This section summarizes the identities that will be used to calculate these transition matrix elements. We will use the generalized Wick’s theorem to obtain them.
First, note the following pair contractions:
These can be derived in many different ways.7,30–33 A brief derivation of the first identity can be found in Appendix A. Note that given one of the four identities, the rest follow by simply transforming between the and operators. Using these pair contractions and the generalized Wick’s theorem, we get the following transition matrix element:
where the RHS is the determinant of a block matrix with blocks given by the indicated slices. The RA−1 elements arise from contractions between operators, the A−1C elements arise from contractions between the operators, and the A−1 and RA−1C − B elements arise from cross contractions between and operators. The determinant structure results from the fermionic parity factors. This equation will be used often in Secs. II C and II D. For convenience, we define the matrix Δ as
C. Overlap evaluation
We will not include the Jastrow factor in the following since it can be handled the same way as for the Jastrow Slater wave function. The overlap of a walker with the multi-Slater wave function is given by
The first term in this sum, the overlap of the walker with the reference configuration, is
where |0⟩ is the vacuum state. This follows from the usual (vacuum) Wick’s theorem. Similarly, the overlap of the walker with any other configuration in the CI expansion is also given by a determinant of size n × n. The cost of calculating these determinants scales as O(n3). Thus, the cost of evaluating the total overlap by calculating individual determinant overlaps from scratch has a prohibitive cost scaling of O(n3nc). We can improve upon this by storing a few matrices at the start of the calculation.
From Eq. (9), we have
where |ϕI⟩ is obtained from |ϕ0⟩ by k excitations [Eq. (5)]. Thus, by calculating the matrix A−1C and det(A) at cost O(n3) once, the overlap with any configuration in the CI expansion can be calculated at cost O(k3). We note that the matrix A−1C can be updated using the Sherman–Morrison formula at cost O(n2) when the walker makes a move and does not need to be calculated from scratch every time. Since k is usually much smaller than n, the total cost scaling of this approach, given by O(n2 + nck3), proves to be superior in almost all cases.
D. Local energy evaluation
The local energy of the multi-Slater wave function is given by
where |m⟩ are generated from the action of the Hamiltonian on the walker |n⟩. As mentioned before, an ab initio Hamiltonian generates O(n4) excitations, which is reduced to O(n2) when a local basis is used. Because of this large number of excitations, the calculation of local energy constitutes the most demanding part of a VMC calculation. We present two algorithms to evaluate the local energy.
1. Algorithm I
Following the discussion about calculating overlaps, ⟨m|ϕ⟩ is given by a sum of nc determinants. We can again avoid evaluating these determinants individually from scratch by storing a few matrices. Suppose |m⟩ is obtained from |n⟩ by l excitations as
The overlap of |m⟩ with the excited configuration |ϕI⟩ defined in Eq. (5) is given precisely by Eq. (9) as
Thus, by calculating RA−1, A−1C, and Δ once at cost O(n3), each overlap of the form ⟨m|ϕI⟩ can be evaluated at cost O(k3) since an ab initio Hamiltonian generates only up to double excitations, i.e., l ≤ 2. Note that all of these matrices can be efficiently updated at cost O(n2) in a Monte Carlo sampling run.
Having calculated these three matrices, we can directly perform the sum in Eq. (14) to get the local energy,
where the overlap ratios can be obtained using Eq. (16). This algorithm has a cost scaling of O(n4nck3), if screening of the Hamiltonian elements due to locality of orbitals is not considered, and O(n2nck3), if it is. This simple algorithm is easy to implement and allows for incorporating Hamiltonian screening. However, the cost scaling is rather steep, and one is restricted to fairly short CI expansions. Note that this algorithm is similar to the real space algorithm proposed by Clark et al.23 It improves upon their algorithm by obviating the need to calculate extra intermediate matrices.
2. Algorithm II
In the overlap calculation, it was possible to separate the nc factor from the system size n in the cost scaling. Section II D 2 achieves this for local energy evaluation by storing some intermediates. First, we partition the local energy based on the rank of the Hamiltonian excitation,
where is a configuration obtained from |n⟩ by a single excitation {i → a} and, similarly, is a configuration obtained from |n⟩ by a double excitation i → a, j → b. The Hamiltonian matrix elements are defined as and . In the following, we will suppress the explicit dependence of certain quantities on the walker whenever it is clear from the context.
Consider the calculation of ES, arising from single excitations. We can further split it into contributions from individual configurations,
can be calculated directly as in Sec. II D 1,
For I ≠ 0, suppose |ϕI⟩ is obtained from |ϕ0⟩ by excitations {p1 → t1, …, pk → tk} [Eq. (5)]. Using Eq. (16), can be expressed as
Note that the determinants appearing in this sum all have the same k × k block in the right-bottom corner and they only differ in the first row and column. To benefit from this pattern, we Laplace expand the determinants along the first column as
where the ordered set of indices p1, …, pkpμ denotes the set p1, …, pk excluding pμ. In the second line, we have used the linearity of determinants to move the Hamiltonian elements and the sum over Hamiltonian excitations inside the determinants. Evidently, the first term is proportional to . To efficiently evaluate the second term, we define the following intermediate:
where the sums have been arranged to show the lowest cost scaling order of contractions given by O(n3). By building this intermediate during the evaluation of , the local energy contribution of configuration I can be calculated at cost O(k4) as
The total cost scaling of calculating the part of the local energy due to single excitations using Sec. II D 2 is O(n3 + nck4). Thus, we have successfully separated the nc factor from the system size in the cost scaling. We note that this part of the algorithm, dealing with single excitations, is similar to the real space algorithm of Filippi et al.,24 albeit formulated differently.
Now, we turn to the calculation of the double excitation part of the local energy, ED. Again, we split it into contributions due to individual configurations,
Considering |ϕI⟩ defined in Eq. (5), its contribution is given by
Note that the determinants appearing in this sum all have the same k × k block in the right-bottom corner and they only differ in the first two rows and columns. To exploit this fact, we Laplace expand the determinants on the RHS along the first two columns,
The first term can again be directly related to the local energy contribution of the reference given as
To efficiently evaluate the second and third terms, we build the intermediates
Both of these can be built at cost O(n4) during the evaluation of . For example, one can see how D1 can be calculated at this cost since it involves contractions of the type
obtained by expanding the determinant. Using D1 and D2, the second and third terms can be calculated as before due to the linearity of determinants at cost O(k4). To calculate the final term, we use the intermediate
This intermediate can be built at cost O(n5) since it involves sums like
obtained by expanding the two determinants. Using D3, the final term can be calculated at cost O(k6). Therefore, the total cost scaling of the local energy calculation using Sec. II D 2 is O(n5 + nck6).
Unlike Sec. II D 1, there is no obvious way to use the screening of Hamiltonian elements to reduce the cost scaling of building the intermediates in Sec. II D 2. However, we note that the equation for building the intermediate D3 resembles a Hamiltonian integral transformation. Techniques like density fitting34 and Cholesky decomposition35 are employed to reduce the cost of such transformations. Tensor hypercontraction36 can be used to reduce the cost scaling to O(n4). Another possible way of improving the cost is by updating the intermediates as the walker moves during a Monte Carlo run, instead of building them from scratch every time. Incorporating such techniques into our algorithm will be a task for future research.
With some modifications, these algorithms can be used for calculating local energies of multi-Slater trial wave functions in AFQMC as well. This is outlined briefly in Appendix B. Finally, we note that the Jastrow factor overlap ratios, which we have ignored in the discussion above, can be absorbed into the intermediates exactly like the Hamiltonian matrix elements. Other details of the VMC procedure can be found in Appendix C.
III. RESULTS
In this section, we apply the above formalism to the truncated trans-polyacetylene chain C28H30. A model geometry, with uniform bond lengths given by l(C=C) = 1.34 Å, l(C—C) = 1.45 Å, and l(C—H) = 1.08 Å and 120° bond angles, was used. We used PySCF37 to generate molecular integrals for the 6-31g basis set and the heat bath CI (HCI)6,38 program Dice to perform an heat bath configuration interaction (HCI) self consistent field39 calculation with the π active space consisting of 28 electrons in 28 orbitals. Intrinsic bond orbitals40 obtained by localizing the π orbitals were used in the Jastrow factor. They roughly resemble the pz atomic orbital on each carbon atom.
We first look at the computational performance of the two algorithms. These calculations were performed on an Intel Xeon Gold 6150 2.7 GHz central processing unit (CPU). Figure 2 shows the computational cost scaling of the two algorithms as a function of the number of configurations in the wave function, showing the cost of calculating 100 local energy samples during a continuous time Monte Carlo run. The configurations were generated using an stochastic HCI (SHCI) calculation with ϵ1 = 5 × 10−5, and only those up to quadruply excited from the reference were retained. These configurations constitute the majority of all leading configurations in the SHCI expansion: Of the 5 × 105 leading configurations, only 954 were higher than quadruply excited. Section II D 1 shows a linear scaling with the number of configurations as expected. We store the Hamiltonian in a heat bath format, which allows efficient screening of the two-electron integrals, where integrals below a screening threshold ϵ are ignored. In Fig. 2, we show the cost scaling of Sec. II D 1 for two ϵ values to demonstrate the effect of screening. Because there is essentially only one orbital on each carbon, screening is very efficient in this case and leads to a significant reduction in the cost of local energy evaluation. However, even with aggressive screening, the calculations become very expensive as the number of configurations is increased due to the O(n2nck3) scaling.
On the other hand, Sec. II D 2 has a much less severe cost scaling. Up to 1 × 104 configurations, the cost is almost independent of the number of configurations. It is dominated by the O(n5) scaling calculation of the intermediate in Eq. (32). Because we have implemented this as a dense tensor contraction, screening does not affect the cost. Beyond 1 × 104 configurations, the linear scaling O(nck6) starts to become dominant. As more quadruply excited configurations are added to the expansion, the k6 factor leads to a change in the slope of the scaling curve. Despite this, the largest calculation with about 4.9 × 106 configurations required 205 s for calculating 100 samples with Sec. II D 2, whereas a linear extrapolation of the Sec. II D 1 times suggests that it would take more than 38.5 h for the same calculation even with aggressive screening. This analysis demonstrates the favorable scaling of Sec. II D 2, making local energy calculations with long HCI expansions feasible.
In Fig. 3, we show the effect of the Jastrow factor on the convergence of the multi-Slater ground state energy. The reference energy was obtained by performing a density matrix renormalization group calculation, which is very accurate for linear systems. SHCI calculations were performed with progressively smaller ϵ1 values leading to progressively longer expansions. Even after including the perturbation theory correction, SHCI has significant difficulty in converging to the exact energy because of the large number of strongly correlated degrees of freedom in this active space. The ground state is dense in the canonical representation and has significant contributions from highly excited configurations, leading to the slow convergence of SHCI.
For the Jastrow multi-Slater wave function, we used leading configurations from an SHCI calculation performed with ϵ1 = 5 × 10−5. Note that these configurations are different from those included in the SHCI wave functions used to examine the convergence of SHCI energies. We made this particular choice to make sure that wave functions with more configurations are variationally superior than those with fewer. This choice also allowed us to use the parameters optimized for the shorter expansions to be used as initial guesses for longer ones, which was crucial for the optimization. We used stochastic gradient descent with momentum41 to optimize the wave function energies. Jastrow parameters and CI coefficients were optimized together. We used 84 processes with 100 continuous time Monte Carlo (CTMC) samples and 10 burn-in samples each during the optimization and did a final energy calculation with 3000 samples per process to get the statistical error below 0.5 mH. The Jastrow multi-Slater energies can be seen to converge much more rapidly than bare SHCI. The energy error for the wave function consisting of about 1.0 × 104 configurations is 3.5(4) mH, whereas the error in the HCI variational energy with 1.1 × 107 configurations is 48.8 mH. Similar observations have been made for the Hubbard model using transcorrelated Hamiltonians.42,43
IV. CONCLUSION
We have presented an algorithm for calculating local energies of multi-Slater wave functions in orbital space Monte Carlo. We used a Jastrow multi-Slater wave function in VMC to demonstrate its efficiency. The algorithm itself is general and should allow the use of longer selected CI expansions as trial wave functions in other orbital space QMC methods like AFQMC and Green’s function Monte Carlo44,45 as well. It also presents a different formulation of the real space algorithm proposed by Filippi et al. for local energies. We also showed the efficacy of Jastrow factors in vastly improving the convergence of the energy of CI expansions in polyacetylene.
We plan to explore various possibilities for improving the algorithm and using it in other Monte Carlo methods. As mentioned before, methods like density fitting and tensor hypercontraction can be used to reduce the cost of building the intermediates, which will likely dominate the total cost for large systems. We would also like to explore other ways of exploiting the structure in the Hamiltonian. Our pilot implementation uses dense tensor contractions even for the sparse Hamiltonian, which can be improved considerably. As for the Jastrow multi-Slater wave functions, we would like to study the criteria for choosing configurations in the presence of a Jastrow, as well as the possibility of using symmetry projection to improve the convergence of energy with the number of configurations.46,47 This will be crucial for feasibly optimizing these wave functions because stochastic optimization with a large number of parameters is a difficult task. Our numerical experiments with polyacetylene suggest that an optimization strategy where CI coefficients are progressively optimized in blocks may be more effective than optimizing all of them together. Based on experiences in the community with other nonlinearly parameterized wave functions, it may be worth exploring such optimization techniques.
ACKNOWLEDGMENTS
We thank Joonho Lee for useful discussions about AFQMC. The funding for this project was provided by the National Science Foundation through Grant No. CHE-1800584. S.S. was also partly supported through the Sloan research fellowship.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request. The code used to generate the results can be found at https://github.com/sanshar/VMC/tree/master.
APPENDIX A: CALCULATION OF PAIR CONTRACTIONS
We start by deriving the first pair contraction identity in Eq. (8). Using the usual (vacuum) Wick’s theorem, we have
where A′ is obtained from A by replacing its ith row with the ath row of R, i.e., . Laplace expanding det(A′) along the ith row, we get
where CA′ and CA are the cofactor matrices of A′ and A, respectively. Since A′ and A only differ in their ith row, their cofactor matrices have the same ith rows, i.e., . The cofactor matrix is related to the inverse by
Thus, we have
leading to the desired result,
As noted in the main text, the rest of the identities in Eq. (8) follow from this one by expressing the operators in terms of operators.
APPENDIX B: LOCAL ENERGY IN AFQMC
In this section, we illustrate how Sec. II D 2, explained in the main text within the framework of VMC, can be used in AFQMC with some minor changes. We follow the importance sampled open ended random walk AFQMC procedure described by Motta and Zhang.7 While walkers in VMC are represented by different electron configurations in the same one particle orbital set, in AFQMC, walkers are nonorthogonal configurations with different orbitals.
The mixed estimator for ground state energy used in AFQMC is calculated as a weighted average of the local energies of the walkers,
where |ψT⟩ is the trial wave function, |ψ0⟩ is the propagated state, |n⟩ is a walker configuration, is the weight, and EL[n] is the local energy of this walker with respect to the trial wave function,
Suppose the trial wave function used for calculating the mixed estimator is a multi-Slater wave function represented using a canonical orbital set. Green’s function is defined as
where denote the canonical basis operators. G is obtained by a simple extension to all orbital indices μ and ν of the second pair contraction identity in Eq. (8). The ab initio Hamiltonian can be written in the canonical basis as
We only focus on the two-body part of the Hamiltonian and analyze the quantity,
which is used to calculate the full local energy as in Eq. (18). Using the definition of |ϕI⟩ in Eq. (5) and the generalized Wick’s theorem, we get
APPENDIX C: DETAILS OF VMC CALCULATIONS
1. Gradients
In order to optimize the wave function energy, we use gradient based methods. The ith component of the energy gradient is sampled according to the equation
where E is the energy of the wave function |ψ⟩ and |ψi⟩ is the derivative of the wave function with respect to the ith parameter. Since EL[n] and E are both available from the energy sampling, we only need to calculate the wave function derivative overlaps to obtain an estimate of the energy gradient.
The Jastrow multi-Slater wave function has three types of parameters that can be optimized: Jastrow elements, CI coefficients, and orbital coefficients. We note that our parameterization is fairly redundant, and it is possible that removing these redundancies can lead to improvements in optimization. We have chosen not to do so because of the simplicity of the redundant parametrization. For Jastrow parameters, the derivative overlap ratio is given by
where ni and nj are the occupation numbers of orbitals i and j in the walker. All Jastrow derivative overlaps can be calculated at cost O(n2). For the CI coefficients, the derivate overlap ratio is given by
All CI coefficient derivative overlaps can be calculated at cost O(nck3), given the matrices stored during energy sampling. A naive calculation of the orbital coefficient derivatives has a prohibitive cost of O(n2nck3), but this can be improved to O(n3 + nck3) using the method outlined by Assaraf et al.25 We have not used orbital optimization in this article.
2. Sampling
We use continuous time Monte Carlo (CTMC)48,49 to perform the energy and gradient sampling. We refer the reader to our previous work29 for the details of this sampling technique. To perform efficient CTMC sampling of a wave function |ψ⟩, one needs to calculate the overlap ratios for all the excitations |m⟩ generated from the walker |n⟩ through the Hamiltonian. In Sec. II D 1, all these overlap ratios are calculated directly and can be stored during the local energy calculation, allowing the CTMC sampling of the full wave function at essentially no extra cost. However, in Sec. II D 2, these overlap ratios are not calculated explicitly, instead they get absorbed in the intermediates. Thus, it is not possible to perform efficient CTMC sampling of the full wave function in Sec. II D 2. Instead, we sample the wave function given by
Since this sampling wave function only contains the reference configuration, its overlap ratios are available in Sec. II D 2. We estimate the energy of the full wave function by taking the ratio of the following quantities obtained by CTMC sampling |ψ0⟩:
This introduces a bias in the estimate of the energy, which could be severe if |ψ0⟩ has vanishing contributions from parts of the Hilbert space where |ψ⟩ has significant amplitudes.50 We guard against instabilities due to rare events when the walker overlap with the sampling wave function is small by throwing away outlier samples using a threshold for the magnitude of . In our experiments, this has not lead to significant issues because |ψ0⟩ is often a good approximation to |ψ⟩. We note that the occurrences of such rare events can be reduced by including more configurations in the sampling wave function, but we have not done so in this study. This will also improve the sampling efficiency at a slightly enhanced cost introduced by the direct calculation of overlap ratios. We are currently investigating possible sampling issues and will address them more thoroughly in a future article. The gradient can also be sampled similarly.