Variational quantum Monte Carlo (QMC) is an *ab initio* method for solving the electronic Schrödinger equation that is exact in principle, but limited by the flexibility of the available *Ansätze* in practice. The recently introduced deep QMC approach, specifically two deep-neural-network *Ansätze* PauliNet and FermiNet, allows variational QMC to reach the accuracy of diffusion QMC, but little is understood about the convergence behavior of such *Ansätze*. Here, we analyze how deep variational QMC approaches the fixed-node limit with increasing network size. First, we demonstrate that a deep neural network can overcome the limitations of a small basis set and reach the mean-field (MF) complete-basis-set limit. Moving to electron correlation, we then perform an extensive hyperparameter scan of a deep Jastrow factor for LiH and H_{4} and find that variational energies at the fixed-node limit can be obtained with a sufficiently large network. Finally, we benchmark MF and many-body *Ansätze* on H_{2}O, increasing the fraction of recovered fixed-node correlation energy of single-determinant Slater–Jastrow-type *Ansätze* by half an order of magnitude compared to previous variational QMC results, and demonstrate that a single-determinant Slater–Jastrow-backflow version of the *Ansatz* overcomes the fixed-node limitations. This analysis helps understand the superb accuracy of deep variational *Ansätze* in comparison to the traditional trial wavefunctions at the respective level of theory and will guide future improvements of the neural-network architectures in deep QMC.

## I. INTRODUCTION

The fundamental problem in quantum chemistry is to solve the electronic Schrödinger equation as accurately as possible at a manageable cost. Variational quantum Monte Carlo (variational QMC or VMC in short) is an *ab initio* method based on the stochastic evaluation of quantum expectation values that scale favorably with system size and provide explicit access to the wavefunction.^{1} Although exact in principle, VMC strongly depends on the quality of the trial wavefunction, which determines both efficiency and accuracy of the computation, and typically constitutes the limiting factor of VMC calculations.

Recently, deep QMC has been introduced. Deep QMC involves a new class of *Ansätze* that complement traditional trial wavefunctions with the expressiveness of deep neural networks (DNNs). This *ab initio* approach is orthogonal to the supervised learning of electronic structure that requires external datasets.^{2,3} The use of neural-network trial wavefunctions has been pioneered for spin lattice systems^{4} and later generalized to molecules in second quantization.^{5} The first application to molecules in real space was a proof-of-principle effort, but did not reach the accuracy close to traditional VMC.^{6} The DNN architectures PauliNet and FermiNet advanced the real-space deep QMC approach,^{7,8} increasing the accuracy to the state-of-the-art levels and beyond. Demonstrating very high accuracy with far fewer determinants than traditional counterparts, these deep-neural-network trial wavefunctions provide an alternative to increasing the number of Slater determinants, thus potentially improving the unfavorable scaling with respect to the number of electrons that complicates accurate calculations for large systems. Application of the deep QMC method to many-particle quantum systems other than electrons is also possible.^{9}

Currently, there is little understanding of why these DNN wavefunctions work well and how their individual components contribute to the approximation of the ground-state wavefunction and energy. Examining their expressive power and measuring their accuracy in comparison to traditional approaches is essential to establish neural-network trial wavefunctions as a standard technique in VMC and to guide further development.

Here, we identify a hierarchy of model *Ansätze* based on the traditional VMC methodology (Fig. 1) that enables us to distinguish the effects of improving single-particle orbitals and adding correlation in the symmetric part of the wavefunction *Ansatz*. This is of particular interest in the context of discriminating these improvements from reducing the energy by solving the intricate problem of missing many-body effects in the nodal surface.

The trial wavefunctions in QMC are typically constructed by combining a symmetric Jastrow factor with an antisymmetric part that implements the Pauli exclusion principle for fermions by specifying the nodal surface of the *Ansatz*—the hyperplane in the space of electron coordinates, **r** = (**r**_{1}, …, **r**_{N})—on which the wavefunction changes sign. Expressing the antisymmetric part as a linear combination of Slater determinants gives rise to the *Ansatz* of the Slater–Jastrow-backflow-type that comprises most VMC *Ansätze*, including the deep variants PauliNet and FermiNet,

The ability of neural networks to represent antisymmetric (wave) functions has also been explored theoretically.^{10,11}

Traditionally, Slater determinants are antisymmetrized product states constructed from single-particle molecular orbitals, which are expressed in a one-electron basis set consisting of basis functions *ϕ*_{k},

Employing such basis sets transforms the problem of searching over infinitely many functions into a problem of searching over coefficients in a system of equations, which can be solved by means of linear algebra applying, for instance, the Hartree–Fock (HF), the multi-configurational self-consistent field (MCSCF), or the full configuration interaction (FCI) method. The projection comes at the cost of introducing the finite-basis-set error (BSE), which completely vanishes only in the limit of infinitely many basis functions—the complete-basis-set (CBS) limit [Fig. 1(a)]. Finite-basis-set errors are inherent to the second-quantized representation, which, nevertheless, provides an alternative platform to introduce deep learning to quantum chemistry.^{5}

The real-space formulation of VMC allows us to introduce explicit electron correlation efficiently by modeling many-body interactions with a Jastrow factor [Fig. 1(b)]. The Jastrow factor is a symmetric function of the electron coordinates that traditionally involve an expansion in one-, two-, and three-body terms.^{12} Although strongly improving the *Ansatz*, traditional Jastrow factors do not have sufficient expressiveness to reach high accuracy, and an initial VMC calculation is typically followed by a computationally demanding fixed-node diffusion QMC (FN-DMC) simulation [Fig. 1(c)], which eventually projects out the exact solution for the given nodal surface—the fixed-node limit.^{13} DMC is based on the imaginary-time Schrödinger equation and offers yet another entry point for the use of neural networks to represent quantum states.^{14,15}

The nodal surface of the trial wavefunctions can be improved by increasing the number of determinants or by applying the backflow technique, transforming single-particle orbitals to many-body orbitals under consideration of the symmetry constraints. These are key concepts to efficiently reach very high accuracy with VMC and integral features of deep QMC. Using multiple determinants, applying the backflow technique and modifying the symmetric component of the *Ansatz* at the same time, however, makes it difficult identifying the contributions of each individual part. Benchmarking deep QMC *Ansätze* in conceptually simpler contexts confirms their correct functionality and helps achieve a better understanding.

In this paper, we take a closer look at how neural networks compensate for errors arising from finite basis sets and demonstrate convergence to the fixed-node limit within the VMC framework by systematically increasing the expressiveness of a deep Jastrow factor. For the sake of disentangling the individual contributions to the overall accuracy, we conduct our analysis mainly with Slater–Jastrow-type trial wavefunctions with an antisymmetric part consisting of a single determinant, that is, with *Ansätze* possessing a mean-field nodal surface. We compare neural-network variants with traditional functional forms, as well as with DMC results. In particular, we investigate the PauliNet, a recently proposed neural-network trial wavefunction.^{7} PauliNet combines ideas from conventional trial wavefunctions, such as a symmetric Jastrow factor, a generalized backflow transformation, multi-determinant expansions, quantum-chemistry baselines, and an explicit implementation of physical constraints of ground-state wavefunctions. Since PauliNet is a powerful instance of the general *Ansatz* in (1), we can obtain traditional types of QMC *Ansätze* at different levels of the theory by deactivating certain trainable parts of PauliNet. The hierarchy of *Ansätze* sketched in Fig. 1 maps restricted single-determinant versions of PauliNet and their eventual expressiveness in the context of the traditional single-determinant VMC approach. The incentive of implementing restricted variants of PauliNet is to test the behavior of the *Ansatz* in settings that are well solved by existing methods and investigate the expressiveness of the individual components of PauliNet on well-defined subproblems. These restricted variants, however, are not intended to be used in order to achieve best accuracy, which is attained when taking advantage of the full flexibility of the PauliNet *Ansatz*, as demonstrated previously.^{7}

The rest of the paper is organized as follows. In Sec. II, we review the general PauliNet *Ansatz* and show how different levels of the model hierarchy (Fig. 1) can be obtained. In Sec. III, we use these instances of PauliNet to investigate several subproblems of the fixed-node limit within the deep QMC approach. First, we demonstrate that DNNs can be employed to correct the single-particle orbitals of a HF calculation in a small basis and obtain energies close to the CBS limit. Next, we benchmark the deep Jastrow factor. We start by applying it to two node-less test systems, H_{2} and He, where results within five significant digits of the exact energy are achieved. Next, we conduct an extensive hyperparameter search for two systems with four electrons, LiH and the H_{4} rectangle, revealing that the expressiveness of the *Ansatz* can be systematically increased to converge to the fixed-node limit imposed by the employed antisymmetric *Ansatz*. We further explore the convergence aspect by sampling the dipole moment for the LiH *Ansätze* and evaluating energy differences for two configurations of the hydrogen rectangle. Thereafter, we show the size consistency of the method, examining the optimization of the deep Jastrow factor for systems of non-interacting molecules (H_{2}–H_{2} and LiH–H_{2}). Finally, we test various single-determinant variants of PauliNet in an analysis of the water molecule and compare them to traditional trial wavefunctions. Section IV discusses the results.

## II. THEORY AND METHODS

### A. PauliNet

The central object of our investigation is PauliNet, a neural-network trial wavefunction of the form in (1). PauliNet extends the traditional multi-determinant Slater–Jastrow-backflow-type trial wavefunctions,^{16} retaining physically motivated structural features while replacing *ad hoc* parameterizations with highly expressive DNNs,

The *Ansatz* consists of a linear combination of Slater determinants of molecular single-particle orbitals *φ*_{μ} corrected by a generalized backflow transformation **f**_{θ} and of a Jastrow factor *J*_{θ}. The DNN components are indicated by the ** θ** subscript, denoting the trainable parameters of the involved neural networks. The expansion coefficients

*c*

_{p}and the single-particle orbitals are initialized from a preceding standard quantum-chemistry calculation (HF or MCSCF). The analytically known electron–nucleus and electron–electron cusp conditions

^{17}are enforced within the orbitals

*φ*

_{μ}and as a fixed part

*γ*of the Jastrow factor, respectively. The correct cusps are maintained by designing the remaining trial wavefunction architecture to be cusp-less.

Both backflow transformation and Jastrow factor can introduce many-body correlation and are constructed in such a way that they preserve the antisymmetry of the trial wavefunction. The Jastrow factor consists of a symmetric function, that is, it retains the antisymmetry upon being invariant under the exchange of same-spin electrons. This, however, has the consequence that it scales the wavefunction without altering the nodes of the *Ansatz*. The backflow transformation on the other hand alters the nodal surface by acting on the orbitals directly. Traditionally, the backflow correction introduces many-body correlation by assigning quasi-particle coordinates that get streamed through the original orbitals. PauliNet generalizes this concept, based on the observation that equivariance with respect to the exchange of electrons is a sufficient criterion to retain the antisymmetry of the Slater determinant, and considers the backflow correction as a many-body transformation of the orbitals themselves. In fact, it has been shown in principle that a single Slater determinant with generalized orbitals is capable of representing any antisymmetric function, if the many-body orbitals are sufficiently expressive.^{11} Both the Jastrow factor *J*_{θ} and backflow transformation **f**_{θ} are obtained from a joint latent-space representation encoded by a graph-convolutional neural network. The network acts on the rotation- and translation-invariant representation of the system given by the fully connected graph of distances between all electrons and nuclei. The latent-space many-body representation is designed to be equivariant under the exchange of same-spin electrons, which is used to construct the permutation-equivariant backflow transformation and the permutation-invariant Jastrow factor. Details on the graph-convolutional neural-network architecture can be found in the Appendix. Combining an expansion in Slater determinants with the Jastrow factor and backflow transformation introduces multiple ways to model many-body effects, helping us to efficiently encode correlation in the *Ansatz* by representing, i.e., dynamic correlation explicitly while implementing static correlation with multiple determinants.

The PauliNet *Ansatz* is optimized according to the standard VMC scheme^{1} of minimizing its energy expectation value. This is based on the variational principle of quantum mechanics that guarantees the energy expectation value of any trial wavefunction to be lower-bounded by the ground-state energy, as long as the fermionic antisymmetry constraint is implemented,

In VMC, this expectation value is approximated by Monte Carlo integration,

In practice, this gives rise to an alternating scheme of sampling electronic configurations according to the probability density associated with the trial wavefunction with a standard Langevin sampling approach and optimizing the parameters of this wavefunction by following their (stochastic) gradient with respect to estimates of the expectation value over small batches. For further details of the training methodology, see Ref. 7. Numerical calculations were carried out with the DeepQMC Python package,^{18} with training hyperparameters as reported in Table VI.

Next, we show how to obtain the *Ansätze* of Fig. 1 from the general PauliNet architecture and introduce the respective optimization problems to be solved.

### B. Deep orbital correction

The simplest way to approach the quantum many-body problem is by considering a mean-field theory. The HF method gives the optimal mean-field solution within the space of the employed basis set. A mean-field variant of the PauliNet architecture can be used to account for finite-basis-set errors in the HF baseline, by introducing a real-space correction to the single-particle orbitals,

The functions $f\theta \u2297$ and $f\theta \u2295$ are implemented by DNNs that generate a multiplicative and an additive correction to the HF orbitals *φ*_{μ}, respectively. Combining a multiplicative and an additive correction serves the practical purpose of facilitating the learning process, as the multiplicative correction has a strong effect where the value of the orbital is large, while the additive correction can alter the nodes of the molecular orbital. (In principle, an additive correction only would be a sufficient parameterization.) This approach is a special case of the generalized backflow transformation in (4), in which the backflow correction depends on the position of the *i*th electron only. The single-particle representation $xi(L)(ri)$ can be obtained by a slight modification of the graph-convolutional architecture, as described in the Appendix. If Gaussian-type orbitals are used, it is common to correct the missing nuclear cusp at the coalescence points within the orbitals. We employ the cusp correction of Ma *et al.*^{19} and construct the DNNs to be cusp-less. Though the DNN could in principle approximate the orbitals from scratch, providing the HF baseline that ensures the correct asymptotics and offers a good initial guess reduces the training cost and makes the training process more robust. In the mean-field theory, the HF energy at the CBS limit constitutes a benchmark for the best possible solution to the optimization problem.

### C. Deep Jastrow factor

The Slater–Jastrow-type *Ansatz* goes beyond the mean-field theory by introducing explicit electronic correlation. The symmetric Jastrow factor, however, cannot alter the nodal surface, and the single-determinant Slater–Jastrow-type *Ansatz* is, therefore, a many-body *Ansatz* possessing a mean-field nodal surface,

The deep Jastrow factor *J*_{θ} is obtained from the latent-space many-body representation encoded by the graph-convolutional neural network described in the Appendix,

To enforce the symmetry of the Jastrow factor, the permutation-equivariant many-body embeddings $xi(L)$ are summed over the electrons to give permutation-invariant features. These features serve as the input to a fully connected neural network *η*_{θ}, which returns the final Jastrow factor. The process of obtaining the latent-space representation involves multiple smaller components, such as trainable arrays and fully connected neural networks, whose full specification gives rise to a collection of hyperparameters that influence the expressiveness of the *Ansatz*. A list of the components and the respective hyperparameters can be found in Table V.

Benchmarking Jastrow factors comes with the difficulty of distinguishing errors arising from the nodal surface from those present due to a lack of expressiveness in the Jastrow factor. The optimal energy of a Slater–Jastrow-type trial wavefunction, however, can be obtained with the FN-DMC algorithm, which gives the exact ground state of the Schrödinger equation under the fixed-node constraint of the antisymmetric part of the *Ansatz*.

### D. Mean-field Jastrow factor

We, furthermore, implement a mean-field Jastrow factor, which constitutes another point in the space of *Ansatz* classes (Fig. 1),

The mean-field Jastrow factor can optimize the one-electron density of the *Ansatz* without modifying the nodal surface or introducing correlation, making its variations a strict subset of the orbital correction. This equips us with an intermediate step in approaching the finite-basis-set limit that can be used to relate the finite-basis-set error to the fixed-node error of the HF baseline. If the many-body Jastrow factor is used, the mean-field version is not needed, as it is implicitly included in the many-body version.

## III. RESULTS

### A. Large basis sets are not necessary in DNN *Ansätze*

We start from a HF baseline obtained in the small 6-31G basis set. Instead of introducing more basis functions, the PauliNet *Ansatz* follows the alternative approach of correcting the orbitals directly in real space. We trained the mean-field variant of the PauliNet *Ansatz* (Sec. II B) on H_{2}, He, Be, LiH, and the hydrogen square H_{4}. For all five test systems, we obtained energies close to the extrapolated CBS limit and recovered at least 97% of the finite-basis-set error (Fig. 2). This shows that the use of a very small basis set for the baseline of PauliNet does not introduce any fundamental limitation to accuracy, because the neural network is able to correct it. We note that such an approach to the CBS limit is practical only within the context of the full PauliNet, not as a standalone technique to replace large basis sets in quantum chemistry.

### B. Exact solutions for two-electron systems

Next, we turn to modeling electron correlation with the deep Jastrow factor (Sec. II C). We start by evaluating the deep Jastrow factor for H_{2} and He, two-electron closed-shell systems for which the ground state is node-less (the antisymmetry comes from the spin part of the wavefunction only), such that the Jastrow factor is, in principle, sufficient to reproduce exact results. This yields a pure test for the expressiveness of the deep Jastrow factor. The recovered many-body correlation is measured by the correlation energy,

For both systems, we obtain energies matching five significant digits of the exact references (Table I). We evaluate the *Ansatz* along the dissociation curve of H_{2} (Fig. 3). Deep QMC outperforms FCI even with the large cc-pV5Z basis set, reducing the error in correlation energy by one to two orders of magnitude at compressed geometries and still being more accurate at stretched geometries, where the system exhibits static correlation, and the restricted HF baseline gives qualitatively wrong results (ionic contributions resulting in negative interaction energy). The results demonstrate the difficulty of modeling dynamic correlation in Slater-determinant space when applying purely second-quantized approaches and showcase the advantages of explicitly encoding many-body correlations.

System . | Deep Jastrow factor . | Exact energy . | η (%)
. |
---|---|---|---|

H_{2} (d = 1.4) | −1.17 446(1) | −1.174 474 8^{20} | 99.97(3) |

He | 2.90 372(1) | −2.903 724 7^{22} | 99.98(2) |

### C. Systematically approaching the fixed-node limit

The complexity of modeling correlation increases steeply with the number of particles. We evaluate the performance of the deep Jastrow factor for LiH and the hydrogen rectangle H_{4}. While these four-electron systems exhibit more intricate interactions, they are computationally lightweight such that the hyperparameter space of the deep *Ansätze* can be explored exhaustively. With multiple same-spin electrons, the spatial wavefunction is no longer node-less, and the single-determinant Slater–Jastrow *Ansatz* possesses a fixed-node error. Instead of comparing to exact energies, we, therefore, measure the performance of the Jastrow factor with respect to the fixed-node limit estimated from FN-DMC calculations and report the fixed-node correlation energy,

As the fixed-node correlation energy is defined for *Ansätze* with an identical nodal surface, the nodes of the FN-DMC benchmark have to be reconstructed. For the mean-field nodal surface, this implies starting from a HF computation with the same basis set.

For the H_{4} rectangle, we performed a scan of all the hyperparameters of the deep Jastrow factor including those of the graph-convolutional neural-network architecture (Table V). The scan was a grid search that involved the training of 864 models, comprising models with all combinations of the hyperparameters in the vicinity of their default values. In order to reduce the dimensionality of the experiment, some hyperparameters were merged and varied together. Further details of the scan can be found in the caption of Fig. 8, depicting the energies of all the model instances. The experiment aimed at obtaining a first impression of the hyperparameter space and revealed that by increasing the total number of trainable parameters, the fixed-node limit can be approached. The experiment shows that the energy behaves smoothly with respect to changes in the hyperparameters, and there are no strong mutual dependencies between hyperparameters. Several important hyperparameters for systematically scaling the architecture can be identified, such as the depth of the neural network *η*_{θ} from (10), the number of interactions *L*, and the dimension of the convolutional kernel, referring to the dimension of the latent space where the interactions within the graph-convolutional neural network take place ( Appendix).

The results were used to perform a thorough investigation of the convergence behavior on LiH, varying a subset of hyperparameters and fixing the remaining hyperparameters at suitable values. We show a systematic convergence to the fixed-node limit with an increasing dimension of the convolutional kernel (DNN width) as well as the number of interactions *L* (Fig. 4). This is an indication that the deep Jastrow factor can be extended toward completeness in a computationally feasible way. The remaining fluctuations of the fixed-node correlation energy are caused by the stochasticity of the training and the sampling error of evaluating the energy of the wavefunction.

By evaluating the dipole moment of the LiH wavefunctions, we go beyond the energy and investigate the convergence of a property that the PauliNet *Ansatz* is not explicitly optimized for. We found that upon converging to the fixed-node limit with increasingly large models, the dipole moment approaches the coupled cluster reference (Fig. 4). Even though the energies of the LiH wavefunctions converge consistently, the convergence of the dipole moment is subject to fluctuations, which are particularly strong for the small models and decrease as the fixed-node limit is approached. This can be explained by degenerate energy minima of the *Ansatz* with respect to the parameters. Multiple solutions to the optimization problem can be present if the exact solution is outside the variational subspace. This ambiguity, however, decreases with increasing expressiveness of the trial wavefunction.

While the accuracy with respect to the total energy is an appropriate measure for expressiveness of the trial wavefunction *Ansatz*, in practice, relative energies are most often of interest. The capability of the full PauliNet *Ansatz* in computing relative energies has been previously demonstrated for the cyclobutadiene automerization,^{7} and the results with the deep Jastrow factor for the node-less H_{2} (Fig. 3) provide interaction energies at the level of FCI. Here, we want to study how the relative energy converges with an increasing expressiveness of the deep Jastrow factor and demonstrate a cancellation of errors at different geometries. This is a feature that makes relative energy calculations usually more accurate than total energy calculations and is very desirable for any quantum-chemistry method. We optimized increasingly expressive versions of the deep Jastrow factor for two geometries of the hydrogen rectangle (Table VII) and determined their relative energy (Fig. 5). In order to reduce the level of stochasticity in the training, we performed five independent optimization runs and used the ones with the lowest energy to calculate the relative energies. Both the total and relative energies converge to the DMC reference with increasing number of trainable parameters of the *Ansatz*. Furthermore, the relative energy fluctuates within 2 kcal/mol of the DMC reference for all models with more than two interactions and is well within 1 kcal/mol for the models with the largest DNN width. This demonstrates that the deep Jastrow factor can achieve similar accuracy for both geometries and exhibits a cancellation of errors. The stochasticity of the optimization, however, complicates the comparison of individual runs, which will be subject to further investigation. Looking at the energy of the different optimizations though, we found a decrease in stochasticity of the final energy with increasing model size, which is convenient for practical purpose, where typically large models would be used. The difficulty of optimizing small models is well-known in the context of training neural networks, which tends to be improved by increasing the number of trainable parameters.^{27}

One of the essential properties of any proper electronic structure method is size consistency. Traditional Jastrow factors are factorizable in the electronic and nuclear coordinates of two infinitely distant subsystems, which leads to exact size consistency for identical copies of a given system and to approximate size consistency for an assembly of different systems (because optimized parameters are now shared by different systems). In PauliNet, the embeddings **x**_{i} for two electrons at two distant subsystems are independent of each other by construction. Although the subsequent nonlinear transformation *η*_{θ} applied to the sum of the embeddings breaks exact factorizability, it could be restored by applying the transformation before summing the embeddings, which in numerical experiments does not affect performance. Regardless, in numerical experiments with two systems of non-interacting molecules (H_{2}–H_{2} and LiH–H_{2}), we show that even the variant of our *Ansatz*, which is not exactly factorizable, is size-consistent in practice (Table II). For the system composed of two distant hydrogen molecules, both the combined and individual calculations give nearly exact results, 99.99(1)% and 100.00(1)% of the correlation energy, respectively. In the second test with LiH and H_{2}, 99.65(2)% and 99.68(2)% of the correlation energy are achieved, respectively, which corresponds to the difference of less than 10% of the overall error of PauliNet with respect to the exact energy. The results, furthermore, show that optimization of the *Ansatz* for the combined system works similarly well as optimizing the separate instances for the respective subsystems (Fig. 6).

### D. Application of different levels of theory to H_{2}O

The results for the small test systems showed that DNNs can be used to converge to the CBS limit within the mean-field theory and that by adding correlation with a deep Jastrow factor, the fixed-node limit can be approached. To investigate how these *Ansätze* behave for larger systems, we evaluated the respective instances of PauliNet on the water molecule (Figs. 7 and 9). These experiments aim at demonstrating that the same *Ansätze* can be applied to a variety of systems without any modifications and test how much their respective accuracy decreases if the size of the graph-convolutional neural network is kept fixed. For the experiment, we chose four interactions and a kernel dimension of 256, which is equal to the large models from the H_{4} and LiH experiments. Due to the convolutional nature of the neural network, the number of trainable parameters is mostly independent of the number of electrons; hence, it is similar to the previous experiments.

We again start with the mean-field theory and consider the finite-basis-set error. We corrected the HF orbitals in the small 6-311G basis set with the deep orbital correction (Sec. II B), which recovered 90% of the finite-basis-set error. We then estimated how much of the finite-basis-set error amounts to the fixed-node error by applying the mean-field Jastrow factor (Sec. II D), which can recover about half of the finite-basis-set error only. This suggests that upon approaching the finite-basis-set limit, the nodal surface is altered significantly.

Next, we investigate single-determinant *Ansätze* with the full Jastrow factor (Sec. II C). We benchmark the deep Jastrow factor with a HF determinant in the Roos augmented double-zeta basis (Roos-aug-DZ-ANO),^{33} a basis set that is frequently used for calculations on H_{2}O and gives HF energies at the CBS limit. We compare to VMC and DMC results from the literature, achieving 97.2(1)% of the fixed-node correlation energy and surpassing the accuracy of previous VMC calculations with single-determinant Slater–Jastrow trial wavefunctions by half an order of magnitude (Table III).

Reference . | HF . | VMC (SD-SJ) . | DMC (SD-SJ) . | η_{FN} (%)
. | Basis set . |
---|---|---|---|---|---|

PauliNet | −76.009 | −76.3923(7) | … | 91.2(2)^{a} | 6-311G |

PauliNet | −76.0612 | −76.4096(7) | … | 96.0(2)^{a} | 6-311G + DNN |

PauliNet | −76.0672 | −76.4139(5) | … | 97.2(1)^{a} | Roos-aug-DZ-ANO |

Clark et al.^{29} | … | −76.3938(4) | −76.423 6(2) | 91.6(1) | Roos-aug-TZ-ANO |

Gurtubay and Needs^{30} | −76.0672 | −76.3773(2) | −76.423 76(5) | 87.01(6) | Roos-aug-DZ-ANO |

Gurtubay et al.^{31} | −76.0587 | −76.327(1) | −76.421 02(4) | 73.5(3) | 6–311++G(2d,2p) |

Reference . | HF . | VMC (SD-SJ) . | DMC (SD-SJ) . | η_{FN} (%)
. | Basis set . |
---|---|---|---|---|---|

PauliNet | −76.009 | −76.3923(7) | … | 91.2(2)^{a} | 6-311G |

PauliNet | −76.0612 | −76.4096(7) | … | 96.0(2)^{a} | 6-311G + DNN |

PauliNet | −76.0672 | −76.4139(5) | … | 97.2(1)^{a} | Roos-aug-DZ-ANO |

Clark et al.^{29} | … | −76.3938(4) | −76.423 6(2) | 91.6(1) | Roos-aug-TZ-ANO |

Gurtubay and Needs^{30} | −76.0672 | −76.3773(2) | −76.423 76(5) | 87.01(6) | Roos-aug-DZ-ANO |

Gurtubay et al.^{31} | −76.0587 | −76.327(1) | −76.421 02(4) | 73.5(3) | 6–311++G(2d,2p) |

^{a}

The fixed-node correlation energy is computed with respect to the reference FN-DMC energy of Gurtubay and Needs.^{30}

In order to study how finite-basis-set errors manifest in the mean-field nodal surface of both the HF and the many-body *Ansätze*, we computed the energies of the deep Jastrow factor with a HF determinant in a 6-311G basis. The results suggest that finite-basis-set errors of the HF calculations transfer directly to the many-body regime. In particular, the differences between the energies of the mean-field *Ansätze* match the differences of the respective Slater–Jastrow trial wavefunctions, and errors in the energy due to finite-basis-set effects are not altered by the many-body correlation.

We, furthermore, demonstrate that both methods can be combined, by optimizing a trial wavefunction composed of a deep Jastrow factor and a Slater determinant of orbitals of an imprecise HF baseline that are modified by the orbital correction. The parameters of both the Jastrow factor and orbital correction were optimized simultaneously. The HF baseline was computed in the small 6-311G basis set. With this setup, we were able to achieve energies close to the fixed-node limit of the optimal mean-field nodal surface. Starting from a minimal baseline, we recovered 96.0(2)% of the fixed-node correlation energy with respect to the Roos-aug-DZ-ANO basis.

Finally, we show that the full PauliNet *Ansatz* can go beyond the fixed-node approximation and train an instance with the same graph-convolutional architecture as in the previous experiments, but using the full backflow transformation. With this *Ansatz*, we obtained a VMC energy of −76.4252(3) and −76.4281(3) for the 6-311G and the Roos-aug-DZ-ANO basis set, respectively, amounting to 96.67(8)% and 97.38(8)% of the total correlation energy. This energy is significantly below the single-determinant DMC results (Fig. 7), demonstrating energetically favorable changes in the nodal surface due to the backflow transformation. A comparison to traditional VMC results shows that the single-determinant version of PauliNet strongly improves on single-determinant Slater–Jastrow-backflow (SD-SJB) trial wavefunctions, and multi-determinant Slater–Jastrow (MD-SJ) wavefunctions need thousands of determinants to obtain a similar accuracy (Table IV). Here, it should be stated that, in principle, the accuracy of PauliNet can be further improved by increasing the size of the graph-convolutional neural-network architecture or introducing multiple determinants. The comparison should, therefore, not be understood as ultimate, but serves to give an impression of the capabilities of the PauliNet backflow. More exemplary calculations with the full PauliNet *Ansatz* including multi-determinant *Ansätze* have been carried out previously,^{7} and a more thorough investigation of the improvements in the nodal-surface as well as a benchmark of the computational complexity will be conducted in the future work.

. | Ansatz
. | # determinants . | VMC . |
---|---|---|---|

PauliNet | SD-SJB | 1 | −76.4281(3) |

Gurtubay and Needs^{30} | SD-SJB | 1 | −76.4034(2) |

Clark et al.^{29} | MD-SJ | 2316 | −76.4259(6) |

Clark et al.^{29} | MD-SJ | 7425 | −76.4289(8) |

## IV. DISCUSSION

We have demonstrated that the choice of the architecture does not introduce fundamental limitations regarding the flexibility of the investigated components of the PauliNet and that a systematic improvement of the accuracy is possible when increasing the number of trainable parameters in a suitable way. Close to exact energies for the corresponding level of theory can be obtained for both the deep orbital correction and the deep Jastrow factor. This highlights the generality and expressiveness of deep QMC—a single *Ansatz* without any problem-specific modifications can be applied to a variety of systems and extended systematically to improve the accuracy without introducing new components to the trial wavefunction architecture. Though the results with the deep orbital correction and the deep Jastrow factor emphasize the potential of the deep QMC approach, the major benefit of deep QMC over FN-DMC calculations remains that it can go beyond the fixed-node approximation by faithfully representing the nodal surface upon introducing many-body correlation at the level of the orbitals. We have outlined this with an exemplary calculation on the water molecule, using a single-determinant instance of the full PauliNet *Ansatz*. The presented analysis paves the way for future investigations on how the full PauliNet *Ansatz* improves the nodes and overcomes the fixed-node limitations.

## ACKNOWLEDGMENTS

We acknowledge funding and support from the European Research Commission (Grant No. ERC CoG 772230), the Berlin Mathematics Research Center MATH+ (Project Nos. AA2-8, EF1-2, and AA2-22), and the German Ministry for Education and Research (Berlin Institute for the Foundations of Learning and Data BIFOLD). J.H. would like to thank K.-R. Müller for support and acknowledge funding from TU Berlin.

## DATA AVAILABILITY

The data that support the findings of this study are openly available in http://doi.org/10.6084/m9.figshare.13077158.v3.

### APPENDIX: GRAPH-CONVOLUTIONAL NEURAL-NETWORK ARCHITECTURE

At the core of the PauliNet architecture, a graph-convolutional neural-network generates a permutation-equivariant latent-space many-body representation of a given electron configuration. In the following we give a short introduction to the network architecture, discussing both the general concept and the particular application in the context of PauliNet.

Graph neural networks are constructed to represent functions on graph domains^{34} and have become increasingly popular for modeling chemical systems, as they can be designed to comply with symmetries of molecules.^{35} The graph-convolutional neural network of PauliNet is a modification of SchNet,^{36} an architecture developed to predict molecular properties from atom positions upon being trained in a supervised setting, that is, by repeated exposure to known pairs of input and output data. In SchNet, a trainable embedding is assigned to each atom, which serves to give an abstract representation of the atomic properties in a high-dimensional feature space, and is successively updated to encode information about the atomic environment. The updates are implemented as convolutions over the graph of atomic distances, which makes the architecture invariant to translation and rotation and equivariant with respect to the exchange of identical atoms. The graph convolutions, furthermore, implement parameter sharing across the edges such that the number of network parameters does not depend on the number of interacting entities; hence, it is constant with the system size. The final features are then used to predict the molecular properties, respectively.

In quantum chemistry, we consider modeling electrons and nuclei. At this level, a molecule can be represented as a complete graph, where nodes correspond to electrons and nuclei, and the distances of each pair of particles are assigned to the edge between their respective nodes. The graph-convolutional neural network at the core of PauliNet acts on this graph representation of the system. Similar to the SchNet implementation, a representation in an abstract feature space is assigned to each node by introducing electronic embeddings $X\theta ,si$ and nuclear embeddings **Y**_{θ,I}, respectively. The embeddings are trainable arrays that are initialized randomly. As same-spin electrons are indistinguishable, they share representations and get initialized with a copy of the same electronic embedding,

The electronic embeddings constitute the latent-space representation, which serves to obtain the Jastrow factor and backflow transformation in the later process of evaluating the trial wavefunction. In order to encode positional information of each electron with respect to the nuclei as well as electronic many-body correlation into the latent-space representation, the electronic embeddings are updated in an interaction process. Information is transmitted along the edges of the graph, by exchanging messages that take the distances to the nuclei and the other electrons into account,

Here, the functions **w**_{θ} and **h**_{θ} are implemented by fully connected neural networks, **e** represents an expansion of the distances in a basis of Gaussian functions, and ⊙ indicates element-wise multiplication. For each two interacting particles, the filter-generating function **w**_{θ} generates a mask that is applied to their respective embeddings. Thereby, the filter-generating function moderates interactions based on the distance of the particles. By summing up the messages of identical particles, their overall contribution is invariant under the exchange of these identical particles. The transformation **h**_{θ} serves to introduce additional flexibility to the architecture, by separating the latent-space representation from the interaction space. The superscripts of the neural networks indicate that different functions are applied at each subsequent interaction, and the filter-generating functions for the interactions with spin-up electrons, spin-down electrons, and nuclei are different. The distance expansion **e** is truncated with an envelope that ensures it to be cusp-less, that is, that all Gaussian features have a vanishing derivative at zero distance, and imposes a long-range cutoff. The final step in the interaction process is to update to the electronic embeddings,

Therefore, the messages are transformed from the interaction space to the embedding space and added to the original embedding. The transformation **g**_{θ} is again implemented by fully connected neural networks. This interaction process is repeated *L* times, to successively encode increasingly complex many-body information. The continuous-filter convolutions over the molecular graph and the initialization of electrons with identical embeddings make the architecture equivariant with respect to the exchange of same-spin electrons,

Overall, this gives a latent-space representation that can efficiently encode electronic many-body effects while intrinsically fulfilling the desired permutation equivariance. Information about the hyperparameters of all of the components is collected in Table V.

Component . | Type . | Hyperparameter . |
---|---|---|

$X\theta ,si$—electronic embedding | Trainable array | Embedding dimension |

Y_{θ,I}—nuclear embeddings | Trainable array | Kernel dimension |

e—distance expansion | Fixed function | # distance features |

w_{θ}—filter-generating function | DNN | (# distance features → kernel dimension), depth |

h_{θ}—transformation embedding to kernel space | DNN | (Embedding dimension → kernel dimension), depth |

g_{θ}—transformation kernel to embedding space | DNN | (Kernel dimension → embedding dimension), depth |

η_{θ}—Jastrow network | DNN | (Embedding dimension → 1), depth |

Full architecture | … | # interactions L |

Component . | Type . | Hyperparameter . |
---|---|---|

$X\theta ,si$—electronic embedding | Trainable array | Embedding dimension |

Y_{θ,I}—nuclear embeddings | Trainable array | Kernel dimension |

e—distance expansion | Fixed function | # distance features |

w_{θ}—filter-generating function | DNN | (# distance features → kernel dimension), depth |

h_{θ}—transformation embedding to kernel space | DNN | (Embedding dimension → kernel dimension), depth |

g_{θ}—transformation kernel to embedding space | DNN | (Kernel dimension → embedding dimension), depth |

η_{θ}—Jastrow network | DNN | (Embedding dimension → 1), depth |

Full architecture | … | # interactions L |

Hyperparameter . | Value . |
---|---|

One-electron basis | 6-31G |

Dimension of e (# distance features) | 16 |

Dimension of x_{i} (embedding dimension) | 128 |

Number of interaction layers L | 4 |

Number of layers in η_{θ} | 3 |

Number of layers in w_{θ} | 1 |

Number of layers in h_{θ} | 2 |

Number of layers in g_{θ} | 2 |

Batch size | 2000 |

Number of walkers | 2000 |

Number of training steps | H_{4}: 5000 |

H_{2}: 10 000 | |

He: 10 000 | |

Be: 10 000 | |

LiH: 10 000 | |

H_{2}O: see Fig. 9 | |

Optimizer | AdamW |

Learning rate scheduler | CyclicLR |

Minimum/maximum learning rate | 0.0001/0.01 |

Clipping window q | 5 |

Epoch size | 100 |

Number of decorrelation sampling steps | 4 |

Target acceptance | 57% |

Hyperparameter . | Value . |
---|---|

One-electron basis | 6-31G |

Dimension of e (# distance features) | 16 |

Dimension of x_{i} (embedding dimension) | 128 |

Number of interaction layers L | 4 |

Number of layers in η_{θ} | 3 |

Number of layers in w_{θ} | 1 |

Number of layers in h_{θ} | 2 |

Number of layers in g_{θ} | 2 |

Batch size | 2000 |

Number of walkers | 2000 |

Number of training steps | H_{4}: 5000 |

H_{2}: 10 000 | |

He: 10 000 | |

Be: 10 000 | |

LiH: 10 000 | |

H_{2}O: see Fig. 9 | |

Optimizer | AdamW |

Learning rate scheduler | CyclicLR |

Minimum/maximum learning rate | 0.0001/0.01 |

Clipping window q | 5 |

Epoch size | 100 |

Number of decorrelation sampling steps | 4 |

Target acceptance | 57% |

Molecule . | Atom . | Position (Å) . |
---|---|---|

LiH | Li | (0.000, 0.000, 0.000) |

H | (1.595, 0.000, 0.000) | |

H_{4} square | H | (−0.635, −0.635, 0.000) |

H | (−0.635, 0.635, 0.000) | |

H | (0.635, −0.635, 0.000) | |

H | (0.635, 0.635, 0.000) | |

H_{4} deformed | H | (−0.900, −0.635, 0.000) |

H | (−0.900, 0.635, 0.000) | |

H | (0.900, −0.635, 0.000) | |

H | (0.900, 0.635, 0.000) | |

H_{2}O | O | (0.000 00, 0.000 00, 0.000 00) |

H | (0.756 95, 0.585 88, 0.000 00) | |

H | (−0.756 95, 0.585 88, 0.000 00) |

Molecule . | Atom . | Position (Å) . |
---|---|---|

LiH | Li | (0.000, 0.000, 0.000) |

H | (1.595, 0.000, 0.000) | |

H_{4} square | H | (−0.635, −0.635, 0.000) |

H | (−0.635, 0.635, 0.000) | |

H | (0.635, −0.635, 0.000) | |

H | (0.635, 0.635, 0.000) | |

H_{4} deformed | H | (−0.900, −0.635, 0.000) |

H | (−0.900, 0.635, 0.000) | |

H | (0.900, −0.635, 0.000) | |

H | (0.900, 0.635, 0.000) | |

H_{2}O | O | (0.000 00, 0.000 00, 0.000 00) |

H | (0.756 95, 0.585 88, 0.000 00) | |

H | (−0.756 95, 0.585 88, 0.000 00) |

A single-particle variant of the graph-convolutional neural-network architecture can be obtained by considering only interactions along edges between electrons and nuclei, that is, the overall architecture of the network remains identical but the electronic updates $zi(n,\xb1)$ in (A3) are removed. Given that the convolution over the electronic distances is the only interaction between electrons, the final embeddings do not contain any many-body correlation.