Computing accurate yet efficient approximations to the solutions of the electronic Schrödinger equation has been a paramount challenge of computational chemistry for decades. Quantum Monte Carlo methods are a promising avenue of development as their core algorithm exhibits a number of favorable properties: it is highly parallel and scales favorably with the considered system size, with an accuracy that is limited only by the choice of the wave function *Ansatz*. The recently introduced machine-learned parametrizations of quantum Monte Carlo *Ansätze* rely on the efficiency of neural networks as universal function approximators to achieve state of the art accuracy on a variety of molecular systems. With interest in the field growing rapidly, there is a clear need for easy to use, modular, and extendable software libraries facilitating the development and adoption of this new class of methods. In this contribution, the DeepQMC program package is introduced, in an attempt to provide a common framework for future investigations by unifying many of the currently available deep-learning quantum Monte Carlo architectures. Furthermore, the manuscript provides a brief introduction to the methodology of variational quantum Monte Carlo in real space, highlights some technical challenges of optimizing neural network wave functions, and presents example black-box applications of the program package. We thereby intend to make this novel field accessible to a broader class of practitioners from both the quantum chemistry and the machine learning communities.

## I. INTRODUCTION

Recently, the application of machine learning to a wide range of problems from the natural sciences has proven to be highly successful. Computational chemistry is a field of particular activity: machine learning force fields model complicated quantum mechanical effects at the resolution of atoms, while machine learned functionals elevate density functional theory to unprecedented accuracy.^{1–3} These approaches utilize supervised training to learn from accurate quantum mechanical reference calculations and make predictions for unseen configurations. While this results in fast yet accurate approximations to the quantum many-body problem, it inherently depends on high quality training data, which represents a major bottleneck of these methods.

An orthogonal way to incorporate machine learning into computational chemistry is its application to improve *ab initio* calculations. Notably, over the course of the past years, a new family of deep-learning quantum Monte Carlo (deep QMC) methods has developed, incorporating advancements from the field of machine learning.^{4} Common to the deep QMC methods is the utilization of neural networks to parameterize highly expressive *Ansätze*, efficiently approximating the solutions of the time-independent electronic Schrödinger equation, thereby providing a complete description of the system’s electronic properties. Originating from spin lattices,^{5} deep-learning *Ansätze* were soon applied to molecules in real-space.^{6} With the development of PauliNet^{7} and FermiNet,^{8} the accuracy of neural-network wave functions became the state of the art within variational Monte Carlo. Subsequent studies have further increased the accuracy of these *Ansätze*,^{9–12} extended them to the simulation of excited states^{13} as well as periodic systems,^{14,15} combined them with pseudopotentials,^{16} used them in the calculation of interatomic forces,^{17} utilized them in diffusion Monte Carlo simulations,^{18,19} and extended them to share parameters across multiple molecular geometries^{20–22} or distinct molecules.^{23,24}

Although the method of optimizing deep-learning wave function *Ansätze* using variational quantum Monte Carlo was developed only a few years ago, it already competes with some of the most accurate traditional quantum chemistry methods on molecules with up to ∼100 electrons. Exhibiting competitive scaling with the number of electrons, it has the potential to be extended to larger systems in the near future. Achieving this will no doubt require further method development as well as efficient implementations of the core algorithms, creating the need for open source libraries that facilitate experimentation and contribution from the community.

Accompanying the above-summarized research, various software libraries for variational optimization of deep-learning wave functions have been released.^{25–28} While NetKet^{25} provides a general implementation of variational optimization of machine learning wave functions mainly for lattice systems with recent extensions to continuous space, the research for molecular machine learning wave functions was carried out across various repositories and is lacking a unified framework. The presented DeepQMC program package aims to provide a unified implementation of the developments in the field of deep-learning molecular wave functions. It intends to be easy to use out of the box while allowing full control and flexibility over the *Ansatz* and variational training for advanced users. The library is designed to be modular, facilitating the rapid development and testing of individual components, and easing the implementation of new features. It makes use of the composable function transformations and just-in-time compilation provided by the JAX library^{29} to express performant GPU accelerated algorithms using concise Python^{30} syntax. Neural network models are encapsulated in higher-level objects, using the haiku deep-learning library.^{31} The project is open-source and distributed online under the MIT license.^{26}

## II. THEORY

### A. The electronic structure problem

**r**

_{i}denotes the position of the

*i*th electron, while

*Z*

_{I}and

**R**

_{I}are the charge and position of the

*I*th nucleus, respectively. To determine the electronic structure of these molecular systems, one must solve the associated time independent Schrödinger equation,

**x**

_{i}= (

**r**

_{i},

*σ*

_{i}) comprise the positions of the electrons and their spin. A solution is an eigenfunction of the Hamiltonian, the electronic wave function

*ψ*, and its corresponding energy eigenvalue

*E*. With the exact wave function at hand, any observable electronic property of the system can, in principle, be computed, as the wave function gives a complete description of the system’s electronic state. Since electrons have half-integer spin, their wave functions must be antisymmetric with respect to electron exchanges,

### B. Variational optimization

**of a trial wave function (**

*θ**Ansatz*)

*ψ*

_{θ}, to minimize the expectation value of the Hamiltonian

*H*

^{−},

*Ansatz*

*ψ*

_{θ}.

*ψ*(

**r**

_{1}, …,

**r**

_{N}), where fixed spins are assigned to the electrons and spin-up and spin-down electrons are treated as distinguishable.

^{32}The convention is to sort the electrons by spin and consider the first

*N*

^{↑}electrons to have spin-up and the remaining

*N*

^{↓}=

*N*−

*N*

^{↑}electrons to possess spin-down.

In practice, a VMC simulation then consists of choosing an *Ansatz* (see Sec. III) and optimizing it in an alternating scheme of sampling and parameter updates. The expectation value in Eq. (6) is approximated by sampling the probability density given by the square of the wave function (see Sec. V), followed by a parameter update using the gradient of the expectation value (see Sec. IV).

### C. Neural network wave functions

*Ansatz*is crucial for the efficiency of a VMC simulation. Recently, neural network parametrizations of real-space molecular wave functions have been introduced by PauliNet

^{7}and FermiNet.

^{8}They both rely on generalized Slater determinants, which augment the single particle orbitals of conventional Slater determinants with many-body correlation,

**r**

^{↑}and spin-down electrons

**r**

^{↓}, respectively. The

*Ansatz*may be a linear combination of multiple generalized Slater determinants, which are distinguished with the

*p*index. The form of $\varphi kp$ in Eq. (8) is closely related to the backflow transformation,

^{33}which introduces quasi-particles to obtain many-body orbital functions. The key observation motivating this augmentation is that the antisymmetry of Slater determinants constructed from many-body orbitals is preserved as long as the orbital functions are equivariant with the exchange of electrons,

*i*and

*j*.

Most of the currently used deep-learning molecular wave functions^{7,8,11} share the functional form of Eqs. (7) and (8) and differ only in the parametrization of the many-body orbitals $\varphi kp$ and single-particle envelopes $\phi kp$. DeepQMC aims to provide a general framework for variational optimization of deep-learning molecular wave functions, facilitating the investigation of the design space spanned by the PauliNet, FermiNet, and DeepErwin neural network *Ansätze*.

### D. Pseudopotentials

Despite the favorable asymptotic scaling of VMC with the number of electrons, systems containing heavy nuclei remain challenging due to a variety of reasons. The high energy of electrons near these nuclei complicates simulations by spoiling the optimization and reducing the effectiveness of Markov chain Monte Carlo (MCMC) sampling. Furthermore, the kinetic energy of these electrons reaches the relativistic regime, requiring the treatment of relativistic effects that are not accounted for in the standard non-relativistic molecular Hamiltonian of Eq. (1). On the other hand, while the core regions of heavily charged nuclei contribute dominantly to the total energy, they are typically unchanged during chemical processes and thus have little effect on the chemically relevant relative energies. Therefore, most quantum chemistry methods targeted at computing relative energies reduce the above-outlined difficulties, by treating the outer (valence) electrons separately from the inner (core) electrons.

*et al.*

^{16}In this method, the core electrons are excluded from the explicit calculation and replaced by additional terms in the Hamiltonian, to simulate their influence on the remaining

*N*

_{v}valence electrons. The modified Hamiltonian reads

*V*

^{I}and $WlI$ are sets of scalar functions describing the local and non-local pseudopotential contributions, while $P\u0302liI=\u2211m=\u2212lm=llmiIlmiI$ is a projection operator of the

*i*th electron on spherical harmonics centered on the

*I*th nucleus. To evaluate the contribution of the non-local part of the pseudopotential [second term of Eq. (11)], one considers integrals of the form

*Y*

_{lm}is a spherical harmonic and (

*r*

_{iI},

**Ω**

_{iI}) denotes the position vector of the

*i*th electron

**r**

_{i}, expressed in spherical coordinates centered on nucleus

*I*. Following the first implementation of pseudopotentials for deep-learning molecular wave functions by Li

*et al.*,

^{16}the above integral is approximated using an icosahedral quadrature of 12-points.

The scalar functions *V*^{I} and $WlI$ are typically pre-computed by expanding them in a Gaussian basis and fitting the expansion parameters directly to a database of reference energies. The DeepQMC program package currently includes the widely used BFD^{34} and the most recent ccECP^{35} pseudopotentials, with an application of the latter presented in Sec. VIII B.

## III. WAVE FUNCTION DESIGN SPACE

DeepQMC implements a variety of options to obtain the equivariant many-body orbitals $\varphi kp$ and the accompanying envelopes $\phi kp$, covering PauliNet, FermiNet, DeepErwin, and their derivatives. In the following, the main architectural concepts of these wave function *Ansätze* will be described in more detail. For ease of use, DeepQMC provides predefined configuration files to obtain the above-mentioned *Ansätze* while allowing their interpolation through a manual choice of hyperparameters.

### A. Graph neural networks

Central to the neural network wave function *Ansätze* is the computation of equivariant many-body embedding vectors for the electrons, which are used downstream to obtain the entries of the generalized Slater determinant. Many strategies of obtaining these embeddings can be unified in the framework of graph neural networks (GNNs). GNNs are well suited to model functions on graphs that exhibit certain permutational symmetries and can be adapted to describe electrons of molecules in real-space.

An electronic configuration of a molecule can be encoded as a graph, where the nodes are electrons and nuclei and the connecting edges carry pairwise features, e.g., difference vectors. GNNs are functions of these graphs, yielding high-dimensional latent space embeddings of the nodes. The electronic node embeddings are initialized with single-electron features and iteratively updated to incorporate many-body information of the electronic environment. Using graph convolutions, the updates are invariant under the exchange of electrons in the environment and the conditions of Eq. (9) are fulfilled.

The most relevant aspects of the GNN architecture implemented in DeepQMC are sketched on the right pane of Fig. 1 and are discussed in detail in the following. Electron positions (spins) are denoted with **r** (*σ*), while **R** and *Z* indicate nuclei positions and charges. Node and edge quantities are denoted with the superscripts (n) and (e), respectively. Furthermore, *l* indexes the GNN interaction layers, ** θ** denotes functions parameterized by MLPs, and

*t*runs over the different edge types (those between electrons of identical or opposite spins, or between electrons and nuclei), node types (electron or nuclei nodes), or message types. Finally, vertical brackets indicate the different options implemented in DeepQMC for the computation of various quantities.

#### 1. The graph representation

A graph is a natural way to denote the electronic configuration of a molecule in real-space. The nodes of the graph represent particles (electrons and nuclei), carrying information such as spin or nuclear charge. The edges support the difference vectors between the particles, resulting in a representation invariant under global translation. Note that using internal coordinates that are invariant under global rotation may not be sufficient to represent all wave functions (simple counterexamples are atomic wave functions with P symmetry) and can only be employed with a careful treatment of the spatial symmetries.

To implement a variety of wave function *Ansätze*, DeepQMC provides configuration options to define the specifics of the graph construction outlined above. Most importantly, the nodes corresponding to nuclei and their respective nuclei–electron edges can optionally be excluded from the graph. In this case, electron–nuclei information can still be introduced to the GNN, by initializing the electron embeddings using a concatenation of the difference vectors between the positions of the given electron and all nuclei [see the second case of Eq. (13)].

#### 2. Node features

#### 3. Edge features

**r**

_{ij}), and their embeddings are consequently initialized as

#### 4. Message generation

*t*

^{(n)}on the node features specifies the subset of sending nodes, and the superscript

*t*

^{(e)}on the edge features depends on the type of the sending and the receiving nodes, respectively. The choice of how to distinguish electron–electron messages based on their spin (relative spin of sending and receiving electrons, spin of sending electron, or no distinction between messages from spin-up and spin-down electrons) is another hyperparameter of the GNN. Optionally, the above sum over the edges can be normalized by dividing it with the number of edges. Note that messages depending only on node (edge) information can be obtained by setting the function $w\theta l,t(e)$ $(h\theta l,t(n))$ to return identity. The superscript

*t*

^{(m)}runs over all the constructed messages, which may include different choices of $w\theta l,t(e)$ and $h\theta l,t(n)$.

#### 5. Electron embedding update

*t*

^{(m)}can include a residual connection $fi(n),l$ such that the trainable self-interaction of FermiNet and DeepErwin can be reproduced.

In the above-outlined general GNN framework, a wide variety of *Ansätze* can be obtained. Furthermore, the implementation of DeepQMC and its GNN framework focus on facilitating rapid extensions with new *Ansatz* variants either by exploration within the existing hyperparameter space or by extending it with new features.

### B. Orbital construction

*κ*

_{θ}is an MLP applied electronwise, projecting the embedding dimension to the required number of orbitals. For the $\phi kp$ envelopes, DeepQMC implements linear combinations of exponentials centered on the nuclei,

*β*

_{I}indexes the basis function centered on atom

*I*. The hyperparameter

*α*∈ (1, 2) represents the choice of Slater type orbitals with

*α*= 1 and Gaussian type orbitals (GTOs) with

*α*= 2. DeepQMC provides an option to restrict the envelopes to be isotropic $(\Sigma k\beta Ip\u2254\sigma k\beta Ip\u22c5I)$. The GTOs can be initialized from the molecular orbital coefficients of reference solutions with standard quantum chemistry basis sets obtained in PySCF.

^{36}

### C. Determinant construction

*et al.*,

^{10}which constructs a single determinant using both spin-up and spin-down electrons,

*Ansatz*is still not antisymmetric with respect to these permutations. Instead, a full determinant can practically be understood as an expansion in multiple spin-factorized determinants, e.g., by relying on the generalized Laplace expansion of determinants to expand det[

**A**

^{↿⇂,p}(

**r**)] according to the rows corresponding to spin-up electrons,

*S*runs over all subsets of the orbitals that contain as many elements as the number of spin-up electrons, $S\u0304$ stands for the complement subset of

*S*,

**A**

^{↑,S}(

**r**) denotes the submatrix of

**A**

^{↿⇂}(

**r**) formed from the orbitals in

*S*and the spin-up electrons, and

*ɛ*

_{S}is the sign of the permutation defined by the subset

*S*. For the block diagonal matrices of Eq. (21), the determinants for all subsets of spin-up orbitals containing off-diagonal elements evaluate to zero and the sum in Eq. (23) reduces to a single product of a spin-up and spin-down determinant. Note that since the many-body orbitals defined in Eq. (19) are not equivariant with respect to exchanges of electrons with opposite spins, the terms on the right-hand side of Eq. (23) with different

*S*s will, in general, be unrelated. In practice, it is conceivable that due to the concrete form of parametrization of the many-body orbitals, there remains some structure in the set of factorized determinants, which makes the full determinant more effective than using an equivalent number of spin-factorized determinants formed from independent orbitals.

### D. Jastrow factor and cusp correction

*Ansatz*. DeepQMC implements a learnable Jastrow factor e

^{J}, where

*J*is computed from the permutation invariant sum of many-body electron embeddings,

*η*

_{θ}again being implemented by an MLP. Furthermore, DeepQMC provides a fixed Jastrow factor that implements the known asymptotic behavior

^{37}when two electrons approach

*c*

_{ij}is $14$ if the electrons

*i*and

*j*are of the same spin and $12$ if the electrons possess opposite spins and the hyperparameter alpha scales the width of the correction term. If cuspless Gaussian envelopes are used, a similar cusp correction can be employed for the nuclei,

*et al.*

^{7}

### E. Log-representation of the wave function

*ϕ*

^{p}, we apply the log-sum-exp trick,

## IV. TRAINING

In this section, some technical aspects of the variational optimization of deep-learning trial wave functions are discussed. While these *Ansätze* are trained within the standard VMC framework, the characteristics of their optimization differ markedly from other VMC *Ansätze*, mainly due to the greatly increased parameter count introduced by neural networks. On the other hand, it is also distinct from most other deep-learning settings owing to the unusual complexity of the loss function and the self-supervised setting, where the training data are generated in parallel to the optimization.

### A. Loss function and gradient trick

*Ansatz*to minimize the expectation value of the local energies. From a machine learning perspective, this translates to considering the loss function

*E*

_{loc}[

*ψ*

_{θ}] with respect to the

*Ansatz*parameters

**. However, evaluating the local energy already involves second derivatives of the wave function with respect to the electron coordinates due to the Laplacian in Eq. (1). Consequently, this naive gradient computation would necessitate taking mixed third derivatives of the**

*θ**Ansatz*.

*et al.*

^{38}It replaces the derivatives of the local energy with a simple gradient of the wave function; therefore, it is expected to be more efficient and numerically stable to evaluate than the direct gradient computation.

### B. Local energy evaluation

*Ansatz*is by far the most computationally demanding part of the training (and evaluation),

### C. Pretraining

Choosing initial values for *Ansatz* parameters is a non-trivial question common to many computational chemistry methods. One need only think of the sensitivity of the self-consistent iterations to the initial guess in Hartree–Fock (HF) and related methods.^{39–41} The case of deep-learning VMC *Ansätze* is no different—a random initialization of the neural network parameters according to some of the widely adopted schemes of the machine learning community can lead to the optimization diverging or converging to local minima. This problem becomes increasingly severe with growing system size, presumably due to the higher-dimensional, more complex wave functions of larger molecules and their intricate nodal structure.

^{36}enabling the initialization of wave functions from the coefficients of a preceding HF or multi-configurational self-consistent field (MCSCF) calculation. While this allows the direct initialization of the neural network wave function

*Ansatz*as introduced by Hermann

*et al.*,

^{7}subsequent work suggested that explicitly incorporating an approximate reference wave function in the model can deteriorate performance.

^{11}Instead, a short, supervised pretraining with respect to a reference solution before the self-supervised variational optimization is recommended. In this step, the many-body orbitals of the

*Ansatz*are trained to match the reference by minimizing the pretraining loss

*Ansatz*, which means that pretraining requires significantly less computational resources than variational training. Initialization with pretrained orbitals, as introduced by Pfau

*et al.*,

^{8}improves the convergence properties of the variational training and, if well balanced with the subsequent variational optimization, can even slightly boost the final accuracy, as Gerard

*et al.*recently demonstrated.

^{11}

### D. Gradient clipping

Despite the utilization of sophisticated gradient estimators and pretraining, the convergence of the variational optimization is still often hindered by outliers in the training batches of local energies. The existence of these outliers is not surprising, considering that the electrostatic energy is singular when two particles coincide, while the kinetic energy is singular at the nodes of the wave function—energy contributions that the shape of the wave function precisely levels out in later stages of the training. While the outliers are valid contributions to the energy expectation value, their presence can inject a lot of noise into the gradient estimates. To reduce their contribution to the parameter update, the loss and its gradient [given in Eqs. (29) and (30)] are evaluated using clipped local energies $E\u0302loc\mu ,\sigma $, where *μ* is the center and *σ* is the half-width of the clipping window.

Regarding concrete choices for *μ* and *σ*, some empirical findings have been reported in the related literature. Investigating transition metal atoms using pseudopotentials, Li and co-workers reported^{16} that choosing *σ* = 10 × std(*E*_{loc}) significantly outperforms all other options they have considered. More recently, von Glehn *et al.*^{12} have found that centering the clipping window at the median of local energies, and using the mean absolute deviation from the median to determine *σ*, improves the training of multiple deep-learning *Ansätze*. Considering the practical importance of the clipping mechanism, DeepQMC implements the algorithm of von Glehn *et al.*,^{12} along with an analogous logarithmically scaling “soft” clipping scheme introduced by Hermann *et al.*,^{7} and also offers full flexibility to the user in specifying custom clipping functions.

Finally, it should be highlighted that the local energies are only to be clipped for computing the gradient of the loss during optimization. Since clipping can introduce a bias to the estimate of the energy expectation value, variational energy estimates can only be obtained from unclipped local energies.

### E. Optimizer

Utilizing natural gradient descent optimization^{42} or Kronecker-factored approximations thereof^{43} has proven to be a crucial ingredient to the success of variational optimization of deep-learning wave functions on molecular systems.^{8,11,12,44,45} Consequently, DeepQMC makes use of the Kronecker-Factored Approximate Curvature (KFAC) optimizer implementation of Botev and Martens.^{46} To showcase the importance of the choice of the optimizer, the performance of KFAC is compared to the commonly employed first-order optimizer AdamW,^{47} on variational trainings on six test systems. The obtained training energy curves are plotted in Fig. 2. To account for the 10%–25% longer per iteration run time of KFAC compared to AdamW, the wall clock time of the training (instead of the usual iteration count) is shown on the horizontal axes. The results show that the slightly increased per-iteration cost is offset by the significantly improved per-iteration convergence speed of the KFAC optimizer. Furthermore, it is found that the increase in the relative cost of KFAC over AdamW optimization is smaller for systems with larger numbers of electrons. In practice, this means that the last percents of correlation energy can be recovered much more efficiently with KFAC, resulting in improved final energies for a given computational budget.

The effectiveness of natural gradient descent in this setting can be rationalized through its connection to the stochastic reconfiguration method,^{8,48} known from the traditional variational quantum Monte Carlo optimization.^{49,50}

These higher order methods utilize the Fisher information of the unnormalized density associated with the wave function as a preconditioner to the gradients. KFAC extends the application range of natural gradient descent by low-rank approximating the Fisher information, facilitating its computation for neural network wave functions with large numbers of model parameters.

Instead of following the steepest descent in parameter space, an optimization step with the preconditioned gradient is in the direction of steepest descent in distribution space, with distance defined by the Kullback–Leibler (KL) divergence.^{51} Considering that in VMC, the predicted quantity *ψ*_{θ}(**r**) directly defines the distribution *p*(**r**|** θ**) ∝ |

*ψ*

_{θ}(

**r**)|

^{2}, one concludes that a natural gradient step is in the direction of maximal loss decrease for a given KL divergence between

*ψ*

_{θ}and

*ψ*

_{θ+dθ}. This is an advantageous property, as relying on the KL divergence results in updates that are independent of the way

*ψ*

_{θ}is parameterized, as opposed to the steepest descent where the Euclidean metric introduces strong dependence.

## V. SAMPLING

An important characteristic of VMC is that the data (electron positions) used to fit the model are generated in tandem with the optimization, by sampling the probability distribution of the electronic degrees of freedom defined by the square of the wave function. This sampling task comes with its own challenges, due to its tight coupling with the training. For the variational principle to remain valid, the samples used to evaluate (6) must be equilibrated according to the distribution **r** ∼ |*ψ*_{θ}(**r**)|^{2}. Furthermore, since *ψ*_{θ} is updated in every training iteration, the sampling must account for the corresponding changes in the distribution of the electron positions. To carry out this demanding sampling task in a computationally efficient manner, the DeepQMC program package offers two optimized Markov chain Monte Carlo (MCMC) algorithms. Along with the random walk MCMC algorithm,^{52,53} referred to as Metropolis sampler, the Metropolis-Adjusted Langevin Algorithm^{54} (MALA), referred to as Langevin sampler, is also implemented, which proposes walker updates using overdamped Langevin dynamics. The implemented MALA includes the correction proposed by Hermann *et al.*,^{7} which scales the electron step sizes around the nuclei to avoid “overshooting” the latter. In addition, changes of the wave function during training can be accounted for by re-equilibration after each gradient step or using a batch reweighting scheme. In Secs. V A and V B, these MCMC samplers along with the above-described sampling difficulties are characterized in more detail.

### A. Energy convergence

*Ansatz*. In order to draw

*n*=

*n*

_{b}×

*n*

_{s}electron samples {

**r**}

_{ij}, distributed according to |

*ψ*

_{θ}(

**r**)|

^{2}, a batch of

*n*

_{b}many walkers is propagated for

*n*

_{s}MCMC steps. Based on the electron samples, the energy expectation value is estimated as

^{32,55}the sampling error of such estimates decays proportional to

*n*

^{−1/2}. To approximate the sampling error, we utilize the nonoverlapping batch means estimator, as reviewed by Flegal

*et al.*

^{55}We first obtain independent estimates of the energy by averaging the local energies over the walker trajectories (batches),

The convergence of the energy estimate and its error bar throughout the evaluation of an *Ansatz* trained on the CH_{4} molecule are plotted in Fig. 3. In the top pane, the final value of the exponential walking mean of the training energies and its estimated error are also shown with a horizontal line and shaded area. It can be seen from this plot that the energy estimate of the evaluation converges gradually toward the final training energy as expected, while its sampling error converges toward zero. Note that due to the parameter updates during the optimization, the energy estimate from the training is an unreliable estimate and a thorough evaluation of the energy expectation value requires sampling the *Ansatz* with fixed parameters.

In the bottom pane of Fig. 3, the convergence of the estimated sampling error is compared between the Metropolis sampler and the Langevin sampler. Importantly, the expected *n*^{−1/2} convergence behavior is observed for both methods. Comparing the two algorithms, it can be seen that the error of MALA converges slightly faster than that of random walk MCMC, indicating a lower degree of correlation between the subsequent positions of the walkers of the Langevin sampler.

### B. Decorrelated sampling

*t*and $\mu Eloc$ is the mean of the local energies over the entire trajectory. The autocorrelation time of the local energy is then computed as $\tau \u2032=2\u222b0\u221e\rho Eloc(t)dt$. Finally,

*τ*is obtained by taking the mean of

*τ*′s over all propagated MCMC chains, providing a simple measure of local energy autocorrelation.

The autocorrelation times for five atoms of increasing size and the cyclobutadiene molecule are plotted in the upper pane of Fig. 4, for both the Metropolis and the Langevin sampler. The general trend of longer autocorrelation times for larger systems can be observed for the Metropolis sampler. One of the main causes of this trend is the increasing nuclear charge, which induces higher and higher peaks in the distribution of the electrons near the nuclei. These pronounced peaks necessitate shorter update proposal radii, ultimately resulting in a higher correlation between subsequent samples. Furthermore, the autocorrelation time is expected to grow with the increasing complexity of the wave functions and their nodal surfaces. On the other hand, the Langevin sampler seems less affected by this trend, delivering largely constant autocorrelation times for all systems. It is reasonable to assume that by explicitly making use of information about the gradient of the wave function, the MALA update proposal retains better decorrelation efficiency than random walk MCMC, when considering more and more complicated wave functions. Finally, the showcased autocorrelation times are in reasonably good agreement with the fact that the default number of decorrelating steps performed between parameter updates is chosen between 10 and 30 in the currently used neural wave function program packages.

The experiments depicted in Fig. 4 also demonstrate a slightly smaller correlation between subsequent samples of the Langevin sampler in comparison with those of the Metropolis sampler, for all but the smallest of systems. In the bottom pane of Fig. 4, the wall clock run time of performing *τ* sampling steps are shown for each system, to account for the slightly increased computational cost of the MALA update proposal. Considering wall clock run times, the Metropolis sampler is more efficient on atoms up to carbon, while the Langevin sampler performs slightly to considerably better on the larger atoms and cyclobutadiene. While we find MALA to be more efficient than random walk Metropolis, we observe that for larger systems with heavier nuclei, it could result in less stable optimization. To improve the black-boxed nature of the method, we applied random walk MCMC in all subsequent experiments.

## VI. SCALING

Understanding the scaling of a method’s computational cost with the considered system size is of utmost importance in the field of quantum chemistry, where a pervasive caveat of the most accurate approaches is their unfavorable scaling behavior. Given its high accuracy, the asymptotic scaling of VMC (typically listed with *N*^{4})^{32} is considered favorable. Although this general scaling is, indeed, much better than, for example, the *N*^{7} scaling of the gold-standard CCSD(T) method, and on a par with the scaling of hybrid density functionals (such as DM21^{3}), deep QMC calculations incur a larger prefactor, resulting in much higher practical costs on systems of intermediate size. While reducing this prefactor is an important long term goal of method developers in the field, investigating the method’s scaling is also of interest, to estimate the prospect of system sizes feasible with further improvement and serve as a baseline for future developments. In this section, the scaling of the computational cost of the variational training of deep-learning *Ansätze* is investigated using the DeepQMC program package. Further scaling aspects of the pseudopotential implementation and design choices regarding the major computational bottlenecks of the algorithm are discussed in Appendix A.

The theoretical scaling of VMC (*N*^{4}) is obtained when combining the *N*^{3} cost of the determinant evaluation with an additional factor of *N* from the Laplacian required in the computation of the kinetic energy. In practice, however, for simulations with the currently feasible system sizes, the determinant evaluation makes up only a fraction of the computational cost, which is instead dominated by the evaluation of the neural networks of the *Ansatz*. To investigate the practical scaling of a variational training step in DeepQMC, single iteration run times are compared across atoms with increasing atomic numbers, as well as across chains containing an increasing number of hydrogen atoms (Fig. 5). Although the systems contain different numbers of particles, due to the parametrization with GNNs, the total parameter count of the wave function *Ansatz* changes only marginally between systems. On the other hand, owing to the varying numbers of nuclei, isoelectronic species can have slightly different computational requirements. The system classes of atoms and hydrogen chains were chosen, as they represent the lower and upper bounds respectively, on the number of nuclei a neutral system with a fixed number of electrons can contain. Consequently, the scaling of the run time with the number of electrons is also expected to be bounded by these system classes. With the tight empirical bounds of *N*^{2.36−2.79} depicted in Fig. 5, the observed scaling of DeepQMC is still far below the theoretical estimate of *N*^{4}, highlighting the potential for extension to larger systems.

## VII. *ANSATZ* VALIDATION

Relying on the general framework introduced above, the DeepQMC software suite enables the use of many of the previously published deep-learning *Ansätze* by providing configuration files to reproduce PauliNet,^{7} FermiNet,^{8} and DeepErwin.^{11} To validate our implementation of these *Ansätze*, the hyperparameters of sampling, optimization, and GNN architecture are compared in depth to those of the respective reference implementations. In addition, it is verified that when using the same parameters, the DeepQMC implementations predict the same wave function value and local energy as their reference counterparts for a given configuration of electrons and nuclei. Note that we have refrained from exactly matching the cusp-corrected GTOs of PauliNet, because subsequent work has demonstrated that explicitly including a reference solution is limiting the accuracy of the *Ansatz*. However, by combining Gaussian envelopes initialized from the coefficients of a reference calculation with a nuclear cusp correction in the Jastrow factor (26), it is possible to obtain a variant of PauliNet within DeepQMC that matches the characteristics of the original *Ansatz*.

In Fig. 6, the empirical performance of the various *Ansätze* is checked against results published in the literature for a small set of molecules. It can be seen that our DeepQMC implementation of PauliNet, FermiNet, and DeepErwin matches the reference energies well. The remaining discrepancies of FermiNet result from slightly different experimental setups, such as an increased number of reference optimization steps (200 000 compared to 50 000 used here) and batch size (4096 compared to 2048 used here), or an older TensorFlow-based implementation being used in case of N_{2}. The impact of these changes on the deviations of the model accuracy highlights the importance and difficulty of comparing *Ansätze* implemented in different codebases under the same experimental conditions.

As a further contribution, we introduce and analyze the performance of a new default *Ansatz* for the DeepQMC program package, which we refer to as PauliNet2. This exemplary *Ansatz* was optimized to have a good trade-off between accuracy and trainable model parameters. Despite achieving a similar accuracy as FermiNet and DeepErwin for the small systems under investigation (see Figs. 6 and 7), the PauliNet2 *Ansatz* has only about a third of the model parameters of FermiNet and a quarter of DeepErwin (i.e., for the CO molecule, 239 829, 766 944, and 998 816 parameters, respectively). The *Ansatz* combines the SchNet-like graph convolutions of PauliNet (17) with the iterative update of the edge embeddings of FermiNet (18). Edge features are constructed from difference vectors between the electrons, and isotropic exponentials are used as envelopes. Furthermore, the *Ansatz* comprises a trainable Jastrow factor (24) and the fixed electronic cusp correction (25). While these hyperparameters are found to be suitable for the presented experiments, it is conceivable that an extended hyperparameter search targeting specific applications could further improve its performance. The detailed settings of the discussed *Ansätze* can be found in the respective configuration files shipped with the DeepQMC package.

## VIII. APPLICATION EXAMPLES

In this section, the ease of applying the DeepQMC program package as a black-box method to obtain electronic energies is demonstrated on benchmark datasets. Two widely different example problems are chosen in order to showcase the general applicability of the presented method. The experiments are performed using DeepQMC command line interface, which exposes all configuration options of the software suite while also allowing for effortless submission of simple calculations. Short usage examples of the DeepQMC command line interface are provided in Appendix B.

### A. Small molecule reactions

The electronic contributions to the reaction energies of 12 reactions involving small inorganic molecules and hydrocarbons are investigated. These reactions were used by Nemec^{57} to benchmark the accuracy of Slater–Jastrow (SJ) type trial wave functions, constructed following Drummond *et al.*^{58} using electron–electron, electron–nucleus, and electron–electron–nucleus terms in the Jastrow factor. The 14 participating molecules are built from H, C, O, N, and F atoms, containing at least 2 and at most 22 electrons. To facilitate the comparison with the DMC results of Nemec,^{57} the same molecular geometries are considered, obtained from the work of Feller.^{59} Reference energies are taken from the review of O’Neill.^{60} All electron, complete basis set extrapolated CCSD(T) energies are computed in house, using the PySCF program package.^{36}

First, single-point electronic energies obtained for the participating molecules are compared in Fig. 7. On the vertical axis, the error of the recovered total energy is plotted, for VMC and DMC calculations utilizing SJ type trial wave functions and for VMC with deep-learning *Ansätze*. Looking at Fig. 7, one can observe that the total energy errors of SJ-VMC *Ansätze* are consistently above 47 mEh (with a mean of 114 mEh), while the associated DMC errors are in the range of 8–50 mEh (26 mEh on average). In comparison, deep-learning *Ansätze* exhibit at maximum only 11 mEh total energy error, with a mean deviation of 2.6 mEh. While the main goal of quantum chemistry methods is to accurately model energy differences, rather than recover exact total energies, it is encouraging to see that DeepQMC and deep-learning *Ansätze*, in general, are very competitive in this area.

The accuracy of the energy differences obtained with SJ-DMC, CCSD(T), and deep-learning QMC methods is compared in Fig. 8. Note that the energy differences obtained with SJ-VMC are not shown in this figure, as they are an order of magnitude less accurate than the depicted approaches. Comparing the SJ-DMC results with those obtained from DeepQMC, one concludes that combining the VMC method with expressive deep-learning *Ansätze* greatly increases its accuracy, surpassing SJ-DMC on eleven out of twelve reactions. The accuracy advantage of DeepQMC’s PauliNet2, FermiNet, and DeepErwin *Ansätze* is similarly clear when comparing their respective reaction energy mean absolute deviations (MADs) of 2.4, 2.3, and 1.5 mEh to the 7.6 mEh of SJ-DMC. As a final comparison, Fig. 8 also shows the reaction energy differences obtained from a complete basis set extrapolated, all-electron CCSD(T). Not surprisingly, CCSD(T) performs outstandingly on these small, single reference systems in equilibrium geometry, achieving a MAD of 1.3 mEh, and chemical accuracy (less than 1 kcal/mol or 1.6 mEh error) on ten reactions. In comparison, the MAD value of PauliNet2, FermiNet and DeepErwin for this exemplary study with DeepQMC is found to approach that of CCSD(T) and chemical accuracy is achieved on seven, seven and eight out of twelve reactions respectively.

### B. Transition metal oxides

The effects of utilizing pseudopotentials in variational optimization of deep-learning molecular wave function are evaluated on a series of four first-row transition metal oxides. The bond lengths of the ScO, TiO, VO, and CrO molecules are taken from the experimental results of Annaberdiyev *et al.*^{61} The latest ccECP pseudopotentials^{61} are applied to the transition metal atoms only, replacing neon-like cores of ten electrons. Although replacing argon cores (18 electrons) with pseudopotentials would result in even larger computational savings, this is avoided as the third shell electrons are known to play a non-negligible role in the bond formation of transition metal atoms.^{62} Apart from the introduction of pseudopotentials, the *Ansatz* employed on small molecule reactions (Sec. VIII A) is utilized here without further modifications.

*E*

^{O}= −75.0631(1) Ha is the result of an all-electron calculation with the same hyperparameters. Figure 9 shows the comparison of the obtained dissociation energies to experimental values

^{63}and those of some other accurate computational methods, such as CCSD(T),

^{61}FermiNet,

^{16}auxiliary field quantum Monte Carlo (AFQMC), semi-stochastic heat bath configuration interaction (SHCI), and density matrix renormalization group (DMRG).

^{64}Apart from the TiO case, the accuracy of DeepQMC with pseudopotentials is comparable to other theoretical methods, such as CCSD(T) or AFQMC. The fact that the dissociation energy estimates with DeepQMC are systematically lower than the experimental results indicates that the single atoms are described more accurately than the oxide molecules. This can be counteracted by increasing the expressiveness of the

*Ansatz*and investing more computations. Note that the results obtained with FermiNet

^{16}utilized a larger

*Ansatz*trained for about ten times more training iterations than done in this study.

## IX. SUMMARY AND CONCLUSIONS

We have presented the DeepQMC program package—a general variational quantum Monte Carlo framework for optimizing deep-learning molecular wave functions. The implementation focuses on modularity, facilitating rapid development of new *Ansätze*, and provides maximal freedom in choosing the specifics of the variational training setup. The *Ansatz* shipped with DeepQMC attempts to unify most of the currently existing deep-learning molecular wave functions while remaining easy to extend as new models emerge. To reduce the computational complexity associated with heavy nuclei, some popular precomputed pseudopotentials are also implemented.

Using the framework provided by DeepQMC, the most important practical aspects of variational optimization of deep-learning molecular wave functions are discussed. The importance of a proper gradient estimator along with robust gradient clipping is highlighted. For consistent *Ansatz* initialization, supervised pretraining to HF wave functions is suggested. The advantages of using the second-order KFAC optimizer are demonstrated, along with a rationalization of its effectiveness. The theoretical convergence of the Markov Chain Monte Carlo sampling error is verified, and MALA is shown to be more effective in obtaining decorrelated samples than the widely utilized random walk MCMC algorithm. The empirical scaling of the method’s computational cost is found to be more favorable than that of the most popular post-HF approaches, while its large prefactor is identified as an obstacle on the path to wider adoption.

The black-box application of the program package is demonstrated in two significantly different settings. The electronic reaction energies of 12 small molecule reactions are computed with a mean absolute deviation of 1.5 mEh and compared to the 1.3 mEh achieved by CCSD(T) and 7.6 mEh achieved by DMC with SJ *Ansätze*. Using the same *Ansatz* hyperparameters, dissociation energies are computed for a series of transition metal monoxides, utilizing the latest ccECP^{61} pseudopotential. Improved training characteristics compared to all-electron calculations highlight the benefit of employing pseudopotentials. The accuracies of the predicted dissociation energies are on a par with or exceed those of some other recently popularized methods, such as auxiliary field quantum Monte Carlo or density matrix renormalization group.

To conclude, the presented method shows great promise to become an easy-to-use, general, black-box method accurately describing the molecular electronic structure. Especially encouraging is the favorable scaling of computational requirements with increasing system size. It is easy to envision that after further development reducing the large prefactor of the computational costs, the DeepQMC package will prove valuable to the wider quantum chemistry community.

## ACKNOWLEDGMENTS

We would like to thank Jannes Nys for helpful discussions regarding the nuclear cusp correction and Leon Gerard for providing reference data for the *Ansatz* comparison. The work of Z.S. and P.B.S. was funded and supported by the Berlin Mathematics Research Center MATH+ (Project No. AA2-22). The work of M.M. received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 772230). The computing time granted by the HLRN and provided on the supercomputer Lise compute at NHR@ZIB is gratefully acknowledged.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

Z.S. and P.B.S. contributed equally to this work.

**Z. Schätzle**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal). **P. B. Szabó**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal). **M. Mezera**: Data curation (equal); Formal analysis (equal); Software (equal); Visualization (equal); Writing – original draft (equal). **J. Hermann**: Software (equal); Supervision (equal). **F. Noé**: Conceptualization (equal); Funding acquisition (equal); Project administration (equal); Resources (equal); Supervision (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX A: ADDITIONAL SCALING EXPERIMENTS

#### 1. Pseudopotentials

Figure 10 shows the scaling of the run time of the non-local pseudopotential evaluation [the second term in Eq. (12)] on the third-row and fourth-row atoms. This term dominates the total computational overhead of using pseudopotentials overwhelmingly. From the five nested summations of Eq. (12), only the sum over the valence electrons scales with the number of electrons, hinting at an approximate linear scaling with system size. Figure 10 shows that the obtained empirical scaling of *N*^{1.19} is in good agreement with expectations. The sudden jump in run time from 20 to 21 electrons is caused by the reduction in valence electrons, as the utilized ccECP pseudopotentials use a larger core for 4p elements than for 3d ones.

#### 2. Memory requirement

For all investigated applications, the memory requirement bottleneck is presented by the computation of the Laplacian of the wave function, $\u2211i3N\u22022\u22022ri\psi \theta (r)$. In this step, one obtains second derivatives of the wave function with respect to the electron coordinates. We obtain the derivatives of the wave function by applying automatic differentiation in the backward–forward mode. While the gradient *∇*_{r}*ψ*_{θ} is obtained in one backward pass for all coordinates, the diagonal of the Hessian $(\u22022\psi \theta \u22022ri)$ requires an additional 3*N* forward mode differentiations to compute, one for each electron coordinate. Due to the flexible function transformations of JAX, both the serial and parallel executions of the 3*N* forward mode differentiation passes can be implemented in a few lines of code, with the two implementations presumably differing in how they trade computational efficiency for memory requirement.

To decide between the serial and parallel approaches to the Laplacian computation, benchmark calculations on a series of atoms with increasing nuclear charges are performed. The obtained relative memory requirements of the parallel and serial computations are presented on the left vertical axis of Fig. 11. The observed linear scaling of the relative memory requirement between parallel and serial evaluations can be understood by considering that the parallel implementation holds data for all 3*N* backward passes in memory, while the serial approach stores data for a single pass at a time. However, the prefactor of the scaling curve is significantly less than three, which indicates that JAX performs some optimizations on the parallel code that reduce the naive 3*N* memory requirement multiplier. Considering the run times of the two versions (lower panel of Fig. 11), it is found that the relative timings of the serial implementation over the parallel implementation do not scale with the system size. In fact, the ratio of run times between the serial and parallel implementations appears to converge around 1.5. Taking the above observations into account, the serial implementation of the Laplacian evaluation is chosen, due to its favorable scaling memory requirements that outweigh the slight, non-scaling run time edge of the parallel implementation.

### APPENDIX B: USAGE OF DEEPQMC

Here, we provide a few minimal examples of the usage of the DeepQMC command line interface. The interface is based on hydra, which provides a modular way to configure and execute complex jobs. DeepQMC implements a wide variety of configuration options for the wave function *Ansatz* as well as the hyperparameters of training and evaluation. For ease of use, the package includes predefined configuration files, which can be augmented using the command line or extended with custom configuration files. For a thorough tutorial and API documentation, the reader is referred to the DeepQMC documentation. For examples of typical DeepQMC commands, see Fig. 12.

## REFERENCES

*International Conference on Learning Representations 2021*(ICLR, 2021); arXiv:2110.05064 [physics] (

*Eleventh International Conference on Learning Representations*(ICLR, 2022); arXiv:2205.14962 [physics] (

*Python 3 Reference Manual*

*Advances in Neural Information Processing Systems*

*International Conference on Learning Representations 2018*(ICLR, 2018); arXiv:1711.05101 [cs, math] (

*Functional Integration: Basics and Applications*(