Computing accurate yet efficient approximations to the solutions of the electronic Schrödinger equation has been a paramount challenge of computational chemistry for decades. Quantum Monte Carlo methods are a promising avenue of development as their core algorithm exhibits a number of favorable properties: it is highly parallel and scales favorably with the considered system size, with an accuracy that is limited only by the choice of the wave function Ansatz. The recently introduced machine-learned parametrizations of quantum Monte Carlo Ansätze rely on the efficiency of neural networks as universal function approximators to achieve state of the art accuracy on a variety of molecular systems. With interest in the field growing rapidly, there is a clear need for easy to use, modular, and extendable software libraries facilitating the development and adoption of this new class of methods. In this contribution, the DeepQMC program package is introduced, in an attempt to provide a common framework for future investigations by unifying many of the currently available deep-learning quantum Monte Carlo architectures. Furthermore, the manuscript provides a brief introduction to the methodology of variational quantum Monte Carlo in real space, highlights some technical challenges of optimizing neural network wave functions, and presents example black-box applications of the program package. We thereby intend to make this novel field accessible to a broader class of practitioners from both the quantum chemistry and the machine learning communities.
I. INTRODUCTION
Recently, the application of machine learning to a wide range of problems from the natural sciences has proven to be highly successful. Computational chemistry is a field of particular activity: machine learning force fields model complicated quantum mechanical effects at the resolution of atoms, while machine learned functionals elevate density functional theory to unprecedented accuracy.1–3 These approaches utilize supervised training to learn from accurate quantum mechanical reference calculations and make predictions for unseen configurations. While this results in fast yet accurate approximations to the quantum many-body problem, it inherently depends on high quality training data, which represents a major bottleneck of these methods.
An orthogonal way to incorporate machine learning into computational chemistry is its application to improve ab initio calculations. Notably, over the course of the past years, a new family of deep-learning quantum Monte Carlo (deep QMC) methods has developed, incorporating advancements from the field of machine learning.4 Common to the deep QMC methods is the utilization of neural networks to parameterize highly expressive Ansätze, efficiently approximating the solutions of the time-independent electronic Schrödinger equation, thereby providing a complete description of the system’s electronic properties. Originating from spin lattices,5 deep-learning Ansätze were soon applied to molecules in real-space.6 With the development of PauliNet7 and FermiNet,8 the accuracy of neural-network wave functions became the state of the art within variational Monte Carlo. Subsequent studies have further increased the accuracy of these Ansätze,9–12 extended them to the simulation of excited states13 as well as periodic systems,14,15 combined them with pseudopotentials,16 used them in the calculation of interatomic forces,17 utilized them in diffusion Monte Carlo simulations,18,19 and extended them to share parameters across multiple molecular geometries20–22 or distinct molecules.23,24
Although the method of optimizing deep-learning wave function Ansätze using variational quantum Monte Carlo was developed only a few years ago, it already competes with some of the most accurate traditional quantum chemistry methods on molecules with up to ∼100 electrons. Exhibiting competitive scaling with the number of electrons, it has the potential to be extended to larger systems in the near future. Achieving this will no doubt require further method development as well as efficient implementations of the core algorithms, creating the need for open source libraries that facilitate experimentation and contribution from the community.
Accompanying the above-summarized research, various software libraries for variational optimization of deep-learning wave functions have been released.25–28 While NetKet25 provides a general implementation of variational optimization of machine learning wave functions mainly for lattice systems with recent extensions to continuous space, the research for molecular machine learning wave functions was carried out across various repositories and is lacking a unified framework. The presented DeepQMC program package aims to provide a unified implementation of the developments in the field of deep-learning molecular wave functions. It intends to be easy to use out of the box while allowing full control and flexibility over the Ansatz and variational training for advanced users. The library is designed to be modular, facilitating the rapid development and testing of individual components, and easing the implementation of new features. It makes use of the composable function transformations and just-in-time compilation provided by the JAX library29 to express performant GPU accelerated algorithms using concise Python30 syntax. Neural network models are encapsulated in higher-level objects, using the haiku deep-learning library.31 The project is open-source and distributed online under the MIT license.26
II. THEORY
A. The electronic structure problem
B. Variational optimization
In practice, a VMC simulation then consists of choosing an Ansatz (see Sec. III) and optimizing it in an alternating scheme of sampling and parameter updates. The expectation value in Eq. (6) is approximated by sampling the probability density given by the square of the wave function (see Sec. V), followed by a parameter update using the gradient of the expectation value (see Sec. IV).
C. Neural network wave functions
Most of the currently used deep-learning molecular wave functions7,8,11 share the functional form of Eqs. (7) and (8) and differ only in the parametrization of the many-body orbitals and single-particle envelopes . DeepQMC aims to provide a general framework for variational optimization of deep-learning molecular wave functions, facilitating the investigation of the design space spanned by the PauliNet, FermiNet, and DeepErwin neural network Ansätze.
D. Pseudopotentials
Despite the favorable asymptotic scaling of VMC with the number of electrons, systems containing heavy nuclei remain challenging due to a variety of reasons. The high energy of electrons near these nuclei complicates simulations by spoiling the optimization and reducing the effectiveness of Markov chain Monte Carlo (MCMC) sampling. Furthermore, the kinetic energy of these electrons reaches the relativistic regime, requiring the treatment of relativistic effects that are not accounted for in the standard non-relativistic molecular Hamiltonian of Eq. (1). On the other hand, while the core regions of heavily charged nuclei contribute dominantly to the total energy, they are typically unchanged during chemical processes and thus have little effect on the chemically relevant relative energies. Therefore, most quantum chemistry methods targeted at computing relative energies reduce the above-outlined difficulties, by treating the outer (valence) electrons separately from the inner (core) electrons.
The scalar functions VI and are typically pre-computed by expanding them in a Gaussian basis and fitting the expansion parameters directly to a database of reference energies. The DeepQMC program package currently includes the widely used BFD34 and the most recent ccECP35 pseudopotentials, with an application of the latter presented in Sec. VIII B.
III. WAVE FUNCTION DESIGN SPACE
DeepQMC implements a variety of options to obtain the equivariant many-body orbitals and the accompanying envelopes , covering PauliNet, FermiNet, DeepErwin, and their derivatives. In the following, the main architectural concepts of these wave function Ansätze will be described in more detail. For ease of use, DeepQMC provides predefined configuration files to obtain the above-mentioned Ansätze while allowing their interpolation through a manual choice of hyperparameters.
A. Graph neural networks
Central to the neural network wave function Ansätze is the computation of equivariant many-body embedding vectors for the electrons, which are used downstream to obtain the entries of the generalized Slater determinant. Many strategies of obtaining these embeddings can be unified in the framework of graph neural networks (GNNs). GNNs are well suited to model functions on graphs that exhibit certain permutational symmetries and can be adapted to describe electrons of molecules in real-space.
An electronic configuration of a molecule can be encoded as a graph, where the nodes are electrons and nuclei and the connecting edges carry pairwise features, e.g., difference vectors. GNNs are functions of these graphs, yielding high-dimensional latent space embeddings of the nodes. The electronic node embeddings are initialized with single-electron features and iteratively updated to incorporate many-body information of the electronic environment. Using graph convolutions, the updates are invariant under the exchange of electrons in the environment and the conditions of Eq. (9) are fulfilled.
The most relevant aspects of the GNN architecture implemented in DeepQMC are sketched on the right pane of Fig. 1 and are discussed in detail in the following. Electron positions (spins) are denoted with r (σ), while R and Z indicate nuclei positions and charges. Node and edge quantities are denoted with the superscripts (n) and (e), respectively. Furthermore, l indexes the GNN interaction layers, θ denotes functions parameterized by MLPs, and t runs over the different edge types (those between electrons of identical or opposite spins, or between electrons and nuclei), node types (electron or nuclei nodes), or message types. Finally, vertical brackets indicate the different options implemented in DeepQMC for the computation of various quantities.
1. The graph representation
A graph is a natural way to denote the electronic configuration of a molecule in real-space. The nodes of the graph represent particles (electrons and nuclei), carrying information such as spin or nuclear charge. The edges support the difference vectors between the particles, resulting in a representation invariant under global translation. Note that using internal coordinates that are invariant under global rotation may not be sufficient to represent all wave functions (simple counterexamples are atomic wave functions with P symmetry) and can only be employed with a careful treatment of the spatial symmetries.
To implement a variety of wave function Ansätze, DeepQMC provides configuration options to define the specifics of the graph construction outlined above. Most importantly, the nodes corresponding to nuclei and their respective nuclei–electron edges can optionally be excluded from the graph. In this case, electron–nuclei information can still be introduced to the GNN, by initializing the electron embeddings using a concatenation of the difference vectors between the positions of the given electron and all nuclei [see the second case of Eq. (13)].
2. Node features
3. Edge features
4. Message generation
5. Electron embedding update
In the above-outlined general GNN framework, a wide variety of Ansätze can be obtained. Furthermore, the implementation of DeepQMC and its GNN framework focus on facilitating rapid extensions with new Ansatz variants either by exploration within the existing hyperparameter space or by extending it with new features.
B. Orbital construction
C. Determinant construction
D. Jastrow factor and cusp correction
E. Log-representation of the wave function
IV. TRAINING
In this section, some technical aspects of the variational optimization of deep-learning trial wave functions are discussed. While these Ansätze are trained within the standard VMC framework, the characteristics of their optimization differ markedly from other VMC Ansätze, mainly due to the greatly increased parameter count introduced by neural networks. On the other hand, it is also distinct from most other deep-learning settings owing to the unusual complexity of the loss function and the self-supervised setting, where the training data are generated in parallel to the optimization.
A. Loss function and gradient trick
B. Local energy evaluation
C. Pretraining
Choosing initial values for Ansatz parameters is a non-trivial question common to many computational chemistry methods. One need only think of the sensitivity of the self-consistent iterations to the initial guess in Hartree–Fock (HF) and related methods.39–41 The case of deep-learning VMC Ansätze is no different—a random initialization of the neural network parameters according to some of the widely adopted schemes of the machine learning community can lead to the optimization diverging or converging to local minima. This problem becomes increasingly severe with growing system size, presumably due to the higher-dimensional, more complex wave functions of larger molecules and their intricate nodal structure.
D. Gradient clipping
Despite the utilization of sophisticated gradient estimators and pretraining, the convergence of the variational optimization is still often hindered by outliers in the training batches of local energies. The existence of these outliers is not surprising, considering that the electrostatic energy is singular when two particles coincide, while the kinetic energy is singular at the nodes of the wave function—energy contributions that the shape of the wave function precisely levels out in later stages of the training. While the outliers are valid contributions to the energy expectation value, their presence can inject a lot of noise into the gradient estimates. To reduce their contribution to the parameter update, the loss and its gradient [given in Eqs. (29) and (30)] are evaluated using clipped local energies , where μ is the center and σ is the half-width of the clipping window.
Regarding concrete choices for μ and σ, some empirical findings have been reported in the related literature. Investigating transition metal atoms using pseudopotentials, Li and co-workers reported16 that choosing σ = 10 × std(Eloc) significantly outperforms all other options they have considered. More recently, von Glehn et al.12 have found that centering the clipping window at the median of local energies, and using the mean absolute deviation from the median to determine σ, improves the training of multiple deep-learning Ansätze. Considering the practical importance of the clipping mechanism, DeepQMC implements the algorithm of von Glehn et al.,12 along with an analogous logarithmically scaling “soft” clipping scheme introduced by Hermann et al.,7 and also offers full flexibility to the user in specifying custom clipping functions.
Finally, it should be highlighted that the local energies are only to be clipped for computing the gradient of the loss during optimization. Since clipping can introduce a bias to the estimate of the energy expectation value, variational energy estimates can only be obtained from unclipped local energies.
E. Optimizer
Utilizing natural gradient descent optimization42 or Kronecker-factored approximations thereof43 has proven to be a crucial ingredient to the success of variational optimization of deep-learning wave functions on molecular systems.8,11,12,44,45 Consequently, DeepQMC makes use of the Kronecker-Factored Approximate Curvature (KFAC) optimizer implementation of Botev and Martens.46 To showcase the importance of the choice of the optimizer, the performance of KFAC is compared to the commonly employed first-order optimizer AdamW,47 on variational trainings on six test systems. The obtained training energy curves are plotted in Fig. 2. To account for the 10%–25% longer per iteration run time of KFAC compared to AdamW, the wall clock time of the training (instead of the usual iteration count) is shown on the horizontal axes. The results show that the slightly increased per-iteration cost is offset by the significantly improved per-iteration convergence speed of the KFAC optimizer. Furthermore, it is found that the increase in the relative cost of KFAC over AdamW optimization is smaller for systems with larger numbers of electrons. In practice, this means that the last percents of correlation energy can be recovered much more efficiently with KFAC, resulting in improved final energies for a given computational budget.
The effectiveness of natural gradient descent in this setting can be rationalized through its connection to the stochastic reconfiguration method,8,48 known from the traditional variational quantum Monte Carlo optimization.49,50
These higher order methods utilize the Fisher information of the unnormalized density associated with the wave function as a preconditioner to the gradients. KFAC extends the application range of natural gradient descent by low-rank approximating the Fisher information, facilitating its computation for neural network wave functions with large numbers of model parameters.
Instead of following the steepest descent in parameter space, an optimization step with the preconditioned gradient is in the direction of steepest descent in distribution space, with distance defined by the Kullback–Leibler (KL) divergence.51 Considering that in VMC, the predicted quantity ψθ(r) directly defines the distribution p(r|θ) ∝ |ψθ(r)|2, one concludes that a natural gradient step is in the direction of maximal loss decrease for a given KL divergence between ψθ and ψθ+dθ. This is an advantageous property, as relying on the KL divergence results in updates that are independent of the way ψθ is parameterized, as opposed to the steepest descent where the Euclidean metric introduces strong dependence.
V. SAMPLING
An important characteristic of VMC is that the data (electron positions) used to fit the model are generated in tandem with the optimization, by sampling the probability distribution of the electronic degrees of freedom defined by the square of the wave function. This sampling task comes with its own challenges, due to its tight coupling with the training. For the variational principle to remain valid, the samples used to evaluate (6) must be equilibrated according to the distribution r ∼ |ψθ(r)|2. Furthermore, since ψθ is updated in every training iteration, the sampling must account for the corresponding changes in the distribution of the electron positions. To carry out this demanding sampling task in a computationally efficient manner, the DeepQMC program package offers two optimized Markov chain Monte Carlo (MCMC) algorithms. Along with the random walk MCMC algorithm,52,53 referred to as Metropolis sampler, the Metropolis-Adjusted Langevin Algorithm54 (MALA), referred to as Langevin sampler, is also implemented, which proposes walker updates using overdamped Langevin dynamics. The implemented MALA includes the correction proposed by Hermann et al.,7 which scales the electron step sizes around the nuclei to avoid “overshooting” the latter. In addition, changes of the wave function during training can be accounted for by re-equilibration after each gradient step or using a batch reweighting scheme. In Secs. V A and V B, these MCMC samplers along with the above-described sampling difficulties are characterized in more detail.
A. Energy convergence
The convergence of the energy estimate and its error bar throughout the evaluation of an Ansatz trained on the CH4 molecule are plotted in Fig. 3. In the top pane, the final value of the exponential walking mean of the training energies and its estimated error are also shown with a horizontal line and shaded area. It can be seen from this plot that the energy estimate of the evaluation converges gradually toward the final training energy as expected, while its sampling error converges toward zero. Note that due to the parameter updates during the optimization, the energy estimate from the training is an unreliable estimate and a thorough evaluation of the energy expectation value requires sampling the Ansatz with fixed parameters.
In the bottom pane of Fig. 3, the convergence of the estimated sampling error is compared between the Metropolis sampler and the Langevin sampler. Importantly, the expected n−1/2 convergence behavior is observed for both methods. Comparing the two algorithms, it can be seen that the error of MALA converges slightly faster than that of random walk MCMC, indicating a lower degree of correlation between the subsequent positions of the walkers of the Langevin sampler.
B. Decorrelated sampling
The autocorrelation times for five atoms of increasing size and the cyclobutadiene molecule are plotted in the upper pane of Fig. 4, for both the Metropolis and the Langevin sampler. The general trend of longer autocorrelation times for larger systems can be observed for the Metropolis sampler. One of the main causes of this trend is the increasing nuclear charge, which induces higher and higher peaks in the distribution of the electrons near the nuclei. These pronounced peaks necessitate shorter update proposal radii, ultimately resulting in a higher correlation between subsequent samples. Furthermore, the autocorrelation time is expected to grow with the increasing complexity of the wave functions and their nodal surfaces. On the other hand, the Langevin sampler seems less affected by this trend, delivering largely constant autocorrelation times for all systems. It is reasonable to assume that by explicitly making use of information about the gradient of the wave function, the MALA update proposal retains better decorrelation efficiency than random walk MCMC, when considering more and more complicated wave functions. Finally, the showcased autocorrelation times are in reasonably good agreement with the fact that the default number of decorrelating steps performed between parameter updates is chosen between 10 and 30 in the currently used neural wave function program packages.
The experiments depicted in Fig. 4 also demonstrate a slightly smaller correlation between subsequent samples of the Langevin sampler in comparison with those of the Metropolis sampler, for all but the smallest of systems. In the bottom pane of Fig. 4, the wall clock run time of performing τ sampling steps are shown for each system, to account for the slightly increased computational cost of the MALA update proposal. Considering wall clock run times, the Metropolis sampler is more efficient on atoms up to carbon, while the Langevin sampler performs slightly to considerably better on the larger atoms and cyclobutadiene. While we find MALA to be more efficient than random walk Metropolis, we observe that for larger systems with heavier nuclei, it could result in less stable optimization. To improve the black-boxed nature of the method, we applied random walk MCMC in all subsequent experiments.
VI. SCALING
Understanding the scaling of a method’s computational cost with the considered system size is of utmost importance in the field of quantum chemistry, where a pervasive caveat of the most accurate approaches is their unfavorable scaling behavior. Given its high accuracy, the asymptotic scaling of VMC (typically listed with N4)32 is considered favorable. Although this general scaling is, indeed, much better than, for example, the N7 scaling of the gold-standard CCSD(T) method, and on a par with the scaling of hybrid density functionals (such as DM213), deep QMC calculations incur a larger prefactor, resulting in much higher practical costs on systems of intermediate size. While reducing this prefactor is an important long term goal of method developers in the field, investigating the method’s scaling is also of interest, to estimate the prospect of system sizes feasible with further improvement and serve as a baseline for future developments. In this section, the scaling of the computational cost of the variational training of deep-learning Ansätze is investigated using the DeepQMC program package. Further scaling aspects of the pseudopotential implementation and design choices regarding the major computational bottlenecks of the algorithm are discussed in Appendix A.
The theoretical scaling of VMC (N4) is obtained when combining the N3 cost of the determinant evaluation with an additional factor of N from the Laplacian required in the computation of the kinetic energy. In practice, however, for simulations with the currently feasible system sizes, the determinant evaluation makes up only a fraction of the computational cost, which is instead dominated by the evaluation of the neural networks of the Ansatz. To investigate the practical scaling of a variational training step in DeepQMC, single iteration run times are compared across atoms with increasing atomic numbers, as well as across chains containing an increasing number of hydrogen atoms (Fig. 5). Although the systems contain different numbers of particles, due to the parametrization with GNNs, the total parameter count of the wave function Ansatz changes only marginally between systems. On the other hand, owing to the varying numbers of nuclei, isoelectronic species can have slightly different computational requirements. The system classes of atoms and hydrogen chains were chosen, as they represent the lower and upper bounds respectively, on the number of nuclei a neutral system with a fixed number of electrons can contain. Consequently, the scaling of the run time with the number of electrons is also expected to be bounded by these system classes. With the tight empirical bounds of N2.36−2.79 depicted in Fig. 5, the observed scaling of DeepQMC is still far below the theoretical estimate of N4, highlighting the potential for extension to larger systems.
VII. ANSATZ VALIDATION
Relying on the general framework introduced above, the DeepQMC software suite enables the use of many of the previously published deep-learning Ansätze by providing configuration files to reproduce PauliNet,7 FermiNet,8 and DeepErwin.11 To validate our implementation of these Ansätze, the hyperparameters of sampling, optimization, and GNN architecture are compared in depth to those of the respective reference implementations. In addition, it is verified that when using the same parameters, the DeepQMC implementations predict the same wave function value and local energy as their reference counterparts for a given configuration of electrons and nuclei. Note that we have refrained from exactly matching the cusp-corrected GTOs of PauliNet, because subsequent work has demonstrated that explicitly including a reference solution is limiting the accuracy of the Ansatz. However, by combining Gaussian envelopes initialized from the coefficients of a reference calculation with a nuclear cusp correction in the Jastrow factor (26), it is possible to obtain a variant of PauliNet within DeepQMC that matches the characteristics of the original Ansatz.
In Fig. 6, the empirical performance of the various Ansätze is checked against results published in the literature for a small set of molecules. It can be seen that our DeepQMC implementation of PauliNet, FermiNet, and DeepErwin matches the reference energies well. The remaining discrepancies of FermiNet result from slightly different experimental setups, such as an increased number of reference optimization steps (200 000 compared to 50 000 used here) and batch size (4096 compared to 2048 used here), or an older TensorFlow-based implementation being used in case of N2. The impact of these changes on the deviations of the model accuracy highlights the importance and difficulty of comparing Ansätze implemented in different codebases under the same experimental conditions.
As a further contribution, we introduce and analyze the performance of a new default Ansatz for the DeepQMC program package, which we refer to as PauliNet2. This exemplary Ansatz was optimized to have a good trade-off between accuracy and trainable model parameters. Despite achieving a similar accuracy as FermiNet and DeepErwin for the small systems under investigation (see Figs. 6 and 7), the PauliNet2 Ansatz has only about a third of the model parameters of FermiNet and a quarter of DeepErwin (i.e., for the CO molecule, 239 829, 766 944, and 998 816 parameters, respectively). The Ansatz combines the SchNet-like graph convolutions of PauliNet (17) with the iterative update of the edge embeddings of FermiNet (18). Edge features are constructed from difference vectors between the electrons, and isotropic exponentials are used as envelopes. Furthermore, the Ansatz comprises a trainable Jastrow factor (24) and the fixed electronic cusp correction (25). While these hyperparameters are found to be suitable for the presented experiments, it is conceivable that an extended hyperparameter search targeting specific applications could further improve its performance. The detailed settings of the discussed Ansätze can be found in the respective configuration files shipped with the DeepQMC package.
VIII. APPLICATION EXAMPLES
In this section, the ease of applying the DeepQMC program package as a black-box method to obtain electronic energies is demonstrated on benchmark datasets. Two widely different example problems are chosen in order to showcase the general applicability of the presented method. The experiments are performed using DeepQMC command line interface, which exposes all configuration options of the software suite while also allowing for effortless submission of simple calculations. Short usage examples of the DeepQMC command line interface are provided in Appendix B.
A. Small molecule reactions
The electronic contributions to the reaction energies of 12 reactions involving small inorganic molecules and hydrocarbons are investigated. These reactions were used by Nemec57 to benchmark the accuracy of Slater–Jastrow (SJ) type trial wave functions, constructed following Drummond et al.58 using electron–electron, electron–nucleus, and electron–electron–nucleus terms in the Jastrow factor. The 14 participating molecules are built from H, C, O, N, and F atoms, containing at least 2 and at most 22 electrons. To facilitate the comparison with the DMC results of Nemec,57 the same molecular geometries are considered, obtained from the work of Feller.59 Reference energies are taken from the review of O’Neill.60 All electron, complete basis set extrapolated CCSD(T) energies are computed in house, using the PySCF program package.36
First, single-point electronic energies obtained for the participating molecules are compared in Fig. 7. On the vertical axis, the error of the recovered total energy is plotted, for VMC and DMC calculations utilizing SJ type trial wave functions and for VMC with deep-learning Ansätze. Looking at Fig. 7, one can observe that the total energy errors of SJ-VMC Ansätze are consistently above 47 mEh (with a mean of 114 mEh), while the associated DMC errors are in the range of 8–50 mEh (26 mEh on average). In comparison, deep-learning Ansätze exhibit at maximum only 11 mEh total energy error, with a mean deviation of 2.6 mEh. While the main goal of quantum chemistry methods is to accurately model energy differences, rather than recover exact total energies, it is encouraging to see that DeepQMC and deep-learning Ansätze, in general, are very competitive in this area.
The accuracy of the energy differences obtained with SJ-DMC, CCSD(T), and deep-learning QMC methods is compared in Fig. 8. Note that the energy differences obtained with SJ-VMC are not shown in this figure, as they are an order of magnitude less accurate than the depicted approaches. Comparing the SJ-DMC results with those obtained from DeepQMC, one concludes that combining the VMC method with expressive deep-learning Ansätze greatly increases its accuracy, surpassing SJ-DMC on eleven out of twelve reactions. The accuracy advantage of DeepQMC’s PauliNet2, FermiNet, and DeepErwin Ansätze is similarly clear when comparing their respective reaction energy mean absolute deviations (MADs) of 2.4, 2.3, and 1.5 mEh to the 7.6 mEh of SJ-DMC. As a final comparison, Fig. 8 also shows the reaction energy differences obtained from a complete basis set extrapolated, all-electron CCSD(T). Not surprisingly, CCSD(T) performs outstandingly on these small, single reference systems in equilibrium geometry, achieving a MAD of 1.3 mEh, and chemical accuracy (less than 1 kcal/mol or 1.6 mEh error) on ten reactions. In comparison, the MAD value of PauliNet2, FermiNet and DeepErwin for this exemplary study with DeepQMC is found to approach that of CCSD(T) and chemical accuracy is achieved on seven, seven and eight out of twelve reactions respectively.
B. Transition metal oxides
The effects of utilizing pseudopotentials in variational optimization of deep-learning molecular wave function are evaluated on a series of four first-row transition metal oxides. The bond lengths of the ScO, TiO, VO, and CrO molecules are taken from the experimental results of Annaberdiyev et al.61 The latest ccECP pseudopotentials61 are applied to the transition metal atoms only, replacing neon-like cores of ten electrons. Although replacing argon cores (18 electrons) with pseudopotentials would result in even larger computational savings, this is avoided as the third shell electrons are known to play a non-negligible role in the bond formation of transition metal atoms.62 Apart from the introduction of pseudopotentials, the Ansatz employed on small molecule reactions (Sec. VIII A) is utilized here without further modifications.
IX. SUMMARY AND CONCLUSIONS
We have presented the DeepQMC program package—a general variational quantum Monte Carlo framework for optimizing deep-learning molecular wave functions. The implementation focuses on modularity, facilitating rapid development of new Ansätze, and provides maximal freedom in choosing the specifics of the variational training setup. The Ansatz shipped with DeepQMC attempts to unify most of the currently existing deep-learning molecular wave functions while remaining easy to extend as new models emerge. To reduce the computational complexity associated with heavy nuclei, some popular precomputed pseudopotentials are also implemented.
Using the framework provided by DeepQMC, the most important practical aspects of variational optimization of deep-learning molecular wave functions are discussed. The importance of a proper gradient estimator along with robust gradient clipping is highlighted. For consistent Ansatz initialization, supervised pretraining to HF wave functions is suggested. The advantages of using the second-order KFAC optimizer are demonstrated, along with a rationalization of its effectiveness. The theoretical convergence of the Markov Chain Monte Carlo sampling error is verified, and MALA is shown to be more effective in obtaining decorrelated samples than the widely utilized random walk MCMC algorithm. The empirical scaling of the method’s computational cost is found to be more favorable than that of the most popular post-HF approaches, while its large prefactor is identified as an obstacle on the path to wider adoption.
The black-box application of the program package is demonstrated in two significantly different settings. The electronic reaction energies of 12 small molecule reactions are computed with a mean absolute deviation of 1.5 mEh and compared to the 1.3 mEh achieved by CCSD(T) and 7.6 mEh achieved by DMC with SJ Ansätze. Using the same Ansatz hyperparameters, dissociation energies are computed for a series of transition metal monoxides, utilizing the latest ccECP61 pseudopotential. Improved training characteristics compared to all-electron calculations highlight the benefit of employing pseudopotentials. The accuracies of the predicted dissociation energies are on a par with or exceed those of some other recently popularized methods, such as auxiliary field quantum Monte Carlo or density matrix renormalization group.
To conclude, the presented method shows great promise to become an easy-to-use, general, black-box method accurately describing the molecular electronic structure. Especially encouraging is the favorable scaling of computational requirements with increasing system size. It is easy to envision that after further development reducing the large prefactor of the computational costs, the DeepQMC package will prove valuable to the wider quantum chemistry community.
ACKNOWLEDGMENTS
We would like to thank Jannes Nys for helpful discussions regarding the nuclear cusp correction and Leon Gerard for providing reference data for the Ansatz comparison. The work of Z.S. and P.B.S. was funded and supported by the Berlin Mathematics Research Center MATH+ (Project No. AA2-22). The work of M.M. received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 772230). The computing time granted by the HLRN and provided on the supercomputer Lise compute at NHR@ZIB is gratefully acknowledged.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Z.S. and P.B.S. contributed equally to this work.
Z. Schätzle: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal). P. B. Szabó: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal). M. Mezera: Data curation (equal); Formal analysis (equal); Software (equal); Visualization (equal); Writing – original draft (equal). J. Hermann: Software (equal); Supervision (equal). F. Noé: Conceptualization (equal); Funding acquisition (equal); Project administration (equal); Resources (equal); Supervision (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
APPENDIX A: ADDITIONAL SCALING EXPERIMENTS
1. Pseudopotentials
Figure 10 shows the scaling of the run time of the non-local pseudopotential evaluation [the second term in Eq. (12)] on the third-row and fourth-row atoms. This term dominates the total computational overhead of using pseudopotentials overwhelmingly. From the five nested summations of Eq. (12), only the sum over the valence electrons scales with the number of electrons, hinting at an approximate linear scaling with system size. Figure 10 shows that the obtained empirical scaling of N1.19 is in good agreement with expectations. The sudden jump in run time from 20 to 21 electrons is caused by the reduction in valence electrons, as the utilized ccECP pseudopotentials use a larger core for 4p elements than for 3d ones.
2. Memory requirement
For all investigated applications, the memory requirement bottleneck is presented by the computation of the Laplacian of the wave function, . In this step, one obtains second derivatives of the wave function with respect to the electron coordinates. We obtain the derivatives of the wave function by applying automatic differentiation in the backward–forward mode. While the gradient ∇rψθ is obtained in one backward pass for all coordinates, the diagonal of the Hessian requires an additional 3N forward mode differentiations to compute, one for each electron coordinate. Due to the flexible function transformations of JAX, both the serial and parallel executions of the 3N forward mode differentiation passes can be implemented in a few lines of code, with the two implementations presumably differing in how they trade computational efficiency for memory requirement.
To decide between the serial and parallel approaches to the Laplacian computation, benchmark calculations on a series of atoms with increasing nuclear charges are performed. The obtained relative memory requirements of the parallel and serial computations are presented on the left vertical axis of Fig. 11. The observed linear scaling of the relative memory requirement between parallel and serial evaluations can be understood by considering that the parallel implementation holds data for all 3N backward passes in memory, while the serial approach stores data for a single pass at a time. However, the prefactor of the scaling curve is significantly less than three, which indicates that JAX performs some optimizations on the parallel code that reduce the naive 3N memory requirement multiplier. Considering the run times of the two versions (lower panel of Fig. 11), it is found that the relative timings of the serial implementation over the parallel implementation do not scale with the system size. In fact, the ratio of run times between the serial and parallel implementations appears to converge around 1.5. Taking the above observations into account, the serial implementation of the Laplacian evaluation is chosen, due to its favorable scaling memory requirements that outweigh the slight, non-scaling run time edge of the parallel implementation.
APPENDIX B: USAGE OF DEEPQMC
Here, we provide a few minimal examples of the usage of the DeepQMC command line interface. The interface is based on hydra, which provides a modular way to configure and execute complex jobs. DeepQMC implements a wide variety of configuration options for the wave function Ansatz as well as the hyperparameters of training and evaluation. For ease of use, the package includes predefined configuration files, which can be augmented using the command line or extended with custom configuration files. For a thorough tutorial and API documentation, the reader is referred to the DeepQMC documentation. For examples of typical DeepQMC commands, see Fig. 12.