Existing adaptive bias techniques, which seek to estimate free energies and physical properties from molecular simulations, are limited by their reliance on fixed kernels or basis sets which hinder their ability to efficiently conform to varied free energy landscapes. Further, user-specified parameters are in general non-intuitive yet significantly affect the convergence rate and accuracy of the free energy estimate. Here we propose a novel method, wherein artificial neural networks (ANNs) are used to develop an adaptive biasing potential which learns free energy landscapes. We demonstrate that this method is capable of rapidly adapting to complex free energy landscapes and is not prone to boundary or oscillation problems. The method is made robust to hyperparameters and overfitting through Bayesian regularization which penalizes network weights and auto-regulates the number of effective parameters in the network. ANN sampling represents a promising innovative approach which can resolve complex free energy landscapes in less time than conventional approaches while requiring minimal user input.

## I. INTRODUCTION

Use of adaptive biasing methods to estimate free energies of relevant chemical processes from molecular simulation has gained popularity in recent years. Free energy calculations have become central to the computational study of biomolecular systems,^{1} material properties,^{2} and rare events.^{3} The metadynamics algorithm,^{4} which uses a sum of Gaussian kernels to obtain a free energy landscape (FEL) on-the-fly, has unleashed a groundswell of new activity studying on-the-fly approximations of FELs. Many extensions of this method have been proposed,^{3,5–7} which enhance the speed and accuracy of metadynamics simulations, though the essential limitation—representing the FEL with Gaussians—has remained. Notably, the optimal choice of parameters such as the multidimensional Gaussian widths, height, and deposition rate is unknown for each new system and poor choices can severely affect both the convergence rate and the accuracy of the estimated FEL. Partial remedies involving locally optimized bias kernels exist,^{5,8} though these cannot correct for systematic limitations inherent to the approximation of a surface by Gaussians, including boundary errors.^{9}

Inspired by adaptive bias umbrella sampling,^{10–13} algorithms have begun to emerge which address these limitations, including variationally enhanced sampling (VES),^{14} basis function sampling (BFS),^{15} and Green’s function sampling (GFS).^{16} Each utilizes orthogonal basis expansions to represent the underlying FEL, with VES deriving a variational functional which, when stochastically minimized, yields the optimal expansion coefficients. The related BFS approximates FELs using a probability density obtained through an accumulated biased histogram and determines the coefficients by direct projection of the data onto a chosen orthogonal basis set. GFS, a dynamic version of BFS, invokes a metadynamics-like algorithm approximating a delta function, resulting in a trajectory which reveals optimal coefficients on-the-fly. The basis expansions employed by these methods, which can represent the often nuanced and highly nonlinear FELs more robustly than Gaussians, also benefit by eliminating the need for kernel-specific properties such as width and height. A user must still specify the number of terms to include in the expansion and “deposition rate” parameters which control the aggressiveness of bias updating. Still, there remain a number of non-ideal practical and theoretical issues which must be addressed. Importantly, the use of *any* functional form to represent the FEL inherits both the advantages and disadvantages of that set of functions. Basis sets, for example, may lead to Runge or Gibbs phenomena on sharply varying surfaces or near boundaries. These characteristic oscillations introduce artificial free energy barriers which may prevent the system from sampling important regions of phase space, and result in a non-converging bias. These do not self-correct, and increasing the order of the expansion is generally not helpful.^{17}

Alternative algorithms, while less prevalent in the literature, nonetheless exhibit many benefits. Adaptive biasing force (ABF) methods,^{18} developed in parallel to metadynamics-derived methods, seek to estimate the underlying mean force (rather than the free energy) at discrete points along the coordinates of interest. The mean force is then integrated at the end of a simulation to obtain the corresponding FEL. The primary advantage of ABF over biasing potential approaches is the local nature of the mean force at a given point. While the free energy is a global property related to the probability distribution along a set of coordinates, the mean force at a given point is independent of other regions. As a consequence, ABF methods have proven to be formidable alternatives to biasing potentials and are recognized to rapidly generate accurate FEL estimates.^{19} Though powerful in practice, these have seen limited application due to the complexity of the original algorithm^{18} and restrictions on the type of coordinates which may be sampled.^{19,20}

Recognizing the distinct strengths and weaknesses of existing algorithms, we present an adaptive biasing potential algorithm also, utilizing artificial neural networks (ANNs) to construct the FEL. ANNs are a form of *supervised learning*, consisting of a collection of activation functions, or neurons, which combine to produce a function that well aproximates a given set of data.^{21} The use of ANNs in the form of “deep learning” has permeated virtually every quantitative discipline ranging from medical imaging^{22} to market forecasting.^{23} ANNs and related machine-learning tools have also been applied in various ways in molecular simulation, to fit *ab initio* potential energy surfaces,^{24,25} identify reaction coordinates,^{26} analyze biomolecular conformations,^{27,28} obtain structure-property relationships,^{29} and determine quantum mechanical forces in molecular dynamics (MD) simulations.^{30} Recently, advances in machine learning applied to molecular simulations have aided in the reconstruction of FELs. As a competitive alternative to the weighted histogram analysis method (WHAM),^{31} Gaussian process regression (GPR) was used to reconstruct free energies from umbrella sampling simulations.^{32} Adaptive Gaussian-mixture umbrella sampling was also proposed as a way to identify free energy basins in complex systems. Similar to metadynamics, this approach relies on Gaussian representations of the free energy surface, with a focus on resolving low lying regions in higher dimensional free energy space. Very recently,^{33} ANNs were combined with a nearest-neighbor density estimator for high dimensional free energy estimation. They were also used to post-process free energy estimates obtained from enhanced sampling calculations, providing a compact representation of the data for interpolation, manipulation, and storage.^{34} To the best of our knowledge, the only use of ANNs to estimate free energies on-the-fly was the very recent work by Galvelis and Sugita.^{33} Although our proposed method also utilizes ANNs, it differs critically in their application. The Bayesian NN-based free-energy method reported here rapidly adapts to complex FELs without spurious boundary and ringing artifacts, retains all accumulated statistics throughout a simulation, and is robust to the few user-specified hyperparameters and flexible enough to predict FELs where key features may be unknown before simulation.

## II. METHODS

### A. Artificial neural networks

Inspired by biological neurons, artificial neural networks (ANNs) are a collection of interconnected nodes which act as universal approximators; that is, ANNs can approximate any continuous function to an arbitrary level of accuracy with a finite number of neurons.^{35} ANNs are defined by a network topology which specifies the number of neurons and their connectivity. Learning rules determine how the weights associated with each node are updated. Layers within a neural network consist of sets of nodes which receive multiple inputs from the previous layer and pass outputs to the next layer. Here we utilize a fully connected network, where every output within a layer is an input for every neuron in the next layer.

Mathematically, the activation *F*_{i} of a fully connected layer *k* can be phrased as

where $wji$ and *b*_{i} terms define the layer weights and biases, respectively. Indices *i* and *j* represent the number of neurons in the current and previous layers, respectively. The activation function *S*(*x*) is applied element-wise and is taken to be *S*(*x*) = tanh(*x*), with the exception of the output layer which utilizes the linear activation *S*(*x*) = *x*. Backpropagation^{36} is used to update the network weights, and to obtain the output gradient with respect to the inputs for force calculations within molecular dynamics simulations.

In this work, the network seeks the free energy of a set of collective variables (CVs). We define a CV(*ξ*) via the mapping, $\xi :R3N\u2192R$ from the 3*N* positional coordinates of a molecular system onto a continuous subset of the real numbers. This can be any property capturing essential behavior within the system, such as intermolecular distances, energies, or deviation from native protein structures. The set of collective variable coordinates and corresponding approximate free energies $\xi 1,F\u03031,\xi 2,F\u03032,\u2026,\xi N,F\u0303N$ represent the training data, and obtaining them is a key component of this method described in Sec. II C.

Training a neural network involves finding the optimal set of weights and biases that best represents existing data while maximizing generalizability to make accurate predictions about new observations. In particular, one seeks to find the least complex ANN architecture that can capture the true functional dependence of the data. However, since it is nearly impossible to know the proper architecture for an arbitrary problem beforehand, several strategies exist to ensure that a sufficiently complex neural network does not overfit training data. One common approach known as *early stopping* involves partitioning data into training and validation sets. Network optimization is performed using the training data alone, terminating when the validation error begins to increase. Another approach, referred to as regularization or weight decay, penalizes the *L*_{2} norm of the network weights which is analogous to reducing the number of neurons in a network.^{21} In practice, both of these strategies are often used simultaneously.

Although early stopping and regularization are great approaches for post-processing and offline applications, they are not well suited for on-the-fly sampling. Ideally one would like to utilize all of the statistics collected over the course of a simulation in constructing the free energy estimate without needing to subsample the data for validation. More importantly, the optimal choice of network architecture, regularization strength, and other hyperparameters will vary from system to system and typically require the use of cross-validation to be identified. Hyperparameter optimization will not only make a free energy method prohibitively expensive, but it will also require significant user input, negating many of the advantages offered over existing enhanced sampling methods. Other choices such as the training loss function and optimization algorithm can affect the quality of fit and speed of convergence. In what follows, we show how an appropriate choice of loss function and optimization algorithm combined with a Bayesian framework yields a self-regularizing ANN which uses all observations for training, is robust to hyperparameter choice, maintains generalizability, and is thus ideal for use in an adaptive sampling method.

### B. Bayesian regularization

We begin by defining the loss function with respect to the network weights **w** for a set of observations ${\xi i,F\u0303i}$ which are the CV coordinates and free energy estimates,

The index *i* in this equation runs over all *N* bins (in the first term) and all *K* weights and biases within the network (second term). The loss function can be decomposed into two terms: a sum squared error *E*_{D} between the network predictions $F^i$ and targets $F\u0303i$, and a penalty or regularization term *E*_{W} on the network weights and biases $wi$. The ratio *α*/*β* controls the complexity of the network, where an increase results in a smoother network response.

Following the practical Bayesian framework by MacKay,^{36,37} we assume that the free energy estimates, which are the target outputs, are generated by

where *F*(·) is the true (as of yet unknown) underlying free energy surface and *ε*_{i} is the random, independent, and zero mean sampling noise. The training objective is to approximate *F*(·) while ignoring the noise. The Bayesian framework starts by assuming that the network weights are random variables, and we seek the set of weights that maximize the conditional probability of the weights given the data *D*. It is useful, and perhaps more intuitive, to think about the distribution of weights as a distribution of possible functions represented by a given neural network architecture, *A*. Invoking Bayes’ rule yields

Although our notation resembles that of the Boltzmann distribution, its use here is purely statistical. Since we assumed independent Gaussian noise on the data, the probability density, which is our likelihood, becomes

where $\beta =1/(2\sigma \epsilon 2)$, $\sigma \epsilon 2$ is the variance of each element of *ε*_{i}, and $ZD(\beta )=(2\pi \sigma \epsilon 2)N/2=(\pi /\beta )N/2$. The prior density *P*(**w**|*α*, *A*) of the network weights is assumed to be a zero mean Gaussian,

with $\alpha =1/(2\sigma w2)$ and *Z*_{W} = (*π*/*α*)^{K/2}. We can rewrite the posterior density [Eq. (4)] as

The evidence *P*(*D*|*α*, *β*, *A*) is not dependent on the weights, and we can combine the normalization constants to obtain

It is now clear that maximizing the posterior density with the appropriate priors is equivalent to minimizing our loss. *α* and *β* also take on a statistical meaning. *β* is inversely proportional to the variance of the sampling noise. If the variance is large, *β* will be small and force the network weights to be small resulting in a smoother function, attenuating the noise.

What remains is to estimate *α* and *β* from the data,

The only remaining unknown is *Z*_{F}(*α*, *β*) which can be approximated via a Taylor expansion of the loss function about the maximum probability weights **w**^{MP},

where **H** = *β*∇^{2}*E*_{D} + *α*∇^{2}*E*_{W} is the Hessian. Substituting the expansion back in Eq. (8) gives

Equating this expression to the standard form of a Gaussian density yields $Zf(\alpha ,\beta )\u2248(2\pi )K/2det(HMP\u22121)1/2\u2061exp(\u2212E(wMP))$. Finally, setting the derivative of the logarithm of Eq. (10) to zero and solving for optimal *α*, *β* give

with *γ* = *K* − 2*α*^{MP}*tr*(**H**^{−1}), which also represents the effective number of parameters in the network used to reduce the error function. Starting with *γ* = *K*, the hyperparameters *α*, *β*, and *γ* are updated iteratively within a Levenberg-Marquardt optimization routine,^{38} where the Hessian is approximated from the error Jacobian **J** as **H** ≈ **J**^{T}**J**. Training proceeds until the error gradient diminishes, or the trust region radius exceeds 10^{10}.

### C. Sampling method

Figure 1 shows a schematic of the ANN sampling method. For a given CV *ξ*, the free energy along the CV is defined as

Here *k*_{B} is Boltzmann’s constant, *T* is the temperature, and *P*(*ξ*) is the probability distribution of the system along *ξ*, which the ANN method seeks to learn. Simulations proceed in stages or sweeps, indexed by *i*, where a bias, *ϕ*_{i}(*ξ*), is iteratively refined until it matches −*F*(*ξ*). Typically, a simulation begins unbiased, though pre-trained networks may be loaded if available. The value of *ξ* is recorded at each step into a sparse or dense histogram *H*_{i}(*ξ*), depending on the CV dimensionality. Upon completion of a fixed number of steps, *N*_{s}, this histogram is reweighted^{39} according to the bias applied during that sweep,

and subsequently added to a global partition function estimate,

Suitably scaled, this provides an estimator $P\u0303i(\xi )$ with associated free energy

approximating the true FEL *F*(*ξ*). As illustrated, the $F\u0303i(\xi )$ serve as a training set for the ANN, using the discrete CV coordinates as input data. As a result, there are $N=\u220fi=1NCVMi$ observations, where *M*_{i} is the number of bins along each CV *i* and *N*_{CV} is the total number of CVs. Henceforth, we will use a compact notation to denote the ANN architecture, with {5, 2} indicating a neural network with two hidden layers containing 5 and 2 neurons, respectively. The output layer contains a single neuron defining a function $F^i(\xi )$, comprising a best representation of the estimated free energy from the known data.

The bias for the initial sweep, *ϕ*_{0}(*ξ*), is defined at the outset. Subsequent sweeps utilize the ANN output via the equation $\varphi i+1(\xi )=\u2212F^(\xi )$, where $F^$ denotes the optimized output of the neural network. By construction, this biasing function is continuous, differentiable, and free of ringing artifacts. Furthermore, one particular strength of this method is the capacity for a well-trained ANN to interpolate over the learned surface; each successive *ϕ*_{i}(*ξ*) yields a good estimate of the free energy even in poorly sampled regions, necessitating fewer samples in each histogram bin to obtain a high fidelity approximation of the FEL, as will be demonstrated below. The entire process is then iterated until the output layer $F^(\xi )$ achieves a measure of convergence to the true FEL *F*(*ξ*).

## III. RESULTS

### A. One-dimensional FEL

We first illustrate the robustness of the ANN method by studying a synthetic one-dimensional Monte Carlo FEL composed of 50 Gaussians. The rugged landscape is meant to represent that of a glassy system or protein folding funnel; this exact surface was utilized as a benchmark for GFS.^{16} Figure 2(a) shows the evolution of the ANN bias as a function of MC sweep *t*, where a sweep consists of 10^{5} moves. Utilizing a {40} network, the analytical result is matched impressively within 40 sweeps (4 × 10^{6} total MC moves). This highlights how even a parsimonious ANN is capable of capturing rugged topographies.

Further investigation into the effect of network size on the accuracy of the bias yields interesting results. Figure 2(b) shows the difference in *ϕ*(*ξ*) and *F*(*ξ*) for network sizes from {40} to {100}. While the smallest network does not exceed 5*k*_{B}*T* in error, significant improvement is evident upon increasing the network size to {50}. Beyond this, the error does not decrease significantly upon further refinement, indicating that Bayesian regularization converges to an effective number of parameters, demonstrating good generalization and lack of overfitting. Also plotted is a projection of the FEL onto a 70° Legendre polynomial, which represents an idealized result of basis expansion methods.^{14–16} There are clear minor oscillations across the entire interval and more pronounced ones in regions of sharp derivatives. Edge effects are also present with a substantial amount of error introduced at the right boundary. Remarkably, the ANN is capable of yielding a more accurate result using many fewer functions while presenting no appreciable ringing.

Many free energy landscapes for meaningful systems are considerably smoother than this one-dimensional example. However, since adaptive sampling methods are iterative, initial estimates of even smooth surfaces are often noisy and sharply varying. This presents a challenge for basis expansion methods as one must use a large enough basis to drive the system out of metastable states yet one small enough that early sampling noise is effectively smoothed by low frequency truncation. Practically, the FEL can be recovered from a reweighted histogram which affords some flexibility in basis size, but the issue remains that enough terms must be retained to adapt to any features of significant size. The resulting emergence of oscillation artifacts greater than a few *k*_{B}*T* can unfortunately introduce significant sampling issues.

### B. Two-dimensional FEL: Alanine dipeptide in water

As a more realistic example, we study the isomerization of alanine dipeptide (ADP) in water. This system is a good testbed for free energy methods, as it has a rich FEL with significant variation and has been thoroughly studied. It is important to note that the dynamics of ADP in water are *significantly* slower than those in a vacuum and are reflected in the convergence times of the free energy surfaces. The two torsional angles *ϕ* and *ψ* of the ADP molecule are the CVs, which approximate,^{26,40} but are not precisely, the slow modes of the isomerization process. ADP simulations are carried out in the NPT ensemble with a time step of 2 fs and hydrogen bond constraints using the LINear Constraint Solver (LINCS) algorithm. Long-range electrostatics are calculated with particle mesh Ewald on a grid spaced at 0.16 nm. A stochastic velocity rescaling thermostat is used with a time constant of 0.6 ps to maintain 298.15 K, and a Parrinello-Rahman barostat with a time constant of 10 ps maintains a pressure of 1 bar. The system consists of a single ADP molecule and 878 water molecules described by the Amber99sb and TIP3P forcefields, respectively. A modified version of SSAGES v0.7.5 compiled with Gromacs 2016.3 was used for all simulations.

A periodic grid containing 60 × 60 points is used with the ANN having topology {10, 6} and a chosen sweep interval of 10 ps. Figure 3 shows snapshots of the estimated FEL at various time intervals. At early times, one can see how the ANN is able to interpolate poorly sampled and entirely unsampled areas within the vicinity of the initial exploration. The ANN bias is quickly refined, as data are collected in these regions. Statistical noise is smoothed via the Bayesian regularization which forces a smooth network response at early times by increasing weight decay. After only 2.5 ns, the bias closely resembles the final FEL, with only minor distortion of features present at regions of low probability. It is clear that even with limited and noisy data the ANN is able to construct an idealized smooth interpolation of the FEL. In fact, the unbiased histogram used to train the ANN remains noisy for a significant amount of time after the network has converged. This is in stark contrast to other methods which rely on the unbiased histogram to recover an accurate FEL where the bias potential is unable to capture the fine details of the surface.

We investigate the sensitivity of ANN sampling to hyperparameters by studying six additional combinations of network architectures and sweep intervals. The reference FEL is the final ANN state after 100 ns of simulation from the previous example. The new systems were run on a 30 × 30 histogram, unlike the reference, to examine the behavior of ANN sampling with less available data and subject to larger discretization. Figure 4 shows the root mean square error (RMSE) of the FELs over time for the various systems. Except for a single setup, all FELs converged to within a *k*_{B}*T* at about 2.5 ns and 1 kJ/mol at 5 ns. The remaining configuration eventually converges at approximately 12 ns. This relatively slow convergence represents the limit of network size and sweep length; a {8, 4} network is *just* able to represent the FEL but requires more frequent updates and training. For the larger {10, 6} and {12, 8} networks, the sweep length has minimal impact on performance. For the end user, any reasonable choice of sweep length and a network of sufficient size should provide near optimal performance for new systems, eliminating the need for pilot simulations and significant prior experience.

### C. Three-dimensional FEL: Rouse modes of a Gaussian chain

We compute the three-dimensional dense free energy landscape of the three longest wavelength Rouse modes^{41} for a 21 bead Gaussian chain. This unconventional CV is useful in understanding the conformational dynamics of polymer chains and represents a dense multi-variable FES which requires extensive visitation to resolve. The chosen system allows us to test the adaptability of ANN sampling to higher dimensions and compare the estimated FEL to analytical results for each Rouse mode *X*_{i} by integrating over the others. The system was modeled using non-interacting beads connected via harmonic springs with bond coefficients of unity in reduced units. A Langevin thermostat was used to maintain a reduced temperature of 2/3 integrated using the velocity-Verlet integrator with a time step of 0.005. A modified version of SSAGES v0.7.5 compiled with LAMMPS 30 July 2016 was used for the simulation.

A {12, 10} network with an *N* = 25^{3} grid is used to converge the FEL. Figure 5 shows a volume rendering of the final FEL and the three integrated Rouse modes compared to theoretical prediction. The near exact agreement provides an objective measure of the accuracy of ANN sampling. Additionally, the number of hidden layers required to represent the FEL does not necessarily increase with the number of dimensions. Introducing a second hidden layer when going from a single to two or more dimensions is necessary to efficiently represent multidimensional surfaces, but a further increase is not strictly required. This is because the level set of neuron is a hyperplane and only upon composition (via the addition of another hidden layer) is multidimensional nonlinearity naturally representable.

### D. Four-dimensional FEL: Alanine tripeptide in a vacuum

In our final example, we compute the free energy landscape of alanine tripeptide (Ace-Ala-Ala-Nme) in a vacuum using all four backbone dihedrals as collective variables. This example was previously studied by Stecher *et al.*,^{32} using the CHARMM22 forcefield and Gaussian process regression (GPR). Here we use Amber99sb, hydrogen bond constraints, and a 2 fs time step at 298.15 K. Electrostatic interactions were handled using particle mesh Ewald, and temperature was maintained using the stochastic velocity rescale thermostat. As with the previous examples, a modified version of SSAGES v0.7.5 was used, compiled with Gromacs 2016.3.

We use a {24, 20} network with an *N* = 20^{4} grid and a sweep length of 20 ps. As with the previous example, there is no need to introduce a third hidden layer. Figure 6 shows the evolution of the 4D free energy estimate projected onto two Ramachandran plots. ANN sampling rapidly builds up bias in the prominent free energy basins of the tripeptide, uncovering essential free energy basins and their structure within 5 ns of simulation time. This allows the CVs to begin diffusing across these basins within a few nanoseconds, as shown in Fig. 7. The remaining time is spent resolving regions of low probability, with a relative free energy as high as 90 kJ/mol. Convergence is achieved at approximately 20 ns with a reference FEL at a much later 60 ns, showing no appreciable change.

In Ref. 32, the 4D FEL of alanine tripeptide is reconstructed using GPR from an increasing number of short 500 ps simulation windows. Qualitative agreement with a reference simulation is achieved after a total aggregate simulation time of 1.3 *μ*s. While a direct comparison is not possible, we find that the 20 ns required to resolve the 4D FEL with ANN sampling to be remarkable. For high dimensional systems, this corresponds to a substantial improvement in efficiency, where traditional methods can be prohibitively expensive. The results presented here are from a single simulation, although ANN sampling is also trivially parallelizable. Our software implementation already contains support multiple walkers. Similar to GPR, this would allow multiple independent simulations to contribute to the same FEL, taking advantage of modern parallel architectures.

There are some additional important considerations for high dimensional CVs. As can be seen from our examples, we decrease the number of grid points per dimension as the system dimensionality increases. The simple reason for this is the curse of dimensionality: for a fixed point density, the training data size increases exponentially with increasing dimension. Furthermore, Bayesian regularization requires the calculation and storage of the Jacobian which scales as $O(N2)$ and requires the inversion of the approximate Hessian. In an effort to minimize training time, we investigate the effect of limiting the maximum number of training iterations on the performance of ANN sampling.

The results presented above were obtained using a maximum number of 10 iterations per sweep. While this seems to be unreasonably small, it performs very well as shown. The reason for this is that in ANN sampling, the network weights are carried over each training cycle. This can be seen as a form of pre-training, where the network only requires a minimal amount of fine-tuning to adapt to new data. We also carried out the simulations using 20, 50, and 100 maximum iterations and found little difference in performance and due to the fewer iterations incurred much less overhead. These limitations are a result of targeting a uniform probability distribution in the collective variables. Targeting a well-tempered distribution or implementing a maximum fill-level on the free energy can be complimented with a sparse grid since only regions of CV space accessible within a desired cutoff will be sampled. This sort of approach is often desirable and will scale better to even higher dimensions. We see this as an area of future research.

## IV. DISCUSSION AND CONCLUSIONS

We have presented a novel advanced sampling algorithm utilizing artificial neural networks to learn free energy landscapes. The ability of ANNs to act as universal approximators enables the method to resolve topologically complex FELs without presenting numerical artifacts typical of other methods which rely on basis sets. Bayesian regularization allows the entire set of training data to be used, prevents overfitting, and eliminates the need for any learning parameters to be specified. This and other robust features result in a method which is exceptionally swift and accurate at obtaining FELs.

As ANN sampling superficially resembles the recently proposed NN2B method,^{33} some remarks on their similarities and differences should be made. Though it also uses neural networks to estimate free energies, it differs significantly from the out method beyond that. NN2B requires the specification of a large number of parameters, requiring significant knowledge of CV correlation times and sampling density, and it is not clear how they influence convergence. Our use of a discretized global partition function estimate retains information on all CV states sampled over the course of a simulation, while NN2B requires a judicious choice of the degree of subsampling from prior sweeps. The use of standard neural networks and their density estimation procedure necessitates significantly larger architectures and sampling intervals to represent a given free energy landscape. It also means that some of the valuable data must be used for validation to avoid overfitting.

The consequences of this are reflected in their study of alanine dipeptide in a vacuum. Training times for each sweep are reported to range from 10 to 30 min, depending on the size of the network. Furthermore, it requires 50 ns to converge the two-dimensional FEL for ADP in a vacuum. This is considerably longer than the typical 2 ns using standard metadynamics. In comparison, our method requires seconds to train the 2D FEL and a minute for higher dimensions in the early stages; as the approximate FES becomes smoother, the training time decreases considerably due to regularization. Both the lengthy training times and slow convergence render NN2B impractical for most applications. We do however see merit in using nearest neighbor density estimators in place of sparse histograms and anticipate that this approach can be integrated within our proposed method.

The considerable advantages we have demonstrated biasing with ANN networks position this as an ideal method for the study of computationally expensive systems where sampling is limited. In particular, both first-principles MD and large biomolecule simulations are prime candidates for ANN sampling, where all but the most trivial of examples become intractable due to the typical time scales required to obtain free energy estimates. The significant speed improvements in sampling afforded by this method also open the door to the use of accelerated free-energy calculations for high throughput computational screening of material properties (see, e.g., Ref. 42). The ANN framework also makes it possible to initialize simulations with pre-trained networks, a technique commonly used in deep learning applications. Starting with a FEL from a classical simulation can considerably improve the convergence of a first-principles estimate, while a theoretical or coarse-grained model can inform the initial bias in a biomolecular simulation. ANN sampling is a promising new approach to advanced sampling which can help resolve complex free energy landscapes in less time than conventional approaches while overcoming many previous shortcomings.

## ACKNOWLEDGMENTS

H.S. acknowledges support from the National Science Foundation Graduate Fellowship Program (NSF-GRFP). This project was supported by MICCoM, as part of the Computational Materials Sciences Program funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division. H.S. and J.K.W. acknowledge computational resources at the Notre Dame Center for Research Computing (CRC). H.S. and J.K.W. acknowledge Michael Webb (U. Chicago) for providing the analytical Rouse mode solutions.

### APPENDIX: SOFTWARE AND DATA AVAILABILITY

All of the run files, data, and analysis scripts required to reproduce the results in this paper are freely accessible online at https://github.com/hsidky/ann_sampling. ANN sampling methods are provided for download as part of the advance sampling package SSAGES (Software Suite For Advanced Generalized-Ensemble Simulations) at https://github.com/MICCoM/SSAGES-public as of version 0.8.