Existing adaptive bias techniques, which seek to estimate free energies and physical properties from molecular simulations, are limited by their reliance on fixed kernels or basis sets which hinder their ability to efficiently conform to varied free energy landscapes. Further, user-specified parameters are in general non-intuitive yet significantly affect the convergence rate and accuracy of the free energy estimate. Here we propose a novel method, wherein artificial neural networks (ANNs) are used to develop an adaptive biasing potential which learns free energy landscapes. We demonstrate that this method is capable of rapidly adapting to complex free energy landscapes and is not prone to boundary or oscillation problems. The method is made robust to hyperparameters and overfitting through Bayesian regularization which penalizes network weights and auto-regulates the number of effective parameters in the network. ANN sampling represents a promising innovative approach which can resolve complex free energy landscapes in less time than conventional approaches while requiring minimal user input.

Use of adaptive biasing methods to estimate free energies of relevant chemical processes from molecular simulation has gained popularity in recent years. Free energy calculations have become central to the computational study of biomolecular systems,1 material properties,2 and rare events.3 The metadynamics algorithm,4 which uses a sum of Gaussian kernels to obtain a free energy landscape (FEL) on-the-fly, has unleashed a groundswell of new activity studying on-the-fly approximations of FELs. Many extensions of this method have been proposed,3,5–7 which enhance the speed and accuracy of metadynamics simulations, though the essential limitation—representing the FEL with Gaussians—has remained. Notably, the optimal choice of parameters such as the multidimensional Gaussian widths, height, and deposition rate is unknown for each new system and poor choices can severely affect both the convergence rate and the accuracy of the estimated FEL. Partial remedies involving locally optimized bias kernels exist,5,8 though these cannot correct for systematic limitations inherent to the approximation of a surface by Gaussians, including boundary errors.9 

Inspired by adaptive bias umbrella sampling,10–13 algorithms have begun to emerge which address these limitations, including variationally enhanced sampling (VES),14 basis function sampling (BFS),15 and Green’s function sampling (GFS).16 Each utilizes orthogonal basis expansions to represent the underlying FEL, with VES deriving a variational functional which, when stochastically minimized, yields the optimal expansion coefficients. The related BFS approximates FELs using a probability density obtained through an accumulated biased histogram and determines the coefficients by direct projection of the data onto a chosen orthogonal basis set. GFS, a dynamic version of BFS, invokes a metadynamics-like algorithm approximating a delta function, resulting in a trajectory which reveals optimal coefficients on-the-fly. The basis expansions employed by these methods, which can represent the often nuanced and highly nonlinear FELs more robustly than Gaussians, also benefit by eliminating the need for kernel-specific properties such as width and height. A user must still specify the number of terms to include in the expansion and “deposition rate” parameters which control the aggressiveness of bias updating. Still, there remain a number of non-ideal practical and theoretical issues which must be addressed. Importantly, the use of any functional form to represent the FEL inherits both the advantages and disadvantages of that set of functions. Basis sets, for example, may lead to Runge or Gibbs phenomena on sharply varying surfaces or near boundaries. These characteristic oscillations introduce artificial free energy barriers which may prevent the system from sampling important regions of phase space, and result in a non-converging bias. These do not self-correct, and increasing the order of the expansion is generally not helpful.17 

Alternative algorithms, while less prevalent in the literature, nonetheless exhibit many benefits. Adaptive biasing force (ABF) methods,18 developed in parallel to metadynamics-derived methods, seek to estimate the underlying mean force (rather than the free energy) at discrete points along the coordinates of interest. The mean force is then integrated at the end of a simulation to obtain the corresponding FEL. The primary advantage of ABF over biasing potential approaches is the local nature of the mean force at a given point. While the free energy is a global property related to the probability distribution along a set of coordinates, the mean force at a given point is independent of other regions. As a consequence, ABF methods have proven to be formidable alternatives to biasing potentials and are recognized to rapidly generate accurate FEL estimates.19 Though powerful in practice, these have seen limited application due to the complexity of the original algorithm18 and restrictions on the type of coordinates which may be sampled.19,20

Recognizing the distinct strengths and weaknesses of existing algorithms, we present an adaptive biasing potential algorithm also, utilizing artificial neural networks (ANNs) to construct the FEL. ANNs are a form of supervised learning, consisting of a collection of activation functions, or neurons, which combine to produce a function that well aproximates a given set of data.21 The use of ANNs in the form of “deep learning” has permeated virtually every quantitative discipline ranging from medical imaging22 to market forecasting.23 ANNs and related machine-learning tools have also been applied in various ways in molecular simulation, to fit ab initio potential energy surfaces,24,25 identify reaction coordinates,26 analyze biomolecular conformations,27,28 obtain structure-property relationships,29 and determine quantum mechanical forces in molecular dynamics (MD) simulations.30 Recently, advances in machine learning applied to molecular simulations have aided in the reconstruction of FELs. As a competitive alternative to the weighted histogram analysis method (WHAM),31 Gaussian process regression (GPR) was used to reconstruct free energies from umbrella sampling simulations.32 Adaptive Gaussian-mixture umbrella sampling was also proposed as a way to identify free energy basins in complex systems. Similar to metadynamics, this approach relies on Gaussian representations of the free energy surface, with a focus on resolving low lying regions in higher dimensional free energy space. Very recently,33 ANNs were combined with a nearest-neighbor density estimator for high dimensional free energy estimation. They were also used to post-process free energy estimates obtained from enhanced sampling calculations, providing a compact representation of the data for interpolation, manipulation, and storage.34 To the best of our knowledge, the only use of ANNs to estimate free energies on-the-fly was the very recent work by Galvelis and Sugita.33 Although our proposed method also utilizes ANNs, it differs critically in their application. The Bayesian NN-based free-energy method reported here rapidly adapts to complex FELs without spurious boundary and ringing artifacts, retains all accumulated statistics throughout a simulation, and is robust to the few user-specified hyperparameters and flexible enough to predict FELs where key features may be unknown before simulation.

Inspired by biological neurons, artificial neural networks (ANNs) are a collection of interconnected nodes which act as universal approximators; that is, ANNs can approximate any continuous function to an arbitrary level of accuracy with a finite number of neurons.35 ANNs are defined by a network topology which specifies the number of neurons and their connectivity. Learning rules determine how the weights associated with each node are updated. Layers within a neural network consist of sets of nodes which receive multiple inputs from the previous layer and pass outputs to the next layer. Here we utilize a fully connected network, where every output within a layer is an input for every neuron in the next layer.

Mathematically, the activation Fi of a fully connected layer k can be phrased as

(1)

where wji and bi terms define the layer weights and biases, respectively. Indices i and j represent the number of neurons in the current and previous layers, respectively. The activation function S(x) is applied element-wise and is taken to be S(x) = tanh(x), with the exception of the output layer which utilizes the linear activation S(x) = x. Backpropagation36 is used to update the network weights, and to obtain the output gradient with respect to the inputs for force calculations within molecular dynamics simulations.

In this work, the network seeks the free energy of a set of collective variables (CVs). We define a CV(ξ) via the mapping, ξ:R3NR from the 3N positional coordinates of a molecular system onto a continuous subset of the real numbers. This can be any property capturing essential behavior within the system, such as intermolecular distances, energies, or deviation from native protein structures. The set of collective variable coordinates and corresponding approximate free energies ξ1,F̃1,ξ2,F̃2,,ξN,F̃N represent the training data, and obtaining them is a key component of this method described in Sec. II C.

Training a neural network involves finding the optimal set of weights and biases that best represents existing data while maximizing generalizability to make accurate predictions about new observations. In particular, one seeks to find the least complex ANN architecture that can capture the true functional dependence of the data. However, since it is nearly impossible to know the proper architecture for an arbitrary problem beforehand, several strategies exist to ensure that a sufficiently complex neural network does not overfit training data. One common approach known as early stopping involves partitioning data into training and validation sets. Network optimization is performed using the training data alone, terminating when the validation error begins to increase. Another approach, referred to as regularization or weight decay, penalizes the L2 norm of the network weights which is analogous to reducing the number of neurons in a network.21 In practice, both of these strategies are often used simultaneously.

Although early stopping and regularization are great approaches for post-processing and offline applications, they are not well suited for on-the-fly sampling. Ideally one would like to utilize all of the statistics collected over the course of a simulation in constructing the free energy estimate without needing to subsample the data for validation. More importantly, the optimal choice of network architecture, regularization strength, and other hyperparameters will vary from system to system and typically require the use of cross-validation to be identified. Hyperparameter optimization will not only make a free energy method prohibitively expensive, but it will also require significant user input, negating many of the advantages offered over existing enhanced sampling methods. Other choices such as the training loss function and optimization algorithm can affect the quality of fit and speed of convergence. In what follows, we show how an appropriate choice of loss function and optimization algorithm combined with a Bayesian framework yields a self-regularizing ANN which uses all observations for training, is robust to hyperparameter choice, maintains generalizability, and is thus ideal for use in an adaptive sampling method.

We begin by defining the loss function with respect to the network weights w for a set of observations {ξi,F̃i} which are the CV coordinates and free energy estimates,

(2)

The index i in this equation runs over all N bins (in the first term) and all K weights and biases within the network (second term). The loss function can be decomposed into two terms: a sum squared error ED between the network predictions F^i and targets F̃i, and a penalty or regularization term EW on the network weights and biases wi. The ratio α/β controls the complexity of the network, where an increase results in a smoother network response.

Following the practical Bayesian framework by MacKay,36,37 we assume that the free energy estimates, which are the target outputs, are generated by

(3)

where F(·) is the true (as of yet unknown) underlying free energy surface and εi is the random, independent, and zero mean sampling noise. The training objective is to approximate F(·) while ignoring the noise. The Bayesian framework starts by assuming that the network weights are random variables, and we seek the set of weights that maximize the conditional probability of the weights given the data D. It is useful, and perhaps more intuitive, to think about the distribution of weights as a distribution of possible functions represented by a given neural network architecture, A. Invoking Bayes’ rule yields

(4)

Although our notation resembles that of the Boltzmann distribution, its use here is purely statistical. Since we assumed independent Gaussian noise on the data, the probability density, which is our likelihood, becomes

(5)

where β=1/(2σε2), σε2 is the variance of each element of εi, and ZD(β)=(2πσε2)N/2=(π/β)N/2. The prior density P(w|α, A) of the network weights is assumed to be a zero mean Gaussian,

(6)

with α=1/(2σw2) and ZW = (π/α)K/2. We can rewrite the posterior density [Eq. (4)] as

(7)

The evidence P(D|α, β, A) is not dependent on the weights, and we can combine the normalization constants to obtain

(8)

It is now clear that maximizing the posterior density with the appropriate priors is equivalent to minimizing our loss. α and β also take on a statistical meaning. β is inversely proportional to the variance of the sampling noise. If the variance is large, β will be small and force the network weights to be small resulting in a smoother function, attenuating the noise.

What remains is to estimate α and β from the data,

(9)

Assuming a uniform prior density on α, β and using Eqs. (7) and (8), it follows that

(10)

The only remaining unknown is ZF(α, β) which can be approximated via a Taylor expansion of the loss function about the maximum probability weights wMP,

(11)

where H = β2ED + α2EW is the Hessian. Substituting the expansion back in Eq. (8) gives

(12)

Equating this expression to the standard form of a Gaussian density yields Zf(α,β)(2π)K/2det(HMP1)1/2exp(E(wMP)). Finally, setting the derivative of the logarithm of Eq. (10) to zero and solving for optimal α, β give

(13)

with γ = K − 2αMPtr(H−1), which also represents the effective number of parameters in the network used to reduce the error function. Starting with γ = K, the hyperparameters α, β, and γ are updated iteratively within a Levenberg-Marquardt optimization routine,38 where the Hessian is approximated from the error Jacobian J as HJTJ. Training proceeds until the error gradient diminishes, or the trust region radius exceeds 1010.

Figure 1 shows a schematic of the ANN sampling method. For a given CV ξ, the free energy along the CV is defined as

(14)

Here kB is Boltzmann’s constant, T is the temperature, and P(ξ) is the probability distribution of the system along ξ, which the ANN method seeks to learn. Simulations proceed in stages or sweeps, indexed by i, where a bias, ϕi(ξ), is iteratively refined until it matches −F(ξ). Typically, a simulation begins unbiased, though pre-trained networks may be loaded if available. The value of ξ is recorded at each step into a sparse or dense histogram Hi(ξ), depending on the CV dimensionality. Upon completion of a fixed number of steps, Ns, this histogram is reweighted39 according to the bias applied during that sweep,

(15)

and subsequently added to a global partition function estimate,

(16)

Suitably scaled, this provides an estimator P̃i(ξ) with associated free energy

(17)

approximating the true FEL F(ξ). As illustrated, the F̃i(ξ) serve as a training set for the ANN, using the discrete CV coordinates as input data. As a result, there are N=i=1NCVMi observations, where Mi is the number of bins along each CV i and NCV is the total number of CVs. Henceforth, we will use a compact notation to denote the ANN architecture, with {5, 2} indicating a neural network with two hidden layers containing 5 and 2 neurons, respectively. The output layer contains a single neuron defining a function F^i(ξ), comprising a best representation of the estimated free energy from the known data.

FIG. 1.

Schematic of the ANN adaptive biasing method. Hits along the collective variable(s) are recorded onto a histogram during the simulation. The discrete grid coordinates and corresponding reweighted normalized probabilities are used as the input and output training data, respectively. An ANN architecture is chosen prior to runtime and is trained iteratively using a Bayesian regularization procedure. The objective function chosen represents the sum squared error between the free energy estimate F̃ and ANN output F^. Hyperparameters α, β represent the relative penalties to the data and network weights, respectively, which are automatically adjusted during the optimization routine.

FIG. 1.

Schematic of the ANN adaptive biasing method. Hits along the collective variable(s) are recorded onto a histogram during the simulation. The discrete grid coordinates and corresponding reweighted normalized probabilities are used as the input and output training data, respectively. An ANN architecture is chosen prior to runtime and is trained iteratively using a Bayesian regularization procedure. The objective function chosen represents the sum squared error between the free energy estimate F̃ and ANN output F^. Hyperparameters α, β represent the relative penalties to the data and network weights, respectively, which are automatically adjusted during the optimization routine.

Close modal

The bias for the initial sweep, ϕ0(ξ), is defined at the outset. Subsequent sweeps utilize the ANN output via the equation ϕi+1(ξ)=F^(ξ), where F^ denotes the optimized output of the neural network. By construction, this biasing function is continuous, differentiable, and free of ringing artifacts. Furthermore, one particular strength of this method is the capacity for a well-trained ANN to interpolate over the learned surface; each successive ϕi(ξ) yields a good estimate of the free energy even in poorly sampled regions, necessitating fewer samples in each histogram bin to obtain a high fidelity approximation of the FEL, as will be demonstrated below. The entire process is then iterated until the output layer F^(ξ) achieves a measure of convergence to the true FEL F(ξ).

We first illustrate the robustness of the ANN method by studying a synthetic one-dimensional Monte Carlo FEL composed of 50 Gaussians. The rugged landscape is meant to represent that of a glassy system or protein folding funnel; this exact surface was utilized as a benchmark for GFS.16 Figure 2(a) shows the evolution of the ANN bias as a function of MC sweep t, where a sweep consists of 105 moves. Utilizing a {40} network, the analytical result is matched impressively within 40 sweeps (4 × 106 total MC moves). This highlights how even a parsimonious ANN is capable of capturing rugged topographies.

FIG. 2.

(a) Evolution of negative bias over time using a {40} network on a benchmark rugged free energy landscape. The resolved bias converges to the −F(ξ) smoothly over time with no apparent edge effects or oscillations in regions of steep derivatives. An analysis of the bias error (b) for different network sizes shows very good representation. There is a significant drop in error between 40 and 50 neurons after which the error signature stabilizes as a consequence of regularization. For comparison is a projection of the true surface onto an order 70 Legendre polynomial which represents an ideal result for methods employing polynomial basis sets. Values have been shifted along the y-axis for clarity.

FIG. 2.

(a) Evolution of negative bias over time using a {40} network on a benchmark rugged free energy landscape. The resolved bias converges to the −F(ξ) smoothly over time with no apparent edge effects or oscillations in regions of steep derivatives. An analysis of the bias error (b) for different network sizes shows very good representation. There is a significant drop in error between 40 and 50 neurons after which the error signature stabilizes as a consequence of regularization. For comparison is a projection of the true surface onto an order 70 Legendre polynomial which represents an ideal result for methods employing polynomial basis sets. Values have been shifted along the y-axis for clarity.

Close modal

Further investigation into the effect of network size on the accuracy of the bias yields interesting results. Figure 2(b) shows the difference in ϕ(ξ) and F(ξ) for network sizes from {40} to {100}. While the smallest network does not exceed 5kBT in error, significant improvement is evident upon increasing the network size to {50}. Beyond this, the error does not decrease significantly upon further refinement, indicating that Bayesian regularization converges to an effective number of parameters, demonstrating good generalization and lack of overfitting. Also plotted is a projection of the FEL onto a 70° Legendre polynomial, which represents an idealized result of basis expansion methods.14–16 There are clear minor oscillations across the entire interval and more pronounced ones in regions of sharp derivatives. Edge effects are also present with a substantial amount of error introduced at the right boundary. Remarkably, the ANN is capable of yielding a more accurate result using many fewer functions while presenting no appreciable ringing.

Many free energy landscapes for meaningful systems are considerably smoother than this one-dimensional example. However, since adaptive sampling methods are iterative, initial estimates of even smooth surfaces are often noisy and sharply varying. This presents a challenge for basis expansion methods as one must use a large enough basis to drive the system out of metastable states yet one small enough that early sampling noise is effectively smoothed by low frequency truncation. Practically, the FEL can be recovered from a reweighted histogram which affords some flexibility in basis size, but the issue remains that enough terms must be retained to adapt to any features of significant size. The resulting emergence of oscillation artifacts greater than a few kBT can unfortunately introduce significant sampling issues.

As a more realistic example, we study the isomerization of alanine dipeptide (ADP) in water. This system is a good testbed for free energy methods, as it has a rich FEL with significant variation and has been thoroughly studied. It is important to note that the dynamics of ADP in water are significantly slower than those in a vacuum and are reflected in the convergence times of the free energy surfaces. The two torsional angles ϕ and ψ of the ADP molecule are the CVs, which approximate,26,40 but are not precisely, the slow modes of the isomerization process. ADP simulations are carried out in the NPT ensemble with a time step of 2 fs and hydrogen bond constraints using the LINear Constraint Solver (LINCS) algorithm. Long-range electrostatics are calculated with particle mesh Ewald on a grid spaced at 0.16 nm. A stochastic velocity rescaling thermostat is used with a time constant of 0.6 ps to maintain 298.15 K, and a Parrinello-Rahman barostat with a time constant of 10 ps maintains a pressure of 1 bar. The system consists of a single ADP molecule and 878 water molecules described by the Amber99sb and TIP3P forcefields, respectively. A modified version of SSAGES v0.7.5 compiled with Gromacs 2016.3 was used for all simulations.

A periodic grid containing 60 × 60 points is used with the ANN having topology {10, 6} and a chosen sweep interval of 10 ps. Figure 3 shows snapshots of the estimated FEL at various time intervals. At early times, one can see how the ANN is able to interpolate poorly sampled and entirely unsampled areas within the vicinity of the initial exploration. The ANN bias is quickly refined, as data are collected in these regions. Statistical noise is smoothed via the Bayesian regularization which forces a smooth network response at early times by increasing weight decay. After only 2.5 ns, the bias closely resembles the final FEL, with only minor distortion of features present at regions of low probability. It is clear that even with limited and noisy data the ANN is able to construct an idealized smooth interpolation of the FEL. In fact, the unbiased histogram used to train the ANN remains noisy for a significant amount of time after the network has converged. This is in stark contrast to other methods which rely on the unbiased histogram to recover an accurate FEL where the bias potential is unable to capture the fine details of the surface.

FIG. 3.

Snapshots of the estimated free energy landscape as a function of the Ramachandran angles for alanine dipeptide (ADP) in water at 298.15 K, using ANN sampling on a 60 × 60 grid. ANN sampling rapidly resolves the free energy landscape. Much of the speed of ANN sampling derives from its ability to smooth out statistical noise using the Bayesian framework and interpolate well in under-sampled regions at early times. Distortions are rapidly refined as the bias drives sampling to those regions and a stable approximation is achieved at approximately 5 ns, where it is only subject to statistical fluctuations.

FIG. 3.

Snapshots of the estimated free energy landscape as a function of the Ramachandran angles for alanine dipeptide (ADP) in water at 298.15 K, using ANN sampling on a 60 × 60 grid. ANN sampling rapidly resolves the free energy landscape. Much of the speed of ANN sampling derives from its ability to smooth out statistical noise using the Bayesian framework and interpolate well in under-sampled regions at early times. Distortions are rapidly refined as the bias drives sampling to those regions and a stable approximation is achieved at approximately 5 ns, where it is only subject to statistical fluctuations.

Close modal

We investigate the sensitivity of ANN sampling to hyperparameters by studying six additional combinations of network architectures and sweep intervals. The reference FEL is the final ANN state after 100 ns of simulation from the previous example. The new systems were run on a 30 × 30 histogram, unlike the reference, to examine the behavior of ANN sampling with less available data and subject to larger discretization. Figure 4 shows the root mean square error (RMSE) of the FELs over time for the various systems. Except for a single setup, all FELs converged to within a kBT at about 2.5 ns and 1 kJ/mol at 5 ns. The remaining configuration eventually converges at approximately 12 ns. This relatively slow convergence represents the limit of network size and sweep length; a {8, 4} network is just able to represent the FEL but requires more frequent updates and training. For the larger {10, 6} and {12, 8} networks, the sweep length has minimal impact on performance. For the end user, any reasonable choice of sweep length and a network of sufficient size should provide near optimal performance for new systems, eliminating the need for pilot simulations and significant prior experience.

FIG. 4.

Free energy landscape root mean squared error (RMSE) for alanine dipeptide using various ANN architectures and sweep lengths as a function of time. The RMSE is computed relative to a reference FEL obtained from a 100 ns simulation. With the exception of the small {8, 4} network with a large sweep, all other systems quickly converge to within 1 kBT at 2.5 ns and within 1 kJ/mol at 5 ns. This demonstrates that beyond a minimum network size and sweep, the performance of ANN sampling is insensitive to a broad change of user-specified parameters.

FIG. 4.

Free energy landscape root mean squared error (RMSE) for alanine dipeptide using various ANN architectures and sweep lengths as a function of time. The RMSE is computed relative to a reference FEL obtained from a 100 ns simulation. With the exception of the small {8, 4} network with a large sweep, all other systems quickly converge to within 1 kBT at 2.5 ns and within 1 kJ/mol at 5 ns. This demonstrates that beyond a minimum network size and sweep, the performance of ANN sampling is insensitive to a broad change of user-specified parameters.

Close modal

We compute the three-dimensional dense free energy landscape of the three longest wavelength Rouse modes41 for a 21 bead Gaussian chain. This unconventional CV is useful in understanding the conformational dynamics of polymer chains and represents a dense multi-variable FES which requires extensive visitation to resolve. The chosen system allows us to test the adaptability of ANN sampling to higher dimensions and compare the estimated FEL to analytical results for each Rouse mode Xi by integrating over the others. The system was modeled using non-interacting beads connected via harmonic springs with bond coefficients of unity in reduced units. A Langevin thermostat was used to maintain a reduced temperature of 2/3 integrated using the velocity-Verlet integrator with a time step of 0.005. A modified version of SSAGES v0.7.5 compiled with LAMMPS 30 July 2016 was used for the simulation.

A {12, 10} network with an N = 253 grid is used to converge the FEL. Figure 5 shows a volume rendering of the final FEL and the three integrated Rouse modes compared to theoretical prediction. The near exact agreement provides an objective measure of the accuracy of ANN sampling. Additionally, the number of hidden layers required to represent the FEL does not necessarily increase with the number of dimensions. Introducing a second hidden layer when going from a single to two or more dimensions is necessary to efficiently represent multidimensional surfaces, but a further increase is not strictly required. This is because the level set of neuron is a hyperplane and only upon composition (via the addition of another hidden layer) is multidimensional nonlinearity naturally representable.

FIG. 5.

Dense three-dimensional Rouse mode (Xi) free energy landscape for a 21 bead Gaussian chain. The 3D volume free energy (a) can be integrated along two chosen axes to yield the one-dimensional Rouse mode free energies. Comparison with analytical theory for (b) X1, (c) X2, and (d) X3 shows near exact agreement. This example demonstrates that ANN sampling should be useful in sampling correlated CVs in dense high dimensional spaces.

FIG. 5.

Dense three-dimensional Rouse mode (Xi) free energy landscape for a 21 bead Gaussian chain. The 3D volume free energy (a) can be integrated along two chosen axes to yield the one-dimensional Rouse mode free energies. Comparison with analytical theory for (b) X1, (c) X2, and (d) X3 shows near exact agreement. This example demonstrates that ANN sampling should be useful in sampling correlated CVs in dense high dimensional spaces.

Close modal

In our final example, we compute the free energy landscape of alanine tripeptide (Ace-Ala-Ala-Nme) in a vacuum using all four backbone dihedrals as collective variables. This example was previously studied by Stecher et al.,32 using the CHARMM22 forcefield and Gaussian process regression (GPR). Here we use Amber99sb, hydrogen bond constraints, and a 2 fs time step at 298.15 K. Electrostatic interactions were handled using particle mesh Ewald, and temperature was maintained using the stochastic velocity rescale thermostat. As with the previous examples, a modified version of SSAGES v0.7.5 was used, compiled with Gromacs 2016.3.

We use a {24, 20} network with an N = 204 grid and a sweep length of 20 ps. As with the previous example, there is no need to introduce a third hidden layer. Figure 6 shows the evolution of the 4D free energy estimate projected onto two Ramachandran plots. ANN sampling rapidly builds up bias in the prominent free energy basins of the tripeptide, uncovering essential free energy basins and their structure within 5 ns of simulation time. This allows the CVs to begin diffusing across these basins within a few nanoseconds, as shown in Fig. 7. The remaining time is spent resolving regions of low probability, with a relative free energy as high as 90 kJ/mol. Convergence is achieved at approximately 20 ns with a reference FEL at a much later 60 ns, showing no appreciable change.

FIG. 6.

Evolution of 4D free energy estimate for alanine tripeptide in a vacuum shown as two-dimensional Ramachandran plot projections. The top and bottom rows are the first and second ψ-ϕ pairs, respectively. By 5 ns, the bias has built up in the low-lying metastable configurations, allowing for efficient diffusion of all CVs (cf. Fig. 7). The remaining time is spent resolving regions of extremely low probability which converges at approximately 20 ns.

FIG. 6.

Evolution of 4D free energy estimate for alanine tripeptide in a vacuum shown as two-dimensional Ramachandran plot projections. The top and bottom rows are the first and second ψ-ϕ pairs, respectively. By 5 ns, the bias has built up in the low-lying metastable configurations, allowing for efficient diffusion of all CVs (cf. Fig. 7). The remaining time is spent resolving regions of extremely low probability which converges at approximately 20 ns.

Close modal
FIG. 7.

Time evolution (first 10 ns) of the four backbone dihedrals of alanine tripeptide which are used as collective variables with ANN sampling. ϕ1 and ϕ2 represent the slow modes of the peptide which are initially confined to metastable configurations. As bias accumulates, each individual coordinate exhibits free diffusion across the metastable states; this occurs at approximately 3 ns.

FIG. 7.

Time evolution (first 10 ns) of the four backbone dihedrals of alanine tripeptide which are used as collective variables with ANN sampling. ϕ1 and ϕ2 represent the slow modes of the peptide which are initially confined to metastable configurations. As bias accumulates, each individual coordinate exhibits free diffusion across the metastable states; this occurs at approximately 3 ns.

Close modal

In Ref. 32, the 4D FEL of alanine tripeptide is reconstructed using GPR from an increasing number of short 500 ps simulation windows. Qualitative agreement with a reference simulation is achieved after a total aggregate simulation time of 1.3 μs. While a direct comparison is not possible, we find that the 20 ns required to resolve the 4D FEL with ANN sampling to be remarkable. For high dimensional systems, this corresponds to a substantial improvement in efficiency, where traditional methods can be prohibitively expensive. The results presented here are from a single simulation, although ANN sampling is also trivially parallelizable. Our software implementation already contains support multiple walkers. Similar to GPR, this would allow multiple independent simulations to contribute to the same FEL, taking advantage of modern parallel architectures.

There are some additional important considerations for high dimensional CVs. As can be seen from our examples, we decrease the number of grid points per dimension as the system dimensionality increases. The simple reason for this is the curse of dimensionality: for a fixed point density, the training data size increases exponentially with increasing dimension. Furthermore, Bayesian regularization requires the calculation and storage of the Jacobian which scales as O(N2) and requires the inversion of the approximate Hessian. In an effort to minimize training time, we investigate the effect of limiting the maximum number of training iterations on the performance of ANN sampling.

The results presented above were obtained using a maximum number of 10 iterations per sweep. While this seems to be unreasonably small, it performs very well as shown. The reason for this is that in ANN sampling, the network weights are carried over each training cycle. This can be seen as a form of pre-training, where the network only requires a minimal amount of fine-tuning to adapt to new data. We also carried out the simulations using 20, 50, and 100 maximum iterations and found little difference in performance and due to the fewer iterations incurred much less overhead. These limitations are a result of targeting a uniform probability distribution in the collective variables. Targeting a well-tempered distribution or implementing a maximum fill-level on the free energy can be complimented with a sparse grid since only regions of CV space accessible within a desired cutoff will be sampled. This sort of approach is often desirable and will scale better to even higher dimensions. We see this as an area of future research.

We have presented a novel advanced sampling algorithm utilizing artificial neural networks to learn free energy landscapes. The ability of ANNs to act as universal approximators enables the method to resolve topologically complex FELs without presenting numerical artifacts typical of other methods which rely on basis sets. Bayesian regularization allows the entire set of training data to be used, prevents overfitting, and eliminates the need for any learning parameters to be specified. This and other robust features result in a method which is exceptionally swift and accurate at obtaining FELs.

As ANN sampling superficially resembles the recently proposed NN2B method,33 some remarks on their similarities and differences should be made. Though it also uses neural networks to estimate free energies, it differs significantly from the out method beyond that. NN2B requires the specification of a large number of parameters, requiring significant knowledge of CV correlation times and sampling density, and it is not clear how they influence convergence. Our use of a discretized global partition function estimate retains information on all CV states sampled over the course of a simulation, while NN2B requires a judicious choice of the degree of subsampling from prior sweeps. The use of standard neural networks and their density estimation procedure necessitates significantly larger architectures and sampling intervals to represent a given free energy landscape. It also means that some of the valuable data must be used for validation to avoid overfitting.

The consequences of this are reflected in their study of alanine dipeptide in a vacuum. Training times for each sweep are reported to range from 10 to 30 min, depending on the size of the network. Furthermore, it requires 50 ns to converge the two-dimensional FEL for ADP in a vacuum. This is considerably longer than the typical 2 ns using standard metadynamics. In comparison, our method requires seconds to train the 2D FEL and a minute for higher dimensions in the early stages; as the approximate FES becomes smoother, the training time decreases considerably due to regularization. Both the lengthy training times and slow convergence render NN2B impractical for most applications. We do however see merit in using nearest neighbor density estimators in place of sparse histograms and anticipate that this approach can be integrated within our proposed method.

The considerable advantages we have demonstrated biasing with ANN networks position this as an ideal method for the study of computationally expensive systems where sampling is limited. In particular, both first-principles MD and large biomolecule simulations are prime candidates for ANN sampling, where all but the most trivial of examples become intractable due to the typical time scales required to obtain free energy estimates. The significant speed improvements in sampling afforded by this method also open the door to the use of accelerated free-energy calculations for high throughput computational screening of material properties (see, e.g., Ref. 42). The ANN framework also makes it possible to initialize simulations with pre-trained networks, a technique commonly used in deep learning applications. Starting with a FEL from a classical simulation can considerably improve the convergence of a first-principles estimate, while a theoretical or coarse-grained model can inform the initial bias in a biomolecular simulation. ANN sampling is a promising new approach to advanced sampling which can help resolve complex free energy landscapes in less time than conventional approaches while overcoming many previous shortcomings.

H.S. acknowledges support from the National Science Foundation Graduate Fellowship Program (NSF-GRFP). This project was supported by MICCoM, as part of the Computational Materials Sciences Program funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division. H.S. and J.K.W. acknowledge computational resources at the Notre Dame Center for Research Computing (CRC). H.S. and J.K.W. acknowledge Michael Webb (U. Chicago) for providing the analytical Rouse mode solutions.

All of the run files, data, and analysis scripts required to reproduce the results in this paper are freely accessible online at https://github.com/hsidky/ann_sampling. ANN sampling methods are provided for download as part of the advance sampling package SSAGES (Software Suite For Advanced Generalized-Ensemble Simulations) at https://github.com/MICCoM/SSAGES-public as of version 0.8.

1.
A.
Barducci
,
M.
Bonomi
,
M. K.
Prakash
, and
M.
Parrinello
, “
Free-energy landscape of protein oligomerization from atomistic simulations
,”
Proc. Natl. Acad. Sci. U. S. A.
110
(
49
),
E4708
E4713
(
2013
).
2.
A. A.
Joshi
,
J. K.
Whitmer
,
O.
Guzmán
,
N. L.
Abbott
, and
J. J.
de Pablo
, “
Measuring liquid crystal elastic constants with free energy perturbations
,”
Soft Matter
10
(
6
),
882
893
(
2014
).
3.
O.
Valsson
,
P.
Tiwary
, and
M.
Parrinello
, “
Enhancing important fluctuations: Rare events and metadynamics from a conceptual viewpoint
,”
Annu. Rev. Phys. Chem.
67
(
1
),
159
184
(
2016
).
4.
A.
Laio
and
M.
Parrinello
, “
Escaping free-energy minima
,”
Proc. Natl. Acad. Sci. U. S. A.
99
(
20
),
12562
12566
(
2002
).
5.
A.
Barducci
,
G.
Bussi
, and
M.
Parrinello
, “
Well-tempered metadynamics: A smoothly converging and tunable free-energy method
,”
Phys. Rev. Lett.
100
(
2
),
020603
(
2008
).
6.
S.
Singh
,
C.-c.
Chiu
, and
J. J.
de Pablo
, “
Flux tempered metadynamics
,”
J. Stat. Phys.
145
(
4
),
932
945
(
2011
).
7.
J. F.
Dama
,
G.
Rotskoff
,
M.
Parrinello
, and
G. A.
Voth
, “
Transition-tempered metadynamics: Robust, convergent metadynamics via on-the-fly transition barrier estimation
,”
J. Chem. Theory Comput.
10
(
9
),
3626
3633
(
2014
).
8.
D.
Branduardi
,
G.
Bussi
, and
M.
Parrinello
, “
Metadynamics with adaptive Gaussians
,”
J. Chem. Theory Comput.
8
(
7
),
2247
2254
(
2012
).
9.
M.
McGovern
and
J. D.
Pablo
, “
A boundary correction algorithm for metadynamics in multiple dimensions
,”
J. Chem. Phys.
139
(
8
),
084102
(
2013
).
10.
M.
Mezei
, “
Adaptive umbrella sampling: Self-consistent determination of the non-Boltzmann bias
,”
J. Comput. Phys.
68
(
1
),
237
248
(
1987
).
11.
R. W. W.
Hooft
,
B. P.
van Eijck
, and
J.
Kroon
, “
An adaptive umbrella sampling procedure in conformational analysis using molecular dynamics and its application to glycol
,”
J. Chem. Phys.
97
(
9
),
6690
6694
(
1992
).
12.
C.
Bartels
and
M.
Karplus
, “
Multidimensional adaptive umbrella sampling: Applications to main chain and side chain peptide conformations
,”
J. Comput. Chem.
18
(
12
),
1450
1462
(
1997
).
13.
C.
Bartels
and
M.
Karplus
, “
Probability distributions for complex systems: Adaptive umbrella sampling of the potential energy
,”
J. Phys. Chem. B
102
(
5
),
865
880
(
1998
).
14.
O.
Valsson
and
M.
Parrinello
, “
Variational approach to enhanced sampling and free energy calculations
,”
Phys. Rev. Lett.
113
(
9
),
090601
(
2014
).
15.
J. K.
Whitmer
,
C.-c.
Chiu
,
A. A.
Joshi
, and
J. J.
de Pablo
, “
Basis function sampling: A new paradigm for material property computation
,”
Phys. Rev. Lett.
113
(
19
),
190602
(
2014
).
16.
J. K.
Whitmer
,
A. M.
Fluitt
,
L.
Antony
,
J.
Qin
,
M.
McGovern
, and
J. J.
de Pablo
, “
Sculpting bespoke mountains: Determining free energies with basis expansions
,”
J. Chem. Phys.
143
(
4
),
044101
(
2015
).
17.
D.
Kincaid
and
W.
Cheney
,
Numerical Analysis: Mathematics of Scientific Computing
(
American Mathematical Society
,
2009
).
18.
E.
Darve
and
A.
Pohorille
, “
Calculating free energies using average force
,”
J. Chem. Phys.
115
(
20
),
9169
9183
(
2001
).
19.
J.
Comer
,
J. C.
Gumbart
,
J.
Hénin
,
T.
Lelievre
,
A.
Pohorille
, and
C.
Chipot
, “
The adaptive biasing force method: Everything you always wanted to know but were afraid to ask
,”
J. Phys. Chem. B
119
(
3
),
1129
1151
(
2015
).
20.
H.
Fu
,
X.
Shao
,
C.
Chipot
, and
W.
Cai
, “
Extended adaptive biasing force algorithm. An on-the-fly implementation for accurate free-energy calculations
,”
J. Chem. Theory Comput.
12
(
8
),
3506
3513
(
2016
).
21.
H. B.
Demuth
,
M. H.
Beale
,
O.
De Jess
, and
M. T.
Hagan
,
Neural Network Design
, 2nd ed. (
Martin Hagan
,
USA
,
2014
).
22.
M.
Gletsos
,
S. G.
Mougiakakou
,
G. K.
Matsopoulos
,
K. S.
Nikita
,
A. S.
Nikita
, and
D.
Kelekis
, “
A computer-aided diagnostic system to characterize CT focal liver lesions: Design and optimization of a neural network classifier
,”
IEEE Trans. Inf. Technol. Biomed.
7
(
3
),
153
162
(
2003
).
23.
H.-T.
Pao
, “
Forecasting electricity market pricing using artificial neural networks
,”
Energy Convers. Manage.
48
(
3
),
907
912
(
2007
).
24.
J.
Behler
and
M.
Parrinello
, “
Generalized neural-network representation of high-dimensional potential-energy surfaces
,”
Phys. Rev. Lett.
98
(
14
),
146401
(
2007
).
25.
R. Z.
Khaliullin
,
H.
Eshet
,
T. D.
Kühne
,
J.
Behler
, and
M.
Parrinello
, “
Graphite-diamond phase coexistence study employing a neural-network mapping of the ab initio potential energy surface
,”
Phys. Rev. B
81
,
100103
(
2010
).
26.
A.
Ma
and
A. R.
Dinner
, “
Automatic method for identifying reaction coordinates in complex systems
,”
J. Phys. Chem. B
109
(
14
),
6769
6779
(
2005
).
27.
J.
Lätzer
,
M. P.
Eastwood
, and
P. G.
Wolynes
, “
Simulation studies of the fidelity of biomolecular structure ensemble recreation
,”
J. Chem. Phys.
125
(
21
),
214905
(
2006
).
28.
B.
Zhang
and
P. G.
Wolynes
, “
Topology, structures, and energy landscapes of human chromosomes
,”
Proc. Natl. Acad. Sci. U. S. A.
112
(
19
),
6062
6067
(
2015
).
29.
S.-S.
So
and
M.
Karplus
, “
Evolutionary optimization in quantitative structure–activity relationship: An application of genetic neural networks
,”
J. Med. Chem.
39
(
7
),
1521
1530
(
1996
).
30.
Z.
Li
,
J. R.
Kermode
, and
A. D.
Vita
, “
Molecular dynamics with on-the-fly machine learning of quantum-mechanical forces
,”
Phys. Rev. Lett.
114
,
096405
(
2015
).
31.
S.
Kumar
,
J. M.
Rosenberg
,
D.
Bouzida
,
R. H.
Swendsen
, and
P. A.
Kollman
, “
The weighted histogram analysis method for free-energy calculations on biomolecules. I. The method
,”
J. Comput. Chem.
13
(
8
),
1011
1021
(
1992
).
32.
T.
Stecher
,
N.
Bernstein
, and
G.
Csányi
, “
Free energy surface reconstruction from umbrella samples using Gaussian process regression
.”
J. Chem. Theory Comput.
10
(
9
),
4079
4097
(
2014
).
33.
R.
Galvelis
and
Y.
Sugita
, “
Neural network and nearest neighbor algorithms for enhancing sampling of molecular dynamics
,”
J. Chem. Theory Comput.
13
(
6
),
2489
2500
(
2017
).
34.
E.
Schneider
,
L.
Dai
,
R. Q.
Topper
,
C.
Drechsel-Grau
, and
M. E.
Tuckerman
, “
Stochastic neural network approach for learning high-dimensional free energy surfaces
,”
Phys. Rev. Lett.
119
(
15
),
150601
(
2017
).
35.
K.
Hornik
,
M.
Stinchcombe
, and
H.
White
, “
Multilayer feedforward networks are universal approximators
,”
Neural Networks
2
(
5
),
359
366
(
1989
).
36.
D. J. C.
MacKay
, “
A practical Bayesian framework for backpropagation networks
,”
Neural Comput.
4
(
3
),
448
472
(
1992
).
37.
D. J. C.
MacKay
, “
Bayesian interpolation
,”
Neural Comput.
4
(
3
),
415
447
(
1992
).
38.
J.
Nocedal
and
S. J.
Wright
,
Numerical Optimization
(
Springer
,
New York, NY
,
2006
).
39.
G. M.
Torrie
and
J. P.
Valleau
, “
Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling
,”
J. Comput. Phys.
23
(
2
),
187
199
(
1977
).
40.
J.
McCarty
and
M.
Parrinello
, “
A variational conformational dynamics approach to the selection of collective variables in metadynamics
,”
J. Chem. Phys.
147
(
20
),
204109
(
2017
).
41.
M.
Rubinstein
and
R. H.
Colby
,
Polymer Physics
(
Oxford University Press
,
New York
,
2003
).
42.
C. E.
Wilmer
,
M.
Leaf
,
C. Y.
Lee
,
O. K.
Farha
,
B. G.
Hauser
,
J. T.
Hupp
, and
R. Q.
Snurr
, “
Large-scale screening of hypothetical metal–organic frameworks
,”
Nat. Chem.
4
(
2
),
83
89
(
2012
).