A machine learning assisted method is presented for molecular simulation of systems with rugged free energy landscapes. The method is general and can be combined with other advanced sampling techniques. In the particular implementation proposed here, it is illustrated in the context of an adaptive biasing force approach where, rather than relying on discrete force estimates, one can resort to a self-regularizing artificial neural network to generate continuous, estimated generalized forces. By doing so, the proposed approach addresses several shortcomings common to adaptive biasing force and other algorithms. Specifically, the neural network enables (1) smooth estimates of generalized forces in sparsely sampled regions, (2) force estimates in previously unexplored regions, and (3) continuous force estimates with which to bias the simulation, as opposed to biases generated at specific points of a discrete grid. The usefulness of the method is illustrated with three different examples, chosen to highlight the wide range of applicability of the underlying concepts. In all three cases, the new method is found to enhance considerably the underlying traditional adaptive biasing force approach. The method is also found to provide improvements over previous implementations of neural network assisted algorithms.

## I. INTRODUCTION

Complex fluids, materials, or macromolecules are generally characterized by rugged free energy landscapes. Simulating such systems therefore requires that enhanced sampling methods be used to overcome the barriers that separate local free energy minima and explore configuration space efficiently. Enhanced sampling methods can also provide a direct measure of the free energy landscape, thereby leading to a deeper understanding of systems of interest. A good example of an algorithm that is commonly used to improve sampling is provided by umbrella sampling,^{1} where it is possible to force a system to visit a distinct state by applying a harmonic restraint. Multiple restrained simulations may then be stitched together (e.g., with the weighted histogram analysis method or WHAM^{2}) to generate an estimate of the underlying free energy landscape. In systems characterized by a broad or high-dimensional collective variable space, however, a naïve implementation of umbrella sampling can quickly become intractable. More recent methods, such as metadynamics^{3} and its variants,^{4–7} represent the free energy landscape as a sum of Gaussians, which are generated on the fly as a simulation visits different regions of collective variable space. In such approaches, it is important that judicious choices be made for the Gaussian heights, widths, and deposition rates, lest the convergence of a calculation be hampered.^{7} Other techniques that circumvent the use of Gaussians include Basis Function Sampling (BFS),^{8} Green’s Function Sampling (GFS),^{9} and Variationally Enhanced Sampling (VES),^{10} all of which express free energy landscapes in terms of a set of orthogonal basis functions. Unfortunately, such methods are susceptible to errors near sharp features and near boundaries of the free energy landscape.^{11} In all of the algorithms mentioned above, an estimate of the free energy is constructed in terms of the frequency with which distinct states are visited. Importantly, Sidky and Whitmer have shown that by relying on neural networks to describe that free energy surface (Artificial Neural Network or ANN sampling) one can improve considerably the performance of histogram-guided simulations.^{12}

Adaptive Biasing Force (ABF) methods adopt a fundamentally different approach in that they seek to estimate the free energy landscape from its derivatives, which are given in the form of generalized forces.^{13} The local nature of these mean forces can considerably improve the performance of ABF relative to that of algorithms that rely on histograms of visits to different states. Various improvements of ABF have been proposed, including an extended ABF method^{14,15} and a metadynamics-ABF hybrid technique that is augmented with Gaussian process regression (GPR).^{16} The use of GPR for free energy reconstruction in Ref. 16 is representative of a trend to rely on machine learning techniques to enhance sampling in molecular simulations; other recent examples include the utilization of artificial neural networks (ANNs) in the NN2B method^{17} and the representation of high-dimensional free energy surfaces.^{18} In this work, we introduce the concept of relying on a neural network to perform force-biasing. The general idea is that biasing forces in some multi-dimensional collective variable space can be represented by such a network. As such, we refer to this method as “Force-biasing Using Neural Networks” (FUNN) enhanced sampling. While the method is general and could in principle be used to enhance a variety of advanced sampling algorithms, for concreteness, in this work we demonstrate its implementation by using it in conjunction with the traditional adaptive biasing force technique that, in and of itself, we have found to be highly effective for exploring rough free energy landscapes. The combination of FUNN and ABF, or FUNN-ABF, is found to improve the performance of ABF by one or two orders of magnitude in computer time, depending on the example. By applying the neural network on the generalized forces, as opposed to the potential of mean force, it is found that one can improve upon the performance of ANN sampling.^{18}

## II. METHOD DESCRIPTION

The details of the ABF method have been described in the literature;^{15,19,20} here we merely provide a brief summary on which to base our subsequent discussion. One starts with an expression for the mean force given in Ref. 19,

where *A* is the free energy, *ξ* is a collective variable, and **p** is the vector of atomic momenta. **w** is an arbitrary vector field that satisfies

Here ∇*ξ* is the gradient of the collective variable *ξ* with respect to the 3N atomic coordinates. We choose $wi=\u2202xi\u2202\xi $. This framework can be extended to multiple collective variables, and Eq. (2) recast as **J**_{ξ}**W** = **I**, where entry $[J\xi ]ij=\u2202\xi i\u2202xj$ and where *J*_{ξ} has 3*N* columns corresponding to the atomic coordinates and *N*_{ξ} rows corresponding to the number of collective variables. In order to calculate **W** while satisfying **J**_{ξ}**W** = **I**, we choose **W** to be the right pseudoinverse of **J**_{ξ} or

Typically, **W** is selected in such a way as to avoid performing tedious second derivatives of collective variables with respect to Cartesian coordinates, which arise in the original ABF formulation. While second derivatives of simple collective variables may be straightforward, second order derivatives of complex collective variables can in some cases become difficult to evaluate.^{20} Note that this choice of **W** differs from that prescribed in Ref. 19, which is equivalent to the expression in (3) weighted by mass. Equation (3) remains valid in cases with ghost atoms or virtual sites, which, for example, arise frequently in water models. By contrast, the mass-weighted pseudoinverse variant of **W** is not directly applicable to such models. In our experience, our choice of **W** works well in a variety of contexts and has been implemented in the SSAGES enhanced sampling package (along with the mass-weighted alternative).^{21}

In conventional ABF, the mean force $dAd\xi $ is estimated as the running average of $ddt(W\u22a4p)$ on a grid **F**(*k*), where the time derivative is calculated using a finite difference scheme; *k* denotes the closest bin in grid **F** corresponding to collective variable *ξ*. Any previously applied external bias is subtracted before updating the mean force estimates on the grid. The next applied external bias is calculated from the estimated mean force as $\u2212F(k)\u2207\xi n(k)$, where *n*(*k*) denotes the number of times that bin *k* has been visited.

As mentioned above, ABF cannot bias the system on the basis of states or configurations that have not been observed yet. Furthermore, a system can easily diffuse away from a newly encountered free energy barrier before successfully climbing that barrier. Also note that ABF relies on a discrete grid to store local estimates of the mean force. Such a grid is used for both biasing the system and estimating the underlying free energy landscape. A choice must therefore be made between precision and speed: a low resolution ABF grid may converge more rapidly, but the mean force in a particular bin is then smeared across a broader region of collective variable space, leading to errors. Estimates of the mean force are necessarily poor in the early stages of ABF due to the low number of samples in any particular bin; in some cases, fluctuations of the estimated mean forces may lead to instabilities and incorrect biasing. In practice, force estimates are artificially suppressed when a particular bin has been visited fewer than *N*_{0} times; for example, Darve *et al.* recommended that the mean force estimate in the *k*th bin be multiplied by $min(1,n(k)N0)$, where *N*_{0} is typically chosen to be 100-200 samples,^{19} which we adopt in our conventional ABF implementation.^{21}

To circumvent these issues, we propose to generate estimates of the mean force that are stored on a grid, as in ABF, but the grid is instead used to train an artificial neural network that produces a continuous estimate of the mean force across collective variable space. An artificial neural network consists of a collection of connected nodes with associated weights, which may be arranged in layers. Neurons in each layer receive inputs from connected neurons in the previous layer and send outputs to connected neurons in the following layer. Weights are updated by training the neural network; a neural network with at least one hidden layer can be trained to approximate any continuous function,^{22} which in this case is the estimated mean force. This estimate extends to regions that may not have been explored yet, thereby introducing a “scouting” biasing force that can accelerate sampling. Following in the tradition of mountain metaphors, a traveler in search of a mountain to climb does not turn around upon reaching the piedmont to diffusively wander in the flat plain that preceded the hills. She notices the slope changing beneath her feet and anticipates that some mountains might lie somewhere ahead, even if she does not yet see them with her own eyes. In this approach, biasing forces are calculated from the continuous estimate of the mean force obtained via the neural network, allowing for the biasing force to be adaptive beyond the resolution of the underlying grid. The use of a regularized neural network prevents the instabilities associated with noisy mean force estimates in sparsely sampled bins that are encountered during the early stages of ABF, without artificially tempering force estimates over the first *N*_{0} samples. Additionally, by using a neural network, we avoid edge-effects and inaccuracies that arise from sampling on sharp features in the free energy landscape, which are encountered when fitting basis functions or splines to data as in BFS or using Gaussians in metadynamics.^{12} Furthermore, we make no assumptions about the underlying functional form of the mean force across CV space. Recently, Sidky and Whitmer^{12} introduced the idea of using Bayesian regularized neural networks to adaptively bias molecular simulations and learn free energy landscapes. Referred to as articial neural network (ANN) sampling, it constructs a biasing potential from a histogram of visited states to distinct values of the collective variables. The biasing potential is thus based on the frequency of visits to such states. FUNN uses the same Bayesian framework, but instead of relying on gradually updated estimates of the free energy, it relies on on-the-fly estimates of the generalized force (i.e., derivatives of the free energy) to bias the simulation.

In practice, an initial sweep must be performed before training the neural network, where $ddt(W\u22a4p)$ in the *k*th bin is summed into **F** while tracking the number of visits to the bin *n*(*k*). While this may be performed without an external bias, it is advantageous to carry out the initial sweep with conventional ABF to avoid training the neural network with mean force estimates from a narrow strip of collective variable space; we liken the use of ABF in the initial sweep to give the estimated mean forces a “push” before the bias supplied through the neural network takes over. The mean force estimate in the *k*th bin is simply $F(k)n(k)$. The entire grid of estimated mean forces and their *k* locations in collective variable space serve as our training set. Although there are countless possible neural network topologies, we choose to use a fully connected network.

The activation $\varphi im$ of any particular layer *m* is given by

where $wji$ and *b* are the layer weight and bias, respectively, and *i* and *j* are indices over the number of neurons in the current and previous layers. For the activation function *f*, we choose the common sigmoidal activation function for all layers except the output layer, for which we use a linear activation. Backpropagation is used to train the neural network, the output of which is an estimate of the mean force which we denote **Q**. Regularization improves generalizability and prevents overfitting of the neural network; if we denote the *i*th mean force estimate from the grid as *P*_{i} and the *i*th mean force estimate from the neural network output as *Q*_{i}, our loss function is

where the first term is the squared error between the network predictions of the mean force and the target values stored in the histogram summed over all data points, and the second term penalizes network weights. Specifically, we use Bayesian regularization, which was used in the work by Sidky and Whitmer^{12} and is crucial for the viability of this sampling method. Importantly, it is advantageous in requiring minimal input parameters from the end user and eliminating the need for using a subset of the sampling data as a validation set. The latter point is particularly helpful in difficult regions of collective variable space where utilizing all collected estimated mean forces is preferable, rather than using a subset for validation. In this framework, we assume Gaussian priors on the network weights and Gaussian errors on the estimates of the generalized force. To minimize loss, one must find weights that maximize the conditional probability of the weights, given the observed generalized force data; a generalizable model is obtained through this self-regularization of network weights. The loss function is minimized by solving for optimal values of *α* and *β*, which may be expressed in terms of a value *γ*, representing the effective number of well-determined network parameters.^{23,24} A Levenberg-Marquardt optimization routine is used to iteratively update *α*, *β*, and *γ*. This proceeds until either *γ* stabilizes, the Marquardt trust region radius exceeds 10^{10}, or a certain number of maximum training iterations have been reached.^{25–27} We refer readers to Refs. 12, 23, and 24 for further details on Bayesian regularization. While the exact training time depends on network topology and grid size, we find that it does not contribute significantly to the total runtime of most research systems (for example, each round of training for a 60 × 60 grid on a two-layer network topology of 16-12, such as the one used in the solvated alanine dipeptide example below, takes under a minute).

For each subsequent sweep, the system is biased using −**Q**(*ξ**)∇*ξ**, where *ξ** is the instantaneous value of the collective variable, again collecting entries in **F**(*k*) and *n*(*k*). Herein lies the first advantage of FUNN: the trained network returns an estimate of the mean force at any point in collective variable space **Q**(*ξ**), regardless of whether *ξ** has already been visited by the simulation or whether *ξ** lies somewhere between grid points. Here **Q** supplies a hypothetical mean force estimate, even for regions that have yet to be explored, and the applied bias is calculated from the estimated mean force at *ξ**, not the nearest *k*th grid point. Even if **Q** supplies a poor mean force estimate in a previously unexplored region, this estimated mean force performs the critical job of driving preliminary biasing in this region. Future estimates in this region are then corrected, as further samples are accrued in the histogram corresponding to this region and the neural network is retrained.

The applied external biasing forces are easily accounted for before logging the current estimate of the true generalized force by adding the current mean force from the neural network to the instantaneous force,

At the end of each sweep, **Q** is updated by training the network on the current generalized mean force estimates *from the grid*, using $F(k)n(k)$. This cycle of collecting mean force estimates on the grid and re-training the neural network to generate mean force estimates across collective variable space is iterated until convergence of the mean force landscape, which can then be integrated to recover the underlying free energy landscape.

## III. EXAMPLES

To illustrate the application and performance of FUNN sampling, three examples of increasing complexity are considered: (1) a particle on a 2D surface consisting of a sum of 50 randomly deposited Gaussian functions, emulating a rough free energy landscape; (2) the isomerization of a simple alanine dipeptide in explicit water; and (3) the sampling of a polymer chain along its first three Rouse modes.

### A. Langevin particle in 2D potential

The first example consists of a single Langevin particle in a 2-dimensional square box with a side length of 4 LJ units. All simulations were performed using SSAGES^{21} coupled to LAMMPS (11 August 2017).^{28} A potential field of 50 randomly generated Gaussians was used to generate the surface features, shown in panel (c) of Fig. 1.

Figure 1 shows the free energy landscape from both ABF and FUNN simulations; one can appreciate that the convergence of FUNN is several times faster than that of ABF. Free energies obtained from FUNN are shown in the left column [(a) and (b)] and compared to the results from ABF at the equivalent simulation time in the right column [(d) and (e)]. Free energy landscapes are calculated at 500 time-units (top row) and 1000 time-units (middle row). After only 500 time-units, FUNN has visited almost all of phase space, whereas ABF has not left its starting basin. After 1000 time-units, FUNN has essentially converged, while ABF has barely left its starting minimum, leaving most of the phase space unexplored. The relative error of the simulated free energy landscape predicted by FUNN decays rapidly in the initial states of the simulation, whereas for traditional ABF it decays more slowly.

### B. Alanine dipeptide in water

The second example, an alanine dipeptide molecule in explicit water, was simulated using SSAGES, but in this case it was coupled to GROMACS 5.1.3,^{29} using the AMBER99SB force field.^{30} The box size was 3 nm × 3 nm × 3 nm with 880 TIP3P water molecules, and a time step of 2 fs was used. Temperature and pressure were controlled using the GROMACS’ stochastic velocity rescaling thermostat^{31} at 298.15 K and the Parrinello-Rahman barostat^{32} at 1 bar.

We compare ABF and FUNN in two collective variables, using the backbone dihedral angles *ϕ* and *ψ*. The results are shown in Fig. 2, along with a schematic of the system and the relevant dihedral angles. As before, the exploration of collective variable space is faster using FUNN than using a traditional ABF method. As little as one ns into the simulation, FUNN has virtually visited all available states, and a rough but reasonable estimate of the free energy across collective variable space is already available. By contrast, most high-energy regions remain unexplored in ABF. After 5 ns, FUNN provides a converged estimate of the free energy surface, whereas ABF has not converged [as evidenced by the incorrect height of the (*ϕ*, *ψ*) = (0, 0) peak and the features around (*ϕ*, *ψ*) = (1, −1)]. An additional 115 ns of simulation time is necessary for ABF to reach the reference state in Fig. 2(i). The relative error decays rapidly within the first 2 ns of simulation time for FUNN, whereas more than 100 ns is necessary for conventional ABF to reach comparable accuracy. The RMSE of FUNN levels off after 4 ns; we attribute this to the fact that we have chosen our reference state to be ABF at 120 ns, which may not have been fully converged. By choosing to use this reference state, we avoid artificially favoring FUNN over ABF. Note that there is of course the possibility that FUNN may have fully converged after just 4 ns.

### C. Rouse modes of coarse-grained polymer chain

As a final example, we sampled the first three Rouse modes of a coarse-grained ideal Gaussian polymer chain.^{33} The polymer is represented as 21 particles of unit mass 1, connected through harmonic springs with a spring constant of 2 and an equilibrium distance of 0 [see Fig. 3(d) and Ref. 21]. Langevin dynamics were used in this case, with a temperature of 0.66, a damping constant of 0.5, and a time step of 0.005. The results from sampling along these three collective variables were then compared to the analytic result in Fig. 3.

One sees that there is excellent agreement between the exact result for the free energy and that obtained from FUNN across all three Rouse modes, serving to demonstrate that FUNN recovers the pertinent free energy profiles efficiently and accurately.

### D. Comparison with ANN sampling

As mentioned earlier, ANN produces on-the-fly estimates of the free energy to bias a simulation, whereas FUNN produces on-the-fly estimates of derivatives of the free energy, generalized forces, to bias the simulation. In both of the following examples (Fig. 4), utilizing a 60 × 60 grid and a 16–12 neural network architecture, FUNN sampling and ANN sampling show a remarkable speedup compared to their non-neural-network counterparts.^{12} FUNN, however, exhibits a much faster convergence; for ADP, for example, FUNN has converged after approximately 2.5 ns, whereas ANN requires approximately ten times longer. Note that in ANN, the scale of the imposed bias depends on the number of counts accumulated in each bin. Therefore, using a finer grid leads to application of a more modest bias at each sweep than that applied by FUNN. Also note that a 30 × 30 grid was used for the same ADP example in Ref. 12.

Because FUNN does not depend on the grid resolution to set the bias scale, it offers greater flexibility in achieving optimal results. With these considerations in mind, ANN sampling might offer better scaling in higher dimensions since it estimates a scalar biasing potential (as opposed to a vectorial force bias) regardless of dimension. FUNN needs to estimate an *N*-dimensional generalized force, which would become more expensive if one assumes that memory requirements grow as $O$(*N*^{2}) and Hessian inversion, which is required in the Bayesian framework, grows as $O$(*N*^{3}). These considerations should help provide guidance to users in choosing either adaptive biasing approach, ANN or FUNN, depending on the problem under consideration.

## IV. CONCLUSIONS

In summary, we have introduced a new method for performing molecular simulations that relies on learning the generalized mean forces acting on a system with an artificial neural network (FUNN). In the implementation proposed here, FUNN-ABF, we build upon the framework of conventional ABF while addressing some of its shortcomings. Specifically, the use of a neural network enables mean force approximations and adaptive biasing even in previously un-sampled regions of collective variable space. Furthermore, the choice of Bayesian regularization allows for minimal input from the user, eliminates the need to select a subset of the generated data as a validation set, and prevents poor mean force estimates in sparsely sampled histogram bins. A continuous estimate for the mean force is obtained from the neural network output, allowing for force estimates and adaptive biasing with a finer resolution than that used for the underlying grid. These improvements are reflected in FUNN’s superior convergence rate, as demonstrated in three examples that are representative of glassy materials, molecular conformational transitions in aqueous solvents, and polymeric materials, sampling along two or three dimensions. Convergence in higher dimensions is significantly aided by the sparse training option, where the artificial neural network is trained while ignoring empty histogram bins, thereby alleviating issues associated with histogram-based approaches in high dimensions. Finally, the network parameters can easily be used to query the network at arbitrary resolution after sampling is complete, allowing one to integrate the generalized forces using a denser grid with ease.

## ACKNOWLEDGMENTS

A.Z.G. and E.S. contributed equally to this work. H.S. acknowledges support from the National Science Foundation Graduate Fellowship Program (NSF-GRFP). This work was completed on computational resources provided by the University of Chicago Research Computing Center. This work is supported by the Department of Energy, Basic Energy Sciences, Division of Materials Science and Engineering; the development of advanced sampling algorithms and the SSAGES software are supported by MICCoM (Midwest Center for Computational Materials).

### APPENDIX: SOFTWARE

FUNN has been implemented within the SSAGES framework,^{21} and it will be distributed in an upcoming SSAGES public release. FUNN will be publicly available for reference along with the above examples at https://github.com/MICCoM/SSAGES-public.