Gradient-domain machine learning (GDML) is an accurate and efficient approach to learn a molecular potential and associated force field based on the kernel ridge regression algorithm. Here, we demonstrate its application to learn an effective coarse-grained (CG) model from all-atom simulation data in a sample efficient manner. The CG force field is learned by following the thermodynamic consistency principle, here by minimizing the error between the predicted CG force and the all-atom mean force in the CG coordinates. Solving this problem by GDML directly is impossible because coarse-graining requires averaging over many training data points, resulting in impractical memory requirements for storing the kernel matrices. In this work, we propose a data-efficient and memory-saving alternative. Using ensemble learning and stratified sampling, we propose a 2-layer training scheme that enables GDML to learn an effective CG model. We illustrate our method on a simple biomolecular system, alanine dipeptide, by reconstructing the free energy landscape of a CG variant of this molecule. Our novel GDML training scheme yields a smaller free energy error than neural networks when the training set is small, and a comparably high accuracy when the training set is sufficiently large.
I. INTRODUCTION
Molecular dynamics (MD) simulations have become an important tool to characterize the microscopic behavior of chemical systems. Recent advances in hardware and software allow significant extensions of the simulation timescales to study biologically relevant processes.1–3 For example, we can now characterize the configurational changes and folding and binding behavior of small to intermediate-sized proteins through MD on the timescale of milliseconds to seconds.4–9 However, the computational complexity of evaluating the potential energy prohibits this approach to scale up to significantly larger systems and/or longer timescales. Therefore, multiple ways have been proposed to speed up atomistic simulations, such as advanced sampling methods (e.g., umbrella sampling10–12 and parallel tempering13,14) or adaptive sampling.15–17 An alternative approach is to reduce the dimensionality of the system by coarse-graining (CG).18–23 The fact that macromolecules usually exhibit robust collective behavior suggests that not every single degree of freedom is per se essential in determining the important macromolecular processes over long timescales. Furthermore, a CG representation of the system simplifies the model and allows for a more straightforward physico-chemical interpretation of large-scale conformational changes, such as protein folding or protein–protein binding.18
Once the mapping from the atomistic to the CG representation is defined, a fundamental challenge is the definition of an effective potential in reduced coordinates such that the essential physical properties of the system under consideration are retained. The choice of the relevant properties crucially dictates the definition of the CG model.
Following a top-down approach, the CG procedure is driven by the objective to reproduce macroscopic properties, such as structural information or experimentally measured observables.19,21,24–27 In a bottom-up approach, on the other hand, an effective potential is designed to reproduce a selection of properties of an atomistic model, for instance, the probability distribution in a suitable space and the corresponding metastable states.20,22,28–31
In the past several years, machine learning (ML) techniques have been increasingly applied in molecular simulation.9,32–34 Bottom-up CG methods have also started to leverage the advances in ML, to define classical atomistic potentials or force fields from quantum chemical calculations,35–51 to learn kinetic models,52–56 or to design effective CG potential from atomistic simulations.57–59 In this context, we have recently shown that a deep neural network (NN) can be used in combination with the well established “force matching” approach60 to define a coarse-grained implicit water potential that is able to reproduce the correct folding/unfolding process of a small protein from atomistic simulations in explicit water.59 In the force matching approach, the effective energy function of the CG model is optimized variationally by finding the CG force field that minimizes the difference with the instantaneous atomistic forces projected on the CG coordinates. As there are multiple atomistic configurations consistent with a CG configuration, this estimator is very noisy and this approach requires a large amount of training data. It is thus restricted to parametric models such as NNs, as the computational complexity of non-parametric models is directly linked to the training set size. Here, we propose a method to overcome this limitation in the dataset via bootstrap aggregation in combination with a non-parametric, kernel-based regressor.
In particular, we use the Gradient-Domain Machine Learning (GDML) approach.50,61 In the application to quantum data, GDML is able to use a small number (usually less than a few thousand) of example points to build an accurate force field for a specific molecule. Because of the degeneracy of the mapping, the training data required to reconstruct a coarse-grained force field are much larger, and the fact that memory requirements scale quadratically with dataset size prevents a direct application of GDML to the definition of CG models.
To solve this problem, we pursue a hierarchical ensemble learning approach in which the full training set is first divided into smaller batches that are trained independently. A second GDML layer is then applied to the mean prediction of this ensemble, providing the second model with a consistent set of inputs and outputs. We show that GDML with ensemble learning can be efficiently used for the coarse-graining of molecular systems.
The structure of the paper is as follows: In Sec. II, we briefly review the principle of force matching that we use for coarse graining, as well as mathematical underpinnings of the kernel ridge regression used in the GDML method. Then, we describe the idea of ensemble learning and explain how it solves the problem associated with the large number of training points required by force matching. In the “Results” section, we demonstrate that a GDML approach trained with ensemble learning performs well on the coarse-graining of a small molecular system, alanine dipeptide simulated in water solvent, as it produces the same free energy surface as obtained in the all-atom simulations. As was already demonstrated in the case of a NN approach,59 the key to success of a GDML-based coarse graining is that it is able to naturally capture nonlinearities and multi-body effects arising from the renormalization of degrees of freedom at the base of coarse-graining.
II. THEORY AND METHODS
A. Coarse-graining with thermodynamic consistency
Although the definition of a coarse-graining mapping scheme is per se an interesting problem,23,62–64 here, we start by assuming that a mapping is given. The all-atom system we want to coarse-grain consists of N atoms, and its configurations are represented by a 3N dimensional vector . The lower dimensional CG representation of the system is given by the mapping
where n < N is the number of CG beads. The CG mapping function ξ is assumed to be linear, i.e., there exists a coarse-graining matrix that maps the all atom space to the CG space: x = Ξr.
The definition of a CG model requires an effective potential U(x; θ) in the CG space, where θ are the optimization parameters. The potential U(x; θ) can then be used to generate an MD trajectory with a dynamical model. Parameterizations are available in varying degrees of sophistication, ranging from classical force fields with fixed functional forms to ML approaches with strong physical basis.
One popular bottom-up method for building a CG model is to require thermodynamic consistency, that is, to design a CG potential such that its equilibrium distribution matches the one of the all-atom model. In practice, this means that an optimum CG potential satisfies the condition
where kB is the Boltzmann constant, T is the temperature, and the probability density distribution in the CG space is given by the equilibrium distribution of the all-atom model mapped to the CG coordinates,
where is the Boltzmann weight associated with the atomistic energy V(r).
Different methods have been proposed to construct a CG potential U(x, θ) that satisfy Eq. (3), notably the relative entropy method,31 and the force-matching method.20,60 In this work, we will demonstrate how we could learn the molecular CG potential using the idea of force-matching and the GDML kernel method.
B. Force matching
It can be shown that the potential U(x; θ) satisfies thermodynamic consistency if the associated CG forces −∇U(x; θ) minimize the mean square error (MSE),20,60
where ξ(F(r)) denotes the instantaneous all-atom forces projected onto the CG space and ⟨·⟩r is the weighted average over the equilibrium distribution of the atomistic model, i.e., r ∼ μ(r).
The effective CG force field −∇U(x; θ) that minimizes χ2(θ) corresponds to the mean force,60
where r∣x indicates the equilibrium distribution of r constrained to the CG coordinates x, i.e., the ensemble of all atomistic configurations that map to the same CG configuration. For this reason, an optimized CG potential U(x, θ) is also called the potential of mean force (PMF).
By following statistical estimator theory,65 it can also be shown59 that the error χ2(θ) [Eq. (4)] can be decomposed into two terms,
where
While the PMF error term depends on the definition of the CG potential and can be, in principle, reduced to zero, the noise term does not depend on the CG potential and it is solely associated with the decrease in the number of degrees of freedom in the CG mapping, and it is, in general, larger than zero. The force matching estimator of Eq. (4) is thus intrinsically very noisy.
C. GDML
In previous work,59 we have introduced CGnet to minimize the error in Eq. (4) using a neural network to parametrize the CG forces. We have demonstrated that the CGnet approach successfully recovers optimal CG potentials. A large training dataset enables CGnet to resolve the ambiguity in the coarse-grained force labels by converging to the respective mean forces. Here, we explore the Gradient-domain Machine Learning approach (GDML)50,51 as an alternative.
GDML has been used to obtain an accurate reconstruction of flexible molecular force fields from small reference datasets of high-level ab initio calculations.50,51,61 Unlike traditional classical force fields, this approach imposes no hypothesized interaction pattern for the nuclei and is thus unhindered in modeling any complex physical phenomena. Instead, GDML imposes energy conservation as inductive bias, a fundamental property of closed classical and quantum mechanical systems that does not limit generalization. This makes highly data efficient reconstruction possible without sacrificing generality.
The key idea is to use a Gaussian process () to model the force field f as a transformation of an unknown potential energy surface U such that
Here, μU and kU are the mean and covariance functions of the corresponding energy predictor, respectively.
To help disambiguate physically equivalent inputs, the Cartesian geometries x are represented by a descriptor D with entries
that introduces roto-translational invariance. Accordingly, the posterior mean of the GDML model takes the form
where JD(x) is the Jacobian of the descriptor (see the supplementary material for details). Due to linearity, the corresponding expression for the energy predictor can be simply obtained via (analytic) integration. GDML uses a Matérn kernel kU(x, x′) with restricted differentiability to construct the force field kernel function,
where d = ∥x − x′∥ is the Euclidean distance between the two inputs and σ is an hyperparameter.
We use this kernel because empirical evidence indicates that kernels with limited smoothness yield better predictors, even if the prediction target is infinitely differentiable. It is generally assumed that overly smooth priors are detrimental to data efficiency, as the associated hypothesis space is harder to constrain with a finite number of (potentially noisy) training examples.66 The differentiability of functions is directly linked to the rate of decay of their spectral density at high frequencies, which has been shown to play a critical role in spatial interpolation.67
D. Ensemble learning
Ensemble learning is a general and widely used machine learning trick to increase the predictive performance of a trained model by combining multiple sub-models.68–76 In this work, we use the idea at the basis of a particular ensemble learning method, called bootstrap-aggregation in the machine learning literature,71 summarized in the following paragraphs and Algorithm 1. This method enables us to train a GDML approach over millions of data points, a task that would be otherwise impossible.
2-layer training scheme.
2Layer-GDML(D, N, n, n′) | |
1. | Sample N data batches from the original bulk dataset D: {D1, D2, …, DN}, each data batch Di contains n randomly sampled points, each data point d = (r, f) includes a molecular configuration part r and a force part f |
2. | Sample one additional data batch from the original bulk dataset D that contains n′ points, di = (ri, fi)|i=1,2, …, n′, each point di also includes a molecular configuration part ri and a force part fi, and j indicates the point index in data batch |
3. | For i = 1, …, N: # Loop over N batches: |
(a) Train GDML model Pi using data batch Di | |
(b) Predict forces for all n′ configurations rj|j=1,2, …, n′ in data batch using model Pi, which is denoted as | |
4. | Construct the mean force set Dm, which also includes n′ points, the configuration part rj for each point is the same as , but the force part fj is the averaged force computed using the N GDML models: |
5. | Train the 2nd-layer model P using the constructed mean force set Dm |
2Layer-GDML(D, N, n, n′) | |
1. | Sample N data batches from the original bulk dataset D: {D1, D2, …, DN}, each data batch Di contains n randomly sampled points, each data point d = (r, f) includes a molecular configuration part r and a force part f |
2. | Sample one additional data batch from the original bulk dataset D that contains n′ points, di = (ri, fi)|i=1,2, …, n′, each point di also includes a molecular configuration part ri and a force part fi, and j indicates the point index in data batch |
3. | For i = 1, …, N: # Loop over N batches: |
(a) Train GDML model Pi using data batch Di | |
(b) Predict forces for all n′ configurations rj|j=1,2, …, n′ in data batch using model Pi, which is denoted as | |
4. | Construct the mean force set Dm, which also includes n′ points, the configuration part rj for each point is the same as , but the force part fj is the averaged force computed using the N GDML models: |
5. | Train the 2nd-layer model P using the constructed mean force set Dm |
In general, we generate a finite set of alternative GDML reconstructions from randomly drawn subsets of the full MD trajectory and average them to generate an estimate for the “expected” force prediction at each point. The variability in the individual training sets promotes flexibility in the structure across all models in the ensemble and enables us to capture the variability in the dataset. We are then able to compute the expected value for each input by simply taking the mean of the ensemble.
Suppose that we have a large dataset D: (x, y) for training that contains N samples of pairs of points x, y. We would like to train a predictive model f such that y = f(x), using the data D. Instead of training a single model f using the whole N data points from D, we first randomly sample n batches: {D1, D2, …, Dn}, where each batch Di contains N′ points. Usually, N is too large to efficiently train a single model, but it is possible to train sub-models on the different batches {f1, f2, …, fn} if N′ ≪ N. After training all the batches, the final predictive model f is obtained as the average of all the sub-models,
This enables us to generate consistent labels for a held-out subset of the trajectory, which then serves as the basis for another GDML reconstruction.
We demonstrate how bootstrapping aggregation is used on a simple example, where we learn an effective curve to fit a one dimensional dataset. As shown in Fig. 1(a), 600 raw points are uniformly sampled from x ∈ [0, 6], and the y value of each point is assigned according to yi = sin(xi) + 0.2ξ, where ξ ∼ N(0, 1) is a random noise. These 600 points serve as the noisy training set. Instead of learning the curve using all 600 points at once, we bootstrap sample 100 batches from the full dataset, where each batch contains only 20 points. We use a six-order polynomial function to fit 20 points in each batch, and the 100 fitted curves are shown in blue in Fig. 1(a). While each of these 100 blue curves oscillates around the mean and overfit the data, the mean of the 100 predictors (red curve) is smooth and agrees with the ground truth y = sin(x) (green curve) quite well. We use the idea of ensemble learning to apply GDML to CG problems as a 2-layer procedure. Instead of training one single GDML model using all data, which usually exceed the upper memory limit of GDML, we train N models using N data batches, where each batch contains only n points. In this work, N = n = 1000. Since 1000 points is far below the GDML limit, each GDML model Pi is easy to train. After obtaining all N GDML models , we use them to predict the forces fi corresponding to the ith model for any given CG configuration x as fi = Pi(x). The mean force (CG force) for a configuration x is then the average of the forces for all the models: . This average force prediction could be directly used in the CG molecular simulation, but the resulting model would be of low efficiency, since for each single configuration x, the forces fi need to be evaluated for all N models to obtain an average CG force f.
Schematic diagram illustrating the principle of ensemble learning. (a) One dimensional toy system. (b) 2-layer training scheme for learning CG force field using a GDML model.
Schematic diagram illustrating the principle of ensemble learning. (a) One dimensional toy system. (b) 2-layer training scheme for learning CG force field using a GDML model.
This low evaluation efficiency motivates us to propose a 2-layer procedure to speed up the evaluation of the mean force prediction. We generate a new batch of data , which contains n′ points (n′ = 3000 in this work). For all CG configurations in , we use N predictors to evaluate their forces and compute the mean forces. This produces a new dataset Dm where the n′ configurations are associated with the corresponding mean forces. Constructing Dm can be fast because the mean forces are computed only for n′ points, usually a few thousands.
Once the mean force set Dm is obtained, we can train a single final model P using the entire Dm. Since Dm contains the mean forces, the final model P also predicts the mean forces for the CG configurations. P is easy to train due to the small size of Dm (n′ is far below the GDML limit) and the force evaluation for the final model P is much more efficient than by evaluating N models . The general procedure of the 2-layer scheme is illustrated in Algorithm 1.
The hyperparameters that control the performance of the final model are the two kernel sizes σ1, σ2 for each layer [see Eq. (11)]. Another hyperparameter is the regularization coefficient of the ridge term and is set to the standard value (λ = 1 × 10−15) as in the original GDML paper.50 We conduct a 2D cross-validation search to determine σ1 and σ2. The algorithm for the cross-validation of the ensemble learning GDML is shown in Algorithm 2. The parameters N, n, n′, K are selected as N = n = 1000, n′ = 3000, K = 5, K is the number of folds for the cross-validation, and the total number of points in D is 1 000 000.
2-layer training scheme with cross-validation.
2Layer-GDML-CV(D, N, n, n′, K) | |
1. | Sample N data batches from the original bulk dataset D: {D1, D2, …, DN}, each data batch Di contains n randomly sampled points, each points d = (r, f) includes the molecular configuration part r and the force part f |
2. | Sample one additional data batch from the original bulk dataset D that contains n′ points, di = (ri, fi)|i=1,2, …, n′, each points di also includes the molecular configuration part ri and the force part fi, and j indicates the point index in data batch |
3. | For i = 1, …, N: # Loop over N batches: |
(a) Train GDML model Pi using data batch Di | |
(b) Predict forces for all n′ configurations rj|j=1,2, …, n′ in data batch using model Pi, which is denoted as | |
4. | Divide N data batches into K subsets of batches: SD = {SD1, SD2, …, SDK}, where each subset SDi contains N/K batches. |
5. | For the jth point in , divide N of its predicted forces into subsets SFj = {SFj,1, SFj,2, …, SFj,K}, where each subset contains N/K force tags that are consistent with the division in step 4, and j = 1, 2, …, n′ |
6. | For l = 1, …, K: Loop over K cross-validation folds: |
(a) For the jth point in , compute the mean forces using all forces from SFj∖SFj,l (excluding SFj,l), where j = 1, 2, …, n′, after obtaining the mean forces for all n′ configurations in , construct the lth mean force set Dm,l | |
(b) Train the lth second layer model P2l using Dm,l | |
(c) Compute the validation error of model P2l using all data points from the excluded set SDl, and denote the error as El | |
7. | Return cross-validation score |
2Layer-GDML-CV(D, N, n, n′, K) | |
1. | Sample N data batches from the original bulk dataset D: {D1, D2, …, DN}, each data batch Di contains n randomly sampled points, each points d = (r, f) includes the molecular configuration part r and the force part f |
2. | Sample one additional data batch from the original bulk dataset D that contains n′ points, di = (ri, fi)|i=1,2, …, n′, each points di also includes the molecular configuration part ri and the force part fi, and j indicates the point index in data batch |
3. | For i = 1, …, N: # Loop over N batches: |
(a) Train GDML model Pi using data batch Di | |
(b) Predict forces for all n′ configurations rj|j=1,2, …, n′ in data batch using model Pi, which is denoted as | |
4. | Divide N data batches into K subsets of batches: SD = {SD1, SD2, …, SDK}, where each subset SDi contains N/K batches. |
5. | For the jth point in , divide N of its predicted forces into subsets SFj = {SFj,1, SFj,2, …, SFj,K}, where each subset contains N/K force tags that are consistent with the division in step 4, and j = 1, 2, …, n′ |
6. | For l = 1, …, K: Loop over K cross-validation folds: |
(a) For the jth point in , compute the mean forces using all forces from SFj∖SFj,l (excluding SFj,l), where j = 1, 2, …, n′, after obtaining the mean forces for all n′ configurations in , construct the lth mean force set Dm,l | |
(b) Train the lth second layer model P2l using Dm,l | |
(c) Compute the validation error of model P2l using all data points from the excluded set SDl, and denote the error as El | |
7. | Return cross-validation score |
E. Stratified sampling
Another crucial factor that impacts the overall performance of our machine learning model is the distribution of the training data. As our training data are obtained from extensive MD simulations, they are distributed according to the Boltzmann distribution in the molecule configuration space. If a small batch of data is randomly sampled from the whole dataset, the large majority of the data will reside in low free energy regions, while data in high free energy regions, such as transition barriers, are underrepresented. Figure 2(a) shows that, in the case of alanine dipeptide, most of the data in a small batch of randomly selected points are located in the free energy minima on the left side of the (ϕ, ψ) dihedral angle space. If batches from this biased distribution are used in the ensemble learning, the errors for predicting the PMF in high free energy regions would be very large because the models will not be trained efficiently in these sparse data regions.
Stratified sampling of the training set for alanine-dipeptide in the dihedral angles (ψ, ϕ) space. (a) Regular (Boltzmann distributed) sampling of 1000 points for the first-layer. (b) Regular sampling of 3000 points for the second-layer. (c) Uniformly stratified sampling of 1000 points in the (ψ, ϕ) space for the first layer. (d) Uniformly stratified sampling of 3000 points in the (ψ, ϕ) space for the second layer.
Stratified sampling of the training set for alanine-dipeptide in the dihedral angles (ψ, ϕ) space. (a) Regular (Boltzmann distributed) sampling of 1000 points for the first-layer. (b) Regular sampling of 3000 points for the second-layer. (c) Uniformly stratified sampling of 1000 points in the (ψ, ϕ) space for the first layer. (d) Uniformly stratified sampling of 3000 points in the (ψ, ϕ) space for the second layer.
In order to solve this issue, we sample the data for the batches uniformly in the (ϕ, ψ) dihedral angles space of alanine dipeptide, as shown in Fig. 2(b). In this way, all relevant regions in the configurational space are equally represented in the training set, including transition states. The advantage of this strategic sampling is illustrated in more details in Sec. III.
F. Simulating the CG-GDML model
After training the 2-layer GDML model, we use an over-damped Langevin dynamics to generate a trajectory and sample the CG potential U(x; θ),
where xt (xt + τ) is the CG configuration at time t (t + τ), τ is the time step, D is the diffusion constant, and ξ is a vector of independent Gaussian random variables with zero-mean and identity covariance matrix (Wiener process). To sample the trained potential more efficiently, we generate 100 independent trajectories in parallel, with initial configurations randomly sampled from the original dataset.
G. Including physical constraints
When an over-damped Langevin dynamics [Eq. (13)] is used to generate a trajectory with a CG potential trained on a finite dataset, one undesired situation may happen: since the dynamics is stochastic, there is a chance that the simulated CG trajectory may diffuse away from the domain of the data used in the training, generating unphysical configurations. For example, the stretching of a bond too far away from the equilibrium distance is associated with a very high energy cost and is never observed in the simulation with a force field at finite temperature. In simulation with a machine-learned CG potential, there is no mechanism for preventing such an unphysical bond-stretching. Similar to what we proposed in our recent work,59 this problem can be solved by including a prior potential energy Uprior(x) incorporating physical prior knowledge on the system,
where Uprior(x) has harmonic terms modeling bond and angle stretching, with parameters extracted from the training data by Boltzmann inversion. Udiff(x; θ) is the difference between the total CG potential and Uprior(x). The forces obey a similar relation,
so the loss function of the model becomes
Different from what was done in a neural network model,59 the prior potential is not added directly to the trained model: the prior forces are first evaluated and subtracted from the all-atom forces and the GDML is trained over this force difference. Once the model is trained, the total energy (and forces) is obtained by adding back the prior energy (and forces) to the one obtained from the trained model.
III. RESULTS
We illustrate the results of the approach discussed above on a simple molecular system, namely, the coarse-graining of the alanine-dipeptide molecule from the atomistic model in explicit water into a 6-bead CG model. The all-atom model of alanine dipeptide consists of 22 atoms and 651 water molecules for a total of a few thousand degrees of freedom. As illustrated in Fig. 3, for the CG representation, we select the five central backbone atoms of the molecule, with additionally a sixth atom to break the symmetry and differentiate right- or left-handed representations. The overall pipeline for the coarse graining and the training procedure that is discussed below are also summarized in Fig. 3.
Pipeline of learning the CG forcefield with the GDML model. (a) All atom simulation of alanine dipeptide in water. (b) We compute the two dihedral angles ϕ and ψ and project the simulation data on to the (ϕ, ψ) space. (c) All-atom free energy surface in (ϕ, ψ) space. (d) The coarse graining model contains only six heavy atoms from the original molecule. (e) We could sample enough points for training a CG model, and the dataset is usually big. (f) Training the GDML model with one big dataset requires large memory, which hinders the application of GDML to coarse grain a molecule. (g) Instead of sampling one big training set, we sample many smaller training sets. (h) We train GDML models with each small training set. (i) We use Langevin dynamics to simulate a CG MD trajectory with each trained GDML model. (j) Similar to (b) and (c), we compute the free energy surface in (ϕ, ψ) space for each trajectory, and we find that these single models poorly recovered the correct free energy surface. (k) We can obtain an extra model, which is the average of all models we trained in step (h). The averaging procedure indicated by the red dashed box and arrow corresponds to the red box and arrow in Fig. 1(b). (l) We can simulate the averaged model using Langevin dynamics. (m) The average CG model can correctly reconstruct the free energy surface of the molecule. The final result is highlighted in a light gray box.
Pipeline of learning the CG forcefield with the GDML model. (a) All atom simulation of alanine dipeptide in water. (b) We compute the two dihedral angles ϕ and ψ and project the simulation data on to the (ϕ, ψ) space. (c) All-atom free energy surface in (ϕ, ψ) space. (d) The coarse graining model contains only six heavy atoms from the original molecule. (e) We could sample enough points for training a CG model, and the dataset is usually big. (f) Training the GDML model with one big dataset requires large memory, which hinders the application of GDML to coarse grain a molecule. (g) Instead of sampling one big training set, we sample many smaller training sets. (h) We train GDML models with each small training set. (i) We use Langevin dynamics to simulate a CG MD trajectory with each trained GDML model. (j) Similar to (b) and (c), we compute the free energy surface in (ϕ, ψ) space for each trajectory, and we find that these single models poorly recovered the correct free energy surface. (k) We can obtain an extra model, which is the average of all models we trained in step (h). The averaging procedure indicated by the red dashed box and arrow corresponds to the red box and arrow in Fig. 1(b). (l) We can simulate the averaged model using Langevin dynamics. (m) The average CG model can correctly reconstruct the free energy surface of the molecule. The final result is highlighted in a light gray box.
We compute the free energy of the alanine dipeptide as a function of the two dihedral angles ϕ, ψ, where ϕ is defined by atoms 1, 2, 3, 5, and ψ by atoms 2, 3, 5, 6 (see Fig. 3). As shown in Fig. 4(a), there are six metastable states in the free energy landscape of the all-atom model of alanine dipeptide. Figure 4(b) shows that the final 2-layer GDML CG model correctly reproduces the free energy landscape of alanine dipeptide: the free energy obtained from the trajectories [generated by numerical integration of Eq. (13)] of the CG model also exhibits six minima, with depths close to the ones of the corresponding minima in the all-atom model. Representative configurations from the six metastable states are shown in Fig. 4(a) for the all-atom (CPK representation) and the CG model (thick bond representation). Moreover, as shown in the supplementary material, Fig. S1, the bond and angle distribution from the CG simulation are also consistent with the all atom simulation.
Free energy surface in (ψ, ϕ) space for the trained GDML models. (a) reference (all-atom) free energy landscape and all representative configurations of the molecule in the six minima sampled from the all-atom trajectory [space-filling model (CPK) representation] and from a CG simulation with the 2-layer GDML model (thick bonds). (b) Free energy landscape from the 2-layer GDML model. (c) Free energy from a 2-layer GDML model with no stratified sampling. (d) Free energy from a traditional single-layer GDML model trained with 2000 points. (e) Free energy from a traditional single-layer GDML model trained with 5000 points.
Free energy surface in (ψ, ϕ) space for the trained GDML models. (a) reference (all-atom) free energy landscape and all representative configurations of the molecule in the six minima sampled from the all-atom trajectory [space-filling model (CPK) representation] and from a CG simulation with the 2-layer GDML model (thick bonds). (b) Free energy landscape from the 2-layer GDML model. (c) Free energy from a 2-layer GDML model with no stratified sampling. (d) Free energy from a traditional single-layer GDML model trained with 2000 points. (e) Free energy from a traditional single-layer GDML model trained with 5000 points.
The GDML model shown in Fig. 4(b) is optimized based on the minimum cross-validation error over a two-dimensional grid, spanned by the parameters σ1 and σ2, which are the kernel widths for the first and the second layer models. We find that the values (σ1, σ2) = (100, 10) give the smallest cross-validation error. Details on the cross-validation search can be found in the supplementary material, Fig. S4.
Figure 4(c) reports the free energy landscape corresponding to a CG model obtained with a 2-layer GDML but where the selection of the data for the sub-model is performed according to the Boltzmann distribution (that is, uniform sampling along the MD trajectory) instead of the stratified sampling scheme discussed above (uniform sampling in the ϕ, ψ space). While the free energy around the region of the deepest free energy minima in the ϕ, ψ space is quite accurate, the lowly populated metastable state [indicated as state 3 in Fig. 4(a)] is completely missing in Fig. 4(c), because of the scarcity of training points in this region.
As a comparison, Fig. 4(d) shows the results when a single-layer GDML model is trained on only 2000 points. Although this model identifies the general location of the metastable states, the free energy landscape is significantly distorted with respect to the all-atom one. This poor reconstruction performance is due to the limited size of the training set, which is not extensive enough to enable a stable estimate of the expected forces for the reduced representation of the input. We also trained a single-layer GDML model on 5000 points. As shown in Fig. 4(e), the free energy of this model presents a slight improvement with respect to Fig. 4(d) because of the increased number of training points. However, the overall quality is still low compared to the atomistic model. We expect the reconstructed free energy to improve further if we trained a model using much more data, but this is hindered by the memory requirement: it requires about 160 GB memory to train a model with 5000 points, which is almost at the upper limit of our computational ability.
To quantify the performance of the different approaches, we compute the mean square error (MSE) of the free energy difference of the different CG models compared to the atomistic model [Fig. 4(a), Table I, see Ref. 59 for details]. As expected, the 2-layer GDML model has the smallest free energy MSE, which is about , when it is trained with all 1000 batches. The single layer GDML gives the largest free energy difference [ if trained with 2000 points and if trained with 5000 points]. If no stratified sampling is used, the free energy difference is , and most of this value is due to the discrepancies in the free energy ϕ > 0 region.
Free energy mean square error (MSE) comparison for different CG models trained with different number of points, which is in the unit of thousand (k). The error is computed as the mean square error of the free energy of alanine-dipeptide in (ψ, ϕ) space, relative to the atomistic free energy.59 The MSE values are in units of . Boldface values indicate best performing models for different data sets.
Model . | 100 k . | 1000 k . | 2 k . | 5 k . |
---|---|---|---|---|
CGnet | 1.982 ± 0.181 | 0.475 ± 0.103 | … | … |
2-Layer GDML | 0.781 ± 0.154 | 0.363 ± 0.112 | … | … |
Boltz. Samp. 2-Layer GDML | … | 0.861 ± 0.167 | … | … |
1-Layer GDML | … | … | 2.947 ± 0.264 | 1.641 ± 0.243 |
Model . | 100 k . | 1000 k . | 2 k . | 5 k . |
---|---|---|---|---|
CGnet | 1.982 ± 0.181 | 0.475 ± 0.103 | … | … |
2-Layer GDML | 0.781 ± 0.154 | 0.363 ± 0.112 | … | … |
Boltz. Samp. 2-Layer GDML | … | 0.861 ± 0.167 | … | … |
1-Layer GDML | … | … | 2.947 ± 0.264 | 1.641 ± 0.243 |
As a baseline, we also compute the free energy difference obtained by a CG model designed by means of a neural network, CGnet.59 Previously, we have applied CGnet to alanine-dipeptide, but it was a model based on a five atom CG scheme, and we included two dihedral angles as input features to break the symmetry. To make the CGnet model consistent with the CG scheme used in this work, we modified it to contain six atoms (as in the GDML model), and no dihedral angles features were included (only distances are used as input). This CGnet model is trained with the same number of points as the GDML model (i.e., 1 000 000 points from 1000 batches). The resulting CGnet free energy MSE is , a value slightly larger than the 2-layer GDML model. This result shows that the accuracy of a kernel approach can indeed be comparable to or even better than a neural network approach on the same system.
We have also investigated the effect of the batch number (or the training set size). We computed the cross-validation error with different training set sizes, from 10 to 1000 batches for the GDML model, or equivalently from 10 000 to 1 000 000 points for CGnet. Figure 5(a) shows that as the batch number increases from 10 to 1000, the cross-validation error for the GDML model drops quickly and reaches convergence with a batch number >600. The cross-validation error for CGnet is significantly larger than that for the GDML model when the number of batches (or, equivalently, the training set size) is small. When the batch number is larger than 200, the cross-validation error for CGnet becomes smaller than that for GDML. Similarly, if we compare the free energy MSEs, as shown in Fig. 5(b), the free energy constructed by GDML with a small training set is significantly better than the corresponding free energy constructed by CGnet. On the other hand, with a large training set, the MSEs are comparable to each other. Typical free energy profiles are shown in Figs. 6(a)–6(d), and their corresponding MSE values are shown Table I. These results show that with enough data, the 2-layer GDML model and CGnet perform similarly well. However, the 2-layer GDML model is more data efficient and has a better ability to extrapolate the force prediction to unsampled configurations, thus outperforming CGnet for small training sets.
Cross-validation error (a) and free energy mean square error (MSE) (b) as a function of the number of batches. For CGnet, the training set size is equal to 1000 × number of batches. The units for the cross-validation error are kcal/(mol2/Å2), while the units for the free energy MSE are .
Cross-validation error (a) and free energy mean square error (MSE) (b) as a function of the number of batches. For CGnet, the training set size is equal to 1000 × number of batches. The units for the cross-validation error are kcal/(mol2/Å2), while the units for the free energy MSE are .
Free energy as a function of the alanine dipeptide dihedral angles for a 2-layer GDML CG model with number of batches NBatch = 100 (a) and NBatch = 1000 (c) and for CGnet with NBatch = 100 (b) and NBatch = 1000 (d).
Free energy as a function of the alanine dipeptide dihedral angles for a 2-layer GDML CG model with number of batches NBatch = 100 (a) and NBatch = 1000 (c) and for CGnet with NBatch = 100 (b) and NBatch = 1000 (d).
IV. CONCLUSIONS
In this work, we combine the idea of ensemble learning with GDML to apply it to the coarse graining problem. GDML is a kernel method to learn molecular force fields from data and allows us to model nonlinearity and multi-body effects without the need of providing a functional form for the potential. The GMDL approach was originally proposed to learn molecular forces from quantum simulation data. When quantum calculations are used, the error on the force matching loss could, in principle, be zero, and a few thousand points are enough to construct and build an accurate, smooth, and conserved force field. However, when applied to coarse-graining, the force matching loss contains a nonzero term due to the dimensionality reduction and the learning problem becomes very noisy. For this reason, a lot more data points are needed from atomistic simulations to learn a CG potential of mean force. The large amount of input data would presently hinder the application of GDML to the CG problem. In order to circumvent this problem, we use ensemble learning. The basic idea consists in breaking down the learning problem into small batches, which can be more easily solved, and combining the resulting different models into a final solution. Following this approach, we do not train one single GDML model using all the data but propose a 2-layer training scheme: in the first layer, we generate N data batches, each containing a number of points far below the GDML limit. N models are trained on the different batches and are combined into a final model by taking the average. We show that the prediction of the CG 2-layer model accurately reproduces the thermodynamics of the atomistic model.
Consistent with previous work,59 we show that, when applying machine learning methods to design force fields for molecular systems, the addition of physical constraints enforces proper asymptotic of the model. In the design of CG potentials, physical constraints can be introduced by means of a prior potential energy term that prevent the appearance of spurious effects in non-physical regions of the configurational landscape.
A good GDML model should be able to construct a smooth and globally connected conserved force field. However, when the 2-layer approach is used, some of the molecular configurations with high free energy are poorly sampled in the training set, introducing large errors in the resulting model. In order to solve this problem, we sample the data uniformly in the low dimensional space defined by two collective coordinates rather than uniformly from the simulation time series. In the example of alanine dipeptide discussed here, the dihedral angles ϕ, ψ are chosen as collective coordinates.
In our previous work, we proposed CGnet,59 a neural network approach to design CG models. The overall free energy reconstruction obtained with the GDML model is comparably accurate as what was obtained with CGnet when the training set size is sufficiently large. However, the GDML model is significantly more accurate when the training set size is small, indicating that a kernel approach is data-efficient and could, in principle, provide more accurate CG models especially with small training sets.
However, there are still several challenges in order to apply GDML for the coarse-graining of macromolecular systems. In larger systems, a more general definition is needed for the collective coordinates defining the low dimensional space for the uniform sampling of the training batches. These collective coordinates could, in principle, be extracted from the trajectory data,77,78 for instance, by means of time-lagged Independent Component Analysis (tICA),79–83 kernel PCA,84–86 or diffusion maps.87
The decomposition of the large input dataset into an ensemble of small batches has been used here to solve memory issues when training a GDML model. However, the computation is still expensive and we expect it to become even more expensive as the size of the molecular system increases. As the number of data batches and batch size grow, the Nyström approximation of the kernel or other numerical approaches may be a promising solution to increase the computational efficiency.
As for the neural network model, the GDML model trained by force matching can capture the thermodynamics of the system, but there is no guarantee that the dynamics is also preserved. Alternative approaches need to be defined to solve this problem.88
Finally, the GDML model presented here is trained on a specific molecule, and it is not directly transferable to different systems. Ultimately, a transferable CG model would be needed for the general application to large systems that cannot be simulated by atomistic simulations. The trade-off between accuracy and transferability in CG models is an open research question that we will investigate in future work.
SUPPLEMENTARY MATERIAL
See the supplementary material for more details about the hyperparameter search, a discussion on the prior energy, and more information on the descriptors used in the GDML.
AUTHOR’S CONTRIBUTIONS
J.W. and S.C. contributed equally to this manuscript.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
ACKNOWLEDGMENTS
We thank Eugen Hruska and Feliks Nüske for comments on the manuscript. This work was supported by the National Science Foundation (Grant Nos. CHE-1738990, CHE-1900374, and PHY-1427654), the Welch Foundation (Grant No. C-1570), the MATH + excellence cluster (AA1-6, EF1-2), the Deutsche Forschungsgemeinschaft (Grant No. SFB 1114/A04), the European Commission (Grant No. ERC CoG 772230 “ScaleCell”), and the Einstein Foundation Berlin (Einstein Visiting Fellowship to CC). Simulations were performed on the computer clusters of the Center for Research Computing at Rice University, supported, in part, by the Big-Data Private-Cloud Research Cyberinfrastructure MRI-award (NSF Grant No. CNS-1338099), and on the clusters of the Department of Mathematics and Computer Science at Freie Universität, Berlin. K.-R.M. acknowledges partial financial support by the German Ministry for Education and Research (BMBF) under Grant Nos. 01IS14013A-E, 01IS18025A, 01IS18037A, 01GQ1115, and 01GQ0850; Deutsche Forschungsgemeinschaft (DFG) under Grant Math+, EXC 2046/1 (Project No. 390685689); and the Technology Promotion (IITP) grant funded by the Korea government (Grant Nos. 2017-0-00451 and 2017-0-01779). S.C. acknowledges the BASLEARN - TU Berlin/BASF Joint Laboratory, co-financed by TU Berlin and BASF SE.