Gradient-domain machine learning (GDML) is an accurate and efficient approach to learn a molecular potential and associated force field based on the kernel ridge regression algorithm. Here, we demonstrate its application to learn an effective coarse-grained (CG) model from all-atom simulation data in a sample efficient manner. The CG force field is learned by following the thermodynamic consistency principle, here by minimizing the error between the predicted CG force and the all-atom mean force in the CG coordinates. Solving this problem by GDML directly is impossible because coarse-graining requires averaging over many training data points, resulting in impractical memory requirements for storing the kernel matrices. In this work, we propose a data-efficient and memory-saving alternative. Using ensemble learning and stratified sampling, we propose a 2-layer training scheme that enables GDML to learn an effective CG model. We illustrate our method on a simple biomolecular system, alanine dipeptide, by reconstructing the free energy landscape of a CG variant of this molecule. Our novel GDML training scheme yields a smaller free energy error than neural networks when the training set is small, and a comparably high accuracy when the training set is sufficiently large.

## I. INTRODUCTION

Molecular dynamics (MD) simulations have become an important tool to characterize the microscopic behavior of chemical systems. Recent advances in hardware and software allow significant extensions of the simulation timescales to study biologically relevant processes.^{1–3} For example, we can now characterize the configurational changes and folding and binding behavior of small to intermediate-sized proteins through MD on the timescale of milliseconds to seconds.^{4–9} However, the computational complexity of evaluating the potential energy prohibits this approach to scale up to significantly larger systems and/or longer timescales. Therefore, multiple ways have been proposed to speed up atomistic simulations, such as advanced sampling methods (e.g., umbrella sampling^{10–12} and parallel tempering^{13,14}) or adaptive sampling.^{15–17} An alternative approach is to reduce the dimensionality of the system by coarse-graining (CG).^{18–23} The fact that macromolecules usually exhibit robust collective behavior suggests that not every single degree of freedom is *per se* essential in determining the important macromolecular processes over long timescales. Furthermore, a CG representation of the system simplifies the model and allows for a more straightforward physico-chemical interpretation of large-scale conformational changes, such as protein folding or protein–protein binding.^{18}

Once the mapping from the atomistic to the CG representation is defined, a fundamental challenge is the definition of an effective potential in reduced coordinates such that the essential physical properties of the system under consideration are retained. The choice of the relevant properties crucially dictates the definition of the CG model.

Following a top-down approach, the CG procedure is driven by the objective to reproduce macroscopic properties, such as structural information or experimentally measured observables.^{19,21,24–27} In a bottom-up approach, on the other hand, an effective potential is designed to reproduce a selection of properties of an atomistic model, for instance, the probability distribution in a suitable space and the corresponding metastable states.^{20,22,28–31}

In the past several years, machine learning (ML) techniques have been increasingly applied in molecular simulation.^{9,32–34} Bottom-up CG methods have also started to leverage the advances in ML, to define classical atomistic potentials or force fields from quantum chemical calculations,^{35–51} to learn kinetic models,^{52–56} or to design effective CG potential from atomistic simulations.^{57–59} In this context, we have recently shown that a deep neural network (NN) can be used in combination with the well established “force matching” approach^{60} to define a coarse-grained implicit water potential that is able to reproduce the correct folding/unfolding process of a small protein from atomistic simulations in explicit water.^{59} In the force matching approach, the effective energy function of the CG model is optimized variationally by finding the CG force field that minimizes the difference with the instantaneous atomistic forces projected on the CG coordinates. As there are multiple atomistic configurations consistent with a CG configuration, this estimator is very noisy and this approach requires a large amount of training data. It is thus restricted to parametric models such as NNs, as the computational complexity of non-parametric models is directly linked to the training set size. Here, we propose a method to overcome this limitation in the dataset via bootstrap aggregation in combination with a non-parametric, kernel-based regressor.

In particular, we use the Gradient-Domain Machine Learning (GDML) approach.^{50,61} In the application to quantum data, GDML is able to use a small number (usually less than a few thousand) of example points to build an accurate force field for a specific molecule. Because of the degeneracy of the mapping, the training data required to reconstruct a coarse-grained force field are much larger, and the fact that memory requirements scale quadratically with dataset size prevents a direct application of GDML to the definition of CG models.

To solve this problem, we pursue a hierarchical ensemble learning approach in which the full training set is first divided into smaller batches that are trained independently. A second GDML layer is then applied to the mean prediction of this ensemble, providing the second model with a consistent set of inputs and outputs. We show that GDML with ensemble learning can be efficiently used for the coarse-graining of molecular systems.

The structure of the paper is as follows: In Sec. II, we briefly review the principle of force matching that we use for coarse graining, as well as mathematical underpinnings of the kernel ridge regression used in the GDML method. Then, we describe the idea of ensemble learning and explain how it solves the problem associated with the large number of training points required by force matching. In the “Results” section, we demonstrate that a GDML approach trained with ensemble learning performs well on the coarse-graining of a small molecular system, alanine dipeptide simulated in water solvent, as it produces the same free energy surface as obtained in the all-atom simulations. As was already demonstrated in the case of a NN approach,^{59} the key to success of a GDML-based coarse graining is that it is able to naturally capture nonlinearities and multi-body effects arising from the renormalization of degrees of freedom at the base of coarse-graining.

## II. THEORY AND METHODS

### A. Coarse-graining with thermodynamic consistency

Although the definition of a coarse-graining mapping scheme is *per se* an interesting problem,^{23,62–64} here, we start by assuming that a mapping is given. The all-atom system we want to coarse-grain consists of *N* atoms, and its configurations are represented by a 3*N* dimensional vector $r\u2208R3N$. The lower dimensional CG representation of the system is given by the mapping

where *n* < *N* is the number of CG beads. The CG mapping function *ξ* is assumed to be linear, i.e., there exists a coarse-graining matrix $\Xi \u2208R3n\xd73N$ that maps the all atom space to the CG space: **x** = Ξ**r**.

The definition of a CG model requires an effective potential *U*(**x**; ** θ**) in the CG space, where

*θ*are the optimization parameters. The potential

*U*(

**x**;

**) can then be used to generate an MD trajectory with a dynamical model. Parameterizations are available in varying degrees of sophistication, ranging from classical force fields with fixed functional forms to ML approaches with strong physical basis.**

*θ*One popular bottom-up method for building a CG model is to require thermodynamic consistency, that is, to design a CG potential such that its equilibrium distribution matches the one of the all-atom model. In practice, this means that an optimum CG potential satisfies the condition

where *k*_{B} is the Boltzmann constant, *T* is the temperature, and the probability density distribution in the CG space is given by the equilibrium distribution of the all-atom model mapped to the CG coordinates,

where $\mu r=exp\u2212Vr/kBT$ is the Boltzmann weight associated with the atomistic energy *V*(**r**).

Different methods have been proposed to construct a CG potential *U*(*x*, *θ*) that satisfy Eq. (3), notably the relative entropy method,^{31} and the force-matching method.^{20,60} In this work, we will demonstrate how we could learn the molecular CG potential using the idea of force-matching and the GDML kernel method.

### B. Force matching

It can be shown that the potential *U*(**x**; ** θ**) satisfies thermodynamic consistency if the associated CG forces −∇

*U*(

**x**;

**) minimize the mean square error (MSE),**

*θ*^{20,60}

where *ξ*(**F**(**r**)) denotes the instantaneous all-atom forces projected onto the CG space and ⟨·⟩_{r} is the weighted average over the equilibrium distribution of the atomistic model, i.e., **r** ∼ *μ*(**r**).

The effective CG force field −∇*U*(**x**; ** θ**) that minimizes

*χ*

^{2}(

**) corresponds to the mean force,**

*θ*^{60}

where **r**∣**x** indicates the equilibrium distribution of **r** constrained to the CG coordinates **x**, i.e., the ensemble of all atomistic configurations that map to the same CG configuration. For this reason, an optimized CG potential *U*(** x**,

**) is also called the potential of mean force (PMF).**

*θ*By following statistical estimator theory,^{65} it can also be shown^{59} that the error *χ*^{2}(** θ**) [Eq. (4)] can be decomposed into two terms,

where

While the PMF error term depends on the definition of the CG potential and can be, in principle, reduced to zero, the noise term does not depend on the CG potential and it is solely associated with the decrease in the number of degrees of freedom in the CG mapping, and it is, in general, larger than zero. The force matching estimator of Eq. (4) is thus intrinsically very noisy.

### C. GDML

In previous work,^{59} we have introduced CGnet to minimize the error in Eq. (4) using a neural network to parametrize the CG forces. We have demonstrated that the CGnet approach successfully recovers optimal CG potentials. A large training dataset enables CGnet to resolve the ambiguity in the coarse-grained force labels by converging to the respective mean forces. Here, we explore the Gradient-domain Machine Learning approach (GDML)^{50,51} as an alternative.

GDML has been used to obtain an accurate reconstruction of flexible molecular force fields from small reference datasets of high-level *ab initio* calculations.^{50,51,61} Unlike traditional classical force fields, this approach imposes no hypothesized interaction pattern for the nuclei and is thus unhindered in modeling any complex physical phenomena. Instead, GDML imposes energy conservation as inductive bias, a fundamental property of closed classical and quantum mechanical systems that does not limit generalization. This makes highly data efficient reconstruction possible without sacrificing generality.

The key idea is to use a Gaussian process ($GP$) to model the force field **f** as a transformation of an unknown potential energy surface *U* such that

Here, *μ*_{U} and *k*_{U} are the mean and covariance functions of the corresponding energy predictor, respectively.

To help disambiguate physically equivalent inputs, the Cartesian geometries **x** are represented by a descriptor **D** with entries

that introduces roto-translational invariance. Accordingly, the posterior mean of the GDML model takes the form

where **J**_{D(x)} is the Jacobian of the descriptor (see the supplementary material for details). Due to linearity, the corresponding expression for the energy predictor can be simply obtained via (analytic) integration. GDML uses a Matérn kernel *k*_{U}(**x**, **x**′) with restricted differentiability to construct the force field kernel function,

where *d* = ∥**x** − **x**′∥ is the Euclidean distance between the two inputs and *σ* is an hyperparameter.

We use this kernel because empirical evidence indicates that kernels with limited smoothness yield better predictors, even if the prediction target is infinitely differentiable. It is generally assumed that overly smooth priors are detrimental to data efficiency, as the associated hypothesis space is harder to constrain with a finite number of (potentially noisy) training examples.^{66} The differentiability of functions is directly linked to the rate of decay of their spectral density at high frequencies, which has been shown to play a critical role in spatial interpolation.^{67}

### D. Ensemble learning

Ensemble learning is a general and widely used machine learning trick to increase the predictive performance of a trained model by combining multiple sub-models.^{68–76} In this work, we use the idea at the basis of a particular ensemble learning method, called bootstrap-aggregation in the machine learning literature,^{71} summarized in the following paragraphs and Algorithm 1. This method enables us to train a GDML approach over millions of data points, a task that would be otherwise impossible.

2Layer-GDML(D, N, n, n′) | |

1. | Sample N data batches from the original bulk dataset D: {D_{1}, D_{2}, …, D_{N}}, each data batch D_{i} contains n randomly sampled points, each data point d = (r, f) includes a molecular configuration part r and a force part f |

2. | Sample one additional data batch $D\u0303m$ from the original bulk dataset D that contains n′ points, d_{i} = (r_{i}, f_{i})|_{i=1,2, …, n′}, each point d_{i} also includes a molecular configuration part r_{i} and a force part f_{i}, and j indicates the point index in data batch $D\u0303m$ |

3. | For i = 1, …, N: # Loop over N batches: |

(a) Train GDML model P_{i} using data batch D_{i} | |

(b) Predict forces for all n′ configurations r_{j}|_{j=1,2, …, n′} in data batch $D\u0303m$ using model P_{i}, which is denoted as $fji|j=1,2,\u2026,n\u2032$ | |

4. | Construct the mean force set D_{m}, which also includes n′ points, the configuration part r_{j} for each point is the same as $D\u0303m$, but the force part f_{j} is the averaged force computed using the N GDML models: $fj=1N\u2211i=1Nfji$ |

5. | Train the 2nd-layer model P using the constructed mean force set D_{m} |

2Layer-GDML(D, N, n, n′) | |

1. | Sample N data batches from the original bulk dataset D: {D_{1}, D_{2}, …, D_{N}}, each data batch D_{i} contains n randomly sampled points, each data point d = (r, f) includes a molecular configuration part r and a force part f |

2. | Sample one additional data batch $D\u0303m$ from the original bulk dataset D that contains n′ points, d_{i} = (r_{i}, f_{i})|_{i=1,2, …, n′}, each point d_{i} also includes a molecular configuration part r_{i} and a force part f_{i}, and j indicates the point index in data batch $D\u0303m$ |

3. | For i = 1, …, N: # Loop over N batches: |

(a) Train GDML model P_{i} using data batch D_{i} | |

(b) Predict forces for all n′ configurations r_{j}|_{j=1,2, …, n′} in data batch $D\u0303m$ using model P_{i}, which is denoted as $fji|j=1,2,\u2026,n\u2032$ | |

4. | Construct the mean force set D_{m}, which also includes n′ points, the configuration part r_{j} for each point is the same as $D\u0303m$, but the force part f_{j} is the averaged force computed using the N GDML models: $fj=1N\u2211i=1Nfji$ |

5. | Train the 2nd-layer model P using the constructed mean force set D_{m} |

In general, we generate a finite set of alternative GDML reconstructions from randomly drawn subsets of the full MD trajectory and average them to generate an estimate for the “expected” force prediction at each point. The variability in the individual training sets promotes flexibility in the structure across all models in the ensemble and enables us to capture the variability in the dataset. We are then able to compute the expected value for each input by simply taking the mean of the ensemble.

Suppose that we have a large dataset *D*: (*x*, *y*) for training that contains *N* samples of pairs of points *x*, *y*. We would like to train a predictive model *f* such that *y* = *f*(*x*), using the data *D*. Instead of training a single model *f* using the whole *N* data points from *D*, we first randomly sample *n* batches: {*D*_{1}, *D*_{2}, …, *D*_{n}}, where each batch *D*_{i} contains *N*′ points. Usually, *N* is too large to efficiently train a single model, but it is possible to train sub-models on the different batches {*f*_{1}, *f*_{2}, …, *f*_{n}} if *N*′ ≪ *N*. After training all the batches, the final predictive model *f* is obtained as the average of all the sub-models,

This enables us to generate consistent labels for a held-out subset of the trajectory, which then serves as the basis for another GDML reconstruction.

We demonstrate how bootstrapping aggregation is used on a simple example, where we learn an effective curve to fit a one dimensional dataset. As shown in Fig. 1(a), 600 raw points are uniformly sampled from *x* ∈ [0, 6], and the *y* value of each point is assigned according to *y*_{i} = sin(*x*_{i}) + 0.2*ξ*, where *ξ* ∼ *N*(0, 1) is a random noise. These 600 points serve as the noisy training set. Instead of learning the curve using all 600 points at once, we bootstrap sample 100 batches from the full dataset, where each batch contains only 20 points. We use a six-order polynomial function to fit 20 points in each batch, and the 100 fitted curves are shown in blue in Fig. 1(a). While each of these 100 blue curves oscillates around the mean and overfit the data, the mean of the 100 predictors (red curve) is smooth and agrees with the ground truth *y* = sin(*x*) (green curve) quite well. We use the idea of ensemble learning to apply GDML to CG problems as a 2-layer procedure. Instead of training one single GDML model using all data, which usually exceed the upper memory limit of GDML, we train *N* models $Pi|i=1N$ using *N* data batches, where each batch contains only *n* points. In this work, *N* = *n* = 1000. Since 1000 points is far below the GDML limit, each GDML model *P*_{i} is easy to train. After obtaining all *N* GDML models $Pi|i=1N$, we use them to predict the forces *f*_{i} corresponding to the *i*th model for any given CG configuration **x** as *f*_{i} = *P*_{i}(**x**). The mean force (CG force) for a configuration **x** is then the average of the forces for all the models: $f=1N\u2211i=1Nfi$. This average force prediction could be directly used in the CG molecular simulation, but the resulting model would be of low efficiency, since for each single configuration **x**, the forces *f*_{i} need to be evaluated for all *N* models $Pi|i=1N$ to obtain an average CG force *f*.

This low evaluation efficiency motivates us to propose a 2-layer procedure to speed up the evaluation of the mean force prediction. We generate a new batch of data $D\u0303m$, which contains *n*′ points (*n*′ = 3000 in this work). For all CG configurations in $D\u0303m$, we use *N* predictors to evaluate their forces and compute the mean forces. This produces a new dataset *D*_{m} where the *n*′ configurations are associated with the corresponding mean forces. Constructing *D*_{m} can be fast because the mean forces are computed only for *n*′ points, usually a few thousands.

Once the mean force set *D*_{m} is obtained, we can train a single final model *P* using the entire *D*_{m}. Since *D*_{m} contains the mean forces, the final model *P* also predicts the mean forces for the CG configurations. *P* is easy to train due to the small size of *D*_{m} (*n*′ is far below the GDML limit) and the force evaluation for the final model *P* is much more efficient than by evaluating *N* models $Pi|i=1N$. The general procedure of the 2-layer scheme is illustrated in Algorithm 1.

The hyperparameters that control the performance of the final model are the two kernel sizes *σ*_{1}, *σ*_{2} for each layer [see Eq. (11)]. Another hyperparameter is the regularization coefficient of the ridge term and is set to the standard value (*λ* = 1 × 10^{−15}) as in the original GDML paper.^{50} We conduct a 2D cross-validation search to determine *σ*_{1} and *σ*_{2}. The algorithm for the cross-validation of the ensemble learning GDML is shown in Algorithm 2. The parameters *N*, *n*, *n*′, *K* are selected as *N* = *n* = 1000, *n*′ = 3000, *K* = 5, *K* is the number of folds for the cross-validation, and the total number of points in *D* is 1 000 000.

2Layer-GDML-CV(D, N, n, n′, K) | |

1. | Sample N data batches from the original bulk dataset D: {D_{1}, D_{2}, …, D_{N}}, each data batch D_{i} contains n randomly sampled points, each points d = (r, f) includes the molecular configuration part r and the force part f |

2. | Sample one additional data batch $D\u0303m$ from the original bulk dataset D that contains n′ points, d_{i} = (r_{i}, f_{i})|_{i=1,2, …, n′}, each points d_{i} also includes the molecular configuration part r_{i} and the force part f_{i}, and j indicates the point index in data batch $D\u0303m$ |

3. | For i = 1, …, N: # Loop over N batches: |

(a) Train GDML model P_{i} using data batch D_{i} | |

(b) Predict forces for all n′ configurations r_{j}|_{j=1,2, …, n′} in data batch $D\u0303m$ using model P_{i}, which is denoted as $fji|j=1,2,\u2026,n\u2032$ | |

4. | Divide N data batches into K subsets of batches: SD = {SD_{1}, SD_{2}, …, SD_{K}}, where each subset SD_{i} contains N/K batches. |

5. | For the jth point in $D\u0303m$, divide N of its predicted forces $fji|i=1,\u2026,N$ into subsets SF_{j} = {SF_{j,1}, SF_{j,2}, …, SF_{j,K}}, where each subset contains N/K force tags that are consistent with the division in step 4, and j = 1, 2, …, n′ |

6. | For l = 1, …, K: Loop over K cross-validation folds: |

(a) For the jth point in $D\u0303m$, compute the mean forces using all forces from SF_{j}∖SF_{j,l} (excluding SF_{j,l}), where j = 1, 2, …, n′, after obtaining the mean forces for all n′ configurations in $D\u0303m$, construct the lth mean force set D_{m,l} | |

(b) Train the lth second layer model P2_{l} using D_{m,l} | |

(c) Compute the validation error of model P2_{l} using all data points from the excluded set SD_{l}, and denote the error as E_{l} | |

7. | Return cross-validation score $1K\u2211k=1KEk$ |

2Layer-GDML-CV(D, N, n, n′, K) | |

1. | Sample N data batches from the original bulk dataset D: {D_{1}, D_{2}, …, D_{N}}, each data batch D_{i} contains n randomly sampled points, each points d = (r, f) includes the molecular configuration part r and the force part f |

2. | Sample one additional data batch $D\u0303m$ from the original bulk dataset D that contains n′ points, d_{i} = (r_{i}, f_{i})|_{i=1,2, …, n′}, each points d_{i} also includes the molecular configuration part r_{i} and the force part f_{i}, and j indicates the point index in data batch $D\u0303m$ |

3. | For i = 1, …, N: # Loop over N batches: |

(a) Train GDML model P_{i} using data batch D_{i} | |

(b) Predict forces for all n′ configurations r_{j}|_{j=1,2, …, n′} in data batch $D\u0303m$ using model P_{i}, which is denoted as $fji|j=1,2,\u2026,n\u2032$ | |

4. | Divide N data batches into K subsets of batches: SD = {SD_{1}, SD_{2}, …, SD_{K}}, where each subset SD_{i} contains N/K batches. |

5. | For the jth point in $D\u0303m$, divide N of its predicted forces $fji|i=1,\u2026,N$ into subsets SF_{j} = {SF_{j,1}, SF_{j,2}, …, SF_{j,K}}, where each subset contains N/K force tags that are consistent with the division in step 4, and j = 1, 2, …, n′ |

6. | For l = 1, …, K: Loop over K cross-validation folds: |

(a) For the jth point in $D\u0303m$, compute the mean forces using all forces from SF_{j}∖SF_{j,l} (excluding SF_{j,l}), where j = 1, 2, …, n′, after obtaining the mean forces for all n′ configurations in $D\u0303m$, construct the lth mean force set D_{m,l} | |

(b) Train the lth second layer model P2_{l} using D_{m,l} | |

(c) Compute the validation error of model P2_{l} using all data points from the excluded set SD_{l}, and denote the error as E_{l} | |

7. | Return cross-validation score $1K\u2211k=1KEk$ |

### E. Stratified sampling

Another crucial factor that impacts the overall performance of our machine learning model is the distribution of the training data. As our training data are obtained from extensive MD simulations, they are distributed according to the Boltzmann distribution in the molecule configuration space. If a small batch of data is randomly sampled from the whole dataset, the large majority of the data will reside in low free energy regions, while data in high free energy regions, such as transition barriers, are underrepresented. Figure 2(a) shows that, in the case of alanine dipeptide, most of the data in a small batch of randomly selected points are located in the free energy minima on the left side of the (*ϕ*, *ψ*) dihedral angle space. If batches from this biased distribution are used in the ensemble learning, the errors for predicting the PMF in high free energy regions would be very large because the models will not be trained efficiently in these sparse data regions.

In order to solve this issue, we sample the data for the batches uniformly in the (*ϕ*, *ψ*) dihedral angles space of alanine dipeptide, as shown in Fig. 2(b). In this way, all relevant regions in the configurational space are equally represented in the training set, including transition states. The advantage of this strategic sampling is illustrated in more details in Sec. III.

### F. Simulating the CG-GDML model

After training the 2-layer GDML model, we use an over-damped Langevin dynamics to generate a trajectory and sample the CG potential *U*(**x**; *θ*),

where **x**_{t} (**x**_{t + τ}) is the CG configuration at time *t* (*t* + *τ*), *τ* is the time step, *D* is the diffusion constant, and *ξ* is a vector of independent Gaussian random variables with zero-mean and identity covariance matrix (Wiener process). To sample the trained potential more efficiently, we generate 100 independent trajectories in parallel, with initial configurations randomly sampled from the original dataset.

### G. Including physical constraints

When an over-damped Langevin dynamics [Eq. (13)] is used to generate a trajectory with a CG potential trained on a finite dataset, one undesired situation may happen: since the dynamics is stochastic, there is a chance that the simulated CG trajectory may diffuse away from the domain of the data used in the training, generating unphysical configurations. For example, the stretching of a bond too far away from the equilibrium distance is associated with a very high energy cost and is never observed in the simulation with a force field at finite temperature. In simulation with a machine-learned CG potential, there is no mechanism for preventing such an unphysical bond-stretching. Similar to what we proposed in our recent work,^{59} this problem can be solved by including a prior potential energy *U*_{prior}(**x**) incorporating physical prior knowledge on the system,

where *U*_{prior}(**x**) has harmonic terms modeling bond and angle stretching, with parameters extracted from the training data by Boltzmann inversion. *U*_{diff}(**x**; ** θ**) is the difference between the total CG potential and

*U*

_{prior}(

**x**). The forces obey a similar relation,

so the loss function of the model becomes

Different from what was done in a neural network model,^{59} the prior potential is not added directly to the trained model: the prior forces are first evaluated and subtracted from the all-atom forces and the GDML is trained over this force difference. Once the model is trained, the total energy (and forces) is obtained by adding back the prior energy (and forces) to the one obtained from the trained model.

## III. RESULTS

We illustrate the results of the approach discussed above on a simple molecular system, namely, the coarse-graining of the alanine-dipeptide molecule from the atomistic model in explicit water into a 6-bead CG model. The all-atom model of alanine dipeptide consists of 22 atoms and 651 water molecules for a total of a few thousand degrees of freedom. As illustrated in Fig. 3, for the CG representation, we select the five central backbone atoms of the molecule, with additionally a sixth atom to break the symmetry and differentiate right- or left-handed representations. The overall pipeline for the coarse graining and the training procedure that is discussed below are also summarized in Fig. 3.

We compute the free energy of the alanine dipeptide as a function of the two dihedral angles *ϕ*, *ψ*, where *ϕ* is defined by atoms 1, 2, 3, 5, and *ψ* by atoms 2, 3, 5, 6 (see Fig. 3). As shown in Fig. 4(a), there are six metastable states in the free energy landscape of the all-atom model of alanine dipeptide. Figure 4(b) shows that the final 2-layer GDML CG model correctly reproduces the free energy landscape of alanine dipeptide: the free energy obtained from the trajectories [generated by numerical integration of Eq. (13)] of the CG model also exhibits six minima, with depths close to the ones of the corresponding minima in the all-atom model. Representative configurations from the six metastable states are shown in Fig. 4(a) for the all-atom (CPK representation) and the CG model (thick bond representation). Moreover, as shown in the supplementary material, Fig. S1, the bond and angle distribution from the CG simulation are also consistent with the all atom simulation.

The GDML model shown in Fig. 4(b) is optimized based on the minimum cross-validation error over a two-dimensional grid, spanned by the parameters *σ*_{1} and *σ*_{2}, which are the kernel widths for the first and the second layer models. We find that the values (*σ*_{1}, *σ*_{2}) = (100, 10) give the smallest cross-validation error. Details on the cross-validation search can be found in the supplementary material, Fig. S4.

Figure 4(c) reports the free energy landscape corresponding to a CG model obtained with a 2-layer GDML but where the selection of the data for the sub-model is performed according to the Boltzmann distribution (that is, uniform sampling along the MD trajectory) instead of the stratified sampling scheme discussed above (uniform sampling in the *ϕ*, *ψ* space). While the free energy around the region of the deepest free energy minima in the *ϕ*, *ψ* space is quite accurate, the lowly populated metastable state [indicated as state 3 in Fig. 4(a)] is completely missing in Fig. 4(c), because of the scarcity of training points in this region.

As a comparison, Fig. 4(d) shows the results when a single-layer GDML model is trained on only 2000 points. Although this model identifies the general location of the metastable states, the free energy landscape is significantly distorted with respect to the all-atom one. This poor reconstruction performance is due to the limited size of the training set, which is not extensive enough to enable a stable estimate of the expected forces for the reduced representation of the input. We also trained a single-layer GDML model on 5000 points. As shown in Fig. 4(e), the free energy of this model presents a slight improvement with respect to Fig. 4(d) because of the increased number of training points. However, the overall quality is still low compared to the atomistic model. We expect the reconstructed free energy to improve further if we trained a model using much more data, but this is hindered by the memory requirement: it requires about 160 GB memory to train a model with 5000 points, which is almost at the upper limit of our computational ability.

To quantify the performance of the different approaches, we compute the mean square error (MSE) of the free energy difference of the different CG models compared to the atomistic model [Fig. 4(a), Table I, see Ref. 59 for details]. As expected, the 2-layer GDML model has the smallest free energy MSE, which is about $0.363\xb10.112\u2009(kBT)2$, when it is trained with all 1000 batches. The single layer GDML gives the largest free energy difference [$2.947\u2009(kBT)2$ if trained with 2000 points and $1.641\u2009(kBT)2$ if trained with 5000 points]. If no stratified sampling is used, the free energy difference is $0.861\u2009(kBT)2$, and most of this value is due to the discrepancies in the free energy *ϕ* > 0 region.

Model . | 100 k . | 1000 k . | 2 k . | 5 k . |
---|---|---|---|---|

CGnet | 1.982 ± 0.181 | 0.475 ± 0.103 | … | … |

2-Layer GDML | 0.781 ± 0.154 | 0.363 ± 0.112 | … | … |

Boltz. Samp. 2-Layer GDML | … | 0.861 ± 0.167 | … | … |

1-Layer GDML | … | … | 2.947 ± 0.264 | 1.641 ± 0.243 |

Model . | 100 k . | 1000 k . | 2 k . | 5 k . |
---|---|---|---|---|

CGnet | 1.982 ± 0.181 | 0.475 ± 0.103 | … | … |

2-Layer GDML | 0.781 ± 0.154 | 0.363 ± 0.112 | … | … |

Boltz. Samp. 2-Layer GDML | … | 0.861 ± 0.167 | … | … |

1-Layer GDML | … | … | 2.947 ± 0.264 | 1.641 ± 0.243 |

As a baseline, we also compute the free energy difference obtained by a CG model designed by means of a neural network, CGnet.^{59} Previously, we have applied CGnet to alanine-dipeptide, but it was a model based on a five atom CG scheme, and we included two dihedral angles as input features to break the symmetry. To make the CGnet model consistent with the CG scheme used in this work, we modified it to contain six atoms (as in the GDML model), and no dihedral angles features were included (only distances are used as input). This CGnet model is trained with the same number of points as the GDML model (i.e., 1 000 000 points from 1000 batches). The resulting CGnet free energy MSE is $0.475\xb10.103\u2009(kBT)2$, a value slightly larger than the 2-layer GDML model. This result shows that the accuracy of a kernel approach can indeed be comparable to or even better than a neural network approach on the same system.

We have also investigated the effect of the batch number (or the training set size). We computed the cross-validation error with different training set sizes, from 10 to 1000 batches for the GDML model, or equivalently from 10 000 to 1 000 000 points for CGnet. Figure 5(a) shows that as the batch number increases from 10 to 1000, the cross-validation error for the GDML model drops quickly and reaches convergence with a batch number >600. The cross-validation error for CGnet is significantly larger than that for the GDML model when the number of batches (or, equivalently, the training set size) is small. When the batch number is larger than 200, the cross-validation error for CGnet becomes smaller than that for GDML. Similarly, if we compare the free energy MSEs, as shown in Fig. 5(b), the free energy constructed by GDML with a small training set is significantly better than the corresponding free energy constructed by CGnet. On the other hand, with a large training set, the MSEs are comparable to each other. Typical free energy profiles are shown in Figs. 6(a)–6(d), and their corresponding MSE values are shown Table I. These results show that with enough data, the 2-layer GDML model and CGnet perform similarly well. However, the 2-layer GDML model is more data efficient and has a better ability to extrapolate the force prediction to unsampled configurations, thus outperforming CGnet for small training sets.

## IV. CONCLUSIONS

In this work, we combine the idea of ensemble learning with GDML to apply it to the coarse graining problem. GDML is a kernel method to learn molecular force fields from data and allows us to model nonlinearity and multi-body effects without the need of providing a functional form for the potential. The GMDL approach was originally proposed to learn molecular forces from quantum simulation data. When quantum calculations are used, the error on the force matching loss could, in principle, be zero, and a few thousand points are enough to construct and build an accurate, smooth, and conserved force field. However, when applied to coarse-graining, the force matching loss contains a nonzero term due to the dimensionality reduction and the learning problem becomes very noisy. For this reason, a lot more data points are needed from atomistic simulations to learn a CG potential of mean force. The large amount of input data would presently hinder the application of GDML to the CG problem. In order to circumvent this problem, we use ensemble learning. The basic idea consists in breaking down the learning problem into small batches, which can be more easily solved, and combining the resulting different models into a final solution. Following this approach, we do not train one single GDML model using all the data but propose a 2-layer training scheme: in the first layer, we generate *N* data batches, each containing a number of points far below the GDML limit. *N* models are trained on the different batches and are combined into a final model by taking the average. We show that the prediction of the CG 2-layer model accurately reproduces the thermodynamics of the atomistic model.

Consistent with previous work,^{59} we show that, when applying machine learning methods to design force fields for molecular systems, the addition of physical constraints enforces proper asymptotic of the model. In the design of CG potentials, physical constraints can be introduced by means of a prior potential energy term that prevent the appearance of spurious effects in non-physical regions of the configurational landscape.

A good GDML model should be able to construct a smooth and globally connected conserved force field. However, when the 2-layer approach is used, some of the molecular configurations with high free energy are poorly sampled in the training set, introducing large errors in the resulting model. In order to solve this problem, we sample the data uniformly in the low dimensional space defined by two collective coordinates rather than uniformly from the simulation time series. In the example of alanine dipeptide discussed here, the dihedral angles *ϕ*, *ψ* are chosen as collective coordinates.

In our previous work, we proposed CGnet,^{59} a neural network approach to design CG models. The overall free energy reconstruction obtained with the GDML model is comparably accurate as what was obtained with CGnet when the training set size is sufficiently large. However, the GDML model is significantly more accurate when the training set size is small, indicating that a kernel approach is data-efficient and could, in principle, provide more accurate CG models especially with small training sets.

However, there are still several challenges in order to apply GDML for the coarse-graining of macromolecular systems. In larger systems, a more general definition is needed for the collective coordinates defining the low dimensional space for the uniform sampling of the training batches. These collective coordinates could, in principle, be extracted from the trajectory data,^{77,78} for instance, by means of time-lagged Independent Component Analysis (tICA),^{79–83} kernel PCA,^{84–86} or diffusion maps.^{87}

The decomposition of the large input dataset into an ensemble of small batches has been used here to solve memory issues when training a GDML model. However, the computation is still expensive and we expect it to become even more expensive as the size of the molecular system increases. As the number of data batches and batch size grow, the Nyström approximation of the kernel or other numerical approaches may be a promising solution to increase the computational efficiency.

As for the neural network model, the GDML model trained by force matching can capture the thermodynamics of the system, but there is no guarantee that the dynamics is also preserved. Alternative approaches need to be defined to solve this problem.^{88}

Finally, the GDML model presented here is trained on a specific molecule, and it is not directly transferable to different systems. Ultimately, a transferable CG model would be needed for the general application to large systems that cannot be simulated by atomistic simulations. The trade-off between accuracy and transferability in CG models is an open research question that we will investigate in future work.

## SUPPLEMENTARY MATERIAL

See the supplementary material for more details about the hyperparameter search, a discussion on the prior energy, and more information on the descriptors used in the GDML.

## AUTHOR’S CONTRIBUTIONS

J.W. and S.C. contributed equally to this manuscript.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## ACKNOWLEDGMENTS

We thank Eugen Hruska and Feliks Nüske for comments on the manuscript. This work was supported by the National Science Foundation (Grant Nos. CHE-1738990, CHE-1900374, and PHY-1427654), the Welch Foundation (Grant No. C-1570), the MATH + excellence cluster (AA1-6, EF1-2), the Deutsche Forschungsgemeinschaft (Grant No. SFB 1114/A04), the European Commission (Grant No. ERC CoG 772230 “ScaleCell”), and the Einstein Foundation Berlin (Einstein Visiting Fellowship to CC). Simulations were performed on the computer clusters of the Center for Research Computing at Rice University, supported, in part, by the Big-Data Private-Cloud Research Cyberinfrastructure MRI-award (NSF Grant No. CNS-1338099), and on the clusters of the Department of Mathematics and Computer Science at Freie Universität, Berlin. K.-R.M. acknowledges partial financial support by the German Ministry for Education and Research (BMBF) under Grant Nos. 01IS14013A-E, 01IS18025A, 01IS18037A, 01GQ1115, and 01GQ0850; Deutsche Forschungsgemeinschaft (DFG) under Grant Math+, EXC 2046/1 (Project No. 390685689); and the Technology Promotion (IITP) grant funded by the Korea government (Grant Nos. 2017-0-00451 and 2017-0-01779). S.C. acknowledges the BASLEARN - TU Berlin/BASF Joint Laboratory, co-financed by TU Berlin and BASF SE.