Selection of appropriate collective variables (CVs) for enhancing sampling of molecular simulations remains an unsolved problem in computational modeling. In particular, picking initial CVs is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we solve the “initial” CV problem using a data-driven approach inspired by the field of supervised machine learning (SML). In particular, we show how the decision functions in SML algorithms can be used as initial CVs (SMLcv) for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the support vector machines’ decision hyperplane, the output probability estimates from logistic regression, the outputs from shallow or deep neural network classifiers, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.

Efficient configuration space sampling of proteins or other complex physical systems remains an open challenge in computational biophysics. Despite advances in MD code bases,1 hardware, and algorithms,2 routine access to micro to millisecond time scale events is impossible for all but a few.3–7 

As an alternative to unbiased molecular simulations, enhanced sampling methods such as metadynamics8–11 or umbrella sampling12,13 offer promise. However, these require identification of a set of slow collective variables (CVs) to sample along. Recently we14–17 proposed using methods from the Markov modeling literature, namely, the time-structure based independent component analysis (tICA) method14,15,18–20 or the variational auto-encoder method from the machine learning (ML) literature,16,17 for identifying optimal linear and non-linear low-dimensional CVs for accelerated sampling. However, these proposed methods are intended for a wild-type simulation that has been sufficiently sampled. While we showed14 that full phase space convergence was not necessary for approximating such CVs, the models, and thus the CVs, became more robust as more data were input.

Alternatively, other groups21,22 have proposed using spectral gap or re-weighting based approaches to identify appropriate CVs in an iterative fashion. However, the latter methods require effective good initial CVs since poor CVs might inadvertently introduce orthogonal modes thereby slowing convergence. In this work, we wish to answer the question: How do we select initial collective variables? There is obviously no correct a priori answer to this problem. Arguably, an effective starting CV should be continuously defined and differentiable, low-dimensional, and capable of separating out target states of interest while minimizing the number of orthogonal degrees of freedom.23 Previous approaches for protein systems have included expertly chosen distances or dihedrals,24 path based CVs,25,26 and more generic CVs such as alpha or beta sheet character. Similarly, the use of ML based methods for finding CVs is not new. For example, several groups have proposed methods to use linear or non-linear dimensionality reduction based methods8,10,27 to find low-dimensional CVs. In contrast to unsupervised learning, supervised machine learning (SML) can use the state labels, enabling efficient metric learning for distinguishing between the start and end states.

Therefore, in this paper, we propose approaching the CV selection problem using an automated data driven scheme. For molecular systems where the end states are known, we re-cast the problem into a supervised machine learning (SMLcv) problem. SML algorithms are capable of learning 70 a mapping between a high dimensional input and output. These algorithms are routinely used in ML to, say, classify the pictures of cats from dogs. Similar to cats and dogs, we first show how various classification algorithms28 from the ML literature can learn a decision boundary (Fig. 1) separating protein states and how this enables us to define CVs for enhanced sampling. We first demonstrate this using a support vector machine (SVM) and how to use a configuration’s distance to the decision boundary (SVMcv) as a CV. Second, we demonstrate the same principle using a logistic regression (LR) classifier and how its predicted state probabilities (LRcv) can be used as a CV. Third, we show how neural network (NN) based classifiers produce un-normalized state probabilities (NNcv) that can be used as CVs. Fourth, we show how to include data from multiple states by combining multiclass classification with multidimensional enhanced sampling. We end the paper by sampling the folding landscape of the Chignolin mini protein using SVMcv.

FIG. 1.

Pictorial representation of the proposed method and two common classification algorithms. (a) Step-wise process for generating a collective variable (CV) using supervised machine learning (SML) on known end states. (b) In support vector machines (SVMs), the algorithm finds a dividing hyperplane boundary between the two states of interest. The signed distance of any point to this plane becomes the collective variable. (c) By contrast, logistic regression (LR) classifiers model the output probability as a sigmoid transform of the weighted input features. In this instance, the predicted probability becomes the collective variable. (d) In the case of the neural network, the output of the network is the un-normalized probability of being in either state. In this instance, the decision function is a highly non-linear function that maps the input frame X to the un-normalized output probability. Light gray circles indicate hidden layers/nodes.

FIG. 1.

Pictorial representation of the proposed method and two common classification algorithms. (a) Step-wise process for generating a collective variable (CV) using supervised machine learning (SML) on known end states. (b) In support vector machines (SVMs), the algorithm finds a dividing hyperplane boundary between the two states of interest. The signed distance of any point to this plane becomes the collective variable. (c) By contrast, logistic regression (LR) classifiers model the output probability as a sigmoid transform of the weighted input features. In this instance, the predicted probability becomes the collective variable. (d) In the case of the neural network, the output of the network is the un-normalized probability of being in either state. In this instance, the decision function is a highly non-linear function that maps the input frame X to the un-normalized output probability. Light gray circles indicate hidden layers/nodes.

Close modal

All of our methods assume that the modeler has access to short trajectories from the known start and end states. After running short dynamics (on the order of ns to microseconds for protein systems) in the start (State A) and end (State B) states [Figs. 1(b) and 1(c)], we train a classifier to distinguish between the start and end states. To do this, we first project the trajectories onto a complete list of possible order parameters/features. For proteins, this list might include all backbone and side chain dihedrals, alpha carbon pseudo dihedrals, contact distances, etc. An attractive advantage of using SML algorithms to build these CVs is that most of these algorithms can be explicitly regularized (via L1 regularization, for instance) to reduce model complexity. For example, if a certain dihedral or distance is incapable of distinguishing between the start and end states, then the L1 regularizer’s penalty function will allow the SML algorithm to discard this feature. More concretely, this would set the model’s coefficient (see below) for this feature to be 0. In the examples below, we used an L1 regularization for the alanine dipeptide examples and an L2 regularization scheme for the Chignolin example. In contrast to L1, L2 regularization pushes all of the model coefficients toward an average value. A second benefit comes from realizing that the depositing metadynamics bias will simultaneously accelerate all included degrees of freedom [Figs. 2(b), 4(b), and 5(b)] which can accelerate convergence and allow for more aggressive biasing schemes. Again, more concretely, if getting from protein state A to state B requires both a dihedral and an interatomic distance to change, then using the SMLcv would allow us to add external biasing forces to both of these features using a single collective variable. In traditional methods, the modeler would either run multi-dimensional enhanced sampling or judiciously pick and hope for the best. Multi-dimensional sampling cannot scale to more than 3 dimensions while picking and hoping scales even worse. Although there are a wide variety of possible SML algorithms, we focus on three of the most common classification methods, namely, SVM, LR, and NN classifiers.

FIG. 2.

Results from SVMcv based sampling. (a) The two 2 ns training trajectories projected onto the Ramachandran plot. Note that no transition was observed from the β to the α L basins. The training frames are colored according to the frame’s distance to the SVM’s decision boundary in 4-dimensional feature space. The contours show the decision boundary for each region. (b) Decomposing the SVM’s coefficient vector into the original training features shows that the model assigns higher weights to the ϕ features than the ψ features. The L1 regularization drops the sine transform of the ψ backbone dihedral to induce sparsity. (c) Running well-tempered metadynamics simulations along the SVMcv efficiently samples the β to α L transition repeatedly. (d) Reweighting allows us to recover the full free energy surface along the Ramachandran plot.

FIG. 2.

Results from SVMcv based sampling. (a) The two 2 ns training trajectories projected onto the Ramachandran plot. Note that no transition was observed from the β to the α L basins. The training frames are colored according to the frame’s distance to the SVM’s decision boundary in 4-dimensional feature space. The contours show the decision boundary for each region. (b) Decomposing the SVM’s coefficient vector into the original training features shows that the model assigns higher weights to the ϕ features than the ψ features. The L1 regularization drops the sine transform of the ψ backbone dihedral to induce sparsity. (c) Running well-tempered metadynamics simulations along the SVMcv efficiently samples the β to α L transition repeatedly. (d) Reweighting allows us to recover the full free energy surface along the Ramachandran plot.

Close modal

Support Vector Machines. SVMs are linear classifiers that find the separating hyperplane that maximizes the distance between the closest points of the two classes [Fig. 1(b)]. Mathematically, they predict the output label as follows:

y = 1 [ w T X + b > 0 ] ,

where y is the output label, X is the high dimensional input vector, 1 is the indicator function, w is the learnt vector of coefficients, and b is the scalar bias. SVM’s optimization is outside the scope of this paper, but it is worth noting that these models can be easily fitted via API calls to the modern ML software such as scikit-learn.29 The accompanying online IPython notebooks30 provide examples on how these models can be constructed.

The direct indicator function based output from SVMs is not differentiable and thus cannot be used as a collective variable for enhanced sampling. This is because a non-differentiable function of the input coordinates would have discontinuous derivatives at the corresponding decision boundary. This derivative, which can be directly related to the extra forces applied to the atomic positions, would be zero everywhere inside the boundary since moving in any direction would not cause a numerical change in the predicted output state. However, when the particle is at the decision boundary, the derivative would rise rapidly due to a sudden change in the state. This is likely to cause the simulation to crash. However, we can use the signed closest distance to the SVM’s hyperplane as our collective variable instead. Intuitively speaking [Fig. 1(b)], a larger distance in either direction would mean that our current simulation frame is further in one basin or another. Thus our collective variable becomes

SV M cv = distance point , plane = w T X + b w 2 ,

where we divide by w 2 because the weight vector is not normal in most circumstances. It is also possible to simply use the numerator, also called the decision value or the softness field in glass dynamics,31 as the collective variable.

Logistic Regression. The LR classifier [Fig. 1(c)] models the probability of the data by maximizing the distance of each data point to the separating hyperplane,

P ( y = 1 | x ) = σ w T X + b ,

where σ is a sigmoid function,

σ ( x ) = 1 1 + e x .

While SVMs try to maximize the margin between the two classes, LR algorithms maximize the posterior probability of the classes. However, unlike SVMs, now we can directly use the differentiable probability output from the model as our collective variable for metadynamics or other enhanced sampling algorithms,

L R CV = P ( y = 1 | x ) = σ w T X + b .

While we are using the probability of being in state 1 as our collective variable, we can also use the conditional likelihood (odds ratio) of being in either state as a collective variable as well. In this instance, the CV becomes

L R CV = P ( y = 1 |x ) 1 P ( y = 1 |x ) .

It is entirely possible that the linear models presented above are inadequate for separating the training simulation given the feature space. This can be diagnosed via hysteresis during the enhanced sampling simulations. In this case, we recommend two possible extensions. (1) Using a kernel function32,33 to implicitly encode the input feature vector in a high dimensional space. However, it is worth noting that kernel functions often require picking landmark19,32,34 protein configurations for efficient prediction and preventing over-fitting. (2) Alternatively, non-linearity can be achieved via shallow or deep neural networks.15 Neural networks (NNs) are universal function approximators. Typical NNs consist of a series of convolutional and/or fully connected affine transformation layers [Fig. 1(d)] interspersed with a non-linear activation function, such as the Sigmoid, ReLU, or Swish.35 Previous studies have already highlighted the expressive power of neural networks for dimensionality reduction16,36 and sampling.17,37 Here we argue that given some labeled state/trajectory data, the un-normalized output from these networks could now be used as a set of differentiable collective variables for accelerating molecular simulations,

N N CV = Unnormalized Neural Network Output = G X ,

where G(X) is the non-linear transformation learnt by the NN to accurately separate the training examples. However, we caution users against constructing arbitrarily deep networks since they might be difficult to train or make the force calculations computationally expensive, thereby negating the speed ups from accelerated sampling.

Up to now, we have only considered the problem of generating a collective variable for a two-state system. However, most large biophysical3,4 and non-biological systems have multiple metastable states. In several instances, these states are known before hand from crystallography or NMR data, but the relative free energies or kinetics of inter-conversion between those states are not known. To incorporate these states into our model, we recommend generating a multiclass classification28 model. Given N states, the output from our SML algorithm would now be an N-dimensional probability vector, and we can use the multiple outputs as N individual coordinates in a bias-exchange simulation38 or accelerate them simultaneously using multi-dimensional Gaussians (if the number of states is <3). While certain algorithms, such as neural networks, naturally allow for prediction of multiple outputs, other SML algorithms adopt a “one-vs-rest” (OVR) strategy. In OVR, we build a series of models for each output state such that the model learns hyper-boundaries or decision functions that separate the current target state from all the other states. Thus we get N sub-models for the N states, and given a new data point, we compute its distance to each of those hyperplanes and assign it to its closest state. However, in lieu of prediction, we can now use the same set of decision functions as individual collective variables in an enhanced sampling simulation. For example, in the case of a 3-state SVM, the 3 collective variables would be

C V 1 = State 1  vs rest = distance point ,  State 1  hyper plane = w 1 T X + b 1 w 1 2 , C V 2 = State 2  vs rest = distance point ,  State 2  hyper plane = w 2 T X + b 2 w 2 2 , C V 3 = State 3  vs rest = distance point ,  State 3  hyper plane = w 3 T X + b 3 w 3 2 ,

where w1, w2, and w3 and b1, b2, and b3 are the respective weight vectors and biasing scalars for the hyperplanes that separate their respective states from the rest of the ensemble. Similar expressions can be derived for other SML methods such as logistic regression classifiers. On the other hand, since NNs can directly output an un-normalized probability vector of length equal to the number of states, there is no need to use OVR based strategies.

It is worth recognizing that using this SML based CVs requires existing enhanced sampling software to have native machine learning libraries built in so that the collective variable definitions proposed above can be utilized. However, at their core, these machine learning models are nothing more than a series of vector operations whose compute graph can be easily broken down and implemented in a step wise fashion in most modern enhanced sampling engines. To highlight this, our open source implementation at the end of the manuscript provides examples on how the SVM, multiclass-SVM, LR, or even NNs can be written as a series of custom scripts such that enhanced sampling packages such as Plumed and OpenMM1,39 can interpret and utilize the mathematical transforms that connect the current simulation frame’s atomic coordinates to a set of scalars—also known as the collective variables.

To prevent overfitting, we also recommend k-fold (k = 3-10) cross-validation to identify optimal model hyper-parameters. In cross-validation, the model is trained on a subset of the data (the training data), and its accuracy is scored on a held out set (validation data). A range of models with varying hyper-parameters and model complexity are built, and parameters for the best scoring model are saved. The best model can then be retrained on the full training and validation sets before reporting its performance on a held out test set. Since the “test” for our model is its ability to accelerate sampling, we chose to simply do 3-fold cross-validation without a held out test set. The supplementary material contains validation set accuracies for a range of different models and systems. In general, across several reasonable parameter values (Figs. 3–5 of the supplementary material), our models’ accuracies never drop below 0.92 (92%) for either example. For alanine (Figs. 3 and 4 of the supplementary material), all models reported accuracies of 100% because the alpha and beta regions are easily linearly and non-linearly separable. As specified above, after cross-validation, we re-trained the models on the full train and validation dataset.

We showcase our methods for three different linear classifiers on solvated alanine dipeptide. The training simulations were previously generated14 and consisted of two 2 ns trajectories starting from the β and α L basins on the Ramachandran plot (Fig. 2). Similar to previous work, we used the sin-cosine transform of the backbone ϕ and ψ dihedrals19,40 as input to our models. Thus each training simulation frame was represented as 4 numbers. The control simulations (Figs. 1 and 2 of the supplementary material) were accelerated only along the backbone ϕ and ψ dihedral. All models were trained using scikit-learn29 or PyTorch.41 For the SVM and LR models, we performed 3-fold cross-validation to determine the best hyper-parameters and regularization strengths (Figs. 3 and 4 of the supplementary material) but found that our dataset was simple enough that a large range of hyper-parameter settings gave very similar results. Ultimately, we picked reasonable model parameters (Figs. 3 and 4 of the supplementary material) and then re-trained the model on the entire available 4 ns. For the SVM and the LR models, we used a L1 regularization with a regularization strength (C-parameter) of 1.0. The supplementary material contains the full model parameters, and the online Github repository contains the fitted models. The model was trained once and kept fixed throughout the simulation. The pre-fitted models were written as custom scripts to perform metadynamics using Plumed39 and OpenMM.1 The metadynamics simulations were well tempered42 with the parameters shown in Table 1 of the supplementary material. The simulations were started from the β basin, run for a total length of 45 ns apiece, and saved every 10 ps. All other simulation parameters were kept the same as before.14 The trajectories were analyzed and plotted using standard Python libraries19,29,30,43 while reweighting was done using the time-independent estimator of Tiwary.11 

The results from our SVMcv simulations are shown in Fig. 2. Based upon the 4 ns of training data [Fig. 2(a)], the SVM model is able to find a hyperplane separating the β and α L regions [Fig. 2(a)]. The distance to this hyperplane [Fig. 2(a), color bar] now becomes our CV for accelerating sampling via metadynamics [Fig. 2(c)]. Over a 45 ns run, we sample the slowest coordinate β to α L transition tens of times, allowing for robust estimation of the free energy surface via reweighting [Fig. 2(d)]. Additionally, since the simulations were converged, we are able to efficiently sample the faster orthogonal ψ degree of freedom as well. Similar to the SVM, we also used the Logistic Regression (LR) classifiers to define CVs (LRcv) for accelerated simulations. Note 1 and Fig. 6 of the supplementary material contain details on training and using these models for defining CVs. We recommend that modelers start with simpler linear models before moving to the more complex neural network based approaches shown in the following section.

We next looked into using neural networks to find non-linear boundaries between the start and end states. To show this, we trained a simple 5-layered neural network and used the resulting model as a collective variable for enhanced sampling via Plumed and OpenMM.1,39 The layers alternated [Fig. 3(b)] between fully connected affine transformations and non-linear activation functions. For the non-linear activation layer, we used the recently developed Swish35 non-linear function. We randomly chose this combination though it is possible to systematically cross validate various network architecture choices by maximizing model’s performance on the validation set.

FIG. 3.

Results from neural network collective variable (NNCV) based sampling. (a) The training and test trajectories projected onto the Ramachandran plot. Note that no transition was observed from the β to the α L basins. The frames are colored according to the unnormalized output of the NN for the frame being in the α L state. The contours show the decision boundary for each region. (b) Architecture of the neural network used in this study. We did not optimize the architecture but did perform a 50-50 training and test split. The two outputs correspond to the un-normalized probability of being in the β or α L basin. Either of the outputs (or both of them) can be used as a collective variable. In this instance, we used the output corresponding to the un-normalized output of the NN for the frame being in the α L state. (c) Running well-tempered metadynamics simulations along the N N c v efficiently samples the β to α L transition repeatedly. (d) Reweighting allows us to recover the full free energy surface along the Ramachandran plot.

FIG. 3.

Results from neural network collective variable (NNCV) based sampling. (a) The training and test trajectories projected onto the Ramachandran plot. Note that no transition was observed from the β to the α L basins. The frames are colored according to the unnormalized output of the NN for the frame being in the α L state. The contours show the decision boundary for each region. (b) Architecture of the neural network used in this study. We did not optimize the architecture but did perform a 50-50 training and test split. The two outputs correspond to the un-normalized probability of being in the β or α L basin. Either of the outputs (or both of them) can be used as a collective variable. In this instance, we used the output corresponding to the un-normalized output of the NN for the frame being in the α L state. (c) Running well-tempered metadynamics simulations along the N N c v efficiently samples the β to α L transition repeatedly. (d) Reweighting allows us to recover the full free energy surface along the Ramachandran plot.

Close modal

We trained the model [Fig. 3(b)] on half of the 4 ns alanine dipeptide dataset using PyTorch. We note that the output layer of this model gives two values which are the un-normalized probability of the input frame being in either state. Either (or both) of these output nodes can be used as a collective variable since, once normalized, they sum up to 1. To train the model, we minimized the cross-entropy between the models’ output and the training data. This was done using the Adam optimizer44 with an initial learning rate of 0.1. We trained the model for 1 epoch using a batch size of 32. The optimization was performed using PyTorch on the CPU platform, and training took less than a minute. After 1 epoch, our model’s training error was below 0.1 and the model reported 100% accuracy on a held out 2 ns test set. Therefore, we stopped training at this instance and used this model going forward. Once the model had been trained [Fig. 3(a)], we wrote custom string expressions to convert the PyTorch neural network model into a format that Plumed can process. Again, the metadynamics simulations were well tempered42 with the parameters shown in Table 1 of the supplementary material. The online Github repo provides examples on how such models can be trained and transferred to Plumed.

The results are in Figs. 3(c) and 3(d); running metadynamics simulations along the output probability, DNNcv, allows us to observe >15 transitions [Fig. 3(c)] along alanine’s slower ϕ coordinate in just 45 ns of sampling. Similar to the SVM model, the robust transition statistics combined with the reweighting method of Tiwary11 now allows us to recover the free-energy landscape [Fig. 3(d)] across other coordinates of interest.

We next extended our method to the multiple state scenario. This is necessary because most large systems such as kinases, G-protein coupled receptor (GPCRs), or ion channels exhibit multi-state characteristics that are often available as crystallographic starting points.4 Simulations around these local points can be used to define a set of decision functions that separate each state from the rest. This set of decision functions can then be used in a multi-dimensional enhanced sampling run.

To test the method, we generated three 1 ns trajectories in the β , α L , and  α R region of alanine dipeptide [Fig. 3(a)]. Based on the cross-validation results in the previous 2-state SVM model, we built an L1 regularized 3-state SVM with the regularization strength set to 1.0 and trained it using a squared-hinge loss (see the supplementary material). The model parameterizes 3 hyper-planes, and the distance to each of those hyperplanes was used in a 3-dimensional metadynamics simulation. It is trivial to separate these into three one-dimensional metadynamics simulations, connected via Hamiltonian replica exchange.38 All metadynamics parameters were the same except that we used last-bias reweighting, in place of Tiwary’s time-dependent estimator, due to numerical stability issues. We limited our simulations to 12 ns because the simulation showed excellent convergence [Fig. 4(c)] well before that.

FIG. 4.

Results from multiclass SVMcv based sampling. (a) The three 1 ns training trajectories projected onto the Ramachandran plot. Note that no transition was observed in either dimension. The contours show the decision boundary for each region. (b) Decomposing the SVM’s coefficient vector into the original training features shows that the model assigns higher weights to the ϕ features than the ϕ features for the α L -vs-rest hyperplane but assigns higher weights to the ψ dihedrals than for the other two. Several coefficients are dropped due to regularization. (c) Running well-tempered metadynamics simulations along the S V M c v efficiently samples the β to α L transition repeatedly. (d) Reweighting allows us to recover the full free energy surface along the Ramachandran plot.

FIG. 4.

Results from multiclass SVMcv based sampling. (a) The three 1 ns training trajectories projected onto the Ramachandran plot. Note that no transition was observed in either dimension. The contours show the decision boundary for each region. (b) Decomposing the SVM’s coefficient vector into the original training features shows that the model assigns higher weights to the ϕ features than the ϕ features for the α L -vs-rest hyperplane but assigns higher weights to the ψ dihedrals than for the other two. Several coefficients are dropped due to regularization. (c) Running well-tempered metadynamics simulations along the S V M c v efficiently samples the β to α L transition repeatedly. (d) Reweighting allows us to recover the full free energy surface along the Ramachandran plot.

Close modal

The resulting multiclass SVM model [Fig. 4(a)] learnt a set of dividing boundaries or functions that separate each of the β , α L , and  α R regions from the other two. The coefficients for the respective hyperplanes are shown in Fig. 4(b). For example, as intuitively expected, the α L plane [Fig. 4(b), orange] is separated from the rest of the ensemble only via the ϕ dihedral features. In this instance, regularization forces the other two coefficients to 0. By contrast, the α R plane has non-zero weights on both ϕ and ψ features. The simulations quickly reach the diffusive regime [Fig. 3(c)], showing tens of transitions in just 10 ns and highlighting the effectiveness of the multiclass SML algorithm as a set of collective variables. Last bias reweighting gives us a similar free-energy surface as before [Fig. 3(d)].

To compare our simulations to a baseline, we ran two control 12 + ns simulations (Fig. 1 of the supplementary material) where we accelerated the dynamics of the ψ or ϕ dihedral alone ψ is a bad starting CV since the slowest transitions in alanine are dominated by the ϕ dihedral. Thus, in our first control simulation, we only observed three transitions along the ϕ plane. By contrast, both variants of SMLcv were able to observe at least 5 transitions in the same amount of simulation time. Additionally, all simulations, including the control, were also able to drive the slow uphill β to α L transition much more quickly (<5 ns) than the unbiased 43 ns mean first passage time estimate previously reported.45 However, it is worth noting that some of our SML based CVs are not as efficient as simply directly accelerating the ϕ dihedral (Fig. 2 of the supplementary material). This likely indicates that we do not have the most efficient possible CV, but it is entirely possible to now turn our initial SML based CV into a better CV by combining it with the SGOOP formalism.21 Intriguingly, knowledge about the ϕ could also potentially be incorporated as Bayesian prior into the SMLCV formalism.

We next sought to fold the 10 residue Chignolin mini-protein46 using the Amber forcefield,47 TIP3P48 water model, and vanilla NVT simulations. To this end, we obtained the all-atom Chignolin trajectories from D. E. Shaw research.46,49 We then sliced 1000 frames (2 μ s trajectory) from the folded and unfolded [Fig. 5(a)] states (4 μs total), respectively, ensuring that no actual folding event [Fig. 5(a)] was seen in those trajectories. To featurize these 2000 frames, we used a combination of the sin-cosine transform of the backbone ϕ , ψ dihedrals, α carbon contact distances for residues at least 3 apart in sequence and the cosine transform of four consecutive α carbon atoms. Thus each frame in our simulation was represented using 78 features. These features were normalized to have zero mean and unit variance by applying a standard feature scaling transformation.7,46 This scaling is necessary because most SML algorithms have issues with varied feature scaling—for example, dihedrals going from negative pi to pi while distances going up to tens of angstroms. Decision trees and random forests do not require this feature scaling because unlike SVMs, LR, or NNs, and they do not learn a “distance” metric and are invariant to monotonic feature scaling.

FIG. 5.

Results from SVMcv based sampling on the Chignolin ten-residue mini-protein. (a) The two 2 μs training trajectories projected onto the learnt decision function. Note that no transition was observed from the folded to the unfolded basin. The protein images on the left show several randomly sampled frames. (b) The SVMcv decision function is a complex linear combination of dihedrals and contacts. Note that we used normalized contacts and dihedrals as inputs to the ML model. (c) One of several folding trajectories obtained from multiple walker well-tempered metadynamics using the SVMcv. The top panel traces the collective variable, while the bottom traces the root mean squared deviation of the alpha carbons to the NMR structure 2RVD.51 (d) Integrating the hills and reweighting11 allow us to construct the free energy profile. The insets show randomly selected unfolded and folded structures in red with the NMR structure in white. This image was generated using IPython notebook,30 MSMBuilder,19 MSMExplorer,43 and VMD.52 

FIG. 5.

Results from SVMcv based sampling on the Chignolin ten-residue mini-protein. (a) The two 2 μs training trajectories projected onto the learnt decision function. Note that no transition was observed from the folded to the unfolded basin. The protein images on the left show several randomly sampled frames. (b) The SVMcv decision function is a complex linear combination of dihedrals and contacts. Note that we used normalized contacts and dihedrals as inputs to the ML model. (c) One of several folding trajectories obtained from multiple walker well-tempered metadynamics using the SVMcv. The top panel traces the collective variable, while the bottom traces the root mean squared deviation of the alpha carbons to the NMR structure 2RVD.51 (d) Integrating the hills and reweighting11 allow us to construct the free energy profile. The insets show randomly selected unfolded and folded structures in red with the NMR structure in white. This image was generated using IPython notebook,30 MSMBuilder,19 MSMExplorer,43 and VMD.52 

Close modal

We trained a SVM [Fig. 5(b)] on these 2000 frames (1000 folded and 1000 unfolded), using cross-validated parameters (Fig. 5 of the supplementary material), via Scikit-learn.29 For this model, we used L2 regularization with the regularization strength set to 1.0. However, changing the regularization by a factor of 10 in either direction had minimal effect on the final decision boundary (Fig. 5 of the supplementary material). Then similar to the alanine example, we used the distance to the SVM’s hyperplane as our collective variable. In this instance, we started 25 metadynamics using well-tempered metadynamics.9,42 Each of our walkers was run for 100 ns (∼18 h of parallelized sampling on commodity K40 GPUs). Thus the total amount of sampling was 2.5 µs. We ran vanilla NVT metadynamics simulations using a 2 fs Langevin integrator with a friction coefficient of 1/ps and the estimated melting temperature of 340 K.46 The metadynamics simulations were well tempered42 with an initial height of 1 kJ/mol, a bias factor of 8, and a deposition rate of 2 ps. The sigma parameter of 0.2 was chosen based on the training simulation data in the folded state. The simulation frames were saved every 50 ps. See Table 2 of the supplementary material for the rest of the metadynamics parameters.

Over the 100 ns simulations, at least 5 out of 25 walkers naturally folded to the correct Chignolin topology with the insets of Figs. 5(c) and 5(d) showing representative traces and structures. The movie of the supplementary material shows an example of such a folding event. Clearly, our SVMcv is able to efficiently drive the simulations between the folded and unfolded ensembles since we are able to obtain multiple folding events in total simulation time comparable to the starting training simulation in a single basin. We also used the time-independent free energy estimator11 to re-weight the simulations [Fig. 5(d)], showing, similar to previous studies,46,50 that the folded state was the dominant free energy state. Unfortunately, the flattened barrier [Fig. 5(d)] likely indicates that the transition states were flattened during the course of our simulations, making kinetic estimates difficult with these simulations.

We believe that the power of our method SMLcv comes in automating the CV selection problem, providing a systematic way to include non-linearity and regularization and minimizing the work that needs to be done to find starting points for enhanced sampling runs. Our results further open up significant opportunities for both benchmarking existing SML algorithms and developing novel algorithms or loss functions inspired from statistical mechanics. For example, other SML techniques, such as the naïve Bayes classifier, might present interesting starting points for building classification schemes better suited for accelerated sampling. However, not all SML algorithms are amenable to being used as CVs. For example, while decision trees and random forest classifiers53,54 are useful for understanding molecular systems,54 their indicator function based optimization procedure is not differentiable, making their decision function unlikely to be used as a CV.

Our proposed method (SMLcv) is intended to produce a first estimate of CVs from very limited data collected from locally sampling each basin and as such will likely not be able to generate an ideal CV. In all likelihood, the model might accidently induce orthogonal degrees of freedom in either state thereby slowing convergence. However, this problem would exist for any of the existing methods in the literature.

While we have focused on pre-training and generation of CVs, we also believe that supervised machine learning based collective variables (SMLcv) might be excellent starting coordinates for further optimization via SGOOP21 or VAC.22 This would make the entire process into an online learning setup where each additional round of simulation improves the dividing hyper-boundary. Additionally, these coordinates are likely to be transferable15,17 across related systems which would make them useful for investigating drug binding/unbinding kinetics, mutational effects, modest forcefield effects, etc. However, when and where transfer learning might fail is an unsolved problem. Finally, we hypothesize that the automated order parameter identification might make it easier to converge chain-of-states based methods such as the nudged elastic band or string based method25,55–57 since the separating hyperplane can now be used to more efficiently guide the iterative minimum free energy path finding algorithms—similar to the kernel-SVM approach of Pozun et al.58 We hope that these results provide stimulating starting points for automatic CV optimization protocols, allowing modelers to focus more on the results of free-energy simulations rather than their initial design.

See supplementary material for several additional figures which are comparisons of the SMLcv results against hand-picked collective variables for alanine dipeptide, results of 3-fold cross-validation for the given models, and results from using a logistic regression classifier’s decision function as a collective variable for accelerated sampling. Due to size restrictions, the supporting video has been uploaded online: https://www.youtube.com/watch?v=jgor4xbAai8 and https://youtu.be/D1cXDKrHMCA.

The authors would like to thank various members of the Pande lab for useful discussions and feedback on the manuscript. M.M.S. would like to acknowledge support from the National Science Foundation Grant No. NSF-MCB-0954714. This work used the XStream computational resource, supported by the National Science Foundation Major Research Instrumentation program (No. ACI-1429830). The authors would like to thank D.E. Shaw and DESERES for graciously providing the Chignolin folding trajectories. VSP is a consultant and an SAB member of Schrodinger, LLC, and Globavir, sits on the Board of Directors of Apeel, Inc., Freenome, Inc., Omada Health, Patient Ping, and Rigetti Computing, and is a General Partner at Andreessen Horowitz.

All the models and code needed to reproduce the main results of this paper are available at https://github.com/msultan/SML_CV/.

1.
P.
Eastman
,
J.
Swails
,
J. D.
Chodera
,
R. T.
McGibbon
,
Y.
Zhao
,
K. A.
Beauchamp
,
L. P.
Wang
,
A. C.
Simmonett
,
M. P.
Harrigan
,
C. D.
Stern
 et al, “
OpenMM 7: Rapid development of high performance algorithms for molecular dynamics
,”
PLoS Comput. Biol.
13
(
7
),
e1005659
(
2017
).
2.
G. R.
Bowman
,
V. S.
Pande
, and
F.
Noé
,
An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation
(
Springer
,
2014
), Vol. 797.
3.
M. M.
Sultan
,
G.
Kiss
, and
V.
Pande
, “
Towards simple kinetic models of functional dynamics for a kinase subfamily
,”
Nat. Chem.
(published online).
4.
M. M.
Sultan
,
R. A.
Denny
,
R.
Unwalla
,
F.
Lovering
, and
V. S.
Pande
, “
Millisecond dynamics of btk reveal kinome-wide conformational plasticity within the apo kinase domain
,”
Sci. Rep.
7
(
1
),
15604
(
2017
).
5.
R. O.
Dror
,
R. M.
Dirks
,
J. P.
Grossman
,
H.
Xu
, and
D. E.
Shaw
, “
Biomolecular simulation: A computational microscope for molecular biology
,”
Annu. Rev. Biophys.
41
,
429
452
(
2012
).
6.
D.-A.
Silva
,
D. R.
Weiss
,
F.
Pardo Avila
,
L.-T.
Da
,
M.
Levitt
,
D.
Wang
, and
X.
Huang
, “
Millisecond dynamics of RNA polymerase. II. Translocation at atomic resolution
,”
Proc. Natl. Acad. Sci. U. S. A.
111
(
21
),
7665
7670
(
2014
).
7.
D.
Shaw
,
P.
Maragakis
, and
K.
Lindorff-Larsen
, “
Atomic-level characterization of the structural dynamics of proteins
,”
Science
330
,
341
347
(
2010
).
8.
A.
Laio
and
F. L.
Gervasio
, “
Metadynamics: A method to simulate rare events and reconstruct the free energy in biophysics, chemistry and material science
,”
Rep. Prog. Phys.
71
(
12
),
126601
(
2008
).
9.
P.
Raiteri
,
A.
Laio
,
F. L.
Gervasio
,
C.
Micheletti
, and
M.
Parrinello
, “
Efficient reconstruction of complex free energy landscapes by multiple walkers metadynamics
,”
J. Phys. Chem. B
110
(
8
),
3533
3539
(
2006
).
10.
C.
Abrams
and
G.
Bussi
, “
Enhanced sampling in molecular dynamics using metadynamics, replica-exchange, and temperature-acceleration
,”
Entropy
16
(
1
),
163
199
(
2013
).
11.
P.
Tiwary
and
M.
Parrinello
, “
A time-independent free energy estimator for metadynamics
,”
J. Phys. Chem. B
119
(
3
),
736
742
(
2015
).
12.
J.
Kästner
, “
Umbrella sampling
,”
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
1
(
6
),
932
942
(
2011
).
13.
S.
Jo
,
D.
Suh
,
Z.
He
,
C.
Chipot
, and
B.
Roux
, “
Leveraging the information from Markov state models to improve the convergence of umbrella sampling simulations
,”
J. Phys. Chem. B
120
(
33
),
8733
8742
(
2016
).
14.
M. M.
Sultan
and
V. S.
Pande
, “
TICA-metadynamics: Accelerating metadynamics by using kinetically selected collective variables
,”
J. Chem. Theory Comput.
13
(
6
),
2440
2447
(
2017
).
15.
M. M.
Sultan
and
V. S.
Pande
, “
Transfer learning from Markov models leads to efficient sampling of related systems
,”
J. Phys. Chem. B
122
,
5291
(
2017
).
16.
C. X.
Hernández
,
H. K.
Wayment-Steele
,
M. M.
Sultan
,
B. E.
Husic
, and
V. S.
Pande
, “
Variational encoding of complex dynamics
,”
Phys. Rev. E
97
,
062412
(
2018
).
17.
M. M.
Sultan
,
H. K.
Wayment-Steele
, and
V. S.
Pande
, “
Transferable neural networks for enhanced sampling of protein dynamics
,”
J. Chem. Theory Comput.
14
,
1887
1894
(
2018
).
18.
V. S.
Pande
,
K.
Beauchamp
, and
G. R.
Bowman
, “
Everything you wanted to know about Markov state models but were afraid to ask
,”
Methods
52
(
1
),
99
105
(
2010
).
19.
M. P.
Harrigan
,
M. M.
Sultan
,
C. X.
Hernández
,
B. E.
Husic
,
P.
Eastman
,
C. R.
Schwantes
,
K. A.
Beauchamp
,
R. T.
Mcgibbon
, and
V. S.
Pande
, “
MSMBuilder: Statistical models for biomolecular dynamics
,”
Biophys. J.
112
(
1
),
10
15
(
2016
).
20.
G.
Pérez-Hernández
,
F.
Paul
,
T.
Giorgino
,
G.
De Fabritiis
,
F.
Noé
,
G.
Perez-hernandez
, and
F.
Paul
, “
Identification of slow molecular order parameters for Markov model construction
,”
J. Chem. Phys.
139
(
1
),
015102
(
2013
).
21.
P.
Tiwary
and
B. J.
Berne
, “
Spectral gap optimization of order parameters for sampling complex molecular systems
,”
Proc. Natl. Acad. Sci. U. S. A.
113
(
11
),
2839
2844
(
2016
).
22.
J.
McCarty
and
M.
Parrinello
, “
A variational conformational dynamics approach to the selection of collective variables in metadynamics
,”
J. Chem. Phys.
147
(
20
),
204109
(
2017
).
23.
G.
Fiorin
,
M. L.
Klein
, and
J.
Hénin
, “
Using collective variables to drive molecular dynamics simulations
,”
Mol. Phys.
111
(
22–23
),
3345
3362
(
2013
).
24.
S.
Lovera
,
L.
Sutto
,
R.
Boubeva
,
L.
Scapozza
,
N.
Dölker
, and
F. L.
Gervasio
, “
The different flexibility of c-Src and c-Abl kinases regulates the accessibility of a druggable inactive conformation
,”
J. Am. Chem. Soc.
134
(
5
),
2496
2499
(
2012
).
25.
A. C.
Pan
,
D.
Sezer
, and
B.
Roux
, “
Finding transition pathways using the string method with swarms of trajectories
,”
J. Phys. Chem. B
112
(
11
),
3432
3440
(
2008
).
26.
Y.
Meng
,
Y. L.
Lin
, and
B.
Roux
, “
Computational study of the ‘DFG-Flip’ conformational transition in c-Abl and c-Src tyrosine kinases
,”
J. Phys. Chem. B
119
(
4
),
1443
1456
(
2015
).
27.
M.
Ceriotti
,
G. A.
Tribello
, and
M.
Parrinello
, “
Demonstrating the transferability and the descriptive power of sketch-map
,”
J. Chem. Theory Comput.
9
(
3
),
1521
1532
(
2013
).
28.
P.-N.
Tan
,
Classification: Basic Concepts, Decision Trees and Model Evaluation. Introduction to Data Mining
(
Addison-Wesley Longman Publishing
,
2006
), Chap. 4, pp.
145
205
.
29.
F.
Pedregosa
and
G.
Varoquaux
, “
Scikit-learn: Machine learning in Python
,”
J. Mach. Learn.
12
,
2825
2830
(
2011
).
30.
F.
Pérez
and
B. E.
Granger
, “
IPython: A system for interactive scientific computing
,”
Comput. Sci. Eng.
9
(
3
),
21
29
(
2007
).
31.
S. S.
Schoenholz
,
E. D.
Cubuk
,
D. M.
Sussman
,
E.
Kaxiras
, and
A. J.
Liu
, “
A structural approach to relaxation in glassy liquids
,”
Nat. Phys.
12
(
5
),
469
471
(
2016
).
32.
T.
Hofmann
,
B.
Schölkopf
, and
A. J.
Smola
, “
Kernel methods in machine learning
,”
Ann. Stat.
36
,
1171
1220
(
2008
).
33.
K. R.
Müller
,
S.
Mika
,
G.
Rätsch
,
K.
Tsuda
, and
B.
Schölkopf
, “
An introduction to kernel-based learning algorithms
,”
IEEE Trans. Neural Networks
12
(
2
),
181
201
(
2001
).
34.
M. P.
Harrigan
and
V. S.
Pande
, “
Landmark kernel tICA for conformational dynamics
,” preprint bioRxiv:123752 (
2017
).
35.
P.
Ramachandran
,
B.
Zoph
, and
Q. V.
Le
, “
Searching for activation functions
,” preprint arXiv:1710.05941 (
2017
).
36.
A.
Ma
and
A. R.
Dinner
, “
Automatic method for identifying reaction coordinates in complex systems
,”
J. Phys. Chem. B
109
(
14
),
6769
6779
(
2005
).
37.
W.
Chen
and
A. L.
Ferguson
, “
Collective variable discovery and enhanced sampling using autoencoders: Innovations in network architecture and error function design
,”
J. Chem. Phys.
149
,
072312
(
2018
).
38.
S.
Piana
and
A.
Laio
, “
A bias-exchange approach to protein folding
,”
J. Phys. Chem. B
111
(
17
),
4553
4559
(
2007
).
39.
G. A.
Tribello
,
M.
Bonomi
,
D.
Branduardi
,
C.
Camilloni
, and
G.
Bussi
, “
PLUMED 2: New feathers for an old bird
,”
Comput. Phys. Commun.
185
(
2
),
604
613
(
2014
).
40.
R. T.
McGibbon
,
K. A.
Beauchamp
,
M. P.
Harrigan
,
C.
Klein
,
J. M.
Swails
,
C. X.
Hernández
,
C. R.
Schwantes
,
L. P.
Wang
,
T. J.
Lane
, and
V. S.
Pande
, “
MDTraj: A modern open library for the analysis of molecular dynamics trajectories
,”
Biophys. J.
109
(
8
),
1528
1532
(
2015
).
41.
A.
Paszke
,
G.
Chanan
,
Z.
Lin
,
S.
Gross
,
E.
Yang
,
L.
Antiga
, and
Z.
Devito
, “
Automatic differentiation in PyTorch
,”
Adv. Neural Inf. Processes Syst.
30
,
1
4
(
2017
).
42.
A.
Barducci
,
G.
Bussi
, and
M.
Parrinello
, “
Well-tempered metadynamics: A smoothly converging and tunable free-energy method
,”
Phys. Rev. Lett.
100
(
2
),
020603
(
2008
).
43.
C. X.
Hernández
,
M. P.
Harrigan
,
M. M.
Sultan
, and
V. S.
Pande
, “
MSMExplorer: Data visualizations for biomolecular dynamics
,”
J. Open Source Software
2
(
12
),
188
(
2017
).
44.
D. P.
Kingma
and
J.
Ba
, “
Adam: A method for stochastic optimization
,” preprint arXiv:1412.6980 (
2014
).
45.
B.
Trendelkamp-Schroer
and
F.
Noé
, “
Efficient estimation of rare-event kinetics
,”
Phys. Rev. X
6
(
1
),
011009
(
2016
).
46.
K.
Lindorff-Larsen
,
S.
Piana
,
R. O.
Dror
, and
D. E.
Shaw
, “
How fast-folding proteins fold
,”
Science
334
(
6055
),
517
520
(
2011
).
47.
K.
Lindorff-Larsen
,
S.
Piana
,
K.
Palmo
,
P.
Maragakis
,
J. L.
Klepeis
,
R. O.
Dror
, and
D. E.
Shaw
, “
Improved side-chain torsion potentials for the amber ff99SB protein force field
,”
Proteins: Struct., Funct., Bioinf.
78
(
8
),
1950
1958
(
2010
).
48.
W. L.
Jorgensen
,
J.
Chandrasekhar
,
J. D.
Madura
,
R. W.
Impey
, and
M. L.
Klein
, “
Comparison of simple potential functions for simulating liquid water
,”
J. Chem. Phys.
79
(
2
),
926
(
1983
).
49.
D. E.
Shaw
,
J. C.
Chao
,
M. P.
Eastwood
,
J.
Gagliardo
,
J. P.
Grossman
,
C. R.
Ho
,
D. J.
Ierardi
,
I.
Kolossváry
,
J. L.
Klepeis
,
T.
Layman
 et al, “
Anton, a special-purpose machine for molecular dynamics simulation
,” in
Proceedings of the 34th Annual International Symposium on Computer Architecture - ISCA ’07
(
ACM Press
,
New York, USA
,
2007
), Vol. 35, p.
1
.
50.
K. A.
McKiernan
,
B. E.
Husic
, and
V. S.
Pande
, “
Modeling the mechanism of CLN025 beta-hairpin formation
,”
J. Chem. Phys.
147
(
10
),
104107
(
2017
).
51.
S.
Honda
,
T.
Akiba
,
Y. S.
Kato
,
Y.
Sawada
,
M.
Sekijima
,
M.
Ishimura
,
A.
Ooishi
,
H.
Watanabe
,
T.
Odahara
, and
K.
Harata
, “
Crystal structure of a ten-amino acid protein
,”
J. Am. Chem. Soc.
130
(
46
),
15327
15331
(
2008
).
52.
W.
Humphrey
,
A.
Dalke
, and
K.
Schulten
, “
VMD: Visual molecular dynamics
,”
J. Mol. Graphics
14
(
1
),
33
38
(
1996
).
53.
L.
Breiman
, “
Random forests
,”
Mach. Learn.
45
,
5
32
(
2001
).
54.
M. M.
Sultan
,
G.
Kiss
,
D.
Shukla
, and
V. S.
Pande
, “
Automatic selection of order parameters in the analysis of large scale molecular dynamics simulations
,”
J. Chem. Theory Comput.
10
(
12
),
5217
5223
(
2014
).
55.
E.
Weinanaaaa
,
W.
Ren
, and
E.
Vanden-Eijnden
, “
Simplified and improved string method for computing the minimum energy paths in barrier-crossing events
,”
J. Chem. Phys.
126
(
16
),
164103
(
2007
).
56.
E.
Weinanaaaa
,
W.
Ren
, and
E.
Vanden-Eijnden
, “
String method for the study of rare events
,”
Phys. Rev. B
66
(
5
),
052301
(
2002
).
57.
M. A.
Rohrdanz
,
W.
Zheng
, and
C.
Clementi
, “
Discovering mountain passes via torchlight: Methods for the definition of reaction coordinates and pathways in complex macromolecular reactions
,”
Annu. Rev. Phys. Chem.
2013
(
64
),
295
316
(
2012
).
58.
Z. D.
Pozun
,
K.
Hansen
,
D.
Sheppard
,
M.
Rupp
,
K. R.
Müller
, and
G.
Henkelman
, “
Optimizing transition states via kernel-based machine learning
,”
J. Chem. Phys.
136
(
17
),
174101
(
2012
).

Supplementary Material