Selection of appropriate collective variables (CVs) for enhancing sampling of molecular simulations remains an unsolved problem in computational modeling. In particular, picking initial CVs is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we solve the “initial” CV problem using a data-driven approach inspired by the field of supervised machine learning (SML). In particular, we show how the decision functions in SML algorithms can be used as initial CVs (*SML*_{cv}) for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the support vector machines’ decision hyperplane, the output probability estimates from logistic regression, the outputs from shallow or deep neural network classifiers, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.

## INTRODUCTION

Efficient configuration space sampling of proteins or other complex physical systems remains an open challenge in computational biophysics. Despite advances in MD code bases,^{1} hardware, and algorithms,^{2} routine access to micro to millisecond time scale events is impossible for all but a few.^{3–7}

As an alternative to unbiased molecular simulations, enhanced sampling methods such as metadynamics^{8–11} or umbrella sampling^{12,13} offer promise. However, these require identification of a set of slow collective variables (CVs) to sample along. Recently we^{14–17} proposed using methods from the Markov modeling literature, namely, the time-structure based independent component analysis (tICA) method^{14,15,18–20} or the variational auto-encoder method from the machine learning (ML) literature,^{16,17} for identifying optimal linear and non-linear low-dimensional CVs for accelerated sampling. However, these proposed methods are intended for a wild-type simulation that has been sufficiently sampled. While we showed^{14} that full phase space convergence was not necessary for approximating such CVs, the models, and thus the CVs, became more robust as more data were input.

Alternatively, other groups^{21,22} have proposed using spectral gap or re-weighting based approaches to identify appropriate CVs in an iterative fashion. However, the latter methods require effective good initial CVs since poor CVs might inadvertently introduce orthogonal modes thereby slowing convergence. In this work, we wish to answer the question: How do we select initial collective variables? There is obviously no *correct a priori* answer to this problem. Arguably, an effective starting CV should be continuously defined and differentiable, low-dimensional, and capable of separating out target states of interest while minimizing the number of orthogonal degrees of freedom.^{23} Previous approaches for protein systems have included expertly chosen distances or dihedrals,^{24} path based CVs,^{25,26} and more generic CVs such as alpha or beta sheet character. Similarly, the use of ML based methods for finding CVs is not new. For example, several groups have proposed methods to use linear or non-linear dimensionality reduction based methods^{8,10,27} to find low-dimensional CVs. In contrast to unsupervised learning, supervised machine learning (SML) can use the state labels, enabling efficient metric learning for distinguishing between the start and end states.

Therefore, in this paper, we propose approaching the CV selection problem using an automated data driven scheme. For molecular systems *where the end states are known*, we re-cast the problem into a supervised machine learning (*SML*_{cv}) problem. SML algorithms are capable of learning 70 a mapping between a high dimensional input and output. These algorithms are routinely used in ML to, say, classify the pictures of cats from dogs. Similar to cats and dogs, we first show how various classification algorithms^{28} from the ML literature can learn a decision boundary (Fig. 1) separating *protein* states and how this enables us to define CVs for enhanced sampling. We first demonstrate this using a support vector machine (*SVM*) and how to use a configuration’s distance to the decision boundary (*SVM*_{cv}) as a CV. Second, we demonstrate the same principle using a logistic regression (*LR*) classifier and how its predicted state probabilities (*LR*_{cv}) can be used as a CV. Third, we show how neural network (NN) based classifiers produce un-normalized state probabilities (*NN*_{cv}) that can be used as CVs. Fourth, we show how to include data from multiple states by combining multiclass classification with multidimensional enhanced sampling. We end the paper by sampling the folding landscape of the Chignolin mini protein using *SVM*_{cv}.

## METHODS

All of our methods assume that the modeler has access to short trajectories from the known start and end states. After running short dynamics (on the order of ns to microseconds for protein systems) in the start (State A) and end (State B) states [Figs. 1(b) and 1(c)], we train a classifier to distinguish between the start and end states. To do this, we first project the trajectories onto a complete list of possible order parameters/features. For proteins, this list might include *all* backbone and side chain dihedrals, alpha carbon pseudo dihedrals, contact distances, etc. An attractive advantage of using SML algorithms to build these CVs is that most of these algorithms can be explicitly regularized (via L1 regularization, for instance) to reduce model complexity. For example, if a certain dihedral or distance is incapable of distinguishing between the start and end states, then the L1 regularizer’s penalty function will allow the SML algorithm to discard this feature. More concretely, this would set the model’s coefficient (see below) for this feature to be 0. In the examples below, we used an L1 regularization for the alanine dipeptide examples and an L2 regularization scheme for the Chignolin example. In contrast to L1, L2 regularization pushes all of the model coefficients toward an average value. A second benefit comes from realizing that the depositing metadynamics bias will simultaneously accelerate *all* included degrees of freedom [Figs. 2(b), 4(b), and 5(b)] which can accelerate convergence and allow for more aggressive biasing schemes. Again, more concretely, if getting from protein state A to state B requires both a dihedral and an interatomic distance to change, then using the *SML*_{cv} would allow us to add external biasing forces to both of these features using a single collective variable. In traditional methods, the modeler would either run multi-dimensional enhanced sampling or judiciously pick and hope for the best. Multi-dimensional sampling cannot scale to more than 3 dimensions while picking and hoping scales even worse. Although there are a wide variety of possible SML algorithms, we focus on three of the most common classification methods, namely, SVM, LR, and NN classifiers.

*Support Vector Machines.* SVMs are linear classifiers that find the separating hyperplane that maximizes the distance between the closest points of the two classes [Fig. 1(b)]. Mathematically, they predict the output label as follows:

where y is the output label, X is the high dimensional input vector, 1 is the indicator function, w is the learnt vector of coefficients, and b is the scalar bias. SVM’s optimization is outside the scope of this paper, but it is worth noting that these models can be easily fitted via API calls to the modern ML software such as scikit-learn.^{29} The accompanying online IPython notebooks^{30} provide examples on how these models can be constructed.

The direct indicator function based output from SVMs is not differentiable and thus cannot be used as a collective variable for enhanced sampling. This is because a non-differentiable function of the input coordinates would have discontinuous derivatives at the corresponding decision boundary. This derivative, which can be directly related to the extra forces applied to the atomic positions, would be zero everywhere inside the boundary since moving in any direction would not cause a numerical change in the predicted output state. However, when the particle is at the decision boundary, the derivative would rise rapidly due to a sudden change in the state. This is likely to cause the simulation to crash. However, we can use the signed closest *distance* to the SVM’s hyperplane as our collective variable instead. Intuitively speaking [Fig. 1(b)], a larger distance in either direction would mean that our current simulation frame is further in one basin or another. Thus our collective variable becomes

where we divide by $ w 2 $ because the weight vector is not normal in most circumstances. It is also possible to simply use the numerator, also called the decision value or the softness field in glass dynamics,^{31} as the collective variable.

*Logistic Regression.* The LR classifier [Fig. 1(c)] models the probability of the data by maximizing the distance of each data point to the separating hyperplane,

where σ is a sigmoid function,

While SVMs try to maximize the margin between the two classes, LR algorithms maximize the posterior probability of the classes. However, unlike SVMs, now we can directly use the differentiable probability output from the model as our collective variable for metadynamics or other enhanced sampling algorithms,

While we are using the probability of being in state 1 as our collective variable, we can also use the conditional likelihood (odds ratio) of being in either state as a collective variable as well. In this instance, the CV becomes

### Incorporating non-linearity via kernels or neural networks

It is entirely possible that the linear models presented above are inadequate for separating the training simulation given the feature space. This can be diagnosed via hysteresis during the enhanced sampling simulations. In this case, we recommend two possible extensions. (1) Using a kernel function^{32,33} to implicitly encode the input feature vector in a high dimensional space. However, it is worth noting that kernel functions often require picking landmark^{19,32,34} protein configurations for efficient prediction and preventing over-fitting. (2) Alternatively, non-linearity can be achieved via shallow or deep neural networks.^{15} Neural networks (NNs) are universal function approximators. Typical NNs consist of a series of convolutional and/or fully connected affine transformation layers [Fig. 1(d)] interspersed with a non-linear activation function, such as the Sigmoid, ReLU, or Swish.^{35} Previous studies have already highlighted the expressive power of neural networks for dimensionality reduction^{16,36} and sampling.^{17,37} Here we argue that given some labeled state/trajectory data, the un-normalized output from these networks could now be used as a set of differentiable collective variables for accelerating molecular simulations,

where *G*(*X*) is the non-linear transformation learnt by the NN to accurately separate the training examples. However, we caution users against constructing arbitrarily deep networks since they might be difficult to train or make the force calculations computationally expensive, thereby negating the speed ups from accelerated sampling.

## EXTENSION TO MULTIPLE STATES VIA BIAS-EXCHANGE

Up to now, we have only considered the problem of generating a collective variable for a two-state system. However, most large biophysical^{3,4} and non-biological systems have multiple metastable states. In several instances, these states are known before hand from crystallography or NMR data, but the relative free energies or kinetics of inter-conversion between those states are not known. To incorporate these states into our model, we recommend generating a multiclass classification^{28} model. Given N states, the output from our SML algorithm would now be an N-dimensional probability vector, and we can use the multiple outputs as N individual coordinates in a bias-exchange simulation^{38} or accelerate them simultaneously using multi-dimensional Gaussians (if the number of states is <3). While certain algorithms, such as neural networks, naturally allow for prediction of multiple outputs, other SML algorithms adopt a “one-vs-rest” (OVR) strategy. In OVR, we build a series of models for each output state such that the model learns hyper-boundaries or decision functions that separate the current target state from all the other states. Thus we get N sub-models for the N states, and given a new data point, we compute its distance to each of those hyperplanes and assign it to its closest state. However, *in lieu* of prediction, we can now use the same *set* of *decision functions as individual collective variables* in an enhanced sampling simulation. For example, in the case of a 3-state SVM, the 3 collective variables would be

where w_{1}, w_{2}, and w_{3} and b_{1}, b_{2}, and b_{3} are the respective weight vectors and biasing scalars for the hyperplanes that separate their respective states from the rest of the ensemble. Similar expressions can be derived for other SML methods such as logistic regression classifiers. On the other hand, since NNs can directly output an un-normalized probability vector of length equal to the number of states, there is no need to use OVR based strategies.

### Notes on implementation and model cross-validation

It is worth recognizing that using this SML based CVs requires existing enhanced sampling software to have native machine learning libraries built in so that the collective variable definitions proposed above can be utilized. However, at their core, these machine learning models are nothing more than a series of vector operations whose compute graph can be easily broken down and implemented in a step wise fashion in most modern enhanced sampling engines. To highlight this, our open source implementation at the end of the manuscript provides examples on how the SVM, multiclass-SVM, LR, or even NNs can be written as a series of custom scripts such that enhanced sampling packages such as Plumed and OpenMM^{1,39} can interpret and utilize the mathematical transforms that connect the current simulation frame’s atomic coordinates to a set of scalars—also known as the collective variables.

To prevent overfitting, we also recommend k-fold (k = 3-10) cross-validation to identify optimal model hyper-parameters. In cross-validation, the model is trained on a subset of the data (the training data), and its accuracy is scored on a held out set (validation data). A range of models with varying hyper-parameters and model complexity are built, and parameters for the best scoring model are saved. The best model can then be retrained on the full training and validation sets before reporting its performance on a held out test set. Since the “test” for our model is its ability to accelerate sampling, we chose to simply do 3-fold cross-validation without a held out test set. The supplementary material contains validation set accuracies for a range of different models and systems. In general, across several reasonable parameter values (Figs. 3–5 of the supplementary material), our models’ accuracies never drop below 0.92 (92%) for either example. For alanine (Figs. 3 and 4 of the supplementary material), all models reported accuracies of 100% because the alpha and beta regions are easily linearly and non-linearly separable. As specified above, after cross-validation, we re-trained the models on the full train and validation dataset.

## RESULTS

### Application to alanine dipeptide using SVMs

We showcase our methods for three different linear classifiers on solvated alanine dipeptide. The training simulations were previously generated^{14} and consisted of two 2 ns trajectories starting from the $\beta $ and $ \alpha L $ basins on the Ramachandran plot (Fig. 2). Similar to previous work, we used the sin-cosine transform of the backbone $\varphi $ and $\psi $ dihedrals^{19,40} as input to our models. Thus each training simulation frame was represented as 4 numbers. The control simulations (Figs. 1 and 2 of the supplementary material) were accelerated only along the backbone $\varphi $ and $\psi $ dihedral. All models were trained using scikit-learn^{29} or PyTorch.^{41} For the SVM and LR models, we performed 3-fold cross-validation to determine the best hyper-parameters and regularization strengths (Figs. 3 and 4 of the supplementary material) but found that our dataset was simple enough that a large range of hyper-parameter settings gave very similar results. Ultimately, we picked reasonable model parameters (Figs. 3 and 4 of the supplementary material) and then re-trained the model on the entire available 4 ns. For the SVM and the LR models, we used a L1 regularization with a regularization strength (C-parameter) of 1.0. The supplementary material contains the full model parameters, and the online Github repository contains the fitted models. The model was trained once and kept fixed throughout the simulation. The pre-fitted models were written as custom scripts to perform metadynamics using Plumed^{39} and OpenMM.^{1} The metadynamics simulations were well tempered^{42} with the parameters shown in Table 1 of the supplementary material. The simulations were started from the $\beta $ basin, run for a total length of 45 ns apiece, and saved every 10 ps. All other simulation parameters were kept the same as before.^{14} The trajectories were analyzed and plotted using standard Python libraries^{19,29,30,43} while reweighting was done using the time-independent estimator of Tiwary.^{11}

The results from our *SVM*_{cv} simulations are shown in Fig. 2. Based upon the 4 ns of training data [Fig. 2(a)], the SVM model is able to find a hyperplane separating the $\beta $ and $ \alpha L $ regions [Fig. 2(a)]. The distance to this hyperplane [Fig. 2(a), color bar] now becomes our CV for accelerating sampling via metadynamics [Fig. 2(c)]. Over a 45 ns run, we sample the slowest coordinate $\beta $ to $ \alpha L $ transition tens of times, allowing for robust estimation of the free energy surface via reweighting [Fig. 2(d)]. Additionally, since the simulations were converged, we are able to efficiently sample the faster orthogonal $\psi $ degree of freedom as well. Similar to the SVM, we also used the Logistic Regression (LR) classifiers to define CVs (*LR*_{cv}) for accelerated simulations. Note 1 and Fig. 6 of the supplementary material contain details on training and using these models for defining CVs. We recommend that modelers start with simpler linear models before moving to the more complex neural network based approaches shown in the following section.

### Application to alanine dipeptide using neural networks

We next looked into using neural networks to find non-linear boundaries between the start and end states. To show this, we trained a simple 5-layered neural network and used the resulting model as a collective variable for enhanced sampling via Plumed and OpenMM.^{1,39} The layers alternated [Fig. 3(b)] between fully connected affine transformations and non-linear activation functions. For the non-linear activation layer, we used the recently developed Swish^{35} non-linear function. We randomly chose this combination though it is possible to systematically cross validate various network architecture choices by maximizing model’s performance on the validation set.

We trained the model [Fig. 3(b)] on half of the 4 ns alanine dipeptide dataset using PyTorch. We note that the output layer of this model gives two values which are the un-normalized probability of the input frame being in either state. Either (or both) of these output nodes can be used as a collective variable since, once normalized, they sum up to 1. To train the model, we minimized the cross-entropy between the models’ output and the training data. This was done using the Adam optimizer^{44} with an initial learning rate of 0.1. We trained the model for 1 epoch using a batch size of 32. The optimization was performed using PyTorch on the CPU platform, and training took less than a minute. After 1 epoch, our model’s training error was below 0.1 and the model reported 100% accuracy on a held out 2 ns test set. Therefore, we stopped training at this instance and used this model going forward. Once the model had been trained [Fig. 3(a)], we wrote custom string expressions to convert the PyTorch neural network model into a format that Plumed can process. Again, the metadynamics simulations were well tempered^{42} with the parameters shown in Table 1 of the supplementary material. The online Github repo provides examples on how such models can be trained and transferred to Plumed.

The results are in Figs. 3(c) and 3(d); running metadynamics simulations along the output probability, *DNN*_{cv}, allows us to observe >15 transitions [Fig. 3(c)] along alanine’s slower $\varphi $ coordinate in just 45 ns of sampling. Similar to the SVM model, the robust transition statistics combined with the reweighting method of Tiwary^{11} now allows us to recover the free-energy landscape [Fig. 3(d)] across other coordinates of interest.

### Application to alanine dipeptide using multiclass supervised machine learning

We next extended our method to the multiple state scenario. This is necessary because most large systems such as kinases, G-protein coupled receptor (GPCRs), or ion channels exhibit multi-state characteristics that are often available as crystallographic starting points.^{4} Simulations around these local points can be used to define a set of decision functions that separate each state from the rest. This set of decision functions can then be used in a multi-dimensional enhanced sampling run.

To test the method, we generated three 1 ns trajectories in the $\beta , \alpha L ,\u2009and\u2009 \alpha R $ region of alanine dipeptide [Fig. 3(a)]. Based on the cross-validation results in the previous 2-state SVM model, we built an L1 regularized 3-state SVM with the regularization strength set to 1.0 and trained it using a squared-hinge loss (see the supplementary material). The model parameterizes 3 hyper-planes, and the distance to each of those hyperplanes was used in a 3-dimensional metadynamics simulation. It is trivial to separate these into three one-dimensional metadynamics simulations, connected via Hamiltonian replica exchange.^{38} All metadynamics parameters were the same except that we used last-bias reweighting, in place of Tiwary’s time-dependent estimator, due to numerical stability issues. We limited our simulations to 12 ns because the simulation showed excellent convergence [Fig. 4(c)] well before that.

The resulting multiclass SVM model [Fig. 4(a)] learnt a set of dividing boundaries or functions that separate each of the $\beta , \alpha L ,\u2009and\u2009 \alpha R $ regions from the other two. The coefficients for the respective hyperplanes are shown in Fig. 4(b). For example, as intuitively expected, the $ \alpha L $ plane [Fig. 4(b), orange] is separated from the rest of the ensemble only via the $\varphi $ dihedral features. In this instance, regularization forces the other two coefficients to 0. By contrast, the $ \alpha R $ plane has non-zero weights on both $\varphi $ and $\psi $ features. The simulations quickly reach the diffusive regime [Fig. 3(c)], showing tens of transitions in just 10 ns and highlighting the effectiveness of the multiclass SML algorithm as a set of collective variables. Last bias reweighting gives us a similar free-energy surface as before [Fig. 3(d)].

To compare our simulations to a baseline, we ran two control 12 + ns simulations (Fig. 1 of the supplementary material) where we accelerated the dynamics of the $\psi $ or $\varphi $ dihedral alone $\psi $ is a bad starting CV since the slowest transitions in alanine are dominated by the $\varphi $ dihedral. Thus, in our first control simulation, we only observed three transitions along the $\varphi $ plane. By contrast, both variants of *SML*_{cv} were able to observe at least 5 transitions in the same amount of simulation time. Additionally, all simulations, including the control, were also able to drive the slow uphill $\beta $ to $ \alpha L $ transition much more quickly (<5 ns) than the unbiased 43 ns mean first passage time estimate previously reported.^{45} However, it is worth noting that some of our SML based CVs are not as efficient as simply directly accelerating the $\varphi $ dihedral (Fig. 2 of the supplementary material). This likely indicates that we do not have the most efficient possible CV, but it is entirely possible to now turn our initial SML based CV into a better CV by combining it with the SGOOP formalism.^{21} Intriguingly, knowledge about the $\varphi $ could also potentially be incorporated as Bayesian prior into the *SML*_{CV} formalism.

### Application to protein folding using SVMs

We next sought to fold the 10 residue Chignolin mini-protein^{46} using the Amber forcefield,^{47} TIP3P^{48} water model, and vanilla NVT simulations. To this end, we obtained the all-atom Chignolin trajectories from D. E. Shaw research.^{46,49} We then sliced 1000 frames (2 $\mu s$ trajectory) from the folded and unfolded [Fig. 5(a)] states (4 *μ*s total), respectively, ensuring that no actual folding event [Fig. 5(a)] was seen in those trajectories. To featurize these 2000 frames, we used a combination of the sin-cosine transform of the backbone $\varphi ,\psi $ dihedrals, $\alpha $ carbon contact distances for residues at least 3 apart in sequence and the cosine transform of four consecutive $\alpha $ carbon atoms. Thus each frame in our simulation was represented using 78 features. These features were normalized to have zero mean and unit variance by applying a standard feature scaling transformation.^{7,46} This scaling is necessary because most SML algorithms have issues with varied feature scaling—for example, dihedrals going from negative pi to pi while distances going up to tens of angstroms. Decision trees and random forests do not require this feature scaling because unlike SVMs, LR, or NNs, and they do not learn a “distance” metric and are invariant to monotonic feature scaling.

We trained a SVM [Fig. 5(b)] on these 2000 frames (1000 folded and 1000 unfolded), using cross-validated parameters (Fig. 5 of the supplementary material), via Scikit-learn.^{29} For this model, we used L2 regularization with the regularization strength set to 1.0. However, changing the regularization by a factor of 10 in either direction had minimal effect on the final decision boundary (Fig. 5 of the supplementary material). Then similar to the alanine example, we used the distance to the SVM’s hyperplane as our collective variable. In this instance, we started 25 metadynamics using well-tempered metadynamics.^{9,42} Each of our walkers was run for 100 ns (∼18 h of parallelized sampling on commodity K40 GPUs). Thus the total amount of sampling was 2.5 *µ*s. We ran vanilla NVT metadynamics simulations using a 2 fs Langevin integrator with a friction coefficient of 1/ps and the estimated melting temperature of 340 K.^{46} The metadynamics simulations were well tempered^{42} with an initial height of 1 kJ/mol, a bias factor of 8, and a deposition rate of 2 ps. The sigma parameter of 0.2 was chosen based on the training simulation data in the folded state. The simulation frames were saved every 50 ps. See Table 2 of the supplementary material for the rest of the metadynamics parameters.

Over the 100 ns simulations, at least 5 out of 25 walkers naturally folded to the correct Chignolin topology with the insets of Figs. 5(c) and 5(d) showing representative traces and structures. The movie of the supplementary material shows an example of such a folding event. Clearly, our *SVM*_{cv} is able to efficiently drive the simulations between the folded and unfolded ensembles since we are able to obtain multiple folding events in total simulation time comparable to the starting training simulation in a single basin. We also used the time-independent free energy estimator^{11} to re-weight the simulations [Fig. 5(d)], showing, similar to previous studies,^{46,50} that the folded state was the dominant free energy state. Unfortunately, the flattened barrier [Fig. 5(d)] likely indicates that the transition states were flattened during the course of our simulations, making kinetic estimates difficult with these simulations.

## CONCLUSION

We believe that the power of our method *SML*_{cv} comes in automating the CV selection problem, providing a systematic way to include non-linearity and regularization and minimizing the work that needs to be done to find starting points for enhanced sampling runs. Our results further open up significant opportunities for both benchmarking existing SML algorithms and developing novel algorithms or loss functions inspired from statistical mechanics. For example, other SML techniques, such as the naïve Bayes classifier, might present interesting starting points for building classification schemes better suited for accelerated sampling. However, not all SML algorithms are amenable to being used as CVs. For example, while decision trees and random forest classifiers^{53,54} are useful for understanding molecular systems,^{54} their indicator function based optimization procedure is not differentiable, making their decision function unlikely to be used as a CV.

Our proposed method (*SML*_{cv}) is intended to produce a first estimate of CVs from very limited data collected from locally sampling each basin and as such will likely not be able to generate an ideal CV. In all likelihood, the model might accidently induce orthogonal degrees of freedom in either state thereby slowing convergence. However, this problem would exist for any of the existing methods in the literature.

While we have focused on pre-training and generation of CVs, we also believe that supervised machine learning based collective variables (*SML*_{cv}) might be excellent starting coordinates for further optimization via SGOOP^{21} or VAC.^{22} This would make the entire process into an online learning setup where each additional round of simulation improves the dividing hyper-boundary. Additionally, these coordinates are likely to be transferable^{15,17} across related systems which would make them useful for investigating drug binding/unbinding kinetics, mutational effects, modest forcefield effects, etc. However, when and where transfer learning might fail is an unsolved problem. Finally, we hypothesize that the automated order parameter identification might make it easier to converge chain-of-states based methods such as the nudged elastic band or string based method^{25,55–57} since the separating hyperplane can now be used to more efficiently guide the iterative minimum free energy path finding algorithms—similar to the kernel-SVM approach of Pozun *et al.*^{58} We hope that these results provide stimulating starting points for automatic CV optimization protocols, allowing modelers to focus more on the results of free-energy simulations rather than their initial design.

## SUPPLEMENTARY MATERIAL

See supplementary material for several additional figures which are comparisons of the *SML*_{cv} results against hand-picked collective variables for alanine dipeptide, results of 3-fold cross-validation for the given models, and results from using a logistic regression classifier’s decision function as a collective variable for accelerated sampling. Due to size restrictions, the supporting video has been uploaded online: https://www.youtube.com/watch?v=jgor4xbAai8 and https://youtu.be/D1cXDKrHMCA.

## ACKNOWLEDGMENTS

The authors would like to thank various members of the Pande lab for useful discussions and feedback on the manuscript. M.M.S. would like to acknowledge support from the National Science Foundation Grant No. NSF-MCB-0954714. This work used the XStream computational resource, supported by the National Science Foundation Major Research Instrumentation program (No. ACI-1429830). The authors would like to thank D.E. Shaw and DESERES for graciously providing the Chignolin folding trajectories. VSP is a consultant and an SAB member of Schrodinger, LLC, and Globavir, sits on the Board of Directors of Apeel, Inc., Freenome, Inc., Omada Health, Patient Ping, and Rigetti Computing, and is a General Partner at Andreessen Horowitz.

All the models and code needed to reproduce the main results of this paper are available at https://github.com/msultan/SML_CV/.