A data-driven modeling scheme is proposed for conformational dynamics of biomolecules based on molecular dynamics (MD) simulations and experimental measurements. In this scheme, an initial Markov State Model (MSM) is constructed from MD simulation trajectories, and then, the MSM parameters are refined using experimental measurements through machine learning techniques. The second step can reduce the bias of MD simulation results due to inaccurate force-field parameters. Either time-series trajectories or ensemble-averaged data are available as a training data set in the scheme. Using a coarse-grained model of a dye-labeled polyproline-20, we compare the performance of machine learning estimations from the two types of training data sets. Machine learning from time-series data could provide the equilibrium populations of conformational states as well as their transition probabilities. It estimates hidden conformational states in more robust ways compared to that from ensemble-averaged data although there are limitations in estimating the transition probabilities between minor states. We discuss how to use the machine learning scheme for various experimental measurements including single-molecule time-series trajectories.

## I. INTRODUCTION

Conformational dynamics of biomolecules are closely related to their molecular functions.^{1} Regulating the dynamics of biomolecules is essential in many biological reactions. Well-known examples are unidirectional movements of molecular motors,^{2} active transport of pumps or transporters,^{3} specific responses of enzymes,^{1} and protein-protein interactions in signal transduction pathways.^{4} To observe conformational dynamics directly in atomistic detail, molecular dynamics (MD) simulation has been a powerful computational tool. In such simulations, the classical Newtonian equation of motion is solved numerically using atomic forces derived from molecular force fields.^{5,6} The accuracy of MD simulations largely depends on the quality of the force-field parameters since they are empirically determined from experimental data or *ab initio* quantum calculations of small compounds. The parameters are improving continuously and can provide reliable data for the native states of proteins.^{7,8} However, discrepancies in MD simulation results due to inaccurate force-field parameters have been noted, in particular, for unstructured proteins like intrinsically disordered proteins (IDPs) or denatured states.^{9}

To resolve issues related to molecular force fields, system-dependent and data-driven approaches have been developed by several research groups.^{10–15} Roux and Weare recently showed that the maximal entropy principle is satisfied by adding restraint potentials to incorporate ensemble-averaged experimental data.^{11} Perez and co-workers proposed a Bayesian approach to incorporate experimental data into MD simulations.^{15} Recently, we proposed a machine-learning scheme to refine the conformational dynamics by data-driven modeling after MD simulations (Fig. 1).^{16} This scheme contains the following two steps. In the first step, which is considered supervised machine learning,^{17} extensive MD simulations are performed to construct the initial Markov State Model (MSM).^{18} The parameters in the MSM are then refined using time-series trajectory data through an unsupervised machine learning algorithm.^{17,19} The folding dynamics of WW domain protein based on the refined MSM were in good agreement with single-molecule Förster resonance energy transfer (FRET) data^{20} as well as mutagenesis data (Φ-value analysis),^{21} which were independent of the MD simulations and the FRET experiments.

The framework of our machine learning approach is generally applicable to various types of experimental data. It is not restricted to time-series trajectory data and is suitable for ensemble-averaged measurements. Possible advantages of using time-series trajectory data in a machine learning approach are as follows:^{16,22} (i) more latent states can be uncovered by inferring states from temporal information; (ii) dynamic parameters can be estimated from time-series data in straightforward ways.^{23} The first point has also been discussed by Komatsuzaki and co-workers.^{24,25} So far, though, there has been no numerical comparison regarding the performance of model parameter refinement from time-series trajectory data versus that from ensemble-averaged measurements. Here, we compare the performances of the two refinements using simulations of a coarse-grained (CG) model of a dye-labeled polyproline-20,^{26–28} which was used in our previous work on data-assimilation.^{22} We performed two MD simulations based on the “correct” force field and an “inaccurate” one, whose torsional angle parameters are scaled by a factor of 0.6. Then, two MSMs were constructed from the individual simulation data. The MSM from the “inaccurate” simulation was refined by using either the time-series trajectory data or the ensemble-averaged data obtained from the “correct” MSM. We discuss the performances of the MSM refinement and applicability to actual experimental data.

## II. METHODS

### A. Models and simulations

A CG model of a FRET dye-labeled polyproline-20 was used for evaluating the performances of our machine learning scheme (shown in Fig. 2). The details of the model were described in our previous study.^{22} The polyproline-20 is represented with the Cα-based Gō-like model developed by Karanicolas and Brooks.^{29} Each amino-acid residue is treated as a single bead centered on the Cα-atom position. Most potential energy parameters are determined from a reference structure (the all-*trans* form shown in Fig. 2), while sequence specific parameters are introduced in the torsional terms and interactions between native contact pairs. This model allows conformational changes between all-*trans* and *cis*-isomers during simulation. The lifetimes of all-*trans* isomers are typically longer than those of *cis*-isomers. In order to simulate ensemble averages and single-molecule FRET-like time-series data, we explicitly attached FRET dyes at the terminal residues. The attached dyes are represented with the “Gō-dye” model developed by Best and co-workers.^{30} This model has four beads for each linker, and each dye has three beads bonded together to mimic the dipole of an Alexa dye.

We carried out two MD simulations of the peptide using different sets of potential energy parameters. One is the same as the original Karanicolas and Brooks’ Gō-model. This parameter set was used as a reference to construct the “correct” MSM. Another set of parameters was created by systematically scaling the torsional potential terms by *f*_{scale} = 0.6.^{22} By scaling the torsional potential with such a small factor, the energy barriers between the all-*trans* and *cis*-isomers decrease and the energy offsets between them also become smaller, yielding different conformational dynamics from the original one (*f*_{scale} = 1.0). We used the MD simulation data with this parameter set (*f*_{scale} = 0.6) for constructing the initial MSM for “inaccurate” dynamics.

All the simulations were performed with the GENESIS software package.^{31,32} For temperature control (300 K), the Langevin thermostat was used with the Brooks-Brünger-Karplus integrator.^{33} All bonds were constrained at their native length using SHAKE.^{34} A low friction coefficient (1 ps^{−1}) was used to accelerate the dynamics with a time step of 20 fs. For the system with *f*_{scale} = 1.0, a long simulation of 10^{9} steps was performed, while another simulation of 10^{8} steps was conducted for the system with *f*_{scale} = 0.6.

### B. Construction of the initial Markov state models

We first constructed an MSM from the MD simulation data of the system with *f*_{scale} = 1.0 (“correct” dynamics). Multivariate data composed of distances between proline beads were reduced to a 20-dimensional space by applying the time structure-based Independent Component Analysis (tICA)^{35,36} for capturing slow motions related to conformational transitions. The simulation data were clustered into four states by using the *k*-means clustering in the reduced space. Then, the transition probabilities **T**(*τ*) = {*T*_{ij}(*τ*)} between the states were determined by using the reverse maximum-likelihood estimator.^{37} A lag time of steps τ = 1 ns (5 × 10^{4} simulation steps) was used.

By using the MSM with *f*_{scale} = 1.0, we simulated single-molecule FRET-like time-series trajectory data and ensemble-averaged data. We first calculated the empirical normalized distribution *h*_{i}(*r*) of the distance *r* between donor and acceptor FRET dyes for each MSM state *i* [shown in Fig. 3(d)]. Throughout this study, these distributions {*h*_{i}(*r*)} are used as references for mappings from the MSM state *i* to donor-acceptor distance *r*. To emulate FRET-like donor-acceptor distance data, we generated trajectories of the states by a stochastic simulation of the MSM model. Specifically, a random number between 0 and 1 was drawn at every step to determine which state the system will jump in the next step according to **T**(*τ*). Then, from the trajectory of the states, the donor-acceptor distance *r* was randomly drawn from the empirical normalized distribution *h*_{i}(*r*) of each state. Generated time-series trajectory of the donor-acceptor distance {*r*_{1}, …, *r*_{N}} and its distribution *H*(*r*) were used as training data sets (time-series and ensemble data) in the subsequent machine learning.

We also constructed an “inaccurate” MSM from the MD simulation data of the system with *f*_{scale} = 0.6. This MSM is treated as an initial MSM to be refined with our machine-learning scheme. Distances between proline beads of the system with *f*_{scale} = 0.6 were projected onto the 20-dimensional space spanned by the same time structure-based Independent Component (tIC) modes as those of *f*_{scale} = 1.0. Also, for the definition of MSM states, we “re-used” the same cluster centers as those of *f*_{scale} = 1.0. The same definition was employed in order to eliminate any external factors (such as an inconsistency between empirical distributions of donor-acceptor distance) which can affect the performance of estimates. The transition probabilities **T**(*τ*) were estimated by using the same lag time τ as that of *f*_{scale} = 1.0.

### C. Refinement of Markov state models via machine learning

In the learning from the distribution data of donor-acceptor distance *H*(*r*), we use the conventional maximum likelihood method.^{17} With the method, we estimate the equilibrium probabilities of MSM states, *p*_{i} (*i* = 1, …, *I*), by maximizing the likelihood function. The likelihood function is the probability of observing the training distribution data *H*(*r*) given the equilibrium probabilities *p*_{i}. The probability of observing a donor-acceptor distance *r* is

where *p*(*r*|*i*) is the conditional probability of observing *r* given state *i* and *h*_{i}(*r*) is the empirical normalized distribution of *r* for state *i* (already given from the MD data with *f*_{scale} = 1.0).

For numerical calculation, let us discretize the donor-acceptor distance *r* by *M* bins and represent the training distribution data *H*(*r*) and the empirical normalized distribution *h*_{i}(*r*) by binned histograms, *H*_{m} and *h*_{im} (*i* = 1, …, *I* and *m* = 1, …, *M*), respectively. Then, the likelihood function *L*_{histogram}(*p*) of observing *H*(*r*) ∼ *H*_{m} is given by a multinomial distribution,

where $N=\u2211m=1MHm$ is the total number of counts in the histogram. In this study, *M* = 100 bins were used for discretizing the donor-acceptor distance *r*. As mentioned above, the training distribution data *H*(*r*) ∼ *H*_{m} were generated with a stochastic simulation based on the MSM with *f*_{scale} = 1.0. The log of the likelihood function was numerically maximized in an iterative manner with the expectation-maximization-like algorithm.^{17} After the convergence of the log likelihood function, the values of the equilibrium probabilities *p*_{i} were taken as the estimates.

In the learning from the time-series data of the donor-acceptor distance {*r*_{1}, …, *r*_{N}}, we use the hidden Markov modeling.^{17,19} With the method, we estimate the transition probabilities **T**(*τ*) = {*T*_{ij}(*τ*)} of MSM by maximizing the likelihood function. The likelihood function is the probability of observing the training time-series data {*r*_{1}, …, *r*_{N}} given the transition probabilities **T**(*τ*). In the hidden Markov model, it is assumed that the observed donor-acceptor distance *r*_{t} only depends on the MSM state at the same time step. Since the time evolution of the MSM state is described by the transition probabilities **T**(*τ*), the likelihood function is written as

where *i*_{t} denotes a MSM state at time *t* and *p*(*i*_{1}, …, *i*_{N}) is the joint probability of being a sequence of MSM states.

For numerical calculation, let us discretize the time-series data {*r*_{1}, …, *r*_{N}} of the donor-acceptor distance by a sequence of binned values {*m*_{1}, …, *m*_{N}} where $mt\u2208Z$ and 1 ≤ *m*_{t} ≤ *M*. The likelihood function *L*_{time-series}(**T**(*τ*)) of observing {*m*_{1}, …, *m*_{N}} is then given by

Here, the same *M* = 100 bins were used for discretizing the donor-acceptor distance *r*. The training time-series data {*r*_{1}, …, *r*_{N}} were generated with a stochastic simulation based on the MSM with *f*_{scale} = 1.0. The log of the likelihood function was numerically maximized in an iterative manner with the Baum-Welch algorithm^{19} imposing the detailed balance condition.^{38,39} After the convergence of the log likelihood function, the values of the transition probabilities **T**(*τ*) were taken as the estimates. The equilibrium probabilities *p*_{i} of the MSM states are readily obtained as the eigenvector of **T**(*τ*) corresponding to the largest eigenvalue, thanks to the detailed balance condition. By using the equilibrium probabilities *p*_{i}, we compare the performances of the estimates from two different data sets.

It is noted that we used the same number of snapshots (10^{4} stochastic simulation steps of the MSM with *f*_{scale} = 1.0) for both training data sets in comparing the performances of the two refinements. For rigorous comparison of the performances, statistics (averages and standard deviations) of machine learning estimates were calculated by using 1000 independent training data sets.

### D. Toolbox for machine learning

For the analysis of MD simulation data, including the MSM constructions and the refinement of MSM via machine learning, we have developed a data analysis platform called MDToolbox. MDToolbox is a MATLAB/Octave toolbox for statistical and machine learning analysis of MD simulation data of biomolecules. It consists of a collection of functions: machine learning functions for MSM construction/refinement, input-output (I/O) for trajectory/coordinate/topology files used for MD simulation, least-squares fitting of structures, free energy profile estimations from scattered data, weighted estimates (weighted histogram analysis method^{40} and multistate Bennett acceptance ratio^{41} methods) from biased data, and dimensional reductions (principal component analysis^{42–44} and tICA^{35,36}). All the analyses in this study were performed with MDToolbox. MDToolbox can be downloaded from a publicly available GitHub repository (https://github.com/ymatsunaga/mdtoolbox).

## III. RESULTS

### A. Molecular dynamics simulation

Figure 3(a) shows a scatter plot of MD simulation data with a scaling factor of *f*_{scale} = 1.0, which are projected to the 1st and 2nd tIC plane. The four major conformational states are shown in different colors by performing *k*-means clustering of the 20 lowest tICs. Figure 3(c) shows representative structures of dye-labeled polyproline-20 in the four states. All-*trans* conformations without any kink are observed in state 1, and compact isomers with two kinks are found in state 2. States 3 and 4 contain intermediate structures between the all-*trans* and compact *cis*-isomers. The polyproline-20 in these states has a single kink in different positions of the backbone. In total, 83 665 transitions between the different states are found in the MD simulation with *f*_{scale} = 1.0.

In the scatter plot shown in Fig. 3(b), the MD simulation data with *f*_{scale} = 0.6 were projected to the same tIC plane as Fig. 3(a). We clustered the MD snapshots into the four conformational states. In Figs. 3(a) and 3(b), corresponding states are shown in the same colors. In the simulation, 57 346 transitions between the four states are found. Due to the scaling parameter, observed free-energy barriers between state 1 (all-*trans*) and the other states (*cis*-isomers) become lowered. A larger population of the *cis*-isomer states is obtained in the MD simulation with *f*_{scale} = 0.6.

In Fig. 3(d), we show the distribution $hi(r)$ of the donor-acceptor distances for each state $i$, which are calculated from the MD simulation with *f*_{scale} = 1.0. The distribution of the distances for state 1 is clearly distinguishable from that for state 2. By contrast, the distributions for states 3 and 4 are similar, suggesting that conformational differences between states 3 and 4 may be difficult to estimate from the distance data. Similar situations could happen for the ensemble-averaged experimental data in discriminating conformational states.

### B. Markov state models

The implied time scales of the MSMs constructed from the MD simulations with *f*_{scale} = 1.0 and 0.6 are shown in Figs. 4(a) and 4(b), respectively. The time scales well converge at the lag time of τ = 1 ns (the value used in the two MSMs), supportive to the Markovian assumption. Also, these time scales are comparable to the decay time of the autocorrelation function for the end-to-end distance, directly calculated from the MD data.

Graphical representations of the MSMs constructed from the MD simulations with *f*_{scale} = 1.0 and 0.6 are shown in Figs. 4(c) and 4(d), respectively. Due to the scaling factor in the torsional energy terms, the transition probabilities and equilibrium probabilities of the MSM with *f*_{scale} = 0.6 deviate from those of *f*_{scale} = 1.0. While the transition probabilities from state 1 (all-*trans*) are rather small in the MSM with *f*_{scale} = 1.0, they increase about 10-fold with *f*_{scale} = 0.6. In accordance with the change in transition probabilities, the equilibrium probability of state 1 decreases from 0.91 to 0.59 in the MSM with *f*_{scale} = 0.6. The probabilities of the *cis*-isomer states are increased. In both MSMs, the transitions between states 2, 3, and 4 are relatively rare so that the transitions mainly occur between state 1 and the others. Therefore, in order to correct the MSM parameters of *f*_{scale} = 0.6 toward those of *f*_{scale} = 1.0, the transition probabilities from state 1 need to be refined in the next step of our machine learning.

### C. Comparison of two refinements

Figure 5(a) shows the distribution and time-series data of the donor-acceptor distance generated with the MSM with *f*_{scale} = 1.0. These data were used as training data for refining the MSM with *f*_{scale} = 0.6. We generated 1000 independent training sets and compared the averages and standard deviations in the estimated MSM parameters.

Figure 5(b) shows the averages and standard deviations of the estimated equilibrium probabilities using different types of training data. The estimated probabilities of all the states are very close to the correct values in both cases. In state 2, however, the standard deviation of the estimate from the distribution data is larger than that from the time-series data. This tendency is also seen in states 3 and 4. The main cause of the large errors is the overfit to each training distribution data set of the donor-acceptor distance. We observed that the estimated model parameters significantly change depending on each training data set. These changes frequently happened for states 3 and 4 because the overlapped distributions (or a degeneracy) of the donor-acceptor distances between these states [Fig. 3(d)] make the optimization of Eq. (4) very sensitive to individual training data.

The estimated transition probabilities from the training time-series data are shown in Fig. 5(c) [its graphical representation is also shown in Fig. 4(e)]. The figure shows that the transitions from the all-*trans* state (state 1) to the *cis*-isomers (states 2, 3, and 4) and the self-transition of all-*trans* (state 1) are well estimated with small relative errors because these transition events frequently occur in training time-series data. Considering that the probability fluxes (*f*_{ij} = *p*_{i}*T*_{ij}) from the all-*trans* (state 1) are dominant for the *cis*-isomers, the accurate estimates for the transitions from the all-*trans* mainly contribute to the robust estimates for the equilibrium probabilities of *cis*-isomers. On the other hand, the estimates for the transition probabilities between the *cis*-isomers have relatively large errors, suggesting limitations in estimating the transition probabilities between minor states. However, these errors do not have impacts on the estimates for the equilibrium probabilities because the probability fluxes between the *cis*-isomers are very small compared to those from the all-*trans* state.

## IV. DISCUSSION

The maximum entropy method^{10–13} or Bayesian statistics^{15} were recently utilized to connect molecular simulations with experimental observations. In those approaches, restraint potentials are introduced to match the conformational ensembles generated by the simulations to the experimental averages or distributions. Those approaches are still not suitable to analyze single-molecule time-series trajectory data because it is not practical to follow all the sequence data by introducing restraint potentials in a series of molecular simulations. Moreover, a large time scale gap between experiment and atomistic simulations limits the applicability of data assimilation based on the sequential Monte Carlo method^{22} for the analysis of single-molecule experimental data. In this study, we propose a two-step machine learning framework: the construction of an initial MSM from the extensive MD simulations and the refinement of the MSM parameters through an unsupervised machine learning algorithm. This approach is applicable not only to the ensemble-averaged data but also to the single-molecule time-series data in a straightforward way. We, thus, can compare the performance of the MSM refinement from two different types of experimental data.

For fairness in performance comparison, the same number of snapshots was used for both training data sets. We have shown that learning from time-series data is superior to that from ensemble-averaged data in the case of dye-labeled polyproline-20 although there are large errors in the transition probabilities between minor states. In actual experiments, however, the ensemble measurements provide observations with higher resolution and less noise. Single-molecule measurements often suffer from low signal-to-noise ratios. The bleaching of FRET dyes makes it difficult to measure a long-time photon trajectory in single-molecule FRET experiments. Considering these technical issues, a number of independent data collected by repeated experiments are required for our machine learning approach from time-series trajectory data. It is also possible to combine the time-series trajectory data with the ensemble-averaged data in our framework. The likelihood functions in Eqs. (2) and (4) should be multiplied for this purpose.

An important future challenge would be to increase the number of states used in the MSM. Although only four states were treated in this study, microscopic states composed of thousands of clustered conformations would provide more accurate dynamics and significant structural insights. One of the major potential problems in this direction is overfitting. Generally, overfit likely happens when the amount of experimental time-series trajectory data is smaller than the number of parameters in a model. In particular, it likely occurs for the transition probabilities in the MSM since the square of the number of discrete states is required. In order to avoid the overfitting problem, an appropriate Bayesian prior in the likelihood function (https://github.com/bhmm) of the hidden Markov modeling and maximum entropy^{45} or caliber^{46} corrected Markov modeling could be formulated for both time-series and ensemble-averaged data sets.

## ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for their comments and suggestions. Computational resources were provided by HOKUSAI GreatWave in the RIKEN Advanced Center for Computing and Communication and K computer in the RIKEN Advanced Institute for Computational Science through the HPCI System Research project (Project ID Nos. hp160022, hp170137, and ra000009). This research has been funded by the RIKEN pioneering project “Dynamic Structural Biology” (to Y.S.), strategic programs for innovation research, “Computational life science and application in drug discovery and medical development” (to Y.S.), “Novel measurement techniques for visualizing live protein molecules at work” (Grant No. 26119006 to Y.S.), JST CREST on “Structural Life Science and Advanced Core Technologies for Innovative Life Science Research” (Grant No. JPMJCR13M3 to Y.S.), and JST PRESTO (Grant No. JPMJPR1679 to Y.M.).