Molecular dynamics simulations have the potential to provide atomic-level detail and insight to important questions in chemical physics that cannot be observed in typical experiments. However, simply generating a long trajectory is insufficient, as researchers must be able to transform the data in a simulation trajectory into specific scientific insights. Although this analysis step has often been taken for granted, it deserves further attention as large-scale simulations become increasingly routine. In this perspective, we discuss the application of Markov models to the analysis of large-scale biomolecular simulations. We draw attention to recent improvements in the construction of these models as well as several important open issues. In addition, we highlight recent theoretical advances that pave the way for a new generation of models of molecular kinetics.
I. INTRODUCTION
Simulations have the hope to shed light on many areas in chemical physics. For example, molecular dynamics (MD) simulations, which numerically integrate Newton's equations to simulate physical dynamics, can provide high-resolution physics-based models of emergent biological phenomena.1–4 While promising, there are three central challenges which limit the application of MD for studying questions in biophysics. The first is the development of simplified physical models (called force fields), which avoid the intractability of solving the full, electronic Schrödinger equation. Second, while simulations are required to use a discrete time step on the order of one femtosecond, processes that are characterized by slow, large-scale, collective motions, such as protein folding, protein-ligand binding, and conformational change can take milliseconds or longer to occur. This separation in timescales requires a simulation of length at least 1012 time steps in order to observe one such event, which is difficult using current hardware. Finally, the analysis of MD simulations is itself non-trivial since the result—a set of trajectories tracking the Cartesian coordinates of every particle—can contain millions of data points in tens of thousands of dimensions. Despite these challenges, a number of recent innovations have expanded the scope of molecular simulation. In fact, because of improvements in forcefield accuracy,5–10 and the development of novel computing platforms,11–15 we believe that quantitative analysis has increasingly become a limiting factor in the application of MD.16,17
With routine MD datasets now comprising terabytes of data, the direct visualization of raw MD trajectories is neither scalable nor quantitative. Instead, we suggest that MD trajectories should not be viewed as ends in and of themselves, but instead a means of parameterizing a quantitative statistical model of the structure and dynamics of the system of interest.
These models should have the following properties:
-
The model should be oriented toward quantitatively describing the long-timescale processes in the data.
-
The model should be suitably complex, capable of smoothly adapting to—rather than assuming—the structure of the data.
-
The model should be interpretable and provide answers to specific scientific questions that may be difficult to answer via experiment alone.
Towards this end, we survey recent developments in the theory and application of Markov modeling for the analysis of MD simulations, which have all of these desired properties.
II. LONG TIMESCALES AND THE TRANSFER OPERATOR
The transfer operator provides a theoretical foundation for the description of the slow dynamical processes in stochastic systems, and is the basis for the approaches discussed herein. Indeed, at its essence, the methods described below all work to build some sort of numerical model of the transfer operator. Here we highlight some important details; we refer the interested reader to Schütte, Huisinga, and Deuflhard.18
The fundamental approach we will take is to use computational methods to build an approximate representation of the transfer operator, , from molecular dynamics simulations in a way which is systematic, automated, and statistically rigorous. One major upshot of this approach is that it allows a wide range of simulation data to be useful, ranging from many relatively short trajectories (compared to the longest timescales of interest) to a few long trajectories. Below, we lay out the challenges, current status, and future of this paradigm.
III. METHODS
A. Markov state models
Though the theory is relatively simple, the construction of an MSM is not a solved problem. Given a set of MD simulations, we need to define a state decomposition, S, as well as estimate the transition probability matrix, T. Straightforward maximum likelihood estimators for T, given S exist,34,36 but the determination of S itself is more challenging. A wide variety of strategies have been proposed for constructing S, including grid-based discretization,27 and clustering with many algorithms36,37 and distance metrics.38–43 However, no consensus has emerged in the literature on the preferred strategy, largely because these clustering schemes rest on heuristic foundations built primarily by physical intuition. No solid theoretical justifications exist for choosing one of these schemes over another, especially when statistical errors in the parameterization of T are non-negligible.
On the other hand, Sarich, Noé, and Schütte44 showed that in the absence of statistical uncertainty in the parameterization of T, an MSM's error, measured in terms of the maximum possible difference between the true and MSM-predicted probability density at a later time, is bounded. The state decomposition-dependent component of this bound quantifies the difference between the transfer operator's continuous eigenfunctions, ψi(x), and the MSM's piecewise constant approximate eigenfunctions, given by the eigenvectors of T. This result implies that a “good” state decomposition is one that finely partitions phase space particularly in the regions where the dominant eigenfunctions, ψi(x), change rapidly, such as transition regions or energy barriers. Unfortunately, these barrier regions are precisely the areas that are most poorly sampled in MD. Finely subdividing these regions into many MSM states can lead to large statistical errors in the associated transition probabilities.
In practice, even the seemingly simple problem of deciding the correct number of states given a clustering algorithm and distance metric is difficult.39,45 When the number of states is low, the discretization in Eq. (7) is very coarse, which leads to systematic underestimation of the relaxation timescales.44,46 However, as the number of states increases, we have to estimate more state to state transition probabilities. With the amount of data fixed, increasing the number of parameters leads to a corresponding increase in the statistical uncertainty of the model. Some recent work, however, has made progress toward balancing these sources of errors by defining a likelihood function for evaluating and comparing models.39,45 These likelihood functions are useful for simple systems, but can be difficult to employ in most MSM applications.
To illustrate the difficulty of building a state space more concretely, consider a simulation of a protein and a ligand where we are interested in calculating the on and off rates of the binding reaction. In this simulation, the ligand is very mobile when it is not bound to the protein, and so a typical geometric clustering will produce many unbound states. These states contribute to the statistical error since we are essentially forcing the model to describe the diffusion of the ligand in water. However, since we are only interested in the binding reaction, we do not care about this diffusion, and could instead build a more statistically robust MSM by ignoring these motions in the unbound state. This behavior is not unique to protein-ligand simulations, and in fact has been an issue when analyzing protein folding simulations. Typically these models must use tens of thousands of small states—termed microstates—in order to accurately describe the folding reaction's timescale.47–49 Not only does this increase the statistical error, but it raises major challenges for the interpretation of the resulting model. One suggested resolution has been to lump these microstate models into coarser macrostate models by grouping together states which rapidly interconvert.32,33,50,51
In our view, this microstate-to-macrostate lumping strategy can be dangerous, and we discourage its use. The lumping procedure relies directly on the quality of the microstate model, which is purposely estimated subject to large statistical errors. For example, the PCCA and PCCA+ algorithms for lumping MSMs are designed to preserve the estimates of the slow eigenfunctions as identified by the microstate model,33,50 without regard to the fact that these estimates are, by design, approximated subject to large errors. Although other methods like BACE partially remedy this problem,32,52 in our view a more direct solution is to construct accurate and interpretable models from the start.
To be able to build simultaneously accurate and interpretable MSMs, clustering must be able to focus on the important (slowly equilibrating) degrees of freedom, while ignoring the unimportant ones. One suitable method is thus to perform dimensionality reduction before clustering, explicitly removing quickly decorrelating coordinates. With this in mind, Schwantes and Pande43 and Pérez-Hernández et al.42 introduced time-structure based Independent Component Analysis (tICA) for building MSMs. The tICA method identifies the slowest decorrelating linear projections in the observed data. This technique has made it possible to build macrostate MSMs from the very beginning that better resolve long-timescale processes.45,53 In addition, these models have identified new slow motions that eluded detection previously, such as near-native register shift dynamics in β-sheet proteins.43
B. A generalization of MSMs: Hidden Markov models
Part of the difficulty in constructing the MSM state space is that there is no single quantitative criterion for comparing alternative state spaces which is suitable for practical applications. Therefore, one approach is to generalize the MSM method such that the state decomposition can be optimized simultaneously with the transition probabilities in a unified way. Hidden Markov models (HMMs) are one such extension that relax the constraint that states correspond to a discrete partition of Ω. Like MSMs, HMMs characterize the system using a Markov jump process over a finite number of states; however in the HMM, these states are not assumed to be directly observed. Instead, each hidden state is equipped with an emission distribution defined on Ω, describing the per-state conditional probability of observing a conformation at a particular point in phase space. HMMs have been widely used in a variety of fields, including signal processing, speech, and bioinformatics.54,55
One caveat in applications of HMMs to analysis of MD is that without conditioning on , the HMM's observed process Xt is strictly non-Markovian when the emission distributions have nonzero overlap. While it is precisely these overlaps that make the model (particularly θi) optimizable, they also hamper direct interpretation of the HMM as a model for the Markovian transfer operator.57,58
IV. RECENT APPLICATIONS
Markov models have been heavily applied to many different types of systems, ranging from protein folding,48,49,59 RNA folding,60 protein-ligand binding,61 and protein conformational change.4,58,62 Below, we summarize two recent highlights from the literature, which demonstrate the unique utility of Markov modeling. In particular, these approaches allow researchers to connect many short MD trajectories into a single coherent description of the dynamics. Moreover, MSMs are able to provide an interpretable picture of the kinetics that can lead to specific scientific insight.
A. β2 adrenergic receptor (β2AR) activation
β2AR is a G-protein coupled receptor (GPCR) involved in many trans-membrane signaling pathways. GPCRs are critical in medicine, representing the target for 30% of top-selling drugs on the market.63 Despite extensive experimental studies,64–67 the mechanism of β2AR activation is elusive. Kohlhoff et al.68 used Google exacycle, a cloud computing platform that enables scientific calculations to be run on Google's datacenter infrastructure, to collect over 2 ms of aggregate MD simulations of β2AR in a lipid bilayer and explicit solvent using tens of thousands of individual MD trajectories, each with a mean length of ∼10 ns. Although no individual MD trajectory traversed the full activation pathway, Kohlhoff et al.68 were able to use MSMs to understand the full activation landscape using these many short trajectories. This is a particularly powerful property of MSMs, because collecting many short trajectories is substantially easier than generating a few long trajectories.11,69
In addition, Kohlhoff et al.68 simulated these trajectories in three waves. The starting conformations for the second and third wave were selected by uniformly sampling states from an MSM built on the previous waves. Known as adaptive sampling, this technique provides a framework for efficiently covering all of phase space with many, relatively short simulations, which is only possible because of the use of MSMs.70,71
Kohlhoff et al.68 used the MSM to show that the activation of β2AR proceeded along multiple paths. Many of these pathways featured metastable intermediates that displayed unique chemotype selectivity in docking studies, suggesting that computational drug design can be enhanced by considering important intermediates.
B. β-lactamase allosteric sites
While typical structure-based computational drug design pipelines use a single protein conformation to screen for potential therapeutics,72 proteins are dynamic systems and an ensemble of structures exist in solution. Accounting for this heterogeneity could expand the range of possible methods for modulating protein activity. With this in mind, Bowman and Geissler73 set out to find cryptic druggable binding sites in β-lactamase using MD simulations and MSMs.
Using one hundred microseconds of aggregate simulation time distributed over hundreds of trajectories, each no longer than 500 ns, Bowman and Geissler73 were able to characterize slow near-native conformational changes in β-lactamase. Their MSM was able to describe the kinetics of the opening and closing of various cryptic binding sites on distal regions of the protein. In particular, MSMs allowed the researchers to study the lifetimes and equilibrium opening probabilities of these pockets. They first observed that a known cryptic binding site was predicted to be open 53% of the time according to the MSM. Furthermore, they identified 50 novel sites, which were predicted to be open more than 10% of the time, suggesting new avenues for drug design.
V. DISCUSSION
While these results highlight that MSMs have been useful for analyzing MD simulations, the definition of the state space remains a substantial hurdle in MSM construction. Although recent work has provided significant improvements in this area, one way to side-step this challenge is to build a Markov model (i.e., provide an estimate of the transfer operator) without using a discrete state decomposition. Towards this end, Noé and Nüske74 and Nüske et al.75 established a variational principle that applies to the estimation of the transfer operator's eigenspectrum. This is analogous to the variational principle for the Hamiltonian in quantum mechanics and guarantees that, when the operator is discretized using a finite basis set, its eigenvalues are underestimated in the limit of vanishing statistical errors.
Given a basis set, the variational approach searches for linear combinations of basis set elements which approximate the transfer operator eigenfunctions by solving a particular generalized eigenvalue problem, or equivalently maximizing a series Rayleigh quotients that can be interpreted as autocorrelation functions. In fact, the MSM approach is a specific instance of this method when the basis set consists of indicator functions which partition Ω, but the method more broadly can be applied to any set of basis functions including those which have nonzero overlap. Furthermore, tICA is also an instance of this approach, with a basis set instead consisting of linear functions of the input coordinates. In the HMM, however, the process observed in Ω is not strictly Markovian, and thus its interpretation in terms of the Markovian transfer operator (or its eigenspectrum) is less straightforward.
This theoretical advance suggests a tension between two distinct but related perspectives for building simplified models of molecular kinetics:
-
Parameterize a probability distribution over trajectories.
-
Find the first m eigenfunctions of .
The HMM and tICA methods are certainly different, but to what extent are probabilistically- and variationally-inspired models distinct? These views are, at least, not always different since an MSM is both a probabilistic model and one that estimates the slowest eigenfunctions of . But, outside of MSMs, when is a probabilistic model more desirable than a variational model, or vice versa? Is one model missing something that can only be found in the other? If so, is it possible to unify these views with a model other than an MSM? We do not know the answers to these questions, but both perspectives have enriched our understanding of molecular kinetics, and we anticipate future developments will clarify the apparent tension and lead to new classes of Markov models.
VI. CONCLUSIONS
The future of Markov modeling is bright. Ten years ago, conventional MD simulations could only access timescales in the nanosecond regime.76 A revolution in Graphical Processing Unit (GPU) computing,13,69,77 however, has enabled researchers around the world to perform simulations that were once impossible. But, MD trajectories alone are not enough. Quantitative analysis, which can turn a simulation into scientific insight, is a necessary component of the research process that has often been taken for granted.
Indeed, the combination of improvements in simulation and analysis has allowed for simulations of phenomena on the hundreds of microseconds to millisecond timescale;17 as a high end GPU can today simulate over 100 ns/day for a 30 000 atom system, a cluster of 100 GPUs can simulate an aggregate of 10 μs/day, yielding a millisecond of aggregate simulation in three months. While this cluster cannot simulate one long millisecond trajectory in three months, MSMs allow the use of these 10-μs trajectories to describe phenomenon on the millisecond timescale.
Markov modeling is a technology that has been rapidly maturing for MD analysis. At least two open-source software packages allow non-experts to construct these models routinely,36,37 and tutorials on their use are available.78 Nevertheless, many challenges remain in their theory and application. In particular, parameter selection continues to be difficult—e.g., how fine a discretization best balances competing sources of error? Recent work has benefited from a synthesis of concepts from machine learning and mathematical physics, and we anticipate that the future development of Markov modeling will continue to be enhanced by a cross-pollination of ideas from these communities.
Looking to the next 10 years, we expect that these approaches will become more automated, with statistical methods replacing empirical tests of model quality and adaptive sampling approaches maturing to the point where they can automatically tackle a wide range of sampling challenges. Moreover, in 10 years, GPUs (which double in speed every year) will likely be 1000 times more powerful. The combination of these technologies suggests that MSM simulations on the millisecond timescale will become routine and require only a day on a modest 10-GPU cluster, and simulations on the seconds timescale will become within reach. At that point, the field of bimolecular simulation will likely be in a golden period, where sampling is no longer the principle challenge for many biological questions of interest.
ACKNOWLEDGMENTS
The authors would like to acknowledge Thomas J. Lane for useful discussions during the preparation of this manuscript. In addition, the authors acknowledge funding from NSF-MCB-0954714 and NIH-R01-GM062868.