I. INTRODUCTION
The Journal of Chemical Physics article collection on Markov Models of Molecular Kinetics (MMMK) features recent advances in developing and using Markov State Models (MSMs)1–6 in atomistic molecular simulations and related applications—see Refs. 7–10 for recent MSM reviews. MSMs have been an important driving force in molecular dynamics (MD), as they facilitate divide-and-conquer integration of short, distributed MD simulations into long-time scale predictions and they are conceptually simple and provide readily interpretable models of kinetics and thermodynamics.
Most MSM estimation approaches proceed by a sequence, or pipeline, of data processing steps that is also represented by MSM software packages11–13 and typically includes the following:
Featurization: The MD coordinates are transformed into features, such as residue distances, contact maps, or torsion angles,11,12,14,15 which form the input of the MSM analysis.
Dimension reduction: The dimension is reduced to much fewer (typically 2–100) slow collective variables (CVs).16–26 The resulting coordinates may be scaled in order to embed them in a metric space whose distances correspond to some form of dynamical distance.27,28
Discretization: The space may be discretized by clustering the projected data,4,7,11,29–33 typically resulting in 100–1000 discrete “microstates.”
MSM estimation: A transition matrix or rate matrix describing the transition probabilities or rate between the discrete states at some lag time τ is estimated.5,6,34,35
Coarse-graining: In order to get an easier interpretable kinetic model, the MSM from step 5 is often coarse-grained to a few states.36–44
Some methods skip or combine some of these steps, and novel machine learning methods attempt to integrate most or all of them in an end-to-end learning framework.
Key in much of the methodological progress in Markov modeling has been the mathematical theory of conformation dynamics pioneered by Schütte1 and further developed by many contributors. This theory models the dynamics of molecules by a Markov propagator that describes how an ensemble of molecules ρ0 evolves in a time step τ,
The MSM transition matrix is a discrete version of . Discretization in high-dimensional spaces is difficult to impossible, and so it is important to understand the structure underlying these dynamics in order to make the MSM estimation problem feasible. If the dynamics are in equilibrium, this propagation can further be approximated by a sum of processes ψi that relax the initial distribution toward the equilibrium distribution with characteristic time scales ti,
As decays exponentially fast in the time step τ, only few terms are needed for Eq. (1) to be an accurate description if we focus on the long-time dynamics, i.e., the kinetics. The key insight from this theory is no less that Markov modeling is possible even for complicated and very high-dimensional molecular systems: We cannot sample or discretize truly high-dimensional spaces, but we can do that for metastable molecular systems because we are ultimately only interested in the low-dimensional manifold spanned by a few eigenfunctions ψi of the Markov operator. Characterizing this manifold more compactly than by modeling all relevant eigenfunctions ψi explicitly is the subject of current research.45
An important cornerstone for improving MSM estimators, developing new ones and turning MSM estimation into generic machine learning problem that can be combined with kernel machines or neural networks, is the development of variational optimization methods. The variational approach for conformation (VAC) dynamics16,17 shows that eigenvalues of MSMs [the same is true for other linear Markovian models such as time-lagged independent component analysis (TICA)] systematically underestimate time scales ti and eigenvalues and define a variational score—essentially the sum of eigenvalues of an estimated Markov model—that ought to be maximized to optimally approximate the unknown eigenvalues and eigenfunctions in (1). The variational formulation is key to many contributions in the MMMK collection and remains to be the subject of further development and application, e.g., in the context of MSM hyperparameter optimization.29,46
While the VAC only describes the scenario of equilibrium dynamics, i.e., where our dynamics have a unique equilibrium distribution and obey detailed balance, much recent research has focused on nonequilibrium Markov models.47–52 While nonequilibrium MSM studies are still in their infancy, several theoretical principles are known in order to make progress here. An important framework for the description of such processes is that of nonequilibrium work as covered by the Jarzynski fluctuation theorem.53 From a machine learning and optimization perspective, we can replace the eigenvalue decomposition in (1) with a singular value decomposition of the operator whose components can be approximated with the variational approach of Markov processes (VAMPs).51 The VAMP is exploited in several contributions in the MMMK collection.
II. FEATURE SELECTION
One step in MSM estimation that had not yet undergone systematic analysis or optimization is the selection of features used as an input. For solvated molecules, rototranslationally invariant features were usually chosen based on what works best for a given application—including intramolecular distances, angles, contact matrices, or features implicitly defined by pairwise metrics, such as minimal root mean square distance. The VAC has previously been invoked to define variationally optimal features for short peptides, in the spirit of defining optimized basis sets in quantum chemistry.54 In the MMMK collection, Scherer et al.55 proposed to use the VAMP to variationally score different candidates of features for a given MD analysis task. Considering a large list of candidate features and all 12 fast-folding protein simulations published by DESRES,56 the authors of Ref. 55 concluded that a combination of residue-residue contact signals that decay exponentially in the distance and backbone torsions performs best for protein folding.
III. SLOW COLLECTIVE VARIABLES
A major leap forward in MSM construction was the finding that machine-learning methods that identify a manifold of slow collective variables (CVs),26 such as the time-lagged independent analysis (TICA) method,20 led to superior MSMs.18,19,57 Intuitively, this success is due to the fact that MSMs aim at modeling the kinetics between metastable states, and first reducing the dimension to the manifold of slow (kinetic) processes makes subsequent geometric operations such as clustering much simpler and faster than directly working in a high-dimensional space. Theoretically, these methods can indeed be derived from the VAC16–18 and thus be showed to variationally approximate the eigenfunctions of the Markov propagator (1).
Several papers in the MMMK collection develop this approach further. Karasawa et al.58 proposed the extension to relaxation mode analysis (RMA),59 a close sibling of TICA, in which one first solves an eigenvalue problem of the time correlation matrix of features to identify the manifold of slow CVs and then finds the subspace in which the matrix is positive definite, promoting numerically stable estimate of relaxation rates.
A close relative of the VAC is the spectral gap optimization of order parameters (SGOOPs) developed by Tiwary.60 While both SGOOP and VAC find a manifold of slow CVs by maximizing the largest eigenvalues, SGOOP combines this principle with a maximum caliber based estimation of the transition or rate matrix and is thus applicable to enhanced-sampling simulations where dynamics are not readily available. In the MMMK collection, Smith et al. developed a multidimensional version of SGOOP by introducing conditional probability factorization and demonstrated its usefulness on the rare-event dissociation pathway of benzene from lysozyme.61
Paul et al.62 generalized the idea of VAC and TICA to nonequilibrium processes: They showed that the VAMP can be used to variationally find slow CVs in systems that are driven by external forces, such as an ion channel in an electric field. Operationally, the approach is as easy as TICA: time-correlation matrices between features are computed and a singular value decomposition yields the slow CVs. The MSMs estimated in this manifold reveal the circular fluxes between long-lived states driven by the external potential.62
IV. ESTIMATING TRANSITION MATRICES AND OTHER QUANTITIES
It is easy to show that the estimation of MSM transition matrices by maximizing the Markov chain likelihood is statistically unbiased if all simulations are in global equilibrium. However, MSMs are usually estimated from short simulation trajectories that may be simulated under equilibrium dynamics, but whose starting points do not start from a global equilibrium distribution. Nüske et al.63 derived the mathematical form of the MSM estimation error for such data and proposed a reweighting method that allows the user to estimate MSMs without this initial state bias. In the MMMK collection, Ref. 64 provided a new estimation method for the same aim, which is based on statistical resampling. As always in machine learning, there is a trade-off between bias and variance of estimators65 that all these bias-reducing estimators must face. While most MSM estimators have been developed with the aim of reducing the bias, a systematic account for MSM estimators with an optimal bias-variance trade-off is still an open issue for the future.
As described above, slow CVs, transition rates, and MSM transition matrices can be viewed as a result of a variational optimization process, e.g., using VAC, VAMP, or SGOOP. From a mathematical point of view, all methods which do this via some form of linear combination of basis functions—and this includes TICA and standard MSM transition matrix estimators—can also be described by the Galerkin approximation framework.8 The idea of the Galerkin approach is as follows: we define basis functions—the mean-free feature functions in TICA or indicator functions denoting where Markov states are in discrete MSMs—and consider the projection of the dynamics onto this basis set. The Galerkin approach then gives us expression for dynamical quantities, such as the Markov propagator eigenfunctions and its eigenvalues/relaxation rates, based on linear combinations of these basis functions. While the Galerkin approach is primarily a mathematical explanation of what happens in TICA or MSM estimation algorithms, Thiede et al. show in the MMMK collection how it can be exploited and expanded to develop better MSM estimators and also obtain direct estimators for quantities that are usually estimated via MSMs, such as committor functions.66
As mentioned in the context of core-based MSMs, committors, mean first-passage times (MFPTs), milestones, and MSMs are deeply connected. In the MMMK collection, Berezhkovskii and Szabo67 further our theoretical understanding of these relationships by showing why exact MFPTs can be computed via a milestoning MSM and provided a relationship between the equilibrium population of milestoning MSM states and the committor functions.
Building upon previous work done on a variational framework for the identification of Markovian transition states,68 Kells et al.69 developed a variationally optimal coarse-graining framework for MSM transition matrices that has broad applicability and for time series analysis of large datasets in general. They demonstrated that coarse-graining an MSM into two or three states with this method has a simple physical interpretation in terms of mean first passage times and fluxes between the coarse grained states. Results are presented using both analytical test potentials and MD simulations of pentalanine.
V. MARKOV MODEL ESTIMATION WITH RARE-EVENTS
While MSMs effectively turn the problem of estimating molecular kinetics and thermodynamics into an embarrassingly parallel process, estimating a statistically precise or even connected MSM is still hampered by sampling the rare transition events sufficiently often. A manifold of MSM-based approaches has been proposed to address the sampling problem, most prominently: (1) adaptive sampling approaches where an MSM is used as starting points for new simulations are most promising to discover new states or reduce statistical error12,70–76 and (2) multiensemble Markov modeling approaches that estimate MSMs with the aid of generalized ensemble simulations (multiples, temperatures, or biases), in order to exploit expedited rare-event sampling.77–82 In the MMMK collection, several new methods are proposed to construct MSMs without sampling the rare events by brute force.
Adaptive sampling is considered by Hruska et al.76 While a variety of adaptive sampling methods have been developed before, the authors conducted a systematic study of the effectiveness of different adaptive sampling strategies on several fast folding proteins. Reference 76 provided theoretical limits for the adaptive sampling speed-up and showed that different adaptive sampling strategies are optimal, depending on sampling starts without prior knowledge of the metastable states, or whether some states are already known and finding new ones is the aim.
By combining the maximum caliber approach83,84 with optimal transport theory, Dixit and Dill developed an approach to approximate MSM rate matrices from short nonequilibrium simulations.85 Maximum caliber-based estimation of MSMs was used by Meral et al.86 in combination with enhanced sampling using well-tempered metadynamics.87,88 The authors applied their framework to the challenging problem of studying the activation of a G Protein-Coupled Receptor (GPCR), here the μ-opioid receptor. They demonstrated that the caliber-derived transition rates are in agreement with those obtained from adaptive sampling, suggesting that the framework is of general usefulness.
Another approach to avoid waiting for the rare events to happen is to speed up sampling between known metastable states using transition path methods, such as transition path sampling,89 transition interface sampling (TIS),90 or Forward Flux Sampling (FFS),91 and subsequently constructing coarse-grained MSMs from the transition path statistics. Recently developed software for transition path simulations facilitates this task.92 In the MMMK collection, Qin et al.93 developed the reweighted partial path (RPP) method approach that can efficiently reweight TIS or FFS simulations in order to derive equilibrium distributions of states or free energy profiles.
Path-based sampling is also considered by Zhu et al.94 The authors developed a new path-searching method for connecting different metastable states of biomolecules that employ ideas from the traveling-salesman problem. Their traveling-salesman based automated path searching (TAPS) algorithm outperforms the string method by 5–8 times for peptides in a vacuum and solution, suggesting that it is an efficient method to obtain initial pathways and intermediates that facilitate the construction of MSMs and thereby full kinetics of complex conformational changes.
VI. CLUSTERING AND COARSE-GRAINING
A successful class of kinetic models are core-based MSMs, originally proposed in Ref. 5. Core-based MSMs directly go from a low-dimensional manifold of feature space to an MSM of few metastable states, skipping over the traditional approach of clustering that space into microstates and coarse-graining the microstate transition matrix. The basic idea of core-based MSMs is to identify dynamical cores—the most densely populated regions of state space which are parts of metastable states—and estimate an MSM from the rare transition paths between cores. Theoretically, core MSMs are closely related to milestoning95 and can be shown to approximate committor functions between metastable states, which are in turn approximating the eigenfunctions of the Markov propagator in metastable systems.96
A natural approach to identify cores is density-based clustering algorithms.97,98 In the MMMK collection, Nagel et al.99 proposed an extension of their previous density-based coring algorithm,98 which avoids misclassification of MD simulation frames to cores by requiring a minimum time spend in a new core to qualify as a core transition. They demonstrated that dynamical coring obtains better MSMs using alanine dipeptide dynamics and Villin headpiece folding as examples.
VII. NONEQUILIBRIUM MARKOV MODELS
Deviations from equilibrium can come in different forms: An ion channel in an electric field may be in steady-state, i.e., it has a unique stationary distribution, but does not obey detailed balance. A spectroscopically probed molecule may be subject to a periodic external force. When a molecular system is expanded by pulling it with a nonequilibrium optical tweezer experiment, even the dynamics themselves become time-dependent and neither a stationary distribution exists nor detailed balance is obeyed. These different degrees of nonequilibrium call for different analysis methods that are only beginning to unfold now. VAMP-based identification of the slow kinetics manifold for nonequilibrium has been discussed above.62
In Ref. 48, Reuter et al. generalized the popular robust Perron Cluster Cluster Analysis (PCCA+) method for coarse-graining transition matrices to obtain metastable states. Their generalized method (G-PCCA) decomposes the MSM transition matrix with a Schur decomposition instead of an eigenvalue decomposition and can obtain metastable states as well as slow cyclical processes from transition matrices that do not obey detailed balance.
Knoch and Speck49 developed a method to construct MSMs for systems that are periodically driven and illustrated the method using an alanine dipeptide molecule that is exposed to a periodic electric field.
VIII. HIDDEN MARKOV MODELS AND EXPERIMENTAL DATA
Hidden Markov models (HMMs) are an alternative to MSMs and have been used to obtain few-state kinetic models from MD data.43,100 They have also been used to extract kinetics directly from experimental trajectories, such as single-molecule Förster resonance energy transfer (FRET) or optical tweezer measurements, which only track one or a few experimental observables over time instead of the full configuration vector.101–103 Similar methods have been used in order to analyze the kinetics from short FRET trajectories of single molecules diffusing through a confocal volume.104,105 Much of MSM theory can be reused when dealing with HMMs, for example, in order to compute relaxation times and a hierarchical decomposition of the system kinetics into metastable states with different lifetimes.103
In the MMMK collection, Jazani et al. developed a HMM that analyzes fluorescence data from molecules diffusing through a single confocal volume.106 Although the fluorescence data stem from a single sensor, in contrast to wide-field optical microscopy, Ref. 106 showed that the intensity fluctuations resulting from the fact that the nonhomogeneity of the confocal excitation volume bears information about the spatial location of the molecule that can be exploited to reconstruct molecular diffusion paths.
IX. MACHINE LEARNING
Recently, the classical approach of constructing MSMs has been disrupted by replacing the traditional estimation pipeline (see above) by VAMPnets, where a deep neural network is trained with the VAMP to map from a high-dimensional coordinate or feature space to a few-state MSM.52 Deep neural networks are now routinely used for several MSM-related tasks, such as learning slow CVs or aiding rare event sampling.107–111
Another important machine-learning framework is kernel methods. Kernel methods have been previously used for TICA112,113 and are also underlying the diffusion map approach that is a popular MD analysis framework.114,115 Following a similar approach to VAMPnets, Klus et al. developed a VAMP-optimal kernel method116 to estimate conformation dynamics directly using a kernel function acting on molecular feature space. They showed that the established linear models such as TICA and MSMs are special cases of their kernel model and demonstrated the computation of metastable states and kinetics for alanine dipeptide dynamics and NTL9 protein folding.
X. SOFTWARE
An important technology driving MSM method development, dissemination, and application is publicly available and software. Fortunately, two large-scale and widely used open-source packages, PyEMMA11 and MSMbuilder,13 exist that implement a wide range of MSM methods and welcome contributions from the community.
Recently, new software packages have added that publish additional MSM methods and techniques, e.g., Refs. 92 and 117. In the MMMK collection, Porter et al.118 presented the Enspara library that is geared toward scalability to large data or models, i.e., MSMs with many states or from very large datasets. Enspara includes parallelized implementations of computationally intensive operations and represents a flexible framework for MSM construction and analysis.
XI. APPLICATIONS
While MSMs and related techniques in the molecular sciences have been primarily developed to study peptide and protein folding, they are now used for a wide range of dynamical processes, including the study of liquids, aggregation, and structural transitions in materials such as alloys. The MMMK article collection is no exception and contains MSM applications to various interesting molecular processes. In all of these examples, the MSM framework reveals new physical or biological insight by revealing structures and transition processes at an unprecedented degree of detail.
Liquid water is a surprisingly rich and complex dynamical system. Long-standing questions, for example, include which dynamical rearrangements lead to the picosecond dynamics observed in spectroscopic data of liquid water. The difficulty in answering such questions—even with accurate molecular models at hand—lies in data analysis: how can we define “state space” in a practical way and which molecular features are suitable in a liquid of molecules that are diffusing around, constantly switching between states that are identical up to the exchange of labels. In the MMMK collection, Schulz et al. used MSM methodology to pursue a detailed analysis of liquid water.119 They solved the permutation-invariance problem by considering each water trimer as a subsystem in a 12-dimensional space defined by aligning the coordinate system to one of the water molecules and then performed an MSM analysis using all water trimer trajectories of a solvent box simulation in this space. The analysis suggests which exact transition processes are observed by experiments and how elementary dynamical processes, such as hydrogen-bond exchange, in liquid water occur in detail.
Gopich and Szabo120 worked out a detailed analysis of diffusion-limited kinetics of a ligand to a macromolecule with two competing binding sites. Their results indicate that the kinetics of such a system are surprisingly rich, the presence of the second empty binding site can slow down binding to the first as a result of competition, or it can speed up binding when populated as direct transitions of ligands between the two binding sites are possible.
Shin and Kolomeisky employed MSM methods in order to model the kinetics of a one-dimensional walker with conformational changes that affect its transition probabilities.121 Biological systems have many examples of such processes, such as the dynamics of molecular motors along filaments, whose motion depends on the current conformation of the motor protein. The authors derived a phase diagram of such systems exhibiting several dynamical regimes of the one-dimensional search process that are determined by the ratios of the relevant length scales.
Pinamonti et al. combined an advanced clustering technique with core-based MSMs in order to analyze the process of RNA base fraying in detail.122 The dynamics of four different RNA duplexes are analyzed, and an interesting interplay between the equilibrium probability of intermediate states and the overall fraying kinetics is described.
Chakraborty and Wales123 obtained an MSM of the adenine-adenine RNA conformational switch using the discrete path sampling (DPS) technique.124 DPS allows the authors to probe very rare events, with interconversion time scale here predicted to be in the range of minutes. Several competing structures, separated by high barriers, are found, but the two main energy funnels lead to the major and minor conformations known from NMR experiments.
Similar issues with permutation invariances exist when studying aggregation and self-assembly of many identical molecules. To this end, Sengupta et al.125 constructed CVs from descriptors that are invariant with respect to permutation of identical molecules. Using these CVs, the authors constructed MSMs to describe the aggregation of a subsequence of Alzheimer’s amyloid-β peptide. The results suggest that disordered and β-sheet oligomers do not interconvert, and thus amyloid formation relies on having formed ordered aggregates from the very beginning.
XII. CONCLUSION
Markov modeling has come of age. In the molecular sciences, it has grown from an activity practiced by a handful of groups to a technique used by a large fraction—if not the majority—of MD simulation groups. Markov modeling has also gone beyond molecular sciences and found applications in other areas of dynamical systems. Recently, it has been found that key MSM techniques have evolved in parallel in other fields under different names,23–25,126 and consolidating these efforts is fruitful for all these fields.
While basic aspects of Markov modeling, such as steps of the data processing pipeline outlined in the beginning of this editorial, are now well established and largely under control, new research questions have emerged, such as how to treat nonequilibrium processes, how to deal with systems with permutation invariance such as liquid and membrane systems, and how to exploit modern machine learning methods for molecular thermodynamics and kinetics. The contributions of the MMMK collection are a cross section of this change.
An important driving force for the development of the field was, and is, the availability of open-source well-maintained software. Currently, MSM softwares are primarily developed and maintained by individual groups. We believe that a key to make this development sustainable and maintainable—and therefore to preserve the accumulated methodological knowledge for the community—is to move these softwares from single groups to communities or in other words dissociate them from individual principle investigators, e.g., by merging packages from different groups. The next years will be crucial in order to show whether this step succeeds, and therefore whether MSM research can proceed at full steam.
ACKNOWLEDGMENTS
We are grateful to the staff and editors of Journal of Chemical Physics who proposed the MMMK collection and did the heavy lifting in collecting and editing papers, especially Erinn Brigham, John Straub, and Peter Hamm. F.N. acknowledges funding from Deutsche Forschungsgemeinschaft (CRC 1114, Project Nos. C03 and A04) and European Commission (Grant No. ERC CoG 772230 “ScaleCell”). E.R. acknowledges funding from EPSRC (Grant Nos. EP/R013012/1, EP/L027151/1, and EP/N020669/1) and European Commission (Grant No. ERC StG 757850 “BioNet”).