Many state-of-the-art methods for the thermodynamic and kinetic characterization of large and complex biomolecular systems by simulation rely on ensemble approaches, where data from large numbers of relatively short trajectories are integrated. In this context, Markov state models (MSMs) are extremely popular because they can be used to compute stationary quantities and long-time kinetics from ensembles of short simulations, provided that these short simulations are in “local equilibrium” within the MSM states. However, over the last 15 years since the inception of MSMs, it has been controversially discussed and not yet been answered how deviations from local equilibrium can be detected, whether these deviations induce a practical bias in MSM estimation, and how to correct for them. In this paper, we address these issues: We systematically analyze the estimation of MSMs from short non-equilibrium simulations, and we provide an expression for the error between unbiased transition probabilities and the expected estimate from many short simulations. We show that the unbiased MSM estimate can be obtained even from relatively short non-equilibrium simulations in the limit of long lag times and good discretization. Further, we exploit observable operator model (OOM) theory to derive an unbiased estimator for the MSM transition matrix that corrects for the effect of starting out of equilibrium, even when short lag times are used. Finally, we show how the OOM framework can be used to estimate the exact eigenvalues or relaxation time scales of the system without estimating an MSM transition matrix, which allows us to practically assess the discretization quality of the MSM. Applications to model systems and molecular dynamics simulation data of alanine dipeptide are included for illustration. The improved MSM estimator is implemented in PyEMMA of version 2.3.
I. INTRODUCTION
Ensemble approaches, where many fairly short simulations are produced in parallel or on distributed computer architectures, are widely used in order to characterize the thermodynamics and kinetics of large biological macromolecules. Markov state models (MSMs)1–3 have become standard tools for the analysis of such data sets generated by molecular dynamics (MD) simulations.4–13 An MSM provides a simplified model of the underlying Markov process, which is continuous in both time and space, by a discrete time Markov chain on finitely many states. These states are defined by partitioning the continuous state space into finitely many disjoint sets. Time is discretized by choosing a discrete time step, called the lag time, and the full process is replaced by a snapshot process that only keeps track of the discrete state visited at the discrete time steps, discarding any time information in between and any spatial information within the discrete sets. The quality of this approximation critically depends on the choice of both discretization and lag time.14 One of the strengths of Markov models is that the simulations used to construct them do not necessarily need to sample from the global equilibrium distribution, as only conditional transition probabilities between the states are required.4 In particular, at least in principle, these transition probabilities can be obtained without bias from simulations started out of local equilibrium in each state which only run for the length of a single lag time step. However, it is much more practical to produce simulations that are longer than one lag time and estimate MSMs by counting transitions along these trajectories. Even if the simulations are started out of local equilibrium, the distribution deviates from local equilibrium over time until global equilibrium is restored. The estimation of transition probabilities is therefore subjected to a bias.1 In order to keep the bias small, it must be assumed that local equilibrium is approximately restored after every time step.
The effect of the initial distribution onto the MSM quality or even the justification of using an MSM for data analysis has been controversially discussed, and this issue has not been resolved yet. At least three ideas have been discussed as follows:15 (1) This effect exists,1 but may be small and can be ignored in practice. (2) We can reduce the effect of non-equilibrium starting points by discarding the first bit of simulation trajectories, enough to reach local equilibrium.11 (3) We can avoid this problem by preparing local equilibrium distributions in the starting states using biased simulations and then shooting trajectories out of them.16–19
Here we qualify and quantify these ideas by systematically analyzing the effect of non-equilibrium starting conditions onto MSM quality, and we suggest effective correction mechanisms. Throughout the manuscript, we use the term “non-equilibrium” to describe the problem that simulations are started from a distribution which is not in global equilibrium, and their simulation time is too short to reach that global equilibrium. Note, however, that the dynamics itself is assumed to possess a unique equilibrium distribution, and if long enough simulations would be run, they would sample from the equilibrium distribution. Briefly, our main results are the following:
We provide an expression for the error between unbiased transition probabilities and the expected estimate from many simulations running for multiple discrete time steps, see Sec. II. We find that there is no fundamental advantage of starting simulations in local equilibrium. Rather, the estimation error depends on the discretization, the simulation length, and the lag time. In the limit of long lag times and fine discretization, MSMs are estimated without bias even when non-equilibrium starting points are used. However, for a given discretization the lag time required to practically achieve a small estimation bias might be large.
We derive an unbiased MSM estimator that corrects the error due to non-equilibrium starting conditions at short lag times, by exploiting the framework of observable operator models (OOMs)—see Sec. III. OOMs are powerful finite-dimensional models that provide unbiased estimates of stationary and kinetic properties of stochastic processes under fairly mild assumptions, see Refs. 20–22. Most importantly, OOMs can be estimated from non-equilibrium simulations22 and are not limited to a local equilibrium assumption.
We utilize the fact that exact relaxation time scales that are not contaminated by the MSM projection error (i.e., quality of the coordinates and the clustering used) can be estimated using the OOM framework. The difference between the unbiased estimate and the uncorrected or corrected MSM estimate is very insightful as it provides an indicator of the quality of the MSM discretization. If this difference is too large, it is suggested to rather improve the coordinate selection or discretization used for MSM construction and re-analyze. Note that while OOMs offer the more general theory, they are not as easy to interpret and their estimation from finite data is not as stable and mature as MSM estimation.
As a technical advance, we provide a meaningful strategy to select the model rank of an OOM which is required in order to obtain practically useful estimates, by using a statistical analysis of singular values of the count matrix (Sec. III D).
Sec. IV demonstrates the usefulness of the OOM framework for two model systems and MD simulation data of alanine dipeptide. We show that accurate estimates of spectral and stationary properties can be obtained from short non-equilibrium simulations, even for short lag times or poor discretizations. We explain how the discretization quality is revealed by the difference between spectral estimates of the MSM and OOM. We also show that the rank selection strategy helps to choose a suitable model rank even for small lag times, when no apparent time scale separation can be utilized.
As an illustration, consider the one-dimensional model system governed by the potential shown in Fig. 1(a), see Sec. IV A for details. We study the estimation of a Markov model using the two-state discretization indicated in panel (a)of Fig. 1. For various lag times, we investigate the expected transition matrix if 90% of the simulations are started from local equilibrium within state 1, while the other 10% are started from local equilibrium within state 2. Note that we do not use any simulation data here, we only compute expected values over an ensemble of trajectories, with the trajectory length set to 2000 steps, which is shorter than the slowest relaxation time scale.
(a) One-dimensional potential function and discretization into two states. (b) The same potential with a five-state discretization. ((c) and (d)) Estimates for the equilibrium probability of state 1 from the direct MSM (green) and the unbiased MSM (red), reference in black. ((e) and (f)) Estimates for the slowest relaxation time scale t2 from a direct MSM (green), c.f. Eq. (19), the unbiased MSM (red), c.f. Eqs. (38) and (39), and the spectral OOM estimation (blue), Eqs. (62) and (63). The black dashed line corresponds to the reference value.
(a) One-dimensional potential function and discretization into two states. (b) The same potential with a five-state discretization. ((c) and (d)) Estimates for the equilibrium probability of state 1 from the direct MSM (green) and the unbiased MSM (red), reference in black. ((e) and (f)) Estimates for the slowest relaxation time scale t2 from a direct MSM (green), c.f. Eq. (19), the unbiased MSM (red), c.f. Eqs. (38) and (39), and the spectral OOM estimation (blue), Eqs. (62) and (63). The black dashed line corresponds to the reference value.
For short lag times, the standard MSM provides a strongly biased estimate of the equilibrium population of the two wells (Fig. 1(c), green curve). For longer lag times, the MSM converges towards the correct equilibrium population, but the bias only disappears when the lag time approaches the longest relaxation time scale of the system, so if the initial distribution is far from equilibrium this can entail a significant error at practically feasible lag times. In contrast, the corrected MSM estimate proposed in this paper achieves the correct estimate of equilibrium populations even at short lag times (Fig. 1(c), red curve). The standard MSM relaxation time scales are underestimated at short lag times, consistent with previous variational results,23–25 but they can be improved by using the unbiased MSM estimator proposed here (Fig. 1(e)). The OOM can provide a model-free estimate of the relaxation time scale that is unbiased at a relatively short lag time (Fig. 1(e), blue line). The difference between the OOM and the corrected MSM estimate (blue versus red lines in Fig. 1(e)) is an indicator of the MSM model error due to the state space discretization. Please note that all MSM results in this figure can be dramatically improved if a finer clustering is used. For example, if the five-state partitioning from Fig. 1(b) is used instead, the estimation of stationary properties converges much faster (Fig. 1(d)), and there is hardly a difference between the time scales estimated by a direct and an unbiased MSM (Fig. 1(f)).
II. ANALYSIS OF THE STATE OF THE ART: MSM ESTIMATION FROM SIMULATIONS WITH ARBITRARY STARTING POINTS
A. Molecular dynamics, count matrix, and transition matrix
In this work, we consider the setting described in detail in Ref. 1, that is, an ergodic and reversible Markov process Xt on continuous state space Ω, which possesses a unique stationary distribution π. We denote by the lag time and by
the conditional transition density function, that is the probability that the process, when located at configuration x at time t, will be found at configuration y at time . The corresponding transfer operator is denoted by and is defined by its action on a function of state space u,
Its eigenvalues are called
where tm are the implied relaxation time scales. We denote the transfer operator eigenfunctions by . In particular, we have that . If the transfer operator is of rank M at lag time τ, the transition density can be written as
Note that exact equality in Eq. (4) is an assumption, but often it is satisfied approximately for a large range of lag times τ. Throughout the paper, we will consider decompositions of state space into disjoint sets , where . The indicator function of set Si is called . For a simulation of the continuous dynamics which samples positions at discrete time steps, we will denote the position at the kth time step by , such that K is the total number of time steps in the simulation. We use the symbol Y as a shorthand notation for an entire simulation. If multiple different simulations need to be distinguished, we will denote them by , i.e., Q is the total number of available simulations.
Most of this work is based on correlations between the discrete sets. For a trajectory as above, we define the empirical histograms and correlations (also called state-to-state time-correlations) as follows:
Up to the normalization, the matrix is a count matrix because it simply counts the number of transitions from state Si to Sj over a time window τ that have occurred in the simulation, while the vector counts the total visits to state Si and corresponds to the ith row sum of . For each set Sr, the matrix is proportional to a two-step count matrix counting subsequent transitions from state Si to Sr and on to state Sj. At first sight, it may seem confusing that and s only count transitions and visits up to time , but further below, we will use all three matrices in conjunction which requires estimating all of them over the same part of the data. We will continue to refer to these matrices as a count matrix, count vector, and two-step count matrix in what follows. Also note that in the literature, the count matrix and vector are often denoted by , but we will use these symbols differently in what follows. Let us note at this point that can be seen as random variables that map a (stochastic) trajectory Y of discrete time steps to the values given in Eqs. (5)–(7). To emphasize this dependence, we will also write if appropriate.
We are concerned with the estimation of a transition probability matrix between the sets Si of a given discretization of state space. If the process is in equilibrium, the conditional transition probabilities can be expressed as
Here, we have defined the equilibrium correlation between sets Si and Sj by the nominator of Eq. (9) and denoted it by . Also, we have adopted the usual notation for the equilibrium probabilities of the discrete states. Such a matrix of conditional transition probabilities is called a Markov state model (MSM) or Markov model. It can be used as a simplified model for the dynamics allowing extensive analysis, see Ref. 1.
From a long simulation that samples points from the stationary density π, the matrix can be estimated by the formula
B. Starting from local equilibrium
In practice, producing simulation data that sample from the global equilibrium density π is often not tractable. One of the strengths of Markov models is the fact that the transition matrix can also be expressed in terms of local equilibrium densities
The density is the normalized restriction of π to state Si. A Markov model transition matrix can also be estimated by preparing an ensemble of trajectories in such a way that, within each state, the distribution of starting points equals the local density Eq. (12). These trajectories are simulated for a very short time, and the fraction of trajectories starting in Si and ending up in Sj provides an estimate for the transition matrix entry .16,26 To see this, note that in the setting just described, the initial distribution is a convex combination of the local densities ,
Here, ai is the probability to start in state Si. Upon replacing π by in Eq. (9), it follows that
Only very short trajectories and knowledge of the local densities are needed for the application of this method. However, this method suffers from three major disadvantages: first, the intermediate data points of the simulations cannot be used. Second, estimation of the local densities requires the use of biased sampling methods, which is a significant extra effort and entails additional difficulties. Third, changing the discretization requires redoing the simulations, which is not acceptable if a suitable discretization is not easy to find.
C. Multiple-step estimator
A common way to construct MSMs in practice is by conducting a large set of distributed simulations of lengths that are shorter than the largest relaxation time scales of the system but are longer than the lag time τ. For our theoretical investigation, we will assume that each of these trajectories has the same length of K stored simulation steps, but for the estimators we will be deriving later that uniform length is not a requirement, see Appendix C.
The simulations are started from some arbitrary initial distribution at time k = 1. The transition probability matrix is estimated by replacing S(i, j) and s(i) by their empirical mean values over all simulations Yq. These are defined by the following equations, where we include the corresponding definition for for later use,
In analogy to Eq. (11), the transition matrix is then estimated by
Additional constraints can be incorporated in order to obtain more specific estimators than Eq. (19), such as estimators obeying detailed balance.1,10,27
The argument from Sec. II B cannot be transferred directly to a multiple-step estimator like Eqs. (16) and (17): Even if the simulations are started from local equilibrium, this property is lost after the first simulation step, and the resulting estimates are no longer unbiased. A detailed illustration of this phenomenon has been provided by Ref. 1 [Fig. 4], and we repeat it here in Fig. 2. It can be argued that if the discretization is chosen well enough such that the dynamics equilibrates to an approximate local equilibrium within all states over a single time step, the bias can be expected to be very small. This assumption is difficult to check or quantify in practice. In Sec. II D, we analyze the bias introduced by the multiple-step estimator, as well as its dependence on the lag time and simulation length.
Loss of local equilibrium property illustrated by comparing the dynamics of the diffusion in a double-well potential (a) and (e) at time steps 0 (b), 250 (c), 500 (d) with the predictions of a Markov model parameterized at lag time at the same times 0 (f), 250 (g), 500 (h). Please refer to the supplementary material of Ref.1 for details of the system. (b)–(d) show the true distribution of the system (solid black line) and the probabilities associated with the two discrete states left and right of the dashed line. The numbers in (f)–(h) are the discrete state probabilities , predicted by the Markov model. The solid black lines show the hypothetical density that is inherently assumed when estimating a Markov model by counting transitions over multiple steps. This figure has been re-used with permission from Prinz et al., J. Chem. Phys. 134, 174105 (2011) [Fig. 4]. Copyright 2011 American Institute of Physics.
Loss of local equilibrium property illustrated by comparing the dynamics of the diffusion in a double-well potential (a) and (e) at time steps 0 (b), 250 (c), 500 (d) with the predictions of a Markov model parameterized at lag time at the same times 0 (f), 250 (g), 500 (h). Please refer to the supplementary material of Ref.1 for details of the system. (b)–(d) show the true distribution of the system (solid black line) and the probabilities associated with the two discrete states left and right of the dashed line. The numbers in (f)–(h) are the discrete state probabilities , predicted by the Markov model. The solid black lines show the hypothetical density that is inherently assumed when estimating a Markov model by counting transitions over multiple steps. This figure has been re-used with permission from Prinz et al., J. Chem. Phys. 134, 174105 (2011) [Fig. 4]. Copyright 2011 American Institute of Physics.
D. Estimation error from non-equilibrium simulations
Now we study the effect of using an initial distribution of simulation data that is not in local equilibrium when the transitions are counted. This deviation from local equilibrium could come either from the fact that we start trajectories in an arbitrary initial condition, or that our trajectories exceed the lag time τ such that an initially prepared local equilibrium is lost for all transition counts harvested after the first one (Sec. II C).
Let ρ denote the empirical distribution sampled by the simulations. We need to study the error between the equilibrium transition matrix and the asymptotic limit of Eq. (19). To this end, we study the asymptotic limits of and in the limit of infinitely many simulations, , but each having finite lengths,
Thus, we use the symbols for the expected count matrix and vector of total counts associated with the empirical distribution ρ. Using the spectral decomposition, Eq. (4), the expected count matrix can be expressed in terms of the spectral components of the dynamics,
In matrix form, Eq. (23) can be written as
These matrices contain the MSM projections of the true eigenfunctions, i.e., their approximations by step functions, which are extensively discussed in Refs. 1 and 14. Let us emphasize that Eq. (23) also holds for arbitrary basis functions, i.e., is not required to be a basis of indicator functions. Thus, it is the most general expression for a correlation matrix from Markovian dynamics.
Summation over j shows that
It follows from Eq. (23) that the spectral expansion of is given by
using the fact that for trajectories started from global equilibrium we have . Combining Eqs. (23), (29), and (30), we obtain an expression for the estimation error ,
where , and we were able to drop the m = 1 terms on both sides as they are equal. Inspecting this expression leads to a number of insights that are practically important for analyzing simulation data with MSMs:
MSM estimation from long trajectories: In the limit that our trajectories are longer than the time scale of the slowest process, the empirical distribution ρ converges to the equilibrium distribution π, and the bias becomes zero. This offers an explanation why MSMs built from ultra-long simulations28,29 are quite well-behaved and have been extensively used for benchmarking and method validation.
MSM estimation from short trajectories: Even if the trajectories are not long enough to reach global equilibrium, because of Eq. (3), the bias decays multi-exponentially with the lag time τ. This is an important insight because MSMs are in practice constructed in the limit of long enough lag times in which the time scale estimates converge,1,5 and the above equation shows that this limit is meaningful as it approaches an unbiased estimate.
Dependence of bias on the discretization error: The above formula reflects the well-known insight that Markov models are free of bias if the discretization perfectly approximates the dominant eigenfunctions, meaning that the eigenfunctions are constant on the states Si.1,5
Consequences for adaptive sampling: Previous adaptive sampling approaches have suggested preparing an initial local equilibrium distribution in order to shoot trajectories out of selected states.16 The above analysis shows that this strategy is effective if we only count a single transition out of the state but is ineffective when longer trajectories are shot. In the latter case, it is simpler to ignore the initial distribution and to reduce the effect of bias by extending the lag time τ, see again Fig. 1 and also the next example.
E. Example
Before proceeding, we illustrate these findings by re-visiting the one-dimensional model system presented in the Introduction. We study the same two different discretizations, the two-state model from panel (a) of Fig. 3 and the five-state discretization shown in Fig. 3(b). Again, simulations are initiated from local equilibrium in states 1 and 2 of the coarse discretization, with a1 = 0.9, a2 = 0.1. We study the expected estimate of the equilibrium probability of state 1, which equals the equilibrium probability of states I and II for the finer state definition. Panels (c) and (d) of Fig. 3 show the respective estimates for the coarse and fine discretization as a function of the lag time, for simulation lengths K = 1000, 2000, 5000, 10 000, 50 000. Indeed, the estimates improve if the lag time is increased, if the simulation length is increased, or if the discretization is improved. From the coarse partitioning example, we conclude that relaxation to global equilibrium can be required in order to obtain unbiased estimates from simulations initiated out of local equilibrium.
((a) and (b)) One-dimensional potential function with two different discretizations into two states and five states, respectively. (c) Expected estimate of the equilibrium probability of state 1 as a function of the lag time, for simulation lengths K=1000, 2000, 5000, 10000, 50 000, and using the discretization from panel (a). The simulations are initiated in local equilibrium in both states 1 and 2 but predominantly in state 1 (a1 = 0.9, a2 = 0.1). (d) The same for the five-state discretization from panel (b).
((a) and (b)) One-dimensional potential function with two different discretizations into two states and five states, respectively. (c) Expected estimate of the equilibrium probability of state 1 as a function of the lag time, for simulation lengths K=1000, 2000, 5000, 10000, 50 000, and using the discretization from panel (a). The simulations are initiated in local equilibrium in both states 1 and 2 but predominantly in state 1 (a1 = 0.9, a2 = 0.1). (d) The same for the five-state discretization from panel (b).
III. CORRECTION OF ESTIMATION BIAS USING OBSERVABLE OPERATOR MODELS
In this section, we show how to go beyond just using a longer lag time τ and suggest correction mechanisms to obtain the correct equilibrium transition matrix (Eqs. (8)–(10)) from an ensemble of short simulations. This can be accomplished regardless of the starting distribution being in global equilibrium, in local equilibrium, or far from any equilibrium.
As discussed above, limitations of MSMs include the assumption of Markovianity, sensitivity to projection error, and sensitivity to the distribution of trajectory starting points. All of these limitations can be overcome by realizing that molecular dynamics that is observed in a chosen set of variables, reaction coordinates, or order parameters at a certain lag time τ can be exactly described by projected Markov models(PMMs).30 This insight allows us to employ estimators that are not affected by the MSM limitations, such as hidden Markov models (HMMs)30 or observable operator models (OOMs),20–22 that operate on the discretized state space.
Here, we employ OOMs in order to get improved MSM estimators that are not subject to the bias caused by a non-equilibrium distribution of the trajectories used. In a nutshell, OOMs are spectral estimators able to provide unbiased estimates of stationary and dynamical quantities for dynamical systems that can be well described by a finite number of dynamical components. Here we only summarize a few aspects of OOMs that are relevant to the present paper and present an algorithm that can be used to estimate MSMs without bias from the initial trajectory distribution. To fully understand the theoretical background and derivation, please refer to Refs. 20–22.
A. Observable operator models
Observable operator models (OOMs) provide a framework that completely captures the dynamics of a stochastic dynamical system by a finite-dimensional algebraic system if only a finite number M of relaxation processes contribute in Eq. (4), see Refs. 20 and 21. For molecular dynamics, this property is achieved if we observe and model the dynamics at a finite lag time τ. The full-state observable operator is an matrix which contains the scalar products between the eigenfunctions,
In statistical terms, is the expectation value of the covariance matrix between eigenfunctions. As eigenfunctions are orthogonal with respect to the equilibrium distribution π, or in other words, statistically uncorrelated, is just a diagonal matrix of the eigenvalues,
If we do not integrate over the full state space Ω in Eq. (34), but only over a subset , we can define a matrix of size , called the set-observable operator for set A. All set-observable operators and two vectors are the key ingredients of OOM theory. The vectors equal the first canonical unit vector e1, i.e., , and they are called information state and evaluator, respectively. If the finite-rank assumption Eq. (4) holds, these components form an algebraic system that allows to compute equilibrium probabilities of finite observation sequences. Let be arbitrary subsets of Ω that do not need to form a partition of the state space. If Eq. (4) is satisfied, we can compute the probability that a trajectory in equilibrium visits set A1 at time τ, set A2 at time , …, and set Al at time by the following matrix-vector product:
The proof can be found in Ref. 21, we also repeat it in Appendix B. Note that, in case that form a partition of state space, the probability of such an observation sequence cannot be computed from a Markov model transition matrix between the sets , unless the dynamics is Markovian on these sets. This clearly distinguishes an OOM from a Markov model: An OOM can correctly describe arbitrary projected dynamics as long as Eq. (4) holds.
As a Markov process is determined entirely by finite observation probabilities like Eq. (36), it follows that we can compute several key equilibrium, kinetic, and mechanistic quantities in an unbiased fashion if we can somehow estimate the OOM components. For a fixed decomposition of state space into sets as before, let us denote the set-observable operators of sets Sr by , which implies that
It follows from Eq. (36) that we can compute the unbiased equilibrium correlation matrix and the stationary probabilities by the formulas
In practice we cannot directly estimate but only a similar operator . However, it follows directly from Eqs. (38) and (39) that if an unknown similarity transform affects all OOM quantities via
then Eqs. (38) and (39) remain exactly valid using . In other words, all OOMs that can be constructed by choosing some transformation matrix R form a family of equivalent OOMs. A specific member of this family can be estimated directly from simulation data, and thus we can use it in order to obtain unbiased estimates of Eqs. 38 and 39 even from a large ensemble of trajectories that do not need to sample from global equilibrium. It has been shown in Ref. 22 that Eqs. (47)–(50) in Sec. III B indeed provide the components of an equivalent OOM, i.e., there is an invertible matrix R such that Eqs. (40)–(42) are satisfied in the absence of statistical noise.
B. Unbiased estimation of Markov state models
To construct an exact unbiased estimator, we need three ingredients: (i) the expectation values of the empirical count matrix , (ii) the vector of total counts from Eqs. (20) and (21), and additionally (iii) the two-step count matrices,
As a reminder, expectation values here denote the expectation over a trajectory ensemble sampling from the empirical (non-equilibrium) distribution ρ. In practice, only finitely many simulations are available, and we thus replace and by count vectors and matrices and (Eqs. (16)–(18)), which are asymptotically unbiased estimators. The unbiased estimation algorithm can be summarized as follows:
Obtain the empirical mean , count matrix , and two-step count matrices from simulation data using Eqs (16)–(18).
- Decompose the count matrix by singular value decomposition (SVD),and compute weighted projections onto the leading M left and right singular vectors by(44)(45)We have used the symbols to denote the restriction of these matrices to their first M columns.(46)
- Use F1, F2 to obtain the set-observable operators and the evaluation state vector of an equivalent OOM via(47)Compute the full-state observable operator and obtain the information state vector as the solution to the eigenvalue problem,(48)(49)The normalization Eq. (50) can be achieved by dividing the arbitrarily scaled solution by .(50)
- Compute the unbiased equilibrium correlation matrix and unbiased equilibrium distribution by(51)(52)and then obtain the unbiased MSM transition matrix either using the nonreversible estimator(53)or the reversible estimator(54)(55)
Let us briefly comment on the central idea behind this algorithm, which is the estimation of an equivalent OOM in the third step, particularly in Eq. (47). Using the path probability formula Eq. (36), it can be shown that the expected two-step count matrix is given by
where the matrices are the same as in Eqs. (27) and (28). Thus, by the intermediate step, the set-observable operator is introduced into the decomposition of the two-step count matrix. Now, the idea is to find two matrices , such that and are inverse to each other, because this implies that
is the rth component of an equivalent OOM. The properties of SVD and the decomposition Eq. (26) guarantee that the choice of F1, F2 in the second step above achieves this goal,
Similar arguments can be used to justify the equations for . We also note that different choices of F1, F2 in step 2 are possible. For detailed explanations and proofs, please refer to the previous publications.20–22
C. Recovery of exact relaxation time scales
A remarkable by-product of the procedure described above is that the transformed full-state two-step count matrix is similar to a diagonal matrix of the system eigenvalues without any MSM projection error. This has been shown for equilibrium data in Ref. 31 and also applies to non-equilibrium data,21
Thus, diagonalization of provides an estimate of the leading system eigenvalues, and consequently also of the relaxation rates or time scales, that is not distorted by the fact that we coarse-grain the dynamics to a Markov chain between coarse sets in state space. These eigenvalue and time scale estimates are only subject to statistical error but not to any MSM model error. It is impossible to directly build an MSM that produces these time scales—when an MSM is desired, the time scales can only be approximated, and they will only be correct in the limit of long lag times and good discretization.
However, the fact that we can get a model-free estimate of the eigenvalues and relaxation time scales can be used to assess the discretization quality: According to the variational principle of conformation dynamics,24 the exact system eigenvalues provide an upper bound to the eigenvalues of the equilibrium transition matrix . By comparing the eigenvalues of to those from Eqs. (62) and (63), the MSM discretization error theoretically studied in Refs. 1, 14, and 23 can be practically quantified.
D. Selection of model rank
The above method is theoretically guaranteed to work whenever the number of MSM states N is at least equal to the number M of relaxation processes in Eq. (4), and the count matrix is of rank M. In the absence of statistical noise, the model rank M can then be determined by the number of non-zero singular values of . For finite data, the numerical rank of is not necessarily equal to M, as the singular values can be perturbed by noise. Classical matrix perturbation theory predicts that small singular values will be particularly affected by noise, see, e.g., Ref. 32, and also Fig. 4(a). Including noisy and small singular values can severely affect the accuracy of the method, most likely due to the presence of the matrix of inverse singular values in Eqs. (45) and (46). Also, we expect small singular values to have little impact on the dominant spectral and stationary properties of the final OOM, but this will be backed up by further theoretical investigation.
Analysis of statistical uncertainties for singular values of the count matrix. We use the one-dimensional model system and seven-state discretization as in Sec. IV A, the sample consists of Q = 5000 trajectories of length K = 2000. (a) For each of the seven singular values (distinguished in descending order by the colors black, blue, cyan, green, magenta, red, and yellow), we show the ratio of the true singular value of the expected count matrix to the corresponding singular value of the empirical count matrix , as a function of the lag time. As the small singular values decay quickly with the lag time, they are dominated by the noise even for small lag times. Including these noisy singular values would ruin the results. (b) Ratio between mean value and uncertainty (signal-to-noise ratio) from the bootstrapping for the seven singular values as a function of the lag time. The thin black dashed line indicates the cutoff we have used in applications. Only singular values above this line are included in the estimation, the number of points above this line corresponds to the OOM model rank, see Fig. 5(h).
Analysis of statistical uncertainties for singular values of the count matrix. We use the one-dimensional model system and seven-state discretization as in Sec. IV A, the sample consists of Q = 5000 trajectories of length K = 2000. (a) For each of the seven singular values (distinguished in descending order by the colors black, blue, cyan, green, magenta, red, and yellow), we show the ratio of the true singular value of the expected count matrix to the corresponding singular value of the empirical count matrix , as a function of the lag time. As the small singular values decay quickly with the lag time, they are dominated by the noise even for small lag times. Including these noisy singular values would ruin the results. (b) Ratio between mean value and uncertainty (signal-to-noise ratio) from the bootstrapping for the seven singular values as a function of the lag time. The thin black dashed line indicates the cutoff we have used in applications. Only singular values above this line are included in the estimation, the number of points above this line corresponds to the OOM model rank, see Fig. 5(h).
(a) One-dimensional potential function and discretization of the landscape into seven states. (b) Decadic logarithm of the first nine implied time scales of the model system. ((c) and (d)) Estimates of the stationary probability of states 1-3 from the direct MSM (green) and the corrected MSM (red), compared to the reference (black dashed line). ((e) and (f)) Estimates of the slowest relaxation time scale t2 from a direct MSM (green), the corrected MSM (red), and the OOM-based spectral estimation (blue), compared to the reference (black dashed line). ((g) and (h)) Model rank selected by the bootstrapping procedure. For all quantities derived from the OOM, the dashed lines indicate the estimated values using the complete data set, whereas the bullets and errorbars correspond to mean and standard error from the bootstrapping procedure. Note that the errorbars are hardly visible in panels (d) and (f).
(a) One-dimensional potential function and discretization of the landscape into seven states. (b) Decadic logarithm of the first nine implied time scales of the model system. ((c) and (d)) Estimates of the stationary probability of states 1-3 from the direct MSM (green) and the corrected MSM (red), compared to the reference (black dashed line). ((e) and (f)) Estimates of the slowest relaxation time scale t2 from a direct MSM (green), the corrected MSM (red), and the OOM-based spectral estimation (blue), compared to the reference (black dashed line). ((g) and (h)) Model rank selected by the bootstrapping procedure. For all quantities derived from the OOM, the dashed lines indicate the estimated values using the complete data set, whereas the bullets and errorbars correspond to mean and standard error from the bootstrapping procedure. Note that the errorbars are hardly visible in panels (d) and (f).
Consequently, it seems appropriate to cut off small and statistically unreliable singular values and select a smaller model rank in Eqs. (45) and (46). In order to determine the uncertainties of the singular values, we use the bootstrapping procedure, and we discard all singular values with a signal-to-noise ratio of less than 10. This has proven to be a useful choice in all applications presented further below. Fig. 4(b) illustrates this procedure for a simple model system.
E. Software, algorithmic details, and analysis of computational effort
We close the Section III of this paper by pointing out a few more details of practical importance. First, while it was convenient for the theoretical analysis to assume that all trajectories sample the same number of simulation steps K, this is not required (see Appendix C). Moreover, we also argue in Appendix C that all normalizations in Eqs. (5)–(7) and (16)–(18) can be dropped in practice. All of the matrices used in the estimation algorithm can be replaced by integer valued matrices that simply count the number of visits, transitions, and two-step transitions.
Second, we have suggested using the bootstrapping procedure in order to estimate uncertainties for the singular values of the count matrix. One way to realize this is to re-draw trajectories with replacement from the set of all available simulations and to re-estimate the count matrix from this modified set of simulations. As individual simulations are statistically independent, this procedure is theoretically justified and can also be used to estimate uncertainties of further derived quantities, like time scales and stationary probabilities. We used the trajectory-based bootstrapping in all examples shown below. However, if only a small number of rather long simulations is available, it may be more practical to re-draw individual transitions from the set of all available transitions in the data set. Let T denote the total number of data points, which equals T = KQ for the uniform trajectory length, and Eq. (C1) otherwise. If the transitions were statistically independent, one could simply re-sample T transition pairs from the set of all N2 possible pairs, where the probability of drawing the pair (i, j) is given by . In fact, transitions are not statistically independent. Therefore, we suggest to replace the count matrix by the effective count matrix described in Ref. 33, but it should be noted that this procedure relies on several approximations and must be improved in the future.
Third, we present an overview of the computational cost of each step in the estimation algorithm in Table I, assuming that dense matrix algebra is used in every step. It is expressed in terms of the total number of data points T, the number of MSM states N, the OOM model rank M, and the number of bootstrapping samples nb.
Analysis of computational effort required by the OOM-based estimation algorithm, if all operations are performed in dense matrix algebra.
. | Operation . | Cost . | . |
---|---|---|---|
Count matrix estimation | |||
Bootstrapping | |||
SVD of | |||
Computation of OOM components | |||
Transition matrix |
. | Operation . | Cost . | . |
---|---|---|---|
Count matrix estimation | |||
Bootstrapping | |||
SVD of | |||
Computation of OOM components | |||
Transition matrix |
The first step requires an effort which is linear in the data size and can be performed efficiently. In most cases, we can also assume the count matrices to be sparse, and the model rank M to be small. In this case, the cubic term appearing for the calculation of becomes quadratic, while the contributions of the model rank are small. The only real bottleneck is the singular value decomposition of , accounting for the factor N3 in the second and third steps. As we generally require all singular values of the count matrix, this step must be performed using dense matrix algebra, which can be time-consuming. Future research may provide a method that only requires the computation of the leading singular values, thus allowing for sparse algebra to be employed.
Lastly, we note that the MSM correction method described in Sec. III B is available as part of the PyEMMA package,34 version 2.3 or later, see http://pyemma.org.
IV. EXAMPLES
For each of the following examples, we use the trajectory-based bootstrapping strategy to determine the OOM model rank. Mean values and standard errors for the singular values are estimated from nb = 10 000 re-samplings, singular values with a signal-to-noise ratio of at least 10.0 are accepted. We also generate error estimates for all quantities derived from the OOM-based Markov model by trajectory bootstrapping, using 1000 re-samplings. In addition, we compute a conventional Markov model without OOM-based correction as a comparison.
A. One-dimensional toy potential
As the first example, we study in more detail the one-dimensional system used in the Introduction. The system is defined by the double-well potential function shown in Fig. 5(a). The dynamics here is a finite state space Markov chain with 100 microstates distributed along the x-axis, where transitions can occur between neighboring states based on a Metropolis criterion. The system is kinetically two-state, as the slowest relaxation time scale of the system, corresponding to the transition process between the two wells, is t2 = 3708 steps and clearly dominates all others (Fig. 5(b)).
We investigate the estimation of a seven-state Markov model (N = 7) using the discretization indicated by dashed lines in Fig. 5(a). Using seven states instead of two accelerates the convergence of OOM estimates. Still, the seven-state discretization is a poor one—note that state 4 contains large parts of the transition region as well as parts of the right minimum. This choice was made deliberately in order to test the robustness of our method with respect to poor MSM clusterings. We produced two different data sets, each comprising Q = 5000 simulations. The first set contains short simulations of length K = 250, while the simulations of the second set are K = 2000 steps long. For the analysis of the smaller data set, we can use lag times up to , while we can go up to for the larger data set. Panels (c), (e), and (g) of Fig. 5 display the results for the short simulations, while the corresponding results for the larger data set are shown in panels (d), (f), and (h). All simulations were initiated from a non-equilibrium starting distribution, where the probabilities to start in each of the seven states are given by the vector
that is, 90% of the simulations were started in the left three states, while only 10% were initialized in the deeper minimum on the right. Within each state, the actual microstate was selected from a uniform distribution.
Figs. 5(c) and 5(d) compare estimates of stationary probabilities from direct MSMs based on Eq. (19) and corrected MSMs with the transition matrix given by Eq. (55). Due to the non-equilibrium initial distribution, the simulations visit the left minimum much more frequently than a simulation in equilibrium would do. While the MSM estimates of the stationary distribution converge to the true equilibrium distribution at long lag times, they are surprisingly inaccurate at short times, where the effect of the non-equilibrium starting distribution still has a strong effect. Even at the largest lag time , the bias is still visible. In contrast, the corrected MSM provides an excellent and stable estimate at lag times of 15 steps or longer.
In Figs. 5(e) and 5(f), we compare estimates of the slowest implied relaxation time scale t2 from three different estimators: a direct Markov model based on Eq. (19), the corrected Markov model based on Eq. (55), and the OOM-based spectral estimation, Eqs. (62) and (63). First, we notice that the direct and corrected MSMs provide different estimates because of the combination of non-equilibrium starting points and the poor discretization quality. The corrected MSM time scales converge faster to the true time scales than the uncorrected ones. Second, the OOM-based direct estimation of relaxation time scales by Eq. (63) provides accurate results already at lag time , which is a regime where the number of relevant relaxation processes cannot be easily determined by a time scale separation, see again panel (b) of Fig. 5. The OOM time scale estimates become very accurate for larger lag times if more data can be used. Third, the large deviation between the corrected MSM and the OOM time scales are indicative of the poor discretization quality employed here.
Finally, in Figs. 5(g) and 5(h) we show the model rank selected by the bootstrapping procedure as a function of the lag time. We can observe how our criterion based on statistical uncertainties helps to select an appropriate model rank for each lag time, even when it is not obvious from the time scale plot. As expected, the system becomes effectively of rank 2 for lag times .
B. Molecular dynamics simulations of alanine dipeptide
Our second example is molecular dynamics simulation data of alanine dipeptide (Ac-A-NHMe) in explicit water. Alanine dipeptide has been used as a model system in numerous previous studies, see Refs. 25 and 35 and many others. It is well known that its dynamics can be described by the two-dimensional space of backbone dihedral angles . Fig. 6(a) shows the equilibrium probability distribution in this space with its three metastable minima in the upper left, central left, and central right parts of the plane. The slow dynamics consists of exchanges between the left and right parts () and between the two minima on the left (). We study the estimation of a Markov model using the discretization also indicated in panel (a) of Fig. 6. It was generated by k-means clustering of the data set described below using N = 40 cluster centers. We produced an ensemble of roughly 11 000 very short simulations of length each. Simulations were initiated from eight different starting structures labelled by the numbers 1-8 in Fig. 6(b), see Appendix A for details. It can be seen that the resulting empirical distribution does not even reach local equilibrium within the three metastable regions.
Results for alanine dipeptide. (a) Equilibrium distribution (logarithmic scale) in the space of backbone dihedral angles and cluster centers of a forty-state k-means discretization used to analyze the data. (b) Empirical distribution (logarithmic scale) sampled by the data initiated from eight starting structures indicated by the numbers 1-8. (c) Equilibrium probability of all states in the right part of the plane estimated from the direct MSM (green) and the corrected MSM (red). Reference in black. (d) Estimates for the slowest relaxation time scale t2 from a direct MSM (green), the corrected MSM (red), and the OOM-based estimation (blue). Reference values from equilibrium simulations are displayed in black. We also show the expected time scale estimate using the same forty-state discretization if equilibrium data were used (cyan line). (e) Model rank used for the OOM estimation as determined by the bootstrapping. (f) The same as (d) for the second slowest time scales t3. For all quantities derived from the OOM, the dashed lines indicate the estimated values using the complete data set, whereas the bullets and errorbars correspond to mean and standard error from the bootstrapping procedure. Note that the errorbars are hardly visible in panels (c) and (f).
Results for alanine dipeptide. (a) Equilibrium distribution (logarithmic scale) in the space of backbone dihedral angles and cluster centers of a forty-state k-means discretization used to analyze the data. (b) Empirical distribution (logarithmic scale) sampled by the data initiated from eight starting structures indicated by the numbers 1-8. (c) Equilibrium probability of all states in the right part of the plane estimated from the direct MSM (green) and the corrected MSM (red). Reference in black. (d) Estimates for the slowest relaxation time scale t2 from a direct MSM (green), the corrected MSM (red), and the OOM-based estimation (blue). Reference values from equilibrium simulations are displayed in black. We also show the expected time scale estimate using the same forty-state discretization if equilibrium data were used (cyan line). (e) Model rank used for the OOM estimation as determined by the bootstrapping. (f) The same as (d) for the second slowest time scales t3. For all quantities derived from the OOM, the dashed lines indicate the estimated values using the complete data set, whereas the bullets and errorbars correspond to mean and standard error from the bootstrapping procedure. Note that the errorbars are hardly visible in panels (c) and (f).
Like in the previous example, we find that it is possible to obtain precise estimates of stationary probabilities as soon as the convergence of the OOM-based time scales is achieved. In panel (c) of Fig. 6, we compare results for the equilibrium probability of all states in the right part of the plane, from a direct MSM and the corrected MSM. For lag times , we are able to correct the bias introduced by strong non-equilibrium sampling.
In panels (d) and (f) of Fig. 6, we present estimates of the two slowest time scales t2, t3 produced by the same estimators as before (OOM in blue, direct MSM in green, and corrected MSM in red). Additionally, the cyan lines correspond to the time scale estimates of an MSM using equilibrium simulations and the same discretization (see Appendix A). We find that the OOM-based spectral estimation provides accurate time scale estimates for short lag times starting at . Moreover, we notice that for lag times as small as these, MSM time scales are clearly lower than the true time scales, although a decent discretization is employed. The difference between OOM and MSM estimates indicates that an even finer discretization would be required to match the references at these lag times. The direct estimates, the reference equilibrium time scales, and our OOM-based estimates of equilibrium time scales are nearly identical. Only the mean values extracted from bootstrapping for t2 seem to be a bit low. This will be investigated further.
Finally, the selected model ranks shown in Fig. 6(e) confirm that our framework can work in situations where low-rank descriptions of the dynamics using only a few processes are not adequate.
C. Two-dimensional model system with poor discretization
Our final example is another finite state space Markov chain in the two-dimensional energy landscape shown in Fig. 7(a), defined by microstates. Here we show the behavior of different estimators in an extreme case, where the discretization is so poor that MSM estimates fail completely. Transitions between neighboring states are now possible in both x- and y-directions, again based on a Metropolis criterion. We study the estimation of a Markov model using a discretization into 16 MSM states, also shown in Fig. 7(a). As can be seen in Fig. 7(b), there are two dominant time scales: steps and steps. The next time scale is clearly separated from the first two, after that, there is no more apparent time scale separation. This time, we fix the simulation length at K = 5000 steps, i.e., the trajectories are approximately 30 times shorter than the slowest time scale. The simulations are started from a uniform distribution over all microstates. In panels (c)-(h) of Fig. 7, we display the results if the number of simulations is set to Q = 2000 ((c), (e), and (g)) and Q = 10 000 ((d), (f), and (h)).
(a) Two-dimensional potential function with discretization into 16 MSM states indicated by dashed lines. (b) Leading nine implied time scales tm of the system. ((c) and (d)) Estimates of equilibrium probability of states 13, 14, and 15 from direct MSM (green) and the corrected MSM (red), compared to the reference (black line) and estimates from 400 (c) / 900 (d) different equilibrium simulations, shown by the cyan lines. ((e) and (f)) Estimates of slowest relaxation time scale t2 from a direct MSM (green), the corrected MSM (red), and the OOM-based spectral estimation (blue), compared to the reference (black dashed line). ((g) and (h)) The same for t3. For all quantities derived from the OOM, the dashed lines indicate the estimated values using the complete data set, whereas the bullets and errorbars correspond to mean and standard error from the bootstrapping procedure. Note that the errorbars are hardly visible in panels (f) and (h).
(a) Two-dimensional potential function with discretization into 16 MSM states indicated by dashed lines. (b) Leading nine implied time scales tm of the system. ((c) and (d)) Estimates of equilibrium probability of states 13, 14, and 15 from direct MSM (green) and the corrected MSM (red), compared to the reference (black line) and estimates from 400 (c) / 900 (d) different equilibrium simulations, shown by the cyan lines. ((e) and (f)) Estimates of slowest relaxation time scale t2 from a direct MSM (green), the corrected MSM (red), and the OOM-based spectral estimation (blue), compared to the reference (black dashed line). ((g) and (h)) The same for t3. For all quantities derived from the OOM, the dashed lines indicate the estimated values using the complete data set, whereas the bullets and errorbars correspond to mean and standard error from the bootstrapping procedure. Note that the errorbars are hardly visible in panels (f) and (h).
In Figs. 7(c) and 7(d), we show the estimation results for the equilibrium probability of the states labeled 13, 14, and 15. We expect it to be difficult to estimate this probability, as the states are blending different metastable regions and transition regions. It can be observed that the estimation of stationary probabilities is more sensitive to noise, see the results for Q = 2000. This observation is not surprising, as the stationary probabilities require accurate estimation of the two-step count matrices Eq. (7) from the data, which can be more difficult for rarely visited states. Still, for Q = 10 000, a reliable estimate is achieved and the biased estimate of the direct MSM can be corrected. Another comparison we make is between the estimates from the corrected MSM and those from long equilibrium simulations that use the same number of total data points, i.e., for Q = 2000 and for Q = 10 000. We show mean values and standard errors from roughly 400 long simulations for Q = 2000, and roughly 900 simulations for Q = 10 000. In both cases, the estimates from long equilibrium trajectories provide more accurate estimates. In practice, however, one needs to strike a balance between long trajectories that are more beneficial for the analysis and short trajectories that can be more efficient for sampling and state exploration.36–38
Again, we also compare the estimates for the slowest time scales t2 ((e) and (f)) and t3 ((g) and (h)) from a direct MSM, the corrected MSM, and the OOM-based spectral estimation. In both cases, correct estimates of both time scales can be obtained from the OOM, while both the direct and corrected MSMs estimate time scales one order of magnitude too small. This suggests that for a bad enough discretization, correcting for the effect of the non-equilibrium starting distribution will not be sufficient to achieve convergence in the time scales. However, the poor discretization quality is revealed by a large error between the OOM-based estimate and the corrected MSM, and this observation can be exploited in order to improve the discretization and repeat the analysis.
V. CONCLUSIONS
We have investigated the quality of Markov state models when estimated from many simulations of short length, initiated from non-equilibrium starting conditions. We have derived an expression for the error between unbiased MSM transition probabilities and the expected estimate from many short simulations. This error is shown to depend on the simulation length, the lag time, and the state discretization. If ultra-long trajectories are employed, i.e., trajectories that are long compared to the slowest relaxation time scales, then the effect of the initial distribution is negligible and no further correction is needed. For ensembles of short trajectories, the situation is more complex. Preparing simulation trajectories in such a way that they emerge from a local equilibrium distribution does not appear to be of much practical use: this would only correct the first transition count of every trajectory while the subsequent trajectory segments are still biased. The local equilibrium will be lost for intermediate times along the trajectory as the trajectory ensemble is not in global equilibrium. In a similar sense discarding initial simulation fragments can reduce the bias but cannot systematically remove it. In particular, since the effect of the bias disappears with the slowest relaxation times of the system, discarding pieces of simulation trajectories appears more harmful in terms of reducing the statistics than it is useful to reduce the bias. With the standard MSM estimator, the most effective and simplest method to reduce the bias from the initial trajectory distribution in fact seems to be using a longer lag time or a better state space discretization. These are already the usual objectives of MSM construction. However, if the discretization is poor, the estimation bias due to a non-equilibrium distribution can be dramatic at practically usable lag times.
The main result of this paper is that we propose an improved estimator of the MSM transition matrix which is not biased by the initial distribution. This new estimator is based on the theory of observable operator models. In contrast to the standard MSM estimator, the corrected MSM estimator does not only use the number of transitions observed between pairs of states at lag time τ but also the number of transitions at lag time . These statistics are combined to get a transition matrix estimate a lag time τ that is unbiased by the initial trajectory distribution. While it may seem that having to estimate statistics at is a deficiency compared to standard MSM estimation when only short simulation trajectories are available, please note that the corrected MSM estimator can get significantly better estimates at short lag times, so in practice the lag times needed for a converged MSM will be smaller than for the standard estimator.
Finally, we report a result from the OOM framework that shows how the model-free relaxation time scales can be computed from the same statistics used for the corrected MSM estimator (i.e., transition matrices at lag times τ and ). These estimates are only impaired by statistical error but are not affected by systematic MSM error as no MSM is used in the process of obtaining them. The difference between the corrected MSM time scales and the OOM time scales can be used in order to assess the discretization quality, as this difference goes to zero in the limit of good discretization.
This paper addresses the long-standing controversy about the correct use of simulation data from short non-equilibrium simulations for MSM estimation and their effect on the estimation of equilibrium expectations and kinetics.
ACKNOWLEDGMENTS
This work was funded by Deutsche Forschungsgemeinschaft through SFB 958 and SFB 1114 and by the European Commission through ERC starting grant “pcCell.” C.C. is supported by the National Science Foundation (Grant No. CHE-1265929) and the Welch Foundation (Grant No. C-1570).
APPENDIX A: SIMULATION SETUP OF ALANINE DIPEPTIDE
Molecular dynamics simulations of alanine dipeptide in explicit water at temperature were generated with ACEMD39 software using the AMBER ff-99SB-ILDN force field40 and an integration time step of . The peptide was simulated inside a cubic box of volume containing 651 TIP3P water molecules. The Langevin thermostat was used. The electrostatics were computed every two time steps by the particle-mesh Ewald (PME) method,41 using real-space cutoff and grid spacing . All bonds between hydrogens and heavy atoms were constrained.
We have produced 11 388 ultra short simulations of length each, with saving interval. The simulations were initiated from eight different structures; their projections into -space are indicated by the numbers 1-8 in Fig. 6(b). The probabilities to start in each of these structures are given by the vector
These simulations were used to perform the analyses described in Sec. IV B. Using the same setup, we produced 2363 long runs of simulation time each, with saving interval. We estimated a Markov model on the 40-state k-means discretization at lag time using this data set, and extracted the reference time scales and equilibrium probabilities shown as black lines in Fig. 6. Also, we used the stationary probabilities estimated from this model to initialize 203 short equilibrium runs of simulation time each, with saving interval. This data set was used to compute the equilibrium time scales of the k-means discretization shown as cyan lines in Figs. 6(d) and 6(f).
APPENDIX B: OOM PROBABILITY OF OBSERVATION SEQUENCE
Here, we show the derivation of the path probability formula Eq. (36), that can also be found in Ref. 21. In general, the left-hand side of Eq. (36) can be expressed by repeated integrals over the transition kernel,
Note that π appears in the first integral as we assumed that the dynamics is in equilibrium, i.e., the initial distribution equals π. Next, we replace all transition kernels by the expansion in Eq. (4),
In the second equation, we have used the π-orthogonality of the eigenfunctions and the fact that in order to replace the x0-integral by . For the last integral, we have also used that . This is a sequence of matrix-vector products. It remains to use and that . In matrix notation, Eq. (36) follows:
Finally, note that this derivation also works if the dynamics is not in equilibrium. In this case, the vector 𝝎 is given by , where is the non-equilibrium initial condition.
APPENDIX C: VARIABLE SIMULATION LENGTH
Here, we verify that the estimation algorithm from Sec. III B can be applied to data sets comprising simulations of non-uniform length. We assume that for , there is an ensemble of Qj simulations of length , i.e., Kj transition pairs/triples will be used from each of these trajectories. We assume that for all j, such that every sub-ensemble samples from an empirical distribution . Define the number of data points generated by the jth ensemble as Tj = QjKj and the total number of data points by
Moreover, we assume that , i.e., the fraction of data points generated by the jth ensemble approaches a constant for all j. Let us define the distribution
Trajectories of length are enumerated by qj and labelled Y. Further, let s(Y) be any of the estimators from Eqs. (5)–(7), where the subscript Kj indicates that in Eqs. (5)–(7) must be replaced by Kj. In addition, denote by s(Y) the same estimator but without the normalization. Also, let denote the corresponding correlation from Eqs. (20), (21), and (43) with respect to the density . It follows that
The convergence in Eq. (C6) is convergence in probability. Thus, if we sum up all visits/transitions/two-step transitions, and divide by the total number of data points in the end, we arrive at an asymptotically correct estimator of the correlations with respect to the density ρ. As the OOM estimation algorithm only relies on consistent estimators for correlations with respect to some empirical density ρ, it can still be applied in this setting. Finally, the normalization by can be omitted in practice because it cancels out in Eqs. (47)-(52).