A method is proposed to identify target states that optimize a metastability index amongst a set of trial states and use these target states as milestones (or core sets) to build Markov State Models (MSMs). If the optimized metastability index is small, this automatically guarantees the accuracy of the MSM, in the sense that the transitions between the target milestones is indeed approximately Markovian. The method is simple to implement and use, it does not require that the dynamics on the trial milestones be Markovian, and it also offers the possibility to partition the system’s state-space by assigning every trial milestone to the target milestones it is most likely to visit next and to identify transition state regions. Here the method is tested on the Gly-Ala-Gly peptide, where it is shown to correctly identify the expected metastable states in the dihedral angle space of the molecule without a priori information about these states. It is also applied to analyze the folding landscape of the Beta3s mini-protein, where it is shown to identify the folded basin as a connecting hub between an helix-rich region, which is entropically stabilized, and a beta-rich region, which is energetically stabilized and acts as a kinetic trap.

Markov State Models (MSMs) have become an integral part of the toolbox used to analyze the output of molecular dynamics (MD) simulations of complex systems such as proteins and other large biomolecules.1–7 They were developed in response to the need to process ever longer MD time series data, made either of one long trajectory or very many shorter ones, generated, e.g., by special-purpose high-performance computers,8 high-performance GPUs,9 or massively parallel simulations.10 The basic idea of MSMs is to represent the original dynamics as memoryless jumps between predefined states in the configuration space of the molecular system. Under this Markovian assumption, MD time series data can then be processed via inference techniques such as maximum likelihood estimation to calculate the rate matrix between these states. This matrix defines a Markov jump process (MJP), which in turns permits the calculation of interesting kinetic quantities of the system on time scales that may be larger than those reached in the MD simulations: indeed MSMs also permit to recombine short simulations run in parallel to extract long time information about the system.

A recurrent issue in the context of MSMs is how to pick the states on which to map the original dynamics—see Refs. 11–13 for some recent works in this direction. Indeed this mapping amounts to a drastic coarse-graining of the dynamics, and the jumps between poorly chosen coarse states will not be Markovian in general. This invalidates the basic assumption of MSMs and affects their reliability and accuracy. Fortunately, MD systems typically display metastability and this offers a way around this difficulty. In metastable systems, there exist regions that play the role of hubs: after visiting one such hub, the system returns often to it before making a transition to another. This guarantees that transitions between these hubs are indeed approximately Markovian, and metastability has therefore been invoked as the key property to justify MSMs and assess their accuracy (see, e.g., Refs. 6 and 7 for modern perspectives on the topic that summarize this viewpoint). What remains mostly open, however, is how to identify these hubs in practice.

In the present paper, we aim at addressing this question in the context of milestoning-based MSMs5 that combine the core set method originally introduced in Ref. 14 with the milestoning of the trajectories developed in Refs. 15–17. Unlike standard MSMs that are based on a full partition of the configuration space of the system into blocks that are used as states,3,18–22 milestoning-based MSMs use non-adjacent core sets as states (the milestones), and assign the MD time series to the index of the last such milestone it visited. This maps the original dynamics onto a symbolic dynamics on these indices that are then used as input to build the MSM by maximum likelihood estimation of its transition matrix. In metastable systems, the proper milestones to use should be the hubs mentioned before. Here we propose to identify these hubs among a set of trial milestones via optimization of a metastability index that measures how good the hubs are. The method can be justified within the framework of the potential theoretic approach to metastability developed by Bovier and collaborators.23–25 It has the advantage that it can be used even in situations where the dynamics on the trial milestones is non-Markovian. In this sense, it alleviates a difficulty with the standard approach used to build MSMs via clustering of trial states:7 This clustering is typically done using spectral analysis of the rate matrix of the chain built on these trial states, which may lead to artifacts since the dynamics on these trial states is non-Markovian in general. The method we propose avoids this difficulty altogether. In addition it avoids the need to introduce a time-lag to process the data, which may be difficult to adjust. As we will see, our method also offers a way to partition the state space of the system by identifying regions made of configurations most likely to reach a given hub, and to identify the members of the transition state ensemble as those trial milestones that have a non-negligible probability to reach more than one target milestone next.

The remainder of this paper is organized as follows. In Sec. II we start by presenting the algorithmic aspects of the method we propose, including how to define the trial milestones (Sec. II A) and the metastability index (Sec. II B), how to identify the target milestones that optimize this metastability index (Sec. II C), and how to build the MSM on these target milestones (Sec. II D). We also introduce a variety of diagnostic tests that can be used a posteriori to analyze the output of the MSM and use the trial milestones to get additional information about the system’s dynamics (Sec. II E). A theoretical justification of this algorithm is then given in Sec. III, first in the context of Markov jump processes, which is relevant, e.g., if one assumes that the dynamics on the trial milestones is itself Markovian (Sec. III A), then in the context of systems whose configurational space is continuous, like those encountered in MD simulations, where we cannot expect the dynamics on the trial milestones to be Markovian (Sec. III B). We also test the method on a simple one-dimensional example with a multiscale energy landscape (Sec. III C). In Sec. IV we then apply the method to analyze the dynamics of a Gly-Ala-Gly peptide, and in Sec. V we use it to analyze the folding pathways of the Beta3s mini-protein. Concluding remarks are given in Sec. VI. Some technical derivations are relegated to the  Appendix.

In this section, we outline the algorithm we propose to identify the milestones (or core sets) over which to build an MSM. In a nutshell, this is done by picking among a set of trial milestones a subset of target milestones which minimizes a metastability index—this index is defined so that small values are indicative of Markovianity. The target milestones are taken as states in the MSM, while the trial milestones are used to complement the predictions of this MSM and, e.g., partition the system’s configurational space or identify transition state regions in it.

Denote by x(t) ∈ ℝ3n a trajectory containing the instantaneous position of the n atoms in a molecular system. We assume that we have generated one or several such trajectories and our goal is to build a MSM that captures their main features. To this end, we introduce a set of N trial milestones, which we will denote by S1, S2, …, SN and label them by their index, i.e., i identifies Si. These milestones are disjoint sets in the system’s configuration space that can be defined, e.g., by requiring that some of the dihedral angles of the molecules take values between specific bounds—how to actually choose the trial milestones Si will be illustrated below on specific examples. In the spirit of milestoning, we then map each trajectory x(t) onto the index of the last trial milestone Si it hit, see Fig. 1 for an illustration. This way we obtain a piecewise constant index function i(t) whose value at time t is the index of the last milestone hit by x(t). Note in this procedure we discount recrossings: we only update the index function when a new milestone is hit.

FIG. 1.

Milestoning procedure. (a) Piece of a long trajectory x(t) crossing a set of three circular milestones, S1, S2, and S3. (b) The trajectory shown in (a) is mapped onto the index of the last milestone it hit, thereby defining the piecewise constant function i(t).

FIG. 1.

Milestoning procedure. (a) Piece of a long trajectory x(t) crossing a set of three circular milestones, S1, S2, and S3. (b) The trajectory shown in (a) is mapped onto the index of the last milestone it hit, thereby defining the piecewise constant function i(t).

Close modal

We stress that at this stage we do not assume that the dynamics of the index function i(t) on the trial milestones is Markov—in general it will not be. What we would like to do next is extract from the set of N trial milestones a subset of MN target milestones such that if we map the trajectory x(t) onto this subset of target milestones, the corresponding index function will be approximately Markovian—these target milestones are shown in red in the cartoon shown in Fig. 2. In the sequel, we will denote by M = { i 1 , i 2 , , i M } { 1 , 2 , , N } the set of indices identifying the target milestones, i.e., these are {Si1, Si2, …, SiM}. We will also refer to trial milestones that are not target ones as non-target milestones.

FIG. 2.

Schematic representation of a subset of target milestones (shown in red) immersed in a set of trial milestones. A good subset of target milestone is such that transitions from non-target to target milestones are likely, while transitions between target milestones are not. This can be quantified via the metastability index defined in (4).

FIG. 2.

Schematic representation of a subset of target milestones (shown in red) immersed in a set of trial milestones. A good subset of target milestone is such that transitions from non-target to target milestones are likely, while transitions between target milestones are not. This can be quantified via the metastability index defined in (4).

Close modal

How should the target milestones be chosen? Intuitively, they should be such that they are hubs among the trial milestones towards which the trajectory is attracted but between which it seldom makes transitions, as illustrated in Fig. 2. To make this idea concrete, let us first introduce the probability Γi,j that, if the trajectory hits milestone Si, then subsequently it will hit Sj with ji before hitting Si again (discounting recrossings: recall that only hits of different milestones are counted—in other words, to hit Si again, the trajectory must have hit at least one other milestone in between). To estimate Γi,j, out of each of the piecewise constant index function i(t), we first extract the sequence {i1, i2, i3, …} of successive values that this function takes—for example, for the time series illustrated in Fig. 1 this sequence starts with i1 = 1, i2 = 2, i3 = 1, i4 = 3, etc. We then cut this sequence into the Ni pieces which start from i and contain all the indices visited after i before i appears again. For example, if the sequence is made of 3 indices, i, j, k, and reads

{ i , k , i , j , k , j , i , k , j } ,
(1)

we cut it into three pieces (Ni = 3)

{ i , k } ; { i , j , k , j } ; and { i , k , j } .
(2)

Finally we count the number of pieces Ni,j in which j appears at least once (in the example above Ni,j = 2 since j appears in the last two pieces but not in the first), we add up these numbers coming from every piece of i(t) at our disposal, and we set

Γ ˆ i , j = N i , j N i
(3)

as the estimator for Γi,j. Note that the quality of this estimator depends on the lengths of the pieces of time series x(t) that we have at our disposal, and how to assess the statistical accuracy of (3) is nontrivial. As usual, we cannot expect the estimator to be accurate if these pieces are too short to observe all the relevant events in the dynamics of the system: In the context of (3) this requires that these pieces be long enough that trajectories starting at a non-target milestone have time to reach a target one. Indeed, as we will see below, this condition is sufficient to guarantee that we will correctly identify target from non-target milestones among the trial ones.

We will now use the matrix with entries Γi,j to quantify how good a set of target milestones will be to build an MSM. Specifically, given a candidate M = { i 1 , i 2 , , i M } identifying the target milestones, we estimate the quality of these milestones via their metastability index defined as

ρ M = max i M max j M { i } Γ i , j min i M max j M Γ i , j .
(4)

The smaller ρ M , the better the set of target milestones identified by M = { i 1 , i 2 , , i M } . This claim will be justified in Sec. III by connecting ρ M to a quantity originally introduced by Bovier, but let us briefly explain here why it is true. The numerator in (4),

max i M max j M { i } Γ i , j ,
(5)

identifies the target milestone Si for which the probability is the highest that, after hitting Si, the trajectory will hit some other target milestone Sj before hitting Si again. In this sense, Si is the worst target milestone in the set since we would like that transitions between these target milestones be unlikely, and the smaller (5) the better. Correspondingly, the denominator in (4),

min i M max j M Γ i , j ,
(6)

identifies the non-target milestone Si which is such that the trajectory has the lowest probability to hit a target milestone Sj before hitting Si again. In this sense, Si is the worst non-target milestone since we would like that transitions from non-target to target milestones be likely, and the larger (6) the better. The metastability index ρ M in (4) accounts for both the desiderata that (5) be small and (6) be large, and in this sense it measures the quality of the set of target milestones identified by M = { i 1 , i 2 , , i M } .

Since good sets of target milestones are those whose metastability index ρ M is small, we can systematically search for such good sets by minimizing ρ M . In principle, this can be done by considering increasing values M = 2, M = 3, etc. of the cardinal of M , and for each compute ρ M for every choice of M = { i 1 , i 2 , , i M } so as to identify the one with minimum ρ M . Any choice M for which ρ M is small leads to a good set of target milestones. Note however that ρ M can be small for different values of M, and so several of them should be considered—this effect will be illustrated in the examples below. Of course, if the number N of trial milestones is large, this direct search strategy will quickly become inefficient as M increases since the number of sets M to consider for each M is

N ! M ! ( N M ) ! .
(7)

To avoid this difficulty, we must adopt more efficient optimization strategies, for example using Monte Carlo schemes. We have used such schemes in the examples below. However, we found that the following searching strategy was typically the most efficient.

Given Γi,j, we can identify the index j(i) of the milestone that the trajectory is most likely to hit after hitting i as the one that maximizes Γi,j over all ji, i.e.,

Γ i , j ( i ) = max k i Γ i , k .
(8)

The function j(i) can be used to define ‘trajectories’ in index space: given i, we update it to j(i), then to j(j(i)), etc. Correspondingly, given a set M = { i 1 , i 2 , , i M } we can update this set by updating every entry in it, and only keeping the ones that remain different (for example we could have j(i1) = j(i2) in which case only one of these entries is kept in the update of M ). Because this updating identifies set of indices of milestones that are likely to be hit, these sets should have small metastability index ρ M , and this is indeed what we observed in practice. Specifically, we took random sets of indices M = { i 1 , i 2 , , i M } with random values of M and updated them as described above while monitoring the metastability index ρ M of these updated sets. We observed that this metastability index typically diminishes before starting to oscillates in a periodic fashion (this is because the update has no fixed point, j(i) ≠ i by construction). When this happened we stopped the update, kept the updated set M = { i 1 , i 2 , , i M } with smallest ρ M , and restarted the procedure with a different random set of indices M = { i 1 , i 2 , , i M } . After a few such iterations, we typically got a few different sets M = { i 1 , i 2 , , i M } (with different M) with small ρ M .

Once we have identified a good set of target milestones specified by the index set M = { i 1 , i 2 , , i M } , we can build an MSM using these milestones as states. How to do so was explained in Refs. 5 and 17, so let us be brief here and refer the reader to the original papers for details. For the sake of clarity, in the sequel it will be convenient to distinguish between target and non-target milestones: we will do so by using greek letters α, β, … to refer to the indices in the index set M , and Bα, Bβ, … to refer to the target milestones Si1, Si2, ....

Similar to what was done before, we can map the trajectory x(t) onto the index of the last target milestone it hit. This defines a piecewise constant function α(t) taking values in M . Because we are now using target milestones instead of trial ones, unlike i(t), α(t) should be approximately Markov. This means that the evolution of this function can be completely specified by a rate matrix with entries kα,β: to leading order in δt ≪ 1, kα,βδt gives the probability that α(t) jumps from the value α to the value βα in the interval [t, t + δt) (i.e., that x(t) hits the target milestone Bβ in that time interval if the last target milestone it hit before time t was Bα). In particular, if pα(t) denotes the probability distribution that α(t) takes the value α at time t, then pα(t) satisfies the master equation

d p α ( t ) d t = β α p β ( t ) k β , α p α ( t ) k α , β .
(9)

Similarly, we can write down equations for the distribution of first passage time from target milestone Bα to target milestone Bβ, its mean, etc. in terms of kα,β—see, e.g., Refs. 5 and 17.

The rate matrix entries kα,β can be estimated from the time series x(t) by the method of maximum likelihood. If we denote by Tα the total time the last target milestone hit by x(t) is Bα and by Nα,β the number of times the target milestone Bβ was hit directly after Bα along this time series, the maximum likelihood estimator for kα,β is

k ˆ α , β = N α , β T α .
(10)

This estimator is unbiased in the sense that, if α(t) is indeed Markov with rate matrix entries kα,β and the length of the time series tends to infinity, then k ˆ α , β k α , β in this limit. If the length of the time series is finite, the statistical errors on k ˆ α , β can be estimated by Bayesian sampling, see Ref. 5 for detail. Another source of errors are those due to residual non-Markovian effects in α(t). The smaller ρ M , the smaller these non-Markovian effects are, as discussed in Sec. III. In practice they can also be estimated via Markovianity tests, as illustrated on examples in Secs. IV and V. Notice also that if x(t) satisfies detailed balance, then we should have that Nα,β/Nβ,α → 1 as the length of the time series goes to infinity. When this length is finite, however, Nα,βNβ,α in general, and to enforce detailed balance of the MSM, it is better to use the following symmetrized estimator for the rate matrix entries:

k ˆ α , β = N α , β + N β , α 2 T α .
(11)

With this choice, the equilibrium distribution of the MSM, i.e., the distribution π ˆ α towards which the solution to (9) with kα,β replaced by its estimator k ˆ α , β converges as t → ∞, is

π ˆ α = T α T ,
(12)

where T = α T α is the total length of the time series used to estimate k ˆ α , β . The distribution π ˆ α gives the proportion of time during which Bα was the last milestone hit by x(t) and as the length of the time series increases, T → ∞, π ˆ α converges to the true equilibrium distribution of milestone Bα—explicit expressions for this distribution in the context of a system whose dynamics is governed by a Markov jump process or a Langevin equation will be given in Secs. III A and III B, respectively. We can use π ˆ α to estimate the free energy of milestone Bα

Δ G ˆ α = β 1 ln π ˆ α ,
(13)

where β denotes the reciprocal of the thermodynamic temperature of the system.

Even though the trial milestones merely serve as an intermediary to construct the actual MSM, these milestones can still be used to analyze the MD data and partition the state space in ways that highlight important features of its dynamics.

First we can organize the trial milestones onto a network: if Ni,j denotes the number of times milestone Sj was hit directly after Si, we put an edge between node i and node j with weight

d i , j = N i , j + N j , i 2 T ,
(14)

where T is the total length of the time series for x(t). Note that, consistent with the detailed balance condition, we have symmetrized the weight di,j, i.e., the network is undirected, di,j = dj,i.

Next we can compute the free energy of the trial milestones. If T i = 0 T δ i ( t ) , i d t denotes the total time that the last milestone hit was Si (so that i T i = T , the total length of the time series), we can define the probability distribution

p ˆ i = T i T .
(15)

The distribution p ˆ i gives the proportion of time during which Si was the last milestone hit by x(t) (see Secs. III A and III B for its expression in the limit as T → ∞ when the system’s dynamics is governed by a Markov jump process or a Langevin equation, respectively), and we can estimate the free energy of the trial milestone Si via

Δ A ˆ i = β 1 ln p ˆ i .
(16)

The free energy Δ A ˆ i defines a landscape on the network of trial milestones. One may expect that the nodes associated with the target milestones will be close to the local minima of Δ A ˆ i ; however, we stress that this need not be the case, since the dynamics on the full network (rather than the one restricted to the target milestones alone) can be quite complicated (in particular, not Markovian).

To partition the state space, it is better to introduce the committor functions of the trial milestone Si with respect to the target milestone Bα, qα(i). By definition, qα(i) gives probability that, after hitting Si, the trajectory will hit the target milestone Bα before hitting any other target milestones, and it can be estimated from the time series as

q ˆ α ( i ) = N i , α N i ,
(17)

where N i = α N i , α (so that α q ˆ α ( i ) = 1 ). The committor functions can be used to do a (soft) partitioning of the network into basins of nodes that are more likely to be attracted next to one target milestone rather than any other: for example, those nodes i such that qα(i) is close to 1 are associated with milestones Si out of which the trajectory is very likely to hit Bα next. A hard partitioning can also be performed by assigning i to α ( i ) = argmax α M q α ( i ) .

The committor probability qα(i) whose estimator is given in (17) can also be used to identify the transition state ensemble (TSE), i.e., the trial milestones that lie in between the states in the target set M . If the target set contains only two states, M = { α , β } , it follows that all trial milestones i satisfy qα(i) = 1 − qβ(i) and the TSE is such that q α ( i ) 1 2 . For target sets with more than 2 states the 1 2 criterium may become less effective as the TSE as it can in general connect multiple states. To get around this difficulty, we can introduce a TSE index based on the entropy of the committor probability qα(i). Recalling that α M q α ( i ) = 1 for any milestone Si (i.e., qα(i) is a probability distribution in α), we propose to use the normalized entropy (sometimes also referred to as efficiency) of qα(i) as TSE index,

σ ( i ) = { α M q α ( i ) ln q α ( i ) / ln n ( i ) n ( i ) > 1 0 n ( i ) = 1 ,
(18)

where the sum is carried out over the non-zero entries of qα(i) and n(i) is the number of such entries (more generally, we could restrict the sum to the entries of qα(i) that are above some small threshold δ). The TSE index σ(i) is 1 if the non zero values of qα(i) are all identical, and it is 0 if only one value of qα(i) is different than zero. Therefore, the closer σ(i) is to 1, the higher the chance that state i be a member of the transition state ensemble. Once the states in the TSE have been identified, we can go back to their committor values to determine between which target states they lay.

To characterize the physical origin of the metastability of the target milestones, it is also useful to decompose their free energy into an energetic component and an entropic one. This can be done as follows. Given the system’s potential energy U(x), a mean energy can be assigned to each of the target milestones via

Δ E ˆ α = 1 T α 0 T U ( x ( t ) ) δ α ( t ) , α d t E ̄ ,
(19)

where Tα the total time the last target milestone hit by x(t) is Bα and

E ̄ = 1 T 0 T U ( x ( t ) ) d t
(20)

so that α Δ E ˆ α π ˆ α = 0 . We can then estimate the entropy Δ S ˆ α of the target milestone via

( k B β ) 1 Δ S ˆ α = Δ E ˆ α Δ G ˆ α ,
(21)

where Δ G ˆ α is the free energy estimated in (13) and kB is Boltzmann constant. Target milestones with comparable Δ Gα have similar statistical weights, and the lower Δ Gα the more thermodynamically stable they are (meaning the time series x(t) tends to return to them more often, comparatively): by comparing their values of Δ E ˆ α and Δ S ˆ α we can then determine whether their stability is of energetic or entropic origin, respectively.

The usefulness of the diagnostic tools introduced above will be illustrated in the examples treated in Secs. IV and V.

Let us now justify the use of the metastability index defined in (4) to identify target milestones over which the dynamics can be mapped in an approximately Markovian way.

We begin by discussing the simpler case when the dynamics of i(t) on the trial milestones is itself Markovian—as we mentioned in Sec. II, we do not make this assumption in general, but it is a convenient starting point for our theoretical explanation. The general situation when i(t) is not Markovian will be discussed in Sec. III B.

If i(t) is Markov, its dynamics is specified by a rate matrix L whose entries we will denote by Li,j to distinguish them from the rate matrix entries kα,β on the target milestones. Assuming detailed balance, Li,j satisfies

p i L i , j = p j L j , i , i , j = 1 , , N ,
(22)

where N is the number of trial milestones and pi is their equilibrium probability density—the estimator for pi was given in (15). Together with the assumption of ergodicity (i.e., that time averages along i(t) converge towards ensemble averages over pi), (22) implies that all the N eigenvalues of Li,j are real, with one being 0 and all the other negative. We will denote these eigenvalues by λi, and order them so that 0 = λ0 ≤ |λ1| ≤ ⋯ ≤ |λN−1|.

The eigenvalues of L permit to give a precise definition of what it means for the chain to display metastability. In turns this indicates how to coarse-grain this chain, which in our context means how to chose good target milestones. Specifically, a chain will be metastable if its eigenvalues can be separated into two well separated groups, i.e., if there exists an M < N such that

λ M 1 / λ M 1 .
(23)

If such a separation exists, it means that the M eigenvalues with index smaller than M describe relaxation processes in the chain that occur on timescales that are much slower than those described by the NM eigenvalues with index larger or equal to M. In turn, this implies that these slow processes can be approximately described by a smaller chain with only M states. The practical questions then become: (i) how to assess whether (23) is satisfied for some M without having to compute the full spectrum of L (since this computation is hard in general), and (ii) how to reduce the dynamics to a chain with only M states?

In Ref. 23, Bovier and collaborators addressed these two questions, and the answers they provided are the basis for the algorithm we presented in Sec. II. First they proved that (23) holds if and only if an indicator closely related to the metastability index ρ M in (4) is small for some M = { α 1 , α 2 , , α M } . If that is the case, the ratio in (23) is in fact proportional to ρ M 2 . Second, they showed how to reduce the chain onto a smaller chain involving only the nodes identified by M (i.e., in our context, involving only the target milestones) in such a way that the M eigenvalues of this reduced chain be close to the M first eigenvalues of the original chain (where closeness can again be measured in terms of ρ M ).

It should be stressed that (23) can be satisfied with more than one value of M. This simply means that there can be more than one low-lying group of eigenvalues. In turns this implies that there can be more than one choice of good target milestones.

For completeness, let us end this section by giving explicit expressions for some of the quantities that were defined in Sec. II in the context of a system whose dynamics is governed by an MJP with generator L. First, the probability Γi,j whose estimator was given in (3) can be expressed as

Γ i , j = k = 1 N P i , k q i , j ( k ) .
(24)

Here Pi,j (not to be confused with the entries of the transfer operator T(τ) = e, τ > 0) are the entries of the transition matrix defined as

P i , j = L i , j j i L i , j ( i j ) ,
(25)

which gives the probability that the state first visited by the chain after i is ji; and qi,j(k) is the committor probability solution of

{ l = 1 N L k , l q i , j ( l ) = 0 , k { i , j } q i , j ( k ) = 0 , k = i q i , j ( k ) = 1 , k = j .
(26)

The probabilities Γi,j can also be conveniently calculated in terms of mean recurrence times (MRTs) and mean first passage times (MFPTs) via the formula

Γ i , j = τ i τ i , j + τ j , i ,
(27)

where τi is the MRT of the state i and τi,j is the MFPT to go from state i to state j. A proof of relation (27) can be found in the  Appendix of this paper. In addition, the committor probability qα(i) whose estimator was given in (17) solves an equation similar to (26), with different boundary conditions,

{ j = 1 N L i , j q α ( j ) = 0 , i M q α ( i ) = 0 , i M { α } q α ( i ) = 1 , i = α .
(28)

We can relate the distributions πα and pi, whose estimators were given in (12) and (15) respectively, as

π α = i = 1 N p i q α ( i )
(29)

which also means that the corresponding free energies whose estimators were given in (13) and (16) respectively, are related as

e β Δ G α = i = 1 N e β Δ A i q α ( i ) .
(30)

Finally, the rate matrix entries on the target milestone, kα,β, whose estimator was given in (11) can be expressed as

k α , β = 1 π α i = 1 N p i q α ( i ) L i , β ( α β ) .
(31)

These formulas can be justified within the framework of transition path theory (TPT)26–28 and are useful for analysis. However we stress that we do not need to solve any of the equations above to apply the procedure outlined in Sec. II. Indeed, this procedure can be used with the estimators given in that section, which only require the time series x(t) as input.

In Refs. 24 and 25, the results of Ref. 23 were generalized to situations where the Markovian dynamics takes place on a continuous state-space, like x(t) does (or more generally the pair (x(t), p(t)) if the momentum p(t) needs to be added to make the description Markovian, as in (32) below). The main technical difficulty in that case is that the metastability index ρ M defined in (4) measures the probability to go to a state after leaving another one, rather than returning to that state. The problem is that, in the continuous state-space setting, these states cannot be identified with points in the state space, since the probability to hit a point is zero as soon as the space dimension is higher than 1.

To get around this difficulty, it was shown in Refs. 24 and 25 that one can redefine states by fattening any specific point into a little domain that contains it, so that a metastability index ρ M can again be defined as in (4) and allows to identify low-lying groups of eigenvalues when they exist. This fattening procedure is similar to that of defining trial milestones: they are indeed regions containing specific points in the original state-space of the system, which provides a theoretical justification to the algorithm proposed in Sec. III. It is important to note, however, that Refs. 24 and 25 only gave prescriptions on how to perform this fattening (i.e., how to define trial milestones) in very specific (and simple) cases, like that of an system governed by overdamped dynamics in the limit of very small temperature. We are obviously interested in more complicated situations here, in which case it is no a priori obvious how to define the trial milestones. A few procedures to do so will be discussed in Secs. IV and V: these procedures are by no means the only ones one could envision, but they proved sufficient in these examples and should be transportable to other ones. Note that this also means that one should verify a posteriori that the dynamics on the target milestones identified by the procedure is indeed approximately Markovian. As usual, this can be done by checking that the first passage time between target milestones adjacent on the network of the MSM are exponentially distributed. In the examples we treated below, this turned out to be the case, indicating that the MSMs we constructed were indeed accurate.

Assuming that we have picked trial milestones and identified target ones, for completeness let us give explicit formulas for some of the quantities introduced in Sec. II in the context of a system whose dynamics is governed by the Langevin equation

{ x ̇ = m 1 p p ̇ = U ( x ) γ p + 2 β 1 m 1 / 2 γ 1 / 2 η ( t ) ,
(32)

where U(x) is the potential energy of the system, m the mass matrix, γ the friction tensor, and η(t) is a white-noise satisfying 〈η(t)〉 = 0, 〈η(t)ηT(s)〉 = Id δ(ts)—other choices of thermostats are possible, and the formula below can be adapted to those straightforwardly. System (32) is ergodic with respect to Boltzmann-Gibbs probability density function

ρ ( x , p ) = Z 1 e β H ( x , p ) ,
(33)

where H ( x , p ) = 1 2 p T m 1 p + U ( x ) is the Hamiltonian and Z = ∫Ω×ℝ3neβH(x,p)dxdp. The expression for the probability Γi,j whose estimator was given in (3) is quite complicated if, unlike what we did in Sec. III A, we do not assume that the dynamics on the trial milestones is Markovian. We can, however, give explicit expression for the rate matrix entries kα,β, the distribution πα, and the free energy Gα whose estimators were given in (11)(13), respectively. These expressions were derived in Ref. 5 and they involve the committor function QαQα(x, p) solution of

0 = m 1 p x Q α U ( x ) p Q α γ p p Q α + β 1 m γ : p p Q α
(34)

with the boundary condition Qα(x, p) = 1 if x ∈ ∂Sα and n ˆ α ( x ) p > 0 and Qα(x, p) = 0 if x β M α S β and n ˆ β ( x ) p > 0 , where n ˆ α ( x ) denotes the unit normal vector pointing outward ∂Sα at point x ∈ ∂Sα and similarly for n ˆ β ( x ) . The committor function Qα(x, p) gives the probability that the trajectory initiated at (x, p) reaches Sα before any Sβ with βα; using the invariance of the dynamics under t → −t and p → −p, Qα(x, − p) also gives the probability that the trajectory arriving at (x, p) was last in Sα rather than in any Sβ with βα. We then have

π α = Ω × R 3 n ρ ( x , p ) Q α ( x , p ) d x d p
(35)

so that Δ Gα = β−1lnπα. Also

k α , β = π α 1 S β × R 3 n ρ ( x , p ) Q α ( x , p ) Q β ( x , p ) × | n β ( x ) m 1 p | d σ β ( x ) d p ,
(36)

where β(x) denotes the surface element on ∂Sβ. These formulas can again be derived from TPT.26,28 Similar expressions can be given for pi and Δ Ai, whose estimators were given in (15) and (16) by modifying the boundary conditions in (34).

It is useful to illustrate the results of this section on a simple example. Specifically, we consider the motion of a particle by overdamped Langevin dynamics on the one-dimensional potential energy depicted in Fig. 3(a). The governing equation is

γ x ̇ ( t ) = U ( x ) + 2 β 1 γ η ( t ) ,
(37)

where U′(x) denotes the derivative of potential energy U(x), γ is the friction coefficient (which we will set to γ = 1) and η(t) a white-noise such that 〈η(t)〉 = 0 and 〈η(t) η(t′)〉 = δ(tt′). Typical trajectories solution of (37) at three different temperatures are shown in Fig. 3(b).

FIG. 3.

(a) The one-dimensional multiple well energy landscape example. The red dots identify and label the milestones contained in the sets of target milestones. Their corresponding index sets are M ( 2 ) = { 1 , 2 } , M ( 3 ) = { 1 , 2 , 3 } , M ( 9 ) = { 1 , 2 , , 9 } and M ( 27 ) = { 1 , 2 , , 27 } . (b) Overdamped Langevin trajectories solution of (37) at three temperatures kBT = 0.5, 0.75, and 1.0 (black curves). The index function α(t) associated with the target sets M ( 9 ) , M ( 2 ) , and M ( 3 ) are superimposed on the trajectories x(t) (red curves).

FIG. 3.

(a) The one-dimensional multiple well energy landscape example. The red dots identify and label the milestones contained in the sets of target milestones. Their corresponding index sets are M ( 2 ) = { 1 , 2 } , M ( 3 ) = { 1 , 2 , 3 } , M ( 9 ) = { 1 , 2 , , 9 } and M ( 27 ) = { 1 , 2 , , 27 } . (b) Overdamped Langevin trajectories solution of (37) at three temperatures kBT = 0.5, 0.75, and 1.0 (black curves). The index function α(t) associated with the target sets M ( 9 ) , M ( 2 ) , and M ( 3 ) are superimposed on the trajectories x(t) (red curves).

Close modal

As can been seen in Fig. 3(a), the potential has a hierarchical structure with a total of 27 local minima separated by barriers of various heights. Correspondingly, the generator of the process governed by (37), namely, the operator

L = U ( x ) x + β 1 x 2
(38)

has a spectrum with several groups of low-lying eigenvalues. This spectrum was obtained by spatial-discretization of (38) on a grid of 200 points, and the ratio of successive eigenvalues are shown in Fig. 4(a) at several different temperatures. Small ratios identify low-lying groups of eigenvalues, and several of them can be seen: λ1/λ2, λ2/λ3, λ8/λ9, and finally λ26/λ27 are all small at temperatures ranging from β−1 = 0.5 to β−1 = 1.0. The eigenvalues involved in these groups describe processes arising on slow time scales that can be organized as follows: λ1 is associated with longest time scale of hopping over the largest barrier separating the left basin left centered around milestone 1 and right one centered around milestone 2 in Fig. 3(a); λ2 is associated with the next longest time hopping over the barrier separating the basin at the extreme left centered around milestone 1 and the one in the center centered around milestone 3; λ3, …, λ8 are associated with hoping over the barriers separating milestones 1, 4, 5, milestones 3, 6, 7, and milestones 2, 8, 9; and finally λ9, …, λ27 are associated with hoping over the barriers separating milestones 4, 10, 11, milestones 1, 12, 13, etc. From the results in Secs. III A and III B, the existence of the low lying groups also suggests that the continuous-time dynamics can be approximated by Markov chains (i.e., MSMs) containing, respectively, 2, 3, 9, and 27 states.

FIG. 4.

(a) First 30 ratios of consecutive eigenvalues of the generator (38) in a range of temperatures from 0.5 to 1.0. The small ratios are λ1/λ2, λ2/λ3, λ8/λ9, and λ26/λ27. (b) The values of the metastability index ρ M minimized over index sets M of increasing cardinal M: ρ M clearly correlates well with the eigenvalue ratios shown in panel (a), and the associated index sets M permit to identify the target milestones shown in Fig. 3 that capture the slow processes in the system.

FIG. 4.

(a) First 30 ratios of consecutive eigenvalues of the generator (38) in a range of temperatures from 0.5 to 1.0. The small ratios are λ1/λ2, λ2/λ3, λ8/λ9, and λ26/λ27. (b) The values of the metastability index ρ M minimized over index sets M of increasing cardinal M: ρ M clearly correlates well with the eigenvalue ratios shown in panel (a), and the associated index sets M permit to identify the target milestones shown in Fig. 3 that capture the slow processes in the system.

Close modal

To confirm this prediction, we used the algorithm presented in Sec. II to construct these MSMs. Specifically, we used the 200 discretization points uniformly spaced between x = − 5 and x = 5 as trial milestones and computed the matrix entries Γi,j defined in (3) from a trajectory obtained by integrating (37). We then minimized the metastability index ρ M over index sets M containing from M = 2 up to M = 30 indices. The minimum values of ρ M for each value of M are shown in Fig. 4(b) and, consistent with Bovier’s result, they correlate well with the values of the eigenvalue ratios λM−1/λM shown in Fig. 4(a), thereby confirming that we can use the metastability index ρ M to identify low-lying groups of eigenvalues. In particular, ρ M could be made small when M = 2, 3, 9, and 27, and the corresponding index sets were M ( 2 ) = { 1 , 2 } , M ( 3 ) = { 1 , 2 , 3 } , M ( 9 ) = { 1 , , 9 } , and M ( 27 ) = { 1 , , 27 } . The associated target milestones are shown in Fig. 3(a). They clearly identify the lowest point of the wells on the hierarchical potential landscape U(x). The trajectory projected onto these target milestones when M ( 2 ) = { 1 , 2 } , M ( 3 ) = { 1 , 2 , 3 } , and M ( 9 ) = { 1 , , 9 } are shown in red in Fig. 3(b): these red pieces can be used to calculate the rate matrices kα,β of the different MSMs on M ( 2 ) , M ( 3 ) , M ( 9 ) , and M ( 27 ) via maximum likelihood maximization, as explained in Sec. II D. We can then calculate the spectrum of these rate matrices and compared them to the spectrum of the generator (38) of the original process. The result of these calculation is shown in Fig. 5 which shows that the eigenvalues of these different MSMs do indeed match the low-lying ones of the generator of the original process.

FIG. 5.

The eigenvalues of the MSMs made of the states in M ( 2 ) (circles), M ( 3 ) (diamonds), M ( 9 ) (squares), and M ( 27 ) (stars) are plotted against the eigenvalues of the generator of the original process with the same indices. As can be seen, the MSMs capture the low-lying part of the spectrum accurately, with deviations observed only for the largest eigenvalues in the low-lying group, where metastability becomes weaker.

FIG. 5.

The eigenvalues of the MSMs made of the states in M ( 2 ) (circles), M ( 3 ) (diamonds), M ( 9 ) (squares), and M ( 27 ) (stars) are plotted against the eigenvalues of the generator of the original process with the same indices. As can be seen, the MSMs capture the low-lying part of the spectrum accurately, with deviations observed only for the largest eigenvalues in the low-lying group, where metastability becomes weaker.

Close modal

Note that the calculations above to identify target milestones were conducted at different temperatures, and showed that the target milestones were robust for temperature ranging from β−1 = 0.5 to β−1 = 1.0, even though their corresponding metastability index ρ M slowly grew with temperature. This is consistent with the fact that the ratios λM−1/λM also grow with temperature, as the system becomes slowly less metastable as it is heated up, with the groups associated with the lowest barriers disappearing first.

Note also that in these calculations, few transition events between the target milestones were observed (much less in particular that what is required to calculate the rate matrix entries kα,β accurately), and yet the procedure was able to identify these milestones correctly.

In this section, we use the method outlined in Sec. II to analyze a MD trajectory of a solvated Glycine-Alanine-Glycine peptide (GAG). The GAG peptide was modeled using the CHARMM 27 force field and simulated in a box of 475 TIP3P water molecules using the program NAMD version 2.8.29 After minimization the system was equilibrated for 10 ns with the peptide held constrained. The equilibration was followed by a 1.3 μs production run with Langevin dynamics at 330 K, using a friction constant of 5 ps. The bonds between hydrogens and heavy atoms were kept rigid to allow integration at 2 fs. Frames were saved every 0.5 ps and a total of 2.6 ⋅ 106 trajectory snapshots were collected.

To construct the set of trial milestones, we projected the MD trajectory onto the pair of dihedral angles (ϕ, ψ) that corresponds to the central residue alanine. We discretized the (ϕ, ψ)-space into a square grid of size 8 deg and used circles of radius 4 deg around these discretization points as trial milestones—we also used a finer grained definition of the trial milestones, a grid of points at 5° but this did not significantly change the results. Only those milestones that we were hit by the trajectory were kept: these consist of the N = 1303 milestones shown as circles in Fig. 6(a), and colored according to their free energy Δ Ai whose estimator is given in (16). This free energy defines a landscape that is divided in four macro-regions, often labeled as C5, PII, αR, αL, and αD. Interestingly, the overall topography of this landscape is consistent with that obtained in a reference study on a solvated alanine dipeptide30 where the αR conformation is the most populated (see the Ramachandran map in Ref. 31 for the conformation names and Refs. 31–33 for comparison with NMR studies on the alanine conformational preferences). Since the target milestones that emerged from our analysis were located in the regions around C5, PII, αR, αL, and αD (these target milestones are shown as filled white circles in Fig. 6(a)), we used this nomenclature to designate them. Fig. 6(b) shows a portion of the trajectory projected on the angles ϕ and ψ.

FIG. 6.

(a) Trial milestones used to analyze the trajectory from the MD simulation of the GAG peptide at T = 330 K. These trial milestones are disks centered around grid points in the (ϕ, ψ)-space, and they are colored according to their free energy Δ Ai, whose estimator is given in (16). The states composing the target milestones αR, αL, PII, and αD are represented as white filled circles. (b) A portion of a 1.3 μs MD simulation at T = 330 K projected onto ϕ and ψ (black curve). In red we show the index function α(t) of the states in the target set M ( 4 ) = { α R , α L , P II , α D } .

FIG. 6.

(a) Trial milestones used to analyze the trajectory from the MD simulation of the GAG peptide at T = 330 K. These trial milestones are disks centered around grid points in the (ϕ, ψ)-space, and they are colored according to their free energy Δ Ai, whose estimator is given in (16). The states composing the target milestones αR, αL, PII, and αD are represented as white filled circles. (b) A portion of a 1.3 μs MD simulation at T = 330 K projected onto ϕ and ψ (black curve). In red we show the index function α(t) of the states in the target set M ( 4 ) = { α R , α L , P II , α D } .

Close modal

From the MD trajectory the matrix of probabilities Γi,j was estimated from (3), and the metastability index ρ M was minimized over index sets M of different cardinals M. The results are shown in Fig. 7. Two sets of target milestones were found to be clearly metastable: M ( 2 ) = { α R , α L } with metastability index ρ M = 0 . 29 , and M ( 4 ) = { α R , α L , P II , α D } with metastability index ρ M = 0 . 45 . A third set of target milestones, M ( 3 ) = { α R , α L , P II } was also metastable, but with a higher metastability index ρ M = 0 . 8 . Interestingly, the index set minimizing ρ M when M = 5 was M ( 5 ) = { α R , α L , P II , α D , C 5 } , but its metastability index ρ M was close to 1, i.e., it was not deemed suitable by our analysis to be used to construct an MSM.

FIG. 7.

GAG peptide: Minimum values of the metastability index ρ M as a function of M, along with the index sets M associated with ρ M smaller than 1. A Monte Carlo scheme was utilized to perform the minimization.

FIG. 7.

GAG peptide: Minimum values of the metastability index ρ M as a function of M, along with the index sets M associated with ρ M smaller than 1. A Monte Carlo scheme was utilized to perform the minimization.

Close modal

To confirm that the target milestones in M ( 2 ) , M ( 3 ) , and M ( 4 ) were good core sets to build an MSM (whereas others, including M ( 5 ) , were not), we estimated the transition rate matrix entries kα,β of the corresponding MSMs using the procedure described in Sec. II D. We then estimated the empirical first passage time (FPT) distributions from the MD trajectory for the transitions between the states in these MSMs. These were found to be approximately exponential, with decay rates consistent with those deduced from the matrix with entries kα,β (data shown in Fig. S1 of the supplementary material34). This indicates that the dynamics projected on these states is approximately Markovian, and confirms that a low metastability index is indeed a sign of Markovianity. Conversely, the FPT distributions between milestones C5 and PII were non-exponential, which is indicative of the non-Markovian character of these transitions (see Fig. S1(a) in supplementary material34). Interestingly, this result is consistent with the experimental findings about the non-cooperativity of the alanine PII conformation in GGAGG peptides due to highly local hydration effects.35 It is also noteworthy that the transitions between the target milestones occur on different time scales: τ(PIIαR) = 0.48 ns, τ(αRPII) = 0.73 ns, τ(PIIαL) = 9.5 ns, τ(αRαL) = 9.1 ns, τ(αLαD) = 8.5 ns, τ(αDαL) = 2.7 ns. This explains why different sets of target milestones can be identified, similar to what we observed in the illustrative example of Sec. III C. Note that the transition time scales between PII and αR are consistent with the ≈1 ns−1 interconversion rates reported in earlier studies on a solvated alanine dipeptide.30,36,37 Note also that the target milestone αL plays the role of a hub for the transitions involving the left quadrant of the Ramachandran plot with the basin centered in αD. That is due to the fact that direct transitions from the helical region αR to the bottom right quadrant are very rare events in a solvated alanine.

Finally, the MSM with the target milestones in M ( 4 ) was used to cluster the state space of the system. These results are shown in Fig. 8. The committor probabilities qαR(i) shown in Fig. 8(a) permit one to assign each of the trial milestones to the target milestones it is most likely to reach next, thereby partitioning the dihedral space into basins (see Fig. 8(c)). In Fig. 8(b) the trial milestones are colored according to the TSE index σ(i) defined in (18). The red regions represent the trial milestones Si which are likely to part of a transition state ensemble. Also shown in Fig. 8(d) is the network of the MSM.

FIG. 8.

GAG peptide: Analysis based on the MSM with states in M ( 4 ) . (a) Committor probabilities qαR(i). (b) The transition state index σ(i), the red regions represent the milestones likely to be part of a TSE; unfilled circles are trial milestones that were visited only once (and hence for which qα(i) = 1 for some α). (c) Hard partitioning of the trial milestones using α ( i ) = argmax α M q α ( i ) . (d) Network of the MSM: The numbers of transitions observed between pairs are shown on the edges, and the equilibrium probability of the target milestones are shown next to the nodes.

FIG. 8.

GAG peptide: Analysis based on the MSM with states in M ( 4 ) . (a) Committor probabilities qαR(i). (b) The transition state index σ(i), the red regions represent the milestones likely to be part of a TSE; unfilled circles are trial milestones that were visited only once (and hence for which qα(i) = 1 for some α). (c) Hard partitioning of the trial milestones using α ( i ) = argmax α M q α ( i ) . (d) Network of the MSM: The numbers of transitions observed between pairs are shown on the edges, and the equilibrium probability of the target milestones are shown next to the nodes.

Close modal

As a last example we applied our method to analyze a 20 μs long MD trajectory of the Beta3s peptide in implicit solvent at 330 K, where multiple folding/unfolding events were observed. The Beta3s is a 20 residue peptide which is known to assume a triple stranded β-sheet fold.38–41 The simulation details of the MD data we analyzed here can be found in Ref. 42. The 20 μs long trajectory is composed of 106 microstates saved at a lag-time τ = 20 ps. In the context of the GAG peptide (Sec. IV), the torsional angles of the central alanine residue were shown to be sufficiently good descriptors of the conformational space and were used to define the trial milestones. This procedure, however, is not applicable to a 20 residues mini protein such as Beta3s, whose configurational space is much more complex than that of a tri-peptide. For the initial definition of the trial set of milestones several alternatives are in principle viable. Here we used the main dihedral angles along the chain as starting point.

Specifically, the trial milestones were defined at the level of the individual amino acids. For a protein of R residues there are R − 2 pairs of main dihedral angles (ϕr, ψr) along the chain, where r = 2, …, (R − 2) denotes the residue index. For each of the R − 2 Ramachandran plots that are associated with a residue, three trial milestones were defined as circles of radius 30°, and denoted as milestone 0, 1, and 2. The centers of these circles are located at the minima of the free energy landscape obtained from the empirical (ϕr, ψr) probability density of all the residues combined (see the free energy contour plot in Fig. 9(a)) that is estimated from the MD trajectory. The locations of the milestone centers are (−90, 110) for state 0 (beta-sheet region), (−80, − 40) for the state 1 (helix region), and (75, − 90) for state 2 (turn/loop region). In this way, the value of each pair of residue (ϕr, ψr) along the MD trajectory is first mapped on a three-letter alphabet, and these letters are then combined into a word, resulting in the re-presentation

S i = s ( 2 ) s ( 3 ) s ( R 2 ) ,
(39)

where s(r) ∈ (0, 1, 2) and r the residue index. For example, the folded state is represented as “000021000000210000.” Fig. 9(b) gives a pictorial representation of the construction of symbolic milestones. Since Beta3s is a 20 residues mini-protein, the upper limit of accessible trial milestones constructed this way is therefore 318 ≈ 4 ⋅ 108.

FIG. 9.

Beta3s mini-protein: (a) Trajectory of the dihedral angles (ϕ, ψ) of a single amino acid crossing the three circular milestones 0, 1, and 2. The part of the trajectory in red contributes to the statistics of the events 0 → 1 while the white double stroked portion contributes to the event 1 → 2. (b) Combinations of the dihedral milestones of two residues into chain milestones (strings). (c) Ranked frequencies of the dihedral milestones prior (black) and after (red) the filtering that yielded a trial set of N = 2120 milestones with a filter cutoff of 30. The unfiltered ranked frequencies follow the Zipf’s law 1/ra with exponent a ≈ 0.76.

FIG. 9.

Beta3s mini-protein: (a) Trajectory of the dihedral angles (ϕ, ψ) of a single amino acid crossing the three circular milestones 0, 1, and 2. The part of the trajectory in red contributes to the statistics of the events 0 → 1 while the white double stroked portion contributes to the event 1 → 2. (b) Combinations of the dihedral milestones of two residues into chain milestones (strings). (c) Ranked frequencies of the dihedral milestones prior (black) and after (red) the filtering that yielded a trial set of N = 2120 milestones with a filter cutoff of 30. The unfiltered ranked frequencies follow the Zipf’s law 1/ra with exponent a ≈ 0.76.

Close modal

In practice, not all these trial milestones were visited even once along the trajectory, and some were visited much more often than others. To avoid using trial milestones visited too few times, we filtered out the milestones that were visited less than a threshold value of times. The first step of dihedral milestoning gave a total number of N0 = 469 677 dihedral strings, 373 456 of which were visited only once, which is about ∼37% of the whole set. Interestingly, the ranked distribution of the visiting frequencies follows a Zipf’s law decay, namely, a power law ∼1/ra with r the rank and a = 0.76 the exponent (see Figure 9(c)). Different frequency thresholds were used to filter out trial milestones. Cutoffs corresponding to 20, 30, 40, and 50 reduced the number of trial milestones to 3439, 2120, 1516, and 1162, respectively. All these four sets of trial milestones were used to process the MD trajectory and estimate the probabilities Γi,j. The metastability index ρ M was calculated for target sets M with cardinality in the range M = 2, …, 20. Fig. 10 shows the result of this calculation. Irrespective of the cutoffs used to define the set of trial milestones, the resulting target sets were robust in the range M = 2, …, 8 where at the low values of ρ M 1 .

FIG. 10.

Metastability index for the Beta3s peptide at 330 K calculated using four different sets of trial milestones obtained with different filtering cutoff. In each case, the procedure identify the same target milestones with metastability index lower than 1.

FIG. 10.

Metastability index for the Beta3s peptide at 330 K calculated using four different sets of trial milestones obtained with different filtering cutoff. In each case, the procedure identify the same target milestones with metastability index lower than 1.

Close modal

To better understand the conformational dynamics of the Beta3s peptide, we used the tools introduced in Sec. II E. The committor probability qα(i) (obtained from (17)) permits one to perform a soft partitioning of the trial milestones. Each target set milestone Bα is the center of a partition with statistical weight πα that is obtained from (12), and from which the free energy Δ Gα is determined from (13). This free energy can then be decomposed into energetic and entropic contributions according to (19) and (21). Here the effective energy of the system is identified as the sum of two contributions: E ̄ = E charmm + E sasa where Echarmm is the total potential energy in the CHARMM force field and Esasa is the solvent accessible term due to the implicit solvation model. Using (20) we obtained E ̄ = 37 . 9 ± 11 . 1 kcal/mol as mean effective energy of the entire set of conformation sampled.

We focused our analysis on the target set M ( 8 ) that resulted the largest metastable target set with ρ M < 1 . The committor probability qα(i) is calculated from the time series of trial milestones, with α M ( 8 ) . Table I lists the statistical weights of the most populated states in the MSM build on M ( 8 ) . The mean effective energies Δ Eα are also reported in Table I, along with the entropy contributions −TΔ Sα to the free energy Δ Gα. The basin centered around the target milestone associated with the folded state (id 1 in Table I) has a statistical weight π1 of about 76%, meaning that Beta3s spends about the 3/4 of the simulation time in this basin. As expected for a folded state, it has an energetic advantage (Δ E1 = − 0.6 kcal/mol) which compensates for its entropic disadvantage (−TΔ S1 = 0.8 kcal/mol). The other basins identified in the target set M ( 8 ) include helical and beta-curl type configurations. Interestingly, the helical basins (id 106, 241, and 264 in Table I) are entropically favored (ΔSα < 0 and ΔEα > 0) while the beta-curl states are energetically favored basins (id 164, 319, 473, and 843 in Table I) with ΔSα < 0 and ΔEα < 0. The statistical error on the mean effective energies ΔEα is estimated as s t d ( Δ E α ) / N α where stdEα) is a standard deviation and Nα is the number of times the system enters a target milestone Bα. The values of Nα are reported in Table I. The statistical errors on ΔEα are reported in Fig. S4 of the supplementary material34 along with the empirical probability density functions of the effective energy for each of the target milestones Bα. While these errors are large, the information contained in Δ Eα remains statistically significant.

TABLE I.

Beta3s mini-protein: Target milestones members of the target set M ( 8 ) , along with their thermodynamical parameters.

Markov state model
id Symbolic milestone πα (%) Nα ΔGα (kcal/mol) ΔEα (kcal/mol) TΔSα (kcal/mol)
000021000000210000  76  48  0.2  −0.6  0.8 
106  111111111111100000  8.2  42  1.6  4.9  −3.3 
164  001211001000210010  3.1  2.3  −5.5  7.8 
241  111111111111111111  3.1  37  2.3  4.7  −2.5 
264  010001111111111111  4.5  33  2.0  5.1  −3.0 
319  001111000011111000  2.3  19  2.5  −3.6  6.0 
473  001111000000200000  1.4  23  2.8  −2.2  4.9 
843  001111000111210010  1.4  2.8  −2.8  5.7 
Markov state model
id Symbolic milestone πα (%) Nα ΔGα (kcal/mol) ΔEα (kcal/mol) TΔSα (kcal/mol)
000021000000210000  76  48  0.2  −0.6  0.8 
106  111111111111100000  8.2  42  1.6  4.9  −3.3 
164  001211001000210010  3.1  2.3  −5.5  7.8 
241  111111111111111111  3.1  37  2.3  4.7  −2.5 
264  010001111111111111  4.5  33  2.0  5.1  −3.0 
319  001111000011111000  2.3  19  2.5  −3.6  6.0 
473  001111000000200000  1.4  23  2.8  −2.2  4.9 
843  001111000111210010  1.4  2.8  −2.8  5.7 

Fig. 11 shows the network of the MSM built on the target set M ( 8 ) . Notably, the folded state lies in between the helical region, which is entropically favored, and the beta-curl region, which is energetically favored. Except very rare direct connections between the helical and beta-curl regions, most of the transitions between the helix and beta-curl basins occur via the folded state. Thus, as was remarked elsewhere,43 the folded state is not only the region of the configurational space where energy and entropy compensate each other but it also plays the role of a dynamical hub in the network. This observation is confirmed by the high number of times the target milestone associated with the folded state is revisited (N1 = 48 is the highest values observed, see Nα column in Table I).

FIG. 11.

Network of the MSM for the Beta3s mini-protein constructed from the target set M ( 8 ) . The nodes on the left side of the network have a high entropy while those on the right side have low energy. The node involving the triple stranded folded state lies at the center of a large basin in which energy and entropy compensate each other. The thickness of the edges between pairs of target milestones on the network is proportional to the total number of transitions observed between these milestones: these numbers are also reported in the figure. Each node of the network is represented along with the ensemble of structures associated with the most populated trial milestone it contains.

FIG. 11.

Network of the MSM for the Beta3s mini-protein constructed from the target set M ( 8 ) . The nodes on the left side of the network have a high entropy while those on the right side have low energy. The node involving the triple stranded folded state lies at the center of a large basin in which energy and entropy compensate each other. The thickness of the edges between pairs of target milestones on the network is proportional to the total number of transitions observed between these milestones: these numbers are also reported in the figure. Each node of the network is represented along with the ensemble of structures associated with the most populated trial milestone it contains.

Close modal

The high entropy character of the helical states is also reflected by the high connectivity within the helical region (left side of the network shown in Fig. 11) and the comparable values of Nα with that of the folded state (N106 = 42, N241 = 37, and N264 = 33 in Table I). On the contrary, the beta-curl states (right side of the network shown in Fig. 11) are less connected (N164 = 9, N319 = 19, N473 = 23, and N843 = 5). This is consistent with the observation reported in Ref. 42 that these beta-curl states act as kinetic traps.

The probability distributions of FPT from the folded state to the other member of the target set and back are fairly well described by an exponential decay (see Fig. S2 and S3 in the supplementary material34). This indicates that the reduced dynamics of the Beta3s to a MSM built on the target set M ( 8 ) is approximatively Markovian. Interestingly, the emerging picture from the network representation in Fig. 11 is qualitatively consistent with the simplified kinetic network representation proposed in Ref. 41.

The committor probability qα(i) calculated for the target set M ( 8 ) allows one to project the configurational space onto a simplex. Three aggregated committor probabilities are defined as qfold = q1 for the folded state, qhelix = q106 + q241 + q264 for the helix state, and qtrap = q164 + q319 + q473 + q843 for the beta-curl state (see Table I for the state indexes). By definition of committor probability one has qfold + qhelix + qtrap = 1. Fig. 12 shows the simplex representation of the aggregated committor probabilities in a triangular plot. Fig. 12(a) shows the time series of the committor probabilities qfold(i(t)), qhelix(i(t)), qtrap(i(t)) projected onto a triangular plot. Interestingly, the conformers are clustered around the three vertexes of the triangular plot. The blue lines represent direct transitions between conformer and only 4 direct transitions link the helix channel to the trap channel. Fig. 12(b) shows a triangular scatter plot of the committor probabilities with the points colored according to their TSE index σ(i) defined in (18). Red points correspond to milestones belonging to transition state ensembles. For instance, the channel connecting the helix to the fold basin is composed by milestones Si such that qfoldqhelix ≈ 0.5. Similarly, along the channel Trap↔ Fold one has milestones Si with qfoldqtrap ≈ 0.5. The simplex representation in Fig. 12 corroborates the observation that the folded state acts as a kinetic hub,42 and it also suggests that Beta3s oscillates between two extreme conformational regions: a low energy one versus a high entropy one.

FIG. 12.

Beta3s mini-protein: (a) Simplex representation of the aggregated committors such that qfold = q1, qhelix = q106 + q241 + q264 and qtrap = q164 + q319 + q473 + q843 (these indices are the same as those used in Fig. 11). The continuous blue line represent the time series qfold(i(t)), qhelix(i(t)) and qtrap(i(t)). Only 4 transition events connect directly the helix and beta regions. (b) Same representation with nodes colored according 1 to the TS index σ(i) defined in (18).

FIG. 12.

Beta3s mini-protein: (a) Simplex representation of the aggregated committors such that qfold = q1, qhelix = q106 + q241 + q264 and qtrap = q164 + q319 + q473 + q843 (these indices are the same as those used in Fig. 11). The continuous blue line represent the time series qfold(i(t)), qhelix(i(t)) and qtrap(i(t)). Only 4 transition events connect directly the helix and beta regions. (b) Same representation with nodes colored according 1 to the TS index σ(i) defined in (18).

Close modal

Despite their growing popularity to analyze and interpret molecular dynamics (MD) simulations, Markov State Models (MSMs) remain somewhat tricky to use and justify. How to pick the states over which to build the MSM and how to assess its accuracy are nontrivial issues that are related to the degree to which the dynamical coarse-graining over these states preserves the Markovianity of the original system. Typically, states in MSMs are picked in a somewhat ad hoc fashion, and their quality is assessed a posteriori via Markovianity tests. In this paper, we propose a strategy that may help systematize the operation of MSM building. It is based on the observation that, in metastable systems, the states in an MSM should be hubs to which the trajectory has a high probability to return often, and between which it seldom transitions. These intuitive properties can be captured via a metastability index that measures how good a set of target milestones chosen among trial ones is: the smaller its metastability index, the better the target milestones it contains are as hubs. This metastability index is not a new concept: it was introduced in a closely related form in potential theoretic approaches to metastability. Our main goal in this paper was to show that this index can also be turned into a practical computational tool for the identification of good target milestones upon which to build accurate MSMs. As shown here, the procedure is simple to use, it does not require to make Markovian assumptions at the intermediate stages of the construction (in particular, the jumps between the trial milestones could be correlated and non-Markovian), and it has the added advantage that it allows one to cluster the system’s state space a posteriori and identify the transition state ensembles (TSEs) between the target milestones. Here, these features were illustrated on a collection of examples of increasing complexity. In particular, our last test case involving the Beta3s mini-protein, showed that the method can reveal interesting properties of the folding landscape of proteins, including the presence of kinetic traps and misfolded states on the way to the native state, and how these features affects the folding and unfolding pathways of the protein. We certainly hope that the technique will be similarly useful to analyze the dynamics of other proteins and macromolecules.

We thank Amedeo Caflisch for sharing the MD trajectories of the Beta3s peptide. We also thank Jianfeng Lu for helpful discussions. E.V.-E.’s research was supported in part by NSF Grant No. DMS-1522767.

In this appendix, a proof of (27) is given in case of a Markov process with generator Li,j and transition probability P i , j = L i , j / k i L i , k , ij, Pi,i = 0. To begin, recall that the MFPT τi,j from state i to state j is the solution to

k = 1 N L i , k τ k , j = 1 , for i j , and τ j , j = 0
(A1)

while the MRT is given by

τ i = ( j i L i , j ) 1 + k = 1 N P i , k τ k , i ,
(A2)

where the first term at the right hand-side accounts for the mean time to exit i to some state ji, and the second for the mean time to come back from any such state ji to i. We are interested to find an expression for the probabilities Γi,j as a function of the MFPT and MRT. Since Γ i , j = k P i , k q i , j ( k ) , we can equivalently express the committor function qi,j(k) in terms of MFPT and the MRT. We claim that

q i , j ( k ) = τ k , i τ k , j + τ i , j τ i , j + τ j , i .
(A3)

Let us check that this relation holds by verifying that the function qi,j(k) defined this way satisfies qi,j(i) = 0, qi,j(j) = 1 and l S L k , l q i , j ( l ) = 0 for k ∉ {i, j}. The two boundary conditions are trivially verified and the third condition is

l = 1 N L k , l q i , j ( l ) = 1 τ i , j + τ j , i k , l = 1 N L k , l ( τ l , i τ l , j + τ i , j ) = 0 ,
(A4)

where we have used (A1). We can now calculate Γi,j using (A3) and (25)

Γ i , j = k = 1 N P i , k q i , j ( k ) = ( k i L i , k ) 1 τ i , j + τ j , i k = 1 N L i , k ( τ k , i τ k , j + τ i , j ) = ( k i L i , k ) 1 τ i , j + τ j , i k = 1 N L i , k τ k , i + 1 = τ i τ i , j + τ j , i ,
(A5)

where we used (A1) and (A2). Thus, for any pair of states (i, j), the probability Γi,j can be computed from the MFPTs and MRTs only.

1.
C.
Schütte
,
A.
Fischer
,
W.
Huisinga
, and
P.
Deuflhard
,
J. Comput. Phys.
151
,
146
(
1999
).
2.
W. C.
Swope
,
J. W.
Pitera
, and
F.
Suits
,
J. Phys. Chem. B
108
,
6571
(
2004
).
3.
J. D.
Chodera
,
N.
Singhal
,
V. S.
Pande
,
K. A.
Dill
, and
W. C.
Swope
,
J. Chem. Phys.
126
,
155101
(
2007
).
4.
F.
Noé
and
S.
Fischer
,
Curr. Opin. Struct. Biol.
18
,
154
(
2008
).
5.
C.
Schütte
,
F.
Noé
,
J.
Lu
,
M.
Sarich
, and
E.
Vanden-Eijnden
,
J. Chem. Phys.
134
,
204105
(
2011
).
6.
C.
Schütte
and
M.
Sarich
,
Metastability and Markov State Models in Molecular Dynamics: Modeling, Analysis, Algorithmic Approaches
(
American Mathematical Society
,
2013
), Vol.
24
.
7.
G. R.
Bowman
,
V. S.
Pande
, and
F.
Noé
,
An Introduction to Markov State Models and their Application to Long Timescale Molecular Simulation
(
Springer Science & Business Media
,
2013
), Vol.
797
.
8.
D.
Shaw
,
P.
Maragakis
,
K.
Lindorff-Larsen
,
S.
Piana
,
R.
Dror
,
M.
Eastwood
,
J.
Bank
,
J.
Jumper
,
J.
Salmon
,
Y.
Shan
, and
W.
Wriggers
,
Science
330
,
341
(
2010
).
9.
J. E.
Stone
,
J. C.
Phillips
,
P. L.
Freddolino
,
D. J.
Hardy
,
L. G.
Trabuco
, and
K.
Schulten
,
J. Comput. Chem.
28
,
2618
(
2007
).
10.
V. A.
Voelz
,
G. R.
Bowman
,
K.
Beauchamp
, and
V. S.
Pande
,
J. Am. Chem. Soc.
132
,
1526
(
2010
).
11.
R. T.
McGibbon
,
C. R.
Schwantes
, and
V. S.
Pande
,
J. Phys. Chem. B
118
,
6475
(
2014
).
12.
M. M.
Sultan
,
G.
Kiss
,
D.
Shukla
, and
V. S.
Pande
,
J. Chem. Theory Comput.
10
,
5217
(
2014
).
13.
L.
Martini
,
A.
Kells
,
G.
Hummer
,
N.-V.
Buchete
, and
E.
Rosta
, e-print arXiv:1605.04328 [physics.chem-ph] (
2016
).
14.
N.
Buchete
and
G.
Hummer
,
J. Phys. Chem. B
112
,
6057
(
2008
).
15.
A. K.
Faradjian
and
R.
Elber
,
J. Chem. Phys.
120
,
10880
(
2004
).
16.
E.
Vanden-Eijnden
,
M.
Venturoli
,
G.
Ciccotti
, and
R.
Elber
,
J. Chem. Phys.
129
,
174102
(
2008
).
17.
E.
Vanden-Eijnden
and
M.
Venturoli
,
J. Chem. Phys.
130
,
194101
(
2009
).
18.
S. V.
Krivov
and
M.
Karplus
,
J. Phys. Chem. B
110
,
12689
(
2006
).
19.
M.
Sarich
,
F.
Noé
, and
C.
Schütte
,
Multiscale Model. Simul.
8
,
1154
(
2010
).
20.
K. A.
Beauchamp
,
G. R.
Bowman
,
T. J.
Lane
,
L.
Maibaum
,
I. S.
Haque
, and
V. S.
Pande
,
J. Chem. Theory Comput.
7
,
3412
(
2011
).
21.
A.
Vitalis
and
A.
Caflisch
,
J. Chem. Theory Comput.
8
,
1108
(
2012
).
22.
B.
Fac̆kovec
,
E.
Vanden-Eijnden
, and
D. J.
Wales
,
J. Chem. Phys.
143
,
044119
(
2015
).
23.
A.
Bovier
,
M.
Eckhoff
,
V.
Gayrard
, and
M.
Klein
,
Commun. Math. Phys.
228
,
219
(
2002
).
24.
A.
Bovier
,
M.
Eckhoff
,
V.
Gayrard
, and
M.
Klein
,
J. Eur. Math. Soc.
6
,
399
(
2004
).
25.
A.
Bovier
,
V.
Gayrard
, and
M.
Klein
,
J. Eur. Math. Soc.
7
,
69
(
2005
).
26.
W.
E
and
E.
Vanden-Eijnden
,
J. Stat. Phys.
123
,
503
(
2006
).
27.
P.
Metzner
,
C.
Schuette
, and
E.
Vanden-Eijnden
,
Multiscale Model. Simul.
7
,
1192
(
2009
).
28.
W.
E
and
E.
Vanden-Eijnden
,
Annu. Rev. Phys. Chem.
61
,
391
(
2010
).
29.
J.
Phillips
,
R.
Braun
,
W.
Wang
,
J.
Gumbart
,
E.
Tajkhorshid
,
E.
Villa
,
C.
Chipot
,
R.
Skeel
,
L.
Kale
, and
K.
Schulten
,
J. Comput. Chem.
26
,
1781
(
2005
).
30.
P. E.
Smith
,
J. Chem. Phys.
111
,
5568
(
1999
).
31.
M.
Iwaoka
,
M.
Okada
, and
S.
Tomoda
,
J. Mol. Struct.
586
,
111
(
2002
).
32.
L.
Ding
,
K.
Chen
,
P.
Santini
,
Z.
Shi
, and
N.
Kallenbach
,
J. Am. Chem. Soc.
125
,
8092
(
2003
).
33.
C.
Weise
and
J.
Weisshaar
,
J. Phys. Chem. B
107
,
3265
(
2003
).
34.
See supplementary material at http://dx.doi.org/10.1063/1.4954769 for that include the first passage times distributions for the gag peptide and beta3s mini-protein, and the statistical errors on the effective energies of the beta3s mini protein.
35.
K.
Chen
,
Z.
Liu
, and
N.
Kallenbach
,
Proc. Natl. Acad. Sci. U. S. A.
101
,
15352
(
2004
).
36.
P. G.
Bolhuis
,
C.
Dellago
, and
D.
Chandler
,
Proc. Natl. Acad. Sci. U. S. A.
97
,
5877
(
2000
).
37.
C. A. F.
de Oliveira
,
D.
Hamelberg
, and
J. A.
McCammon
,
J. Chem. Phys.
127
,
175105
(
2007
).
38.
E.
De Alba
,
J.
Santoro
,
M.
Rico
, and
M. A.
Jiménez
,
Protein Sci.
8
,
854
(
1999
).
39.
P.
Ferrara
and
A.
Caflisch
,
Proc. Natl. Acad. Sci. U. S. A.
97
,
10780
(
2000
).
40.
F.
Rao
and
A.
Caflisch
,
J. Mol. Biol.
342
,
299
(
2004
).
41.
S. V.
Krivov
,
S.
Muff
,
A.
Caflisch
, and
M.
Karplus
,
J. Phys. Chem. B
112
,
8701
(
2008
).
42.
S.
Muff
and
A.
Caflisch
,
Proteins
70
,
1185
(
2008
).
43.
G. R.
Bowman
and
V. S.
Pande
,
Proc. Natl. Acad. Sci. U. S. A.
107
,
10890
(
2010
).

Supplementary Material