A method is proposed to identify target states that optimize a metastability index amongst a set of trial states and use these target states as milestones (or core sets) to build Markov State Models (MSMs). If the optimized metastability index is small, this automatically guarantees the accuracy of the MSM, in the sense that the transitions between the target milestones is indeed approximately Markovian. The method is simple to implement and use, it does not require that the dynamics on the trial milestones be Markovian, and it also offers the possibility to partition the system’s state-space by assigning every trial milestone to the target milestones it is most likely to visit next and to identify transition state regions. Here the method is tested on the Gly-Ala-Gly peptide, where it is shown to correctly identify the expected metastable states in the dihedral angle space of the molecule without *a priori* information about these states. It is also applied to analyze the folding landscape of the Beta3s mini-protein, where it is shown to identify the folded basin as a connecting hub between an helix-rich region, which is entropically stabilized, and a beta-rich region, which is energetically stabilized and acts as a kinetic trap.

## I. INTRODUCTION

Markov State Models (MSMs) have become an integral part of the toolbox used to analyze the output of molecular dynamics (MD) simulations of complex systems such as proteins and other large biomolecules.^{1–7} They were developed in response to the need to process ever longer MD time series data, made either of one long trajectory or very many shorter ones, generated, e.g., by special-purpose high-performance computers,^{8} high-performance GPUs,^{9} or massively parallel simulations.^{10} The basic idea of MSMs is to represent the original dynamics as memoryless jumps between predefined states in the configuration space of the molecular system. Under this Markovian assumption, MD time series data can then be processed via inference techniques such as maximum likelihood estimation to calculate the rate matrix between these states. This matrix defines a Markov jump process (MJP), which in turns permits the calculation of interesting kinetic quantities of the system on time scales that may be larger than those reached in the MD simulations: indeed MSMs also permit to recombine short simulations run in parallel to extract long time information about the system.

A recurrent issue in the context of MSMs is how to pick the states on which to map the original dynamics—see Refs. 11–13 for some recent works in this direction. Indeed this mapping amounts to a drastic coarse-graining of the dynamics, and the jumps between poorly chosen coarse states will not be Markovian in general. This invalidates the basic assumption of MSMs and affects their reliability and accuracy. Fortunately, MD systems typically display metastability and this offers a way around this difficulty. In metastable systems, there exist regions that play the role of hubs: after visiting one such hub, the system returns often to it before making a transition to another. This guarantees that transitions between these hubs are indeed approximately Markovian, and metastability has therefore been invoked as the key property to justify MSMs and assess their accuracy (see, e.g., Refs. 6 and 7 for modern perspectives on the topic that summarize this viewpoint). What remains mostly open, however, is how to identify these hubs in practice.

In the present paper, we aim at addressing this question in the context of milestoning-based MSMs^{5} that combine the core set method originally introduced in Ref. 14 with the milestoning of the trajectories developed in Refs. 15–17. Unlike standard MSMs that are based on a full partition of the configuration space of the system into blocks that are used as states,^{3,18–22} milestoning-based MSMs use non-adjacent core sets as states (the milestones), and assign the MD time series to the index of the last such milestone it visited. This maps the original dynamics onto a symbolic dynamics on these indices that are then used as input to build the MSM by maximum likelihood estimation of its transition matrix. In metastable systems, the proper milestones to use should be the hubs mentioned before. Here we propose to identify these hubs among a set of trial milestones via optimization of a metastability index that measures how good the hubs are. The method can be justified within the framework of the potential theoretic approach to metastability developed by Bovier and collaborators.^{23–25} It has the advantage that it can be used even in situations where the dynamics on the trial milestones is non-Markovian. In this sense, it alleviates a difficulty with the standard approach used to build MSMs via clustering of trial states:^{7} This clustering is typically done using spectral analysis of the rate matrix of the chain built on these trial states, which may lead to artifacts since the dynamics on these trial states is non-Markovian in general. The method we propose avoids this difficulty altogether. In addition it avoids the need to introduce a time-lag to process the data, which may be difficult to adjust. As we will see, our method also offers a way to partition the state space of the system by identifying regions made of configurations most likely to reach a given hub, and to identify the members of the transition state ensemble as those trial milestones that have a non-negligible probability to reach more than one target milestone next.

The remainder of this paper is organized as follows. In Sec. II we start by presenting the algorithmic aspects of the method we propose, including how to define the trial milestones (Sec. II A) and the metastability index (Sec. II B), how to identify the target milestones that optimize this metastability index (Sec. II C), and how to build the MSM on these target milestones (Sec. II D). We also introduce a variety of diagnostic tests that can be used *a posteriori* to analyze the output of the MSM and use the trial milestones to get additional information about the system’s dynamics (Sec. II E). A theoretical justification of this algorithm is then given in Sec. III, first in the context of Markov jump processes, which is relevant, e.g., if one assumes that the dynamics on the trial milestones is itself Markovian (Sec. III A), then in the context of systems whose configurational space is continuous, like those encountered in MD simulations, where we cannot expect the dynamics on the trial milestones to be Markovian (Sec. III B). We also test the method on a simple one-dimensional example with a multiscale energy landscape (Sec. III C). In Sec. IV we then apply the method to analyze the dynamics of a Gly-Ala-Gly peptide, and in Sec. V we use it to analyze the folding pathways of the Beta3s mini-protein. Concluding remarks are given in Sec. VI. Some technical derivations are relegated to the Appendix.

## II. ALGORITHMIC ASPECTS

In this section, we outline the algorithm we propose to identify the milestones (or core sets) over which to build an MSM. In a nutshell, this is done by picking among a set of trial milestones a subset of target milestones which minimizes a metastability index—this index is defined so that small values are indicative of Markovianity. The target milestones are taken as states in the MSM, while the trial milestones are used to complement the predictions of this MSM and, e.g., partition the system’s configurational space or identify transition state regions in it.

### A. Trial versus target milestones

Denote by ** x**(

*t*) ∈ ℝ

^{3n}a trajectory containing the instantaneous position of the

*n*atoms in a molecular system. We assume that we have generated one or several such trajectories and our goal is to build a MSM that captures their main features. To this end, we introduce a set of

*N*trial milestones, which we will denote by

*S*

_{1},

*S*

_{2}, …,

*S*and label them by their index, i.e.,

_{N}*i*identifies

*S*. These milestones are disjoint sets in the system’s configuration space that can be defined, e.g., by requiring that some of the dihedral angles of the molecules take values between specific bounds—how to actually choose the trial milestones

_{i}*S*will be illustrated below on specific examples. In the spirit of milestoning, we then map each trajectory

_{i}**(**

*x**t*) onto the index of the last trial milestone

*S*it hit, see Fig. 1 for an illustration. This way we obtain a piecewise constant index function

_{i}*i*

_{∗}(

*t*) whose value at time

*t*is the index of the last milestone hit by

**(**

*x**t*). Note in this procedure we discount recrossings: we only update the index function when a new milestone is hit.

We stress that at this stage we do not assume that the dynamics of the index function *i*_{∗}(*t*) on the trial milestones is Markov—in general it will not be. What we would like to do next is extract from the set of *N* trial milestones a subset of *M* ≤ *N* target milestones such that if we map the trajectory ** x**(

*t*) onto this subset of target milestones, the corresponding index function will be approximately Markovian—these target milestones are shown in red in the cartoon shown in Fig. 2. In the sequel, we will denote by $M= { i 1 , i 2 , \u2026 , i M } \u2282 { 1 , 2 , \u2026 , N} $ the set of indices identifying the target milestones, i.e., these are {

*S*

_{i1},

*S*

_{i2}, …,

*S*

_{iM}}. We will also refer to trial milestones that are not target ones as non-target milestones.

### B. Metastability index

How should the target milestones be chosen? Intuitively, they should be such that they are hubs among the trial milestones towards which the trajectory is attracted but between which it seldom makes transitions, as illustrated in Fig. 2. To make this idea concrete, let us first introduce the probability Γ_{i,j} that, if the trajectory hits milestone *S _{i}*, then subsequently it will hit

*S*with

_{j}*j*≠

*i*before hitting

*S*again (discounting recrossings: recall that only hits of different milestones are counted—in other words, to hit

_{i}*S*again, the trajectory must have hit at least one other milestone in between). To estimate Γ

_{i}_{i,j}, out of each of the piecewise constant index function

*i*

_{∗}(

*t*), we first extract the sequence {

*i*

_{1},

*i*

_{2},

*i*

_{3}, …} of successive values that this function takes—for example, for the time series illustrated in Fig. 1 this sequence starts with

*i*

_{1}= 1,

*i*

_{2}= 2,

*i*

_{3}= 1,

*i*

_{4}= 3, etc. We then cut this sequence into the

*N*pieces which start from

_{i}*i*and contain all the indices visited after

*i*before

*i*appears again. For example, if the sequence is made of 3 indices,

*i*,

*j*,

*k*, and reads

we cut it into three pieces (*N _{i}* = 3)

Finally we count the number of pieces *N*_{i,j} in which *j* appears at least once (in the example above *N*_{i,j} = 2 since *j* appears in the last two pieces but not in the first), we add up these numbers coming from every piece of *i*_{∗}(*t*) at our disposal, and we set

as the estimator for Γ_{i,j}. Note that the quality of this estimator depends on the lengths of the pieces of time series ** x**(

*t*) that we have at our disposal, and how to assess the statistical accuracy of (3) is nontrivial. As usual, we cannot expect the estimator to be accurate if these pieces are too short to observe all the relevant events in the dynamics of the system: In the context of (3) this requires that these pieces be long enough that trajectories starting at a non-target milestone have time to reach a target one. Indeed, as we will see below, this condition is sufficient to guarantee that we will correctly identify target from non-target milestones among the trial ones.

We will now use the matrix with entries Γ_{i,j} to quantify how good a set of target milestones will be to build an MSM. Specifically, given a candidate $M= { i 1 , i 2 , \u2026 , i M } $ identifying the target milestones, we estimate the quality of these milestones via their metastability index defined as

The smaller $ \rho M $, the better the set of target milestones identified by $M= { i 1 , i 2 , \u2026 , i M } $. This claim will be justified in Sec. III by connecting $ \rho M $ to a quantity originally introduced by Bovier, but let us briefly explain here why it is true. The numerator in (4),

identifies the target milestone *S _{i}* for which the probability is the highest that, after hitting

*S*, the trajectory will hit some other target milestone

_{i}*S*before hitting

_{j}*S*again. In this sense,

_{i}*S*is the worst target milestone in the set since we would like that transitions between these target milestones be unlikely, and the smaller (5) the better. Correspondingly, the denominator in (4),

_{i} identifies the non-target milestone *S _{i}* which is such that the trajectory has the lowest probability to hit a target milestone

*S*before hitting

_{j}*S*again. In this sense,

_{i}*S*is the worst non-target milestone since we would like that transitions from non-target to target milestones be likely, and the larger (6) the better. The metastability index $ \rho M $ in (4) accounts for both the desiderata that (5) be small and (6) be large, and in this sense it measures the quality of the set of target milestones identified by $M= { i 1 , i 2 , \u2026 , i M } $.

_{i}### C. Target milestones identification

Since good sets of target milestones are those whose metastability index $ \rho M $ is small, we can systematically search for such good sets by minimizing $ \rho M $. In principle, this can be done by considering increasing values *M* = 2, *M* = 3, etc. of the cardinal of $M$, and for each compute $ \rho M $ for every choice of $M= { i 1 , i 2 , \u2026 , i M } $ so as to identify the one with minimum $ \rho M $. Any choice $M$ for which $ \rho M $ is small leads to a good set of target milestones. Note however that $ \rho M $ can be small for different values of *M*, and so several of them should be considered—this effect will be illustrated in the examples below. Of course, if the number *N* of trial milestones is large, this direct search strategy will quickly become inefficient as *M* increases since the number of sets $M$ to consider for each *M* is

To avoid this difficulty, we must adopt more efficient optimization strategies, for example using Monte Carlo schemes. We have used such schemes in the examples below. However, we found that the following searching strategy was typically the most efficient.

Given Γ_{i,j}, we can identify the index *j*^{†}(*i*) of the milestone that the trajectory is most likely to hit after hitting *i* as the one that maximizes Γ_{i,j} over all *j* ≠ *i*, i.e.,

The function *j*^{†}(*i*) can be used to define ‘trajectories’ in index space: given *i*, we update it to *j*^{†}(*i*), then to *j*^{†}(*j*^{†}(*i*)), etc. Correspondingly, given a set $M= { i 1 , i 2 , \u2026 , i M } $ we can update this set by updating every entry in it, and only keeping the ones that remain different (for example we could have *j*^{†}(*i*_{1}) = *j*^{†}(*i*_{2}) in which case only one of these entries is kept in the update of $M$). Because this updating identifies set of indices of milestones that are likely to be hit, these sets should have small metastability index $ \rho M $, and this is indeed what we observed in practice. Specifically, we took random sets of indices $M= { i 1 , i 2 , \u2026 , i M } $ with random values of *M* and updated them as described above while monitoring the metastability index $ \rho M $ of these updated sets. We observed that this metastability index typically diminishes before starting to oscillates in a periodic fashion (this is because the update has no fixed point, *j*^{†}(*i*) ≠ *i* by construction). When this happened we stopped the update, kept the updated set $M= { i 1 , i 2 , \u2026 , i M } $ with smallest $ \rho M $, and restarted the procedure with a different random set of indices $M= { i 1 , i 2 , \u2026 , i M } $. After a few such iterations, we typically got a few different sets $M= { i 1 , i 2 , \u2026 , i M } $ (with different *M*) with small $ \rho M $.

### D. MSM building on target milestones

Once we have identified a good set of target milestones specified by the index set $M= { i 1 , i 2 , \u2026 , i M } $, we can build an MSM using these milestones as states. How to do so was explained in Refs. 5 and 17, so let us be brief here and refer the reader to the original papers for details. For the sake of clarity, in the sequel it will be convenient to distinguish between target and non-target milestones: we will do so by using greek letters *α*, *β*, … to refer to the indices in the index set $M$, and *B*_{α}, *B*_{β}, … to refer to the target milestones *S*_{i1}, *S*_{i2}, ....

Similar to what was done before, we can map the trajectory ** x**(

*t*) onto the index of the last target milestone it hit. This defines a piecewise constant function

*α*

_{∗}(

*t*) taking values in $M$. Because we are now using target milestones instead of trial ones, unlike

*i*

_{∗}(

*t*),

*α*

_{∗}(

*t*) should be approximately Markov. This means that the evolution of this function can be completely specified by a rate matrix with entries

*k*

_{α,β}: to leading order in

*δt*≪ 1,

*k*

_{α,β}

*δt*gives the probability that

*α*

_{∗}(

*t*) jumps from the value

*α*to the value

*β*≠

*α*in the interval [

*t*,

*t*+

*δt*) (i.e., that

**(**

*x**t*) hits the target milestone

*B*

_{β}in that time interval if the last target milestone it hit before time

*t*was

*B*

_{α}). In particular, if

*p*

_{α}(

*t*) denotes the probability distribution that

*α*

_{∗}(

*t*) takes the value

*α*at time

*t*, then

*p*

_{α}(

*t*) satisfies the master equation

Similarly, we can write down equations for the distribution of first passage time from target milestone *B*_{α} to target milestone *B*_{β}, its mean, etc. in terms of *k*_{α,β}—see, e.g., Refs. 5 and 17.

The rate matrix entries *k*_{α,β} can be estimated from the time series ** x**(

*t*) by the method of maximum likelihood. If we denote by

*T*

_{α}the total time the last target milestone hit by

**(**

*x**t*) is

*B*

_{α}and by

*N*

_{α,β}the number of times the target milestone

*B*

_{β}was hit directly after

*B*

_{α}along this time series, the maximum likelihood estimator for

*k*

_{α,β}is

This estimator is unbiased in the sense that, if *α*_{∗}(*t*) is indeed Markov with rate matrix entries *k*_{α,β} and the length of the time series tends to infinity, then $ k \u02c6 \alpha , \beta \u2192 k \alpha , \beta $ in this limit. If the length of the time series is finite, the statistical errors on $ k \u02c6 \alpha , \beta $ can be estimated by Bayesian sampling, see Ref. 5 for detail. Another source of errors are those due to residual non-Markovian effects in *α*_{∗}(*t*). The smaller $ \rho M $, the smaller these non-Markovian effects are, as discussed in Sec. III. In practice they can also be estimated via Markovianity tests, as illustrated on examples in Secs. IV and V. Notice also that if ** x**(

*t*) satisfies detailed balance, then we should have that

*N*

_{α,β}/

*N*

_{β,α}→ 1 as the length of the time series goes to infinity. When this length is finite, however,

*N*

_{α,β}≠

*N*

_{β,α}in general, and to enforce detailed balance of the MSM, it is better to use the following symmetrized estimator for the rate matrix entries:

With this choice, the equilibrium distribution of the MSM, i.e., the distribution $ \pi \u02c6 \alpha $ towards which the solution to (9) with *k*_{α,β} replaced by its estimator $ k \u02c6 \alpha , \beta $ converges as *t* → ∞, is

where $T= \u2211 \alpha T \alpha $ is the total length of the time series used to estimate $ k \u02c6 \alpha , \beta $. The distribution $ \pi \u02c6 \alpha $ gives the proportion of time during which *B*_{α} was the last milestone hit by ** x**(

*t*) and as the length of the time series increases,

*T*→ ∞, $ \pi \u02c6 \alpha $ converges to the true equilibrium distribution of milestone

*B*

_{α}—explicit expressions for this distribution in the context of a system whose dynamics is governed by a Markov jump process or a Langevin equation will be given in Secs. III A and III B, respectively. We can use $ \pi \u02c6 \alpha $ to estimate the free energy of milestone

*B*

_{α}

where *β* denotes the reciprocal of the thermodynamic temperature of the system.

### E. State-space partitioning, transition state ensemble identification, and other diagnostic tools

Even though the trial milestones merely serve as an intermediary to construct the actual MSM, these milestones can still be used to analyze the MD data and partition the state space in ways that highlight important features of its dynamics.

First we can organize the trial milestones onto a network: if *N*_{i,j} denotes the number of times milestone *S _{j}* was hit directly after

*S*, we put an edge between node

_{i}*i*and node

*j*with weight

where *T* is the total length of the time series for ** x**(

*t*). Note that, consistent with the detailed balance condition, we have symmetrized the weight

*d*

_{i,j}, i.e., the network is undirected,

*d*

_{i,j}=

*d*

_{j,i}.

Next we can compute the free energy of the trial milestones. If $ T i = \u222b 0 T \delta i \u2217 ( t ) , i dt$ denotes the total time that the last milestone hit was *S _{i}* (so that $ \u2211 i T i =T$, the total length of the time series), we can define the probability distribution

The distribution $ p \u02c6 i $ gives the proportion of time during which *S _{i}* was the last milestone hit by

**(**

*x**t*) (see Secs. III A and III B for its expression in the limit as

*T*→ ∞ when the system’s dynamics is governed by a Markov jump process or a Langevin equation, respectively), and we can estimate the free energy of the trial milestone

*S*via

_{i}The free energy $\Delta A \u02c6 i $ defines a landscape on the network of trial milestones. One may expect that the nodes associated with the target milestones will be close to the local minima of $\Delta A \u02c6 i $; however, we stress that this need not be the case, since the dynamics on the full network (rather than the one restricted to the target milestones alone) can be quite complicated (in particular, not Markovian).

To partition the state space, it is better to introduce the committor functions of the trial milestone *S _{i}* with respect to the target milestone

*B*

_{α},

*q*

_{α}(

*i*). By definition,

*q*

_{α}(

*i*) gives probability that, after hitting

*S*, the trajectory will hit the target milestone

_{i}*B*

_{α}before hitting any other target milestones, and it can be estimated from the time series as

where $ N i = \u2211 \alpha N i , \alpha $ (so that $ \u2211 \alpha q \u02c6 \alpha ( i ) =1$). The committor functions can be used to do a (soft) partitioning of the network into basins of nodes that are more likely to be attracted next to one target milestone rather than any other: for example, those nodes *i* such that *q*_{α}(*i*) is close to 1 are associated with milestones *S _{i}* out of which the trajectory is very likely to hit

*B*

_{α}next. A hard partitioning can also be performed by assigning

*i*to $\alpha ( i ) = argmax \alpha \u2208 M q \alpha ( i ) $.

The committor probability *q*_{α}(*i*) whose estimator is given in (17) can also be used to identify the transition state ensemble (TSE), i.e., the trial milestones that lie in between the states in the target set $M$. If the target set contains only two states, $M= { \alpha , \beta } $, it follows that all trial milestones *i* satisfy *q*_{α}(*i*) = 1 − *q*_{β}(*i*) and the TSE is such that $ q \alpha ( i ) \u2248 1 2 $. For target sets with more than 2 states the $ 1 2 $ criterium may become less effective as the TSE as it can in general connect multiple states. To get around this difficulty, we can introduce a TSE index based on the entropy of the committor probability *q*_{α}(*i*). Recalling that $ \u2211 \alpha \u2208 M q \alpha ( i ) =1$ for any milestone *S _{i}* (i.e.,

*q*

_{α}(

*i*) is a probability distribution in

*α*), we propose to use the normalized entropy (sometimes also referred to as efficiency) of

*q*

_{α}(

*i*) as TSE index,

where the sum is carried out over the non-zero entries of *q*_{α}(*i*) and *n*(*i*) is the number of such entries (more generally, we could restrict the sum to the entries of *q*_{α}(*i*) that are above some small threshold *δ*). The TSE index *σ*(*i*) is 1 if the non zero values of *q*_{α}(*i*) are all identical, and it is 0 if only one value of *q*_{α}(*i*) is different than zero. Therefore, the closer *σ*(*i*) is to 1, the higher the chance that state *i* be a member of the transition state ensemble. Once the states in the TSE have been identified, we can go back to their committor values to determine between which target states they lay.

To characterize the physical origin of the metastability of the target milestones, it is also useful to decompose their free energy into an energetic component and an entropic one. This can be done as follows. Given the system’s potential energy *U*(** x**), a mean energy can be assigned to each of the target milestones via

where *T*_{α} the total time the last target milestone hit by ** x**(

*t*) is

*B*

_{α}and

so that $ \u2211 \alpha \Delta E \u02c6 \alpha \pi \u02c6 \alpha =0$. We can then estimate the entropy $\Delta S \u02c6 \alpha $ of the target milestone via

where $\Delta G \u02c6 \alpha $ is the free energy estimated in (13) and *k _{B}* is Boltzmann constant. Target milestones with comparable

*Δ G*

_{α}have similar statistical weights, and the lower

*Δ G*

_{α}the more thermodynamically stable they are (meaning the time series

**(**

*x**t*) tends to return to them more often, comparatively): by comparing their values of $\Delta E \u02c6 \alpha $ and $\Delta S \u02c6 \alpha $ we can then determine whether their stability is of energetic or entropic origin, respectively.

## III. THEORETICAL JUSTIFICATION

Let us now justify the use of the metastability index defined in (4) to identify target milestones over which the dynamics can be mapped in an approximately Markovian way.

### A. The case of Markov chains

We begin by discussing the simpler case when the dynamics of *i*_{∗}(*t*) on the trial milestones is itself Markovian—as we mentioned in Sec. II, we do not make this assumption in general, but it is a convenient starting point for our theoretical explanation. The general situation when *i*_{∗}(*t*) is not Markovian will be discussed in Sec. III B.

If *i*_{∗}(*t*) is Markov, its dynamics is specified by a rate matrix *L* whose entries we will denote by *L*_{i,j} to distinguish them from the rate matrix entries *k*_{α,β} on the target milestones. Assuming detailed balance, *L*_{i,j} satisfies

where *N* is the number of trial milestones and *p _{i}* is their equilibrium probability density—the estimator for

*p*was given in (15). Together with the assumption of ergodicity (i.e., that time averages along

_{i}*i*

_{∗}(

*t*) converge towards ensemble averages over

*p*), (22) implies that all the

_{i}*N*eigenvalues of

*L*

_{i,j}are real, with one being 0 and all the other negative. We will denote these eigenvalues by

*λ*, and order them so that 0 =

_{i}*λ*

_{0}≤ |

*λ*

_{1}| ≤ ⋯ ≤ |

*λ*

_{N−1}|.

The eigenvalues of *L* permit to give a precise definition of what it means for the chain to display metastability. In turns this indicates how to coarse-grain this chain, which in our context means how to chose good target milestones. Specifically, a chain will be metastable if its eigenvalues can be separated into two well separated groups, i.e., if there exists an *M* < *N* such that

If such a separation exists, it means that the *M* eigenvalues with index smaller than *M* describe relaxation processes in the chain that occur on timescales that are much slower than those described by the *N* − *M* eigenvalues with index larger or equal to *M*. In turn, this implies that these slow processes can be approximately described by a smaller chain with only *M* states. The practical questions then become: (i) how to assess whether (23) is satisfied for some *M* without having to compute the full spectrum of *L* (since this computation is hard in general), and (ii) how to reduce the dynamics to a chain with only *M* states?

In Ref. 23, Bovier and collaborators addressed these two questions, and the answers they provided are the basis for the algorithm we presented in Sec. II. First they proved that (23) holds if and only if an indicator closely related to the metastability index $ \rho M $ in (4) is small for some $M= { \alpha 1 , \alpha 2 , \u2026 , \alpha M } $. If that is the case, the ratio in (23) is in fact proportional to $ \rho M 2 $. Second, they showed how to reduce the chain onto a smaller chain involving only the nodes identified by $M$ (i.e., in our context, involving only the target milestones) in such a way that the *M* eigenvalues of this reduced chain be close to the *M* first eigenvalues of the original chain (where closeness can again be measured in terms of $ \rho M $).

It should be stressed that (23) can be satisfied with more than one value of *M*. This simply means that there can be more than one low-lying group of eigenvalues. In turns this implies that there can be more than one choice of good target milestones.

For completeness, let us end this section by giving explicit expressions for some of the quantities that were defined in Sec. II in the context of a system whose dynamics is governed by an MJP with generator *L*. First, the probability Γ_{i,j} whose estimator was given in (3) can be expressed as

Here *P*_{i,j} (not to be confused with the entries of the transfer operator *T*(*τ*) = *e*^{Lτ}, *τ* > 0) are the entries of the transition matrix defined as

which gives the probability that the state first visited by the chain after *i* is *j* ≠ *i*; and *q*_{i,j}(*k*) is the committor probability solution of

The probabilities Γ_{i,j} can also be conveniently calculated in terms of mean recurrence times (MRTs) and mean first passage times (MFPTs) via the formula

where *τ _{i}* is the MRT of the state

*i*and

*τ*

_{i,j}is the MFPT to go from state

*i*to state

*j*. A proof of relation (27) can be found in the Appendix of this paper. In addition, the committor probability

*q*

_{α}(

*i*) whose estimator was given in (17) solves an equation similar to (26), with different boundary conditions,

We can relate the distributions *π*_{α} and *p _{i}*, whose estimators were given in (12) and (15) respectively, as

which also means that the corresponding free energies whose estimators were given in (13) and (16) respectively, are related as

Finally, the rate matrix entries on the target milestone, *k*_{α,β}, whose estimator was given in (11) can be expressed as

These formulas can be justified within the framework of transition path theory (TPT)^{26–28} and are useful for analysis. However we stress that we do not need to solve any of the equations above to apply the procedure outlined in Sec. II. Indeed, this procedure can be used with the estimators given in that section, which only require the time series ** x**(

*t*) as input.

### B. Generalization to continuous state-spaces

In Refs. 24 and 25, the results of Ref. 23 were generalized to situations where the Markovian dynamics takes place on a continuous state-space, like ** x**(

*t*) does (or more generally the pair (

**(**

*x**t*),

**(**

*p**t*)) if the momentum

**(**

*p**t*) needs to be added to make the description Markovian, as in (32) below). The main technical difficulty in that case is that the metastability index $ \rho M $ defined in (4) measures the probability to go to a state after leaving another one, rather than returning to that state. The problem is that, in the continuous state-space setting, these states cannot be identified with points in the state space, since the probability to hit a point is zero as soon as the space dimension is higher than 1.

To get around this difficulty, it was shown in Refs. 24 and 25 that one can redefine states by fattening any specific point into a little domain that contains it, so that a metastability index $ \rho M $ can again be defined as in (4) and allows to identify low-lying groups of eigenvalues when they exist. This fattening procedure is similar to that of defining trial milestones: they are indeed regions containing specific points in the original state-space of the system, which provides a theoretical justification to the algorithm proposed in Sec. III. It is important to note, however, that Refs. 24 and 25 only gave prescriptions on how to perform this fattening (i.e., how to define trial milestones) in very specific (and simple) cases, like that of an system governed by overdamped dynamics in the limit of very small temperature. We are obviously interested in more complicated situations here, in which case it is no *a priori* obvious how to define the trial milestones. A few procedures to do so will be discussed in Secs. IV and V: these procedures are by no means the only ones one could envision, but they proved sufficient in these examples and should be transportable to other ones. Note that this also means that one should verify *a posteriori* that the dynamics on the target milestones identified by the procedure is indeed approximately Markovian. As usual, this can be done by checking that the first passage time between target milestones adjacent on the network of the MSM are exponentially distributed. In the examples we treated below, this turned out to be the case, indicating that the MSMs we constructed were indeed accurate.

Assuming that we have picked trial milestones and identified target ones, for completeness let us give explicit formulas for some of the quantities introduced in Sec. II in the context of a system whose dynamics is governed by the Langevin equation

where *U*(** x**) is the potential energy of the system,

*m*the mass matrix,

*γ*the friction tensor, and

**(**

*η**t*) is a white-noise satisfying 〈

**(**

*η**t*)〉 = 0, 〈

**(**

*η**t*)

*η*^{T}(

*s*)〉 = Id

*δ*(

*t*−

*s*)—other choices of thermostats are possible, and the formula below can be adapted to those straightforwardly. System (32) is ergodic with respect to Boltzmann-Gibbs probability density function

where $H ( x , p ) = 1 2 p T m \u2212 1 p+U ( x ) $ is the Hamiltonian and *Z* = ∫_{Ω×ℝ3n}*e*^{−βH(x,p)}*d**x**d*** p**. The expression for the probability Γ

_{i,j}whose estimator was given in (3) is quite complicated if, unlike what we did in Sec. III A, we do not assume that the dynamics on the trial milestones is Markovian. We can, however, give explicit expression for the rate matrix entries

*k*

_{α,β}, the distribution

*π*

_{α}, and the free energy

*G*

_{α}whose estimators were given in (11)–(13), respectively. These expressions were derived in Ref. 5 and they involve the committor function

*Q*

_{α}≡

*Q*

_{α}(

**,**

*x***) solution of**

*p* with the boundary condition *Q*_{α}(** x**,

**) = 1 if**

*p***∈ ∂**

*x**S*

_{α}and $ n \u02c6 \alpha ( x ) \u22c5p>0$ and

*Q*

_{α}(

**,**

*x***) = 0 if $x\u2208 \u222a \beta \u2208 M \u2216 \alpha \u2202 S \beta $ and $ n \u02c6 \beta ( x ) \u22c5p>0$, where $ n \u02c6 \alpha ( x ) $ denotes the unit normal vector pointing outward ∂**

*p**S*

_{α}at point

**∈ ∂**

*x**S*

_{α}and similarly for $ n \u02c6 \beta ( x ) $. The committor function

*Q*

_{α}(

**,**

*x***) gives the probability that the trajectory initiated at (**

*p***,**

*x***) reaches**

*p**S*

_{α}before any

*S*

_{β}with

*β*≠

*α*; using the invariance of the dynamics under

*t*→ −

*t*and

**→ −**

*p***,**

*p**Q*

_{α}(

**, −**

*x***) also gives the probability that the trajectory arriving at (**

*p***,**

*x***) was last in**

*p**S*

_{α}rather than in any

*S*

_{β}with

*β*≠

*α*. We then have

so that *Δ G*_{α} = *β*^{−1}ln*π*_{α}. Also

### C. Illustrative example

It is useful to illustrate the results of this section on a simple example. Specifically, we consider the motion of a particle by overdamped Langevin dynamics on the one-dimensional potential energy depicted in Fig. 3(a). The governing equation is

where *U*′(*x*) denotes the derivative of potential energy *U*(*x*), *γ* is the friction coefficient (which we will set to *γ* = 1) and *η*(*t*) a white-noise such that 〈*η*(*t*)〉 = 0 and 〈*η*(*t*) *η*(*t*′)〉 = *δ*(*t* − *t*′). Typical trajectories solution of (37) at three different temperatures are shown in Fig. 3(b).

As can been seen in Fig. 3(a), the potential has a hierarchical structure with a total of 27 local minima separated by barriers of various heights. Correspondingly, the generator of the process governed by (37), namely, the operator

has a spectrum with several groups of low-lying eigenvalues. This spectrum was obtained by spatial-discretization of (38) on a grid of 200 points, and the ratio of successive eigenvalues are shown in Fig. 4(a) at several different temperatures. Small ratios identify low-lying groups of eigenvalues, and several of them can be seen: *λ*_{1}/*λ*_{2}, *λ*_{2}/*λ*_{3}, *λ*_{8}/*λ*_{9}, and finally *λ*_{26}/*λ*_{27} are all small at temperatures ranging from *β*^{−1} = 0.5 to *β*^{−1} = 1.0. The eigenvalues involved in these groups describe processes arising on slow time scales that can be organized as follows: *λ*_{1} is associated with longest time scale of hopping over the largest barrier separating the left basin left centered around milestone 1 and right one centered around milestone 2 in Fig. 3(a); *λ*_{2} is associated with the next longest time hopping over the barrier separating the basin at the extreme left centered around milestone 1 and the one in the center centered around milestone 3; *λ*_{3}, …, *λ*_{8} are associated with hoping over the barriers separating milestones 1, 4, 5, milestones 3, 6, 7, and milestones 2, 8, 9; and finally *λ*_{9}, …, *λ*_{27} are associated with hoping over the barriers separating milestones 4, 10, 11, milestones 1, 12, 13, etc. From the results in Secs. III A and III B, the existence of the low lying groups also suggests that the continuous-time dynamics can be approximated by Markov chains (i.e., MSMs) containing, respectively, 2, 3, 9, and 27 states.

To confirm this prediction, we used the algorithm presented in Sec. II to construct these MSMs. Specifically, we used the 200 discretization points uniformly spaced between *x* = − 5 and *x* = 5 as trial milestones and computed the matrix entries Γ_{i,j} defined in (3) from a trajectory obtained by integrating (37). We then minimized the metastability index $ \rho M $ over index sets $M$ containing from *M* = 2 up to *M* = 30 indices. The minimum values of $ \rho M $ for each value of *M* are shown in Fig. 4(b) and, consistent with Bovier’s result, they correlate well with the values of the eigenvalue ratios *λ*_{M−1}/*λ _{M}* shown in Fig. 4(a), thereby confirming that we can use the metastability index $ \rho M $ to identify low-lying groups of eigenvalues. In particular, $ \rho M $ could be made small when

*M*= 2, 3, 9, and 27, and the corresponding index sets were $ M ( 2 ) = { 1 , 2} $, $ M ( 3 ) = { 1 , 2 , 3} $, $ M ( 9 ) = { 1 , \u2026 , 9} $, and $ M ( 27 ) = { 1 , \u2026 , 27} $. The associated target milestones are shown in Fig. 3(a). They clearly identify the lowest point of the wells on the hierarchical potential landscape

*U*(

*x*). The trajectory projected onto these target milestones when $ M ( 2 ) = { 1 , 2} $, $ M ( 3 ) = { 1 , 2 , 3} $, and $ M ( 9 ) = { 1 , \u2026 , 9} $ are shown in red in Fig. 3(b): these red pieces can be used to calculate the rate matrices

*k*

_{α,β}of the different MSMs on $ M ( 2 ) $, $ M ( 3 ) $, $ M ( 9 ) $, and $ M ( 27 ) $ via maximum likelihood maximization, as explained in Sec. II D. We can then calculate the spectrum of these rate matrices and compared them to the spectrum of the generator (38) of the original process. The result of these calculation is shown in Fig. 5 which shows that the eigenvalues of these different MSMs do indeed match the low-lying ones of the generator of the original process.

Note that the calculations above to identify target milestones were conducted at different temperatures, and showed that the target milestones were robust for temperature ranging from *β*^{−1} = 0.5 to *β*^{−1} = 1.0, even though their corresponding metastability index $ \rho M $ slowly grew with temperature. This is consistent with the fact that the ratios *λ*_{M−1}/*λ _{M}* also grow with temperature, as the system becomes slowly less metastable as it is heated up, with the groups associated with the lowest barriers disappearing first.

Note also that in these calculations, few transition events between the target milestones were observed (much less in particular that what is required to calculate the rate matrix entries *k*_{α,β} accurately), and yet the procedure was able to identify these milestones correctly.

## IV. GLY-ALA-GLY PEPTIDE

In this section, we use the method outlined in Sec. II to analyze a MD trajectory of a solvated Glycine-Alanine-Glycine peptide (GAG). The GAG peptide was modeled using the CHARMM 27 force field and simulated in a box of 475 TIP3P water molecules using the program NAMD version 2.8.^{29} After minimization the system was equilibrated for 10 ns with the peptide held constrained. The equilibration was followed by a 1.3 *μ*s production run with Langevin dynamics at 330 K, using a friction constant of 5 ps. The bonds between hydrogens and heavy atoms were kept rigid to allow integration at 2 fs. Frames were saved every 0.5 ps and a total of 2.6 ⋅ 10^{6} trajectory snapshots were collected.

To construct the set of trial milestones, we projected the MD trajectory onto the pair of dihedral angles (*ϕ*, *ψ*) that corresponds to the central residue alanine. We discretized the (*ϕ*, *ψ*)-space into a square grid of size 8 deg and used circles of radius 4 deg around these discretization points as trial milestones—we also used a finer grained definition of the trial milestones, a grid of points at 5° but this did not significantly change the results. Only those milestones that we were hit by the trajectory were kept: these consist of the *N* = 1303 milestones shown as circles in Fig. 6(a), and colored according to their free energy *Δ A _{i}* whose estimator is given in (16). This free energy defines a landscape that is divided in four macro-regions, often labeled as C5,

*P*

_{II},

*α*,

_{R}*α*, and

_{L}*α*. Interestingly, the overall topography of this landscape is consistent with that obtained in a reference study on a solvated alanine dipeptide

_{D}^{30}where the

*α*conformation is the most populated (see the Ramachandran map in Ref. 31 for the conformation names and Refs. 31–33 for comparison with NMR studies on the alanine conformational preferences). Since the target milestones that emerged from our analysis were located in the regions around C5,

_{R}*P*

_{II},

*α*,

_{R}*α*, and

_{L}*α*(these target milestones are shown as filled white circles in Fig. 6(a)), we used this nomenclature to designate them. Fig. 6(b) shows a portion of the trajectory projected on the angles

_{D}*ϕ*and

*ψ*.

From the MD trajectory the matrix of probabilities Γ_{i,j} was estimated from (3), and the metastability index $ \rho M $ was minimized over index sets $M$ of different cardinals *M*. The results are shown in Fig. 7. Two sets of target milestones were found to be clearly metastable: $ M ( 2 ) = { \alpha R , \alpha L } $ with metastability index $ \rho M =0.29$, and $ M ( 4 ) = { \alpha R , \alpha L , P II , \alpha D } $ with metastability index $ \rho M =0.45$. A third set of target milestones, $ M ( 3 ) = { \alpha R , \alpha L , P II } $ was also metastable, but with a higher metastability index $ \rho M =0.8$. Interestingly, the index set minimizing $ \rho M $ when *M* = 5 was $ M ( 5 ) = { \alpha R , \alpha L , P II , \alpha D , C 5} $, but its metastability index $ \rho M $ was close to 1, i.e., it was not deemed suitable by our analysis to be used to construct an MSM.

To confirm that the target milestones in $ M ( 2 ) $, $ M ( 3 ) $, and $ M ( 4 ) $ were good core sets to build an MSM (whereas others, including $ M ( 5 ) $, were not), we estimated the transition rate matrix entries *k*_{α,β} of the corresponding MSMs using the procedure described in Sec. II D. We then estimated the empirical first passage time (FPT) distributions from the MD trajectory for the transitions between the states in these MSMs. These were found to be approximately exponential, with decay rates consistent with those deduced from the matrix with entries *k*_{α,β} (data shown in Fig. S1 of the supplementary material^{34}). This indicates that the dynamics projected on these states is approximately Markovian, and confirms that a low metastability index is indeed a sign of Markovianity. Conversely, the FPT distributions between milestones *C*5 and *P*_{II} were non-exponential, which is indicative of the non-Markovian character of these transitions (see Fig. S1(a) in supplementary material^{34}). Interestingly, this result is consistent with the experimental findings about the non-cooperativity of the alanine *P*_{II} conformation in GGAGG peptides due to highly local hydration effects.^{35} It is also noteworthy that the transitions between the target milestones occur on different time scales: *τ*(*P*_{II} → *α _{R}*) = 0.48 ns,

*τ*(

*α*→

_{R}*P*

_{II}) = 0.73 ns,

*τ*(

*P*

_{II}→

*α*) = 9.5 ns,

_{L}*τ*(

*α*→

_{R}*α*) = 9.1 ns,

_{L}*τ*(

*α*→

_{L}*α*) = 8.5 ns,

_{D}*τ*(

*α*→

_{D}*α*) = 2.7 ns. This explains why different sets of target milestones can be identified, similar to what we observed in the illustrative example of Sec. III C. Note that the transition time scales between

_{L}*P*

_{II}and

*α*are consistent with the ≈1 ns

_{R}^{−1}interconversion rates reported in earlier studies on a solvated alanine dipeptide.

^{30,36,37}Note also that the target milestone

*α*plays the role of a hub for the transitions involving the left quadrant of the Ramachandran plot with the basin centered in

_{L}*α*. That is due to the fact that direct transitions from the helical region

_{D}*α*to the bottom right quadrant are very rare events in a solvated alanine.

_{R}Finally, the MSM with the target milestones in $ M ( 4 ) $ was used to cluster the state space of the system. These results are shown in Fig. 8. The committor probabilities *q*_{αR}(*i*) shown in Fig. 8(a) permit one to assign each of the trial milestones to the target milestones it is most likely to reach next, thereby partitioning the dihedral space into basins (see Fig. 8(c)). In Fig. 8(b) the trial milestones are colored according to the TSE index *σ*(*i*) defined in (18). The red regions represent the trial milestones *S _{i}* which are likely to part of a transition state ensemble. Also shown in Fig. 8(d) is the network of the MSM.

## V. BETA3S MINI PROTEIN

As a last example we applied our method to analyze a 20 *μ*s long MD trajectory of the Beta3s peptide in implicit solvent at 330 K, where multiple folding/unfolding events were observed. The Beta3s is a 20 residue peptide which is known to assume a triple stranded *β*-sheet fold.^{38–41} The simulation details of the MD data we analyzed here can be found in Ref. 42. The 20 *μ*s long trajectory is composed of 10^{6} microstates saved at a lag-time *τ* = 20 ps. In the context of the GAG peptide (Sec. IV), the torsional angles of the central alanine residue were shown to be sufficiently good descriptors of the conformational space and were used to define the trial milestones. This procedure, however, is not applicable to a 20 residues mini protein such as Beta3s, whose configurational space is much more complex than that of a tri-peptide. For the initial definition of the trial set of milestones several alternatives are in principle viable. Here we used the main dihedral angles along the chain as starting point.

Specifically, the trial milestones were defined at the level of the individual amino acids. For a protein of *R* residues there are *R* − 2 pairs of main dihedral angles (*ϕ _{r}*,

*ψ*) along the chain, where

_{r}*r*= 2, …, (

*R*− 2) denotes the residue index. For each of the

*R*− 2 Ramachandran plots that are associated with a residue, three trial milestones were defined as circles of radius 30°, and denoted as milestone 0, 1, and 2. The centers of these circles are located at the minima of the free energy landscape obtained from the empirical (

*ϕ*,

_{r}*ψ*) probability density of all the residues combined (see the free energy contour plot in Fig. 9(a)) that is estimated from the MD trajectory. The locations of the milestone centers are (−90, 110) for state 0 (beta-sheet region), (−80, − 40) for the state 1 (helix region), and (75, − 90) for state 2 (turn/loop region). In this way, the value of each pair of residue (

_{r}*ϕ*,

_{r}*ψ*) along the MD trajectory is first mapped on a three-letter alphabet, and these letters are then combined into a word, resulting in the re-presentation

_{r} where *s*^{(r)} ∈ (0, 1, 2) and *r* the residue index. For example, the folded state is represented as “000021000000210000.” Fig. 9(b) gives a pictorial representation of the construction of symbolic milestones. Since Beta3s is a 20 residues mini-protein, the upper limit of accessible trial milestones constructed this way is therefore 3^{18} ≈ 4 ⋅ 10^{8}.

In practice, not all these trial milestones were visited even once along the trajectory, and some were visited much more often than others. To avoid using trial milestones visited too few times, we filtered out the milestones that were visited less than a threshold value of times. The first step of dihedral milestoning gave a total number of *N*_{0} = 469 677 dihedral strings, 373 456 of which were visited only once, which is about ∼37% of the whole set. Interestingly, the ranked distribution of the visiting frequencies follows a Zipf’s law decay, namely, a power law ∼1/*r ^{a}* with

*r*the rank and

*a*= 0.76 the exponent (see Figure 9(c)). Different frequency thresholds were used to filter out trial milestones. Cutoffs corresponding to 20, 30, 40, and 50 reduced the number of trial milestones to 3439, 2120, 1516, and 1162, respectively. All these four sets of trial milestones were used to process the MD trajectory and estimate the probabilities Γ

_{i,j}. The metastability index $ \rho M $ was calculated for target sets $M$ with cardinality in the range

*M*= 2, …, 20. Fig. 10 shows the result of this calculation. Irrespective of the cutoffs used to define the set of trial milestones, the resulting target sets were robust in the range

*M*= 2, …, 8 where at the low values of $ \rho M \u22721$.

To better understand the conformational dynamics of the Beta3s peptide, we used the tools introduced in Sec. II E. The committor probability *q*_{α}(*i*) (obtained from (17)) permits one to perform a soft partitioning of the trial milestones. Each target set milestone *B*_{α} is the center of a partition with statistical weight *π*_{α} that is obtained from (12), and from which the free energy *Δ G*_{α} is determined from (13). This free energy can then be decomposed into energetic and entropic contributions according to (19) and (21). Here the effective energy of the system is identified as the sum of two contributions: $ E \u0304 = E charmm + E sasa $ where *E*_{charmm} is the total potential energy in the CHARMM force field and *E*_{sasa} is the solvent accessible term due to the implicit solvation model. Using (20) we obtained $ E \u0304 =\u221237.9\xb111.1$ kcal/mol as mean effective energy of the entire set of conformation sampled.

We focused our analysis on the target set $ M ( 8 ) $ that resulted the largest metastable target set with $ \rho M <1$. The committor probability *q*_{α}(*i*) is calculated from the time series of trial milestones, with $\alpha \u2208 M ( 8 ) $. Table I lists the statistical weights of the most populated states in the MSM build on $ M ( 8 ) $. The mean effective energies *Δ E*_{α} are also reported in Table I, along with the entropy contributions −*TΔ S*_{α} to the free energy *Δ G*_{α}. The basin centered around the target milestone associated with the folded state (id 1 in Table I) has a statistical weight *π*_{1} of about 76%, meaning that Beta3s spends about the 3/4 of the simulation time in this basin. As expected for a folded state, it has an energetic advantage (*Δ E*_{1} = − 0.6 kcal/mol) which compensates for its entropic disadvantage (−*TΔ S*_{1} = 0.8 kcal/mol). The other basins identified in the target set $ M ( 8 ) $ include helical and beta-curl type configurations. Interestingly, the helical basins (id 106, 241, and 264 in Table I) are entropically favored (Δ*S*_{α} < 0 and Δ*E*_{α} > 0) while the beta-curl states are energetically favored basins (id 164, 319, 473, and 843 in Table I) with Δ*S*_{α} < 0 and Δ*E*_{α} < 0. The statistical error on the mean effective energies Δ*E*_{α} is estimated as $std ( \Delta E \alpha ) / N \alpha $ where *std*(Δ*E*_{α}) is a standard deviation and *N*_{α} is the number of times the system enters a target milestone *B*_{α}. The values of *N*_{α} are reported in Table I. The statistical errors on Δ*E*_{α} are reported in Fig. S4 of the supplementary material^{34} along with the empirical probability density functions of the effective energy for each of the target milestones *B*_{α}. While these errors are large, the information contained in *Δ E*_{α} remains statistically significant.

Markov state model . | ||||||
---|---|---|---|---|---|---|

id . | Symbolic milestone . | π_{α} (%)
. | N_{α}
. | ΔG_{α} (kcal/mol)
. | ΔE_{α} (kcal/mol)
. | −TΔS_{α} (kcal/mol)
. |

1 | 000021000000210000 | 76 | 48 | 0.2 | −0.6 | 0.8 |

106 | 111111111111100000 | 8.2 | 42 | 1.6 | 4.9 | −3.3 |

164 | 001211001000210010 | 3.1 | 9 | 2.3 | −5.5 | 7.8 |

241 | 111111111111111111 | 3.1 | 37 | 2.3 | 4.7 | −2.5 |

264 | 010001111111111111 | 4.5 | 33 | 2.0 | 5.1 | −3.0 |

319 | 001111000011111000 | 2.3 | 19 | 2.5 | −3.6 | 6.0 |

473 | 001111000000200000 | 1.4 | 23 | 2.8 | −2.2 | 4.9 |

843 | 001111000111210010 | 1.4 | 5 | 2.8 | −2.8 | 5.7 |

Markov state model . | ||||||
---|---|---|---|---|---|---|

id . | Symbolic milestone . | π_{α} (%)
. | N_{α}
. | ΔG_{α} (kcal/mol)
. | ΔE_{α} (kcal/mol)
. | −TΔS_{α} (kcal/mol)
. |

1 | 000021000000210000 | 76 | 48 | 0.2 | −0.6 | 0.8 |

106 | 111111111111100000 | 8.2 | 42 | 1.6 | 4.9 | −3.3 |

164 | 001211001000210010 | 3.1 | 9 | 2.3 | −5.5 | 7.8 |

241 | 111111111111111111 | 3.1 | 37 | 2.3 | 4.7 | −2.5 |

264 | 010001111111111111 | 4.5 | 33 | 2.0 | 5.1 | −3.0 |

319 | 001111000011111000 | 2.3 | 19 | 2.5 | −3.6 | 6.0 |

473 | 001111000000200000 | 1.4 | 23 | 2.8 | −2.2 | 4.9 |

843 | 001111000111210010 | 1.4 | 5 | 2.8 | −2.8 | 5.7 |

Fig. 11 shows the network of the MSM built on the target set $ M ( 8 ) $. Notably, the folded state lies in between the helical region, which is entropically favored, and the beta-curl region, which is energetically favored. Except very rare direct connections between the helical and beta-curl regions, most of the transitions between the helix and beta-curl basins occur via the folded state. Thus, as was remarked elsewhere,^{43} the folded state is not only the region of the configurational space where energy and entropy compensate each other but it also plays the role of a dynamical hub in the network. This observation is confirmed by the high number of times the target milestone associated with the folded state is revisited (*N*_{1} = 48 is the highest values observed, see *N*_{α} column in Table I).

The high entropy character of the helical states is also reflected by the high connectivity within the helical region (left side of the network shown in Fig. 11) and the comparable values of *N*_{α} with that of the folded state (*N*_{106} = 42, *N*_{241} = 37, and *N*_{264} = 33 in Table I). On the contrary, the beta-curl states (right side of the network shown in Fig. 11) are less connected (*N*_{164} = 9, *N*_{319} = 19, *N*_{473} = 23, and *N*_{843} = 5). This is consistent with the observation reported in Ref. 42 that these beta-curl states act as kinetic traps.

The probability distributions of FPT from the folded state to the other member of the target set and back are fairly well described by an exponential decay (see Fig. S2 and S3 in the supplementary material^{34}). This indicates that the reduced dynamics of the Beta3s to a MSM built on the target set $ M ( 8 ) $ is approximatively Markovian. Interestingly, the emerging picture from the network representation in Fig. 11 is qualitatively consistent with the simplified kinetic network representation proposed in Ref. 41.

The committor probability *q*_{α}(*i*) calculated for the target set $ M ( 8 ) $ allows one to project the configurational space onto a simplex. Three aggregated committor probabilities are defined as *q*_{fold} = *q*_{1} for the folded state, *q*_{helix} = *q*_{106} + *q*_{241} + *q*_{264} for the helix state, and *q*_{trap} = *q*_{164} + *q*_{319} + *q*_{473} + *q*_{843} for the beta-curl state (see Table I for the state indexes). By definition of committor probability one has *q*_{fold} + *q*_{helix} + *q*_{trap} = 1. Fig. 12 shows the simplex representation of the aggregated committor probabilities in a triangular plot. Fig. 12(a) shows the time series of the committor probabilities *q*_{fold}(*i*_{∗}(*t*)), *q*_{helix}(*i*_{∗}(*t*)), *q*_{trap}(*i*_{∗}(*t*)) projected onto a triangular plot. Interestingly, the conformers are clustered around the three vertexes of the triangular plot. The blue lines represent direct transitions between conformer and only 4 direct transitions link the helix channel to the trap channel. Fig. 12(b) shows a triangular scatter plot of the committor probabilities with the points colored according to their TSE index *σ*(*i*) defined in (18). Red points correspond to milestones belonging to transition state ensembles. For instance, the channel connecting the helix to the fold basin is composed by milestones *S _{i}* such that

*q*

_{fold}≈

*q*

_{helix}≈ 0.5. Similarly, along the channel Trap↔ Fold one has milestones

*S*with

_{i}*q*

_{fold}≈

*q*

_{trap}≈ 0.5. The simplex representation in Fig. 12 corroborates the observation that the folded state acts as a kinetic hub,

^{42}and it also suggests that Beta3s oscillates between two extreme conformational regions: a low energy one versus a high entropy one.

## VI. CONCLUSIONS

Despite their growing popularity to analyze and interpret molecular dynamics (MD) simulations, Markov State Models (MSMs) remain somewhat tricky to use and justify. How to pick the states over which to build the MSM and how to assess its accuracy are nontrivial issues that are related to the degree to which the dynamical coarse-graining over these states preserves the Markovianity of the original system. Typically, states in MSMs are picked in a somewhat *ad hoc* fashion, and their quality is assessed *a posteriori* via Markovianity tests. In this paper, we propose a strategy that may help systematize the operation of MSM building. It is based on the observation that, in metastable systems, the states in an MSM should be hubs to which the trajectory has a high probability to return often, and between which it seldom transitions. These intuitive properties can be captured via a metastability index that measures how good a set of target milestones chosen among trial ones is: the smaller its metastability index, the better the target milestones it contains are as hubs. This metastability index is not a new concept: it was introduced in a closely related form in potential theoretic approaches to metastability. Our main goal in this paper was to show that this index can also be turned into a practical computational tool for the identification of good target milestones upon which to build accurate MSMs. As shown here, the procedure is simple to use, it does not require to make Markovian assumptions at the intermediate stages of the construction (in particular, the jumps between the trial milestones could be correlated and non-Markovian), and it has the added advantage that it allows one to cluster the system’s state space *a posteriori* and identify the transition state ensembles (TSEs) between the target milestones. Here, these features were illustrated on a collection of examples of increasing complexity. In particular, our last test case involving the Beta3s mini-protein, showed that the method can reveal interesting properties of the folding landscape of proteins, including the presence of kinetic traps and misfolded states on the way to the native state, and how these features affects the folding and unfolding pathways of the protein. We certainly hope that the technique will be similarly useful to analyze the dynamics of other proteins and macromolecules.

## Acknowledgments

We thank Amedeo Caflisch for sharing the MD trajectories of the Beta3s peptide. We also thank Jianfeng Lu for helpful discussions. E.V.-E.’s research was supported in part by NSF Grant No. DMS-1522767.

### APPENDIX: DERIVATION OF (27)

In this appendix, a proof of (27) is given in case of a Markov process with generator *L*_{i,j} and transition probability $ P i , j = L i , j / \u2211 k \u2260 i L i , k $, *i* ≠ *j*, *P*_{i,i} = 0. To begin, recall that the MFPT *τ*_{i,j} from state *i* to state *j* is the solution to

while the MRT is given by

where the first term at the right hand-side accounts for the mean time to exit *i* to some state *j* ≠ *i*, and the second for the mean time to come back from any such state *j* ≠ *i* to *i*. We are interested to find an expression for the probabilities Γ_{i,j} as a function of the MFPT and MRT. Since $ \Gamma i , j = \u2211 k P i , k q i , j ( k ) $, we can equivalently express the committor function *q*_{i,j}(*k*) in terms of MFPT and the MRT. We claim that

Let us check that this relation holds by verifying that the function *q*_{i,j}(*k*) defined this way satisfies *q*_{i,j}(*i*) = 0, *q*_{i,j}(*j*) = 1 and $ \u2211 l \u2208 S L k , l q i , j ( l ) =0$ for *k* ∉ {*i*, *j*}. The two boundary conditions are trivially verified and the third condition is