Reversibility is a key concept in Markov models and master-equation models of molecular kinetics. The analysis and interpretation of the transition matrix encoding the kinetic properties of the model rely heavily on the reversibility property. The estimation of a reversible transition matrix from simulation data is, therefore, crucial to the successful application of the previously developed theory. In this work, we discuss methods for the maximum likelihood estimation of transition matrices from finite simulation data and present a new algorithm for the estimation if reversibility with respect to a given stationary vector is desired. We also develop new methods for the Bayesian posterior inference of reversible transition matrices with and without given stationary vector taking into account the need for a suitable prior distribution preserving the meta-stable features of the observed process during posterior inference. All algorithms here are implemented in the PyEMMA software — http://pyemma.org — as of version 2.0.

## I. INTRODUCTION

Markov models, Markov state models (MSMs), or master-equation models are a powerful framework to reduce the great complexity of bio-molecular dynamics to a simple kinetic description that represents the underlying transitions between distinct conformations.^{1–7} These models allow us to analyze the longest-living (metastable) sets of structures,^{8} the effective transition rates between them,^{9,10} the kinetic relaxation processes and their relationship to equilibrium kinetics experiments,^{7,11–14} and the thermodynamics and kinetics over multiple thermodynamic states.^{15–18} A key advantage of MSMs is that they are estimated from conditional transition statistics between states, and they thus do not require the data to be in global equilibrium across all states. As a result, they are an excellent tool to integrate the data of multiple simulation trajectories that have been run independently and from different initial states into a single informative model.^{19–21}

A variety of complex molecular processes have been successfully described using MSMs. Examples include the folding of proteins into their native folded structure,^{20,22,23} the dynamics of natively unstructured proteins,^{24,25} and the binding of a ligand to a target protein.^{21,26–30}

There are two key steps in the construction of a MSM. At first, a suitable discretization of the continuous conformation space has to be obtained. In most cases, no good *a priori* discretization is known and the discretization has to be found based on the simulation data. The appropriate choice of discretization is a topic of ongoing research.^{24,31,32} The error incurred by the discretization and by the subsequent approximation of the jump-process as a Markov process can be systematically controlled and evaluated.^{7,33–35}

In the second step, one estimates the transition probabilities between pairs of states based on the transition statistics. The most common approach to estimating Markov models from data is by means of a Bayesian framework. One first harvests the transition counts, *c _{ij}*(

*τ*) from the data, i.e., how often trajectories were found in discrete state

*i*at some time

*t*and in discrete state

*j*at some later time

*t*+

*τ*. The parameter

*τ*is called lag time and is crucial for the quality of the Markov model.

^{7,33}Next, one computes the transition matrix either by maximizing the likelihood, i.e., the probability over all possible Markov model transition (or rate) matrices that may have generated the observed transition counts

^{7,36,37}or by sampling Markov models from the posterior distribution.

^{38–42}A maximum likelihood estimate gives a single-point estimate, i.e., a single Markov model that is “most representative” given the data. However, if some transition events are rare compared to the total simulation length — and this is the typical case in molecular dynamics simulation — this maximum likelihood model might be very uncertain and thus far away from the model that the one would converge to by increasing amount of simulation data. The Bayesian posterior ensemble is a natural approach to quantify such statistical uncertainties and thus to make meaningful comparisons between a Markov model obtained from different sets of simulations or to experimental data.

A key property of molecular dynamics at thermal equilibrium, and a necessary consequence of the second law of thermodynamics, is microscopic reversibility of the equations of motion. This property is ensured by many simulation procedures^{43,44} and carries over to detailed balance between discrete states, i.e., formally leads to a (time-) reversible Markov model. A reversible Markov model is not only physically desirable, it offers statistical advantages as it has only about half as many independent parameters compared to a nonreversible model, and it allows the equilibrium kinetics to be analyzed in a straightforward and meaningful manner. Furthermore, imposing detailed balance with respect to a given stationary vector can be used to aid the efficient estimation of rare-event processes from MSMs.^{45}

Algorithms imposing detailed balance during likelihood maximization have been discussed in Refs. 7, 16, 36, and 37. First methods for sampling the posterior distribution of reversible transition matrices have been suggested in Ref. 39 and later in Ref. 46. A method working with natural priors for reversible chains was proposed in Ref. 47. The sampling of transition matrices reversible with respect to a fixed stationary distribution was also presented in Ref. 39, while a Gibbs sampling algorithm with a significantly improved convergence rate has been developed in Ref. 42. Reference 48 discusses methods for goodness-of-fit tests for Markov chains.

This article is deliberately broad and presents new concepts, insights, and algorithms for reversible Markov model estimation in general, maximum likelihood estimators, and Bayesian estimators that mutually benefit from each other. For this reason, we first give a survey of principles and consequences of reversible Markov models. We then extend the framework of maximum likelihood estimation of transition matrices by giving a simplified maximum likelihood estimator (MLE) for reversible transition matrices and a new estimator for reversible transition matrices with a fixed equilibrium distribution. The main part of the paper comprises new algorithms for the full Bayesian analysis of the posterior ensemble of reversible Markov models. As yet, three fundamental problems have not been satisfactorily solved: (i) How can one harvest statistically uncorrelated transition counts from trajectories in which subsequent transitions are correlated, so as to give rise to meaningful uncertainty intervals? (ii) Which prior should be used in a Bayesian analysis so as to get error intervals that envelop the true value even for Markov models with many states? (iii) How can we design efficient sampling algorithms for the reversible posterior ensemble, i.e., algorithms that allow to quickly compute reliable error bars for Markov models with many states? In this paper we discuss (i) and give a rather complete treatment of problems (ii) and (iii). Efficient sampling algorithms are derived for reversible Markov models and reversible Markov models with fixed equilibrium distribution.

## II. REVERSIBLE MARKOV MODELS

In this section, we will show that microscopic reversibility carries over to the discretized situation and discuss the desirable properties of a reversible Markov state model.

### A. From microscopic reversibility to discrete-state detailed balance

Let *μ*(*x*) denote the equilibrium distribution on the microscopic degrees of freedom *x* ∈ Ω, e.g., all-atom coordinates of the system of interest, and let *p*_{τ}(*x*, *y*) denote the conditional transition density of the MD implementation. *p*_{τ}(*x*, *y*) is the probability that the system is found in state *y* at time *t* + *τ* given that it has been in state *x* at time *t*. The MD implementation fulfills microscopic reversibility if the following detailed balance equation

holds for all pairs of states *x*, *y* ∈ Ω. Hence, the terms “detailed balance” and “reversible” are equivalent in our context. Since *μ*(*x*) *p*_{τ}(*x*, *y*) is the unconditional probability to find the transition (*x*, *t*) → (*y*, *t* + *τ*), Eq. (1) means that the system is on average time-reversible — the absolute number of transitions from *x* to *y* is equal to the reverse. Microscopic detailed balance is desirable to have in any MD implementation when the aim is to perform simulations in thermodynamic equilibrium. If (1) would be violated, that would imply the existence of cycles *x* → *y* → *z* → *x* along which there is a net transport. Since such cycles could be used to generate work, their existence in a system that is driven by purely thermal energy would be inconsistent with the second law of thermodynamics.

Dynamical models that are commonly employed in MD implementations fulfill detailed balance. Brownian (overdamped Langevin) dynamics fulfills detailed balance. Hamiltonian and non-overdamped Langevin dynamics fulfill generalized detailed balance with respect to momentum inversion in phase space, but when integrating over the distribution of momenta they do fulfill ordinary detailed balance in position space.^{49} In practice, some finite time-stepping integrators do not obey exact detailed balance with respect to the Boltzmann distribution, but we here consider that the MD implementation has been chosen such that detailed balance is at least approximately fulfilled.

Now suppose that the state space Ω is partitioned into non-overlapping subsets *S*_{1}, …, *S _{n}* that we shall call

*discrete states*here. Each set has an equilibrium probability given by

and the transition density gives rise to a discrete state transition matrix *P*(*τ*) with entries

Note that (4) holds independently of the choice of the lag time *τ*. Moreover, (4) implies that *π* is the equilibrium probability vector of *P*(*τ*). By defining the diagonal matrix Π = diag(*π*_{1}, …, *π _{n}*), we can alternatively write (4) as a matrix equation,

As a result, if the microscopic dynamics are reversible, the Markov model transition matrix must also be reversible. However, a direct estimate of *P* from a finite amount of simulation data cannot be expected to fulfill (4) exactly. Thus, the principle validity of detailed balance motivates us to *enforce*(4) in the process of estimating *P*.

Enforcing detailed balance helps to reduce the statistical error of an estimator for *P* because it reduces the number of independent variables roughly by approximately one half (see Table I). However, there are other consequences of having (4): Given detailed balance, we can compute the molecular equilibrium kinetics in a physically meaningful way and employ some useful analysis tools that are not defined for nonreversible Markov models. Moreover, we can employ more efficient and robust matrix algebra routines when exploiting that *P* is a reversible matrix.

Constraints . | dof . | ≈ . |
---|---|---|

None | n(n − 1) | n^{2} |

Reversible | $ 1 2 n ( n \u2212 1 ) +n\u22121$ | $ 1 2 n 2 $ |

Reversible, fixed π | $ 1 2 n ( n \u2212 1 ) $ | $ 1 2 n 2 $ |

Constraints . | dof . | ≈ . |
---|---|---|

None | n(n − 1) | n^{2} |

Reversible | $ 1 2 n ( n \u2212 1 ) +n\u22121$ | $ 1 2 n 2 $ |

Reversible, fixed π | $ 1 2 n ( n \u2212 1 ) $ | $ 1 2 n 2 $ |

### B. Eigenvalues and eigenvectors

Many methods to analyze the molecular kinetics based on a Markov model transition matrix rely on the eigenvalue decomposition of *P*. Using the diagonal eigenvalue matrix Λ = diag(*λ*_{1}, …, *λ _{n}*), we can formulate a right eigenvalue problem with right column eigenvectors

*R*= (

*r*

_{1}, …,

*r*),

_{n}*r*∈ ℝ

_{i}^{n}, and left row eigenvectors

*L*= (

*l*

_{1}, …,

*l*)

_{n}^{⊤},

From (7), we can obtain a generalized eigenvalue problem,

Π is symmetric positive definite and as a result of detailed balance, Π*P* is symmetric. Hence, all eigenvalues *λ*_{1}, …, *λ _{n}* are real, and the eigenvectors are orthogonal with respect to the equilibrium distribution

^{50}

where we have used the weighted scalar product $\u3008u,v \u3009 \pi = \u2211 i \pi i u i v i $. We can make *R* orthonormal by scaling an arbitrarily obtained eigenvector *r _{i}* by $\u3008 r i ,\u2009 r i \u3009 \pi \u2212 1 / 2 $.

which is a left eigenvalue problem with the choice

#### 1. Example 1

Consider the following reversible 3 × 3 transition matrix:

Suppose we generate a Markov chain of length 20 starting from state 1, resulting in the count matrix at lag *τ* = 1,

Now we conduct a nonreversible and a reversible maximum likelihood estimation of the transition matrix given *C*. Eigenvalues for the exact transition matrix in (11) and both nonreversible and reversible estimates for the given count matrix in (12) are shown in Fig. 1.

It is seen that the nonreversible estimate contains complex eigenvalues. These generally come in complex conjugate pairs. Fig. 1 shows a much higher accuracy of the reversible estimate compared to the nonreversible estimate. In order to explore the statistical significance of this observation, we run *N* = 1000 chains of length *L* = 20 using transition matrix (11). The reversible and nonreversible estimation results, together with the true eigenvalues, are reported in Table II.

. | λ_{1}
. | Re{λ_{2}}
. | Im{λ_{2}}
. | Re{λ_{3}}
. | Im{λ_{3}}
. |
---|---|---|---|---|---|

Exact | 1 | 0.42 | 0.0 | 0.18 | 0.0 |

Reversible | 1 | 0.36 ± 0.31 | 0.0 | 0.18 ± 0.04 | 0.0 |

Non-reversible | 1 | 0.32 ± 0.29 | −0.04 ± 0.08 | 0.07 ± 0.21 | 0.04 ± 0.09 |

. | λ_{1}
. | Re{λ_{2}}
. | Im{λ_{2}}
. | Re{λ_{3}}
. | Im{λ_{3}}
. |
---|---|---|---|---|---|

Exact | 1 | 0.42 | 0.0 | 0.18 | 0.0 |

Reversible | 1 | 0.36 ± 0.31 | 0.0 | 0.18 ± 0.04 | 0.0 |

Non-reversible | 1 | 0.32 ± 0.29 | −0.04 ± 0.08 | 0.07 ± 0.21 | 0.04 ± 0.09 |

It is seen that the reversible estimates do not only have the correct real-valued structure but can also have smaller uncertainties (here, especially for *λ*_{3}). This is expected to be a general result due to the smaller number of degrees of freedom in the reversible estimate.

#### 2. Example 2

Fig. 2 shows the distribution of eigenvalues from nonreversible and reversible Markov models from simulation data for the alanine-dipeptide molecule (see Sec. V B). The eigenvalues of the reversible estimate are purely real while the non-reversible estimate has eigenvalues with non-zero imaginary part.

### C. Equilibrium kinetics analyses

Since detailed balance is a consequence of a system simulated at dynamical equilibrium, it is not surprising that detailed balance in the transition matrix *P* is a prerequisite to analyze the equilibrium kinetics given *P*. Since kinetics are related to slow processes, we will here only consider the *m* largest eigenvalues *λ*_{1}, …, *λ _{m}* and assume that they are positive. Here are a few examples for equilibrium kinetics properties computed from reversible transition matrices:

- The dominant relaxation rates of the molecular system arewhere$ \kappa i = \u2212 1 \tau ln \lambda i , $
*i*≥ 2 (*i*= 1 has a relaxation rate of zero and corresponds to the equilibrium distribution). The inverse quantities are the relaxation time scales $ t i = \kappa i \u2212 1 $. These rates or time scales are of special interest because they are often detectable in kinetic experiments such as fluorescence time-correlation spectroscopy, two-dimensional IR spectroscopy, or temperature jump experiments — see Ref. 11 for a discussion. - Decomposition (10) can be used to write kinetic experimental observables in an illuminating form.
^{11,13,14}For example, the long-time scale part of the autocorrelation of a molecular observable*a*∈ ℝ^{n}, e.g., containing the fluorescence values of every Markov state of a molecule, can be written as$ acf ( a ; \u2009 \tau ) \u2248 \u3008 a , \u2009 \pi \u3009 2 + \u2211 i = 2 m \u3008 a , \u2009 r i \u3009 \pi 2 e \u2212 \kappa i \tau . $ The PCCA+ method for seeking

*m*metastable sets of Markov states and its variants^{8,51}assumes*m*real-valued eigenvalues and eigenvectors. It is thus only reliably applicable to reversible transition matrices.- Discrete transition path theory
^{20,52,53}computes the statistics of transition pathways from a set of states*A*to a set of states*B*given a transition matrix. Discrete TPT can be used with nonreversible and reversible transition matrices. However, in the reversible case, we get that the forward committor and the backward committor are complementary,and the net fluxes, when ordering states such that $ q i + \u2264 q j + $, are given by$ q i + = 1 \u2212 q i \u2212 $which is analogous to an electric current$ f i j + = ( q j + \u2212 q i + ) \pi i p i j $*I*=*UG*, where $I= f i j + $, $U= ( q j + \u2212 q i + ) $ is the potential difference and*π*is the conductivity._{i}p_{ij}^{54}

## III. LIKELIHOOD, COUNTING, AND MAXIMUM LIKELIHOOD ESTIMATION

We restate the transition matrix likelihood and formulate the maximum likelihood estimation problem for Markov model transition matrices. We present new estimation algorithms for reversible Markov models with known or unknown equilibrium distribution *π*.

### A. Likelihood

Suppose we have a discrete sequence *S* = {*s*_{1}, …, *s _{N}*} with

*s*∈ {1, …,

_{i}*n*}. If we assume that this sequence is the realization of a Markov chain with lag time

*τ*= 1, the probability that a transition matrix

*P*has generated

*X*is proportional to the product of individual transition probabilities along the trajectory

We have neglected the proportionality constant because we will not need it in order to maximize or sample from (15). This is very handy because one component in this constant is the probability of generating the first state, *p*_{s0}, which is often unknown, but is constant for a fixed data set *S*.

Suppose we have *c _{ij}* transitions from

*i*to

*j*. Then, we can group all

*p*terms together and get a factor $ p i j c i j $. Doing this for all pairs results in the equivalent likelihood formulation

_{ij} We can see that the count matrix *C* ∈ ℝ^{n×n} is a sufficient statistics for the Markov model likelihood ℙ(*S* ∣ *P*) — it generates the same likelihood although we have discarded the information in which sequence the transitions have occurred.

If multiple trajectories are available, their count matrices are simply added up.

### B. Counting

How should we count transitions for longer lag times *τ* > 1, or if *S* is not Markovian at lag time *τ*? Regarding the first case, if *S* is Markovian at lag time *τ*, a safe approach seems to subsample the trajectory at time steps of *τ* and then treat the subsampled trajectory as above.^{7,9} However, this approach is statistically inefficient: If *S* is also Markovian for shorter lag times than *τ*, then we are using less information than we could. Even if *S* only becomes Markovian at lag times of *τ* or longer, transitions such as 1 → *τ* + 1 and *τ*/2 → *τ* + *τ*/2 are usually only partially correlated, such that discarding the second transition is also not fully exploiting the data. In practical MD simulations, the lag times required such that a Markov model is a good approximation need to be quite long (often in the range of nanoseconds), such that subsampling the data at *τ* will create severe problems with data and connectivity loss. Regarding the second case, if *S* is not Markovian at lag time *τ*, then treating every *c _{ij}* as an independent count is incorrect.

Both cases can, in principle, be treated with the following formalism: We always obtain the count matrix *c _{ij}* in a

*sliding window*mode,

^{7}i.e., we harvest all

*N*−

*τ*available transition counts from time pairs (1 →

*τ*), (2 →

*τ*+ 1), …, (

*N*−

*τ*→

*N*). Unless

*S*is Markovian at lag time 1, we will now harvest more transition counts than are statistically independent. We can formally correct for this by introducing a statistical inefficiency

*I*(

_{ij}*τ*) for every count at a given lag time, such that $ c i j eff ( \tau ) = I i j ( \tau ) \u2009 c i j ( \tau ) $ is the effective number of counts, resulting in the likelihood

The determination of statistical inefficiencies for univariate signals is well established.^{55} Determining *I _{ij}*(

*τ*) for transition count matrices is an open problem. A first approach that allows for the first time to estimate consistent, although somewhat too small uncertainty intervals for practical MD data is discussed in Ref. 56. Note that the validity of the estimation algorithms described in the present paper is independent of the choice of the count matrix, such that future methods for estimating the effective count matrix can be adopted without changing the estimation algorithms.

### C. Maximum likelihood estimation

We will now assume that the effective counts are given. For better readability, we will subsequently omit the superscript eff and just use *C* = (*c _{ij}*) to indicate counts. Now we ask the question what is the most likely transition matrix for the observation

*C*, i.e., we seek the MLE that maximizes (15) over the set of transition matrices.

#### 1. Non-reversible estimation

It is well known that the non-reversible MLE for the transition probability from state *i* to state *j* is simply given by the ratio of observed counts from *i* to *j* divided by the total number of outgoing transitions from state *i*,^{57}

We use the hat in order to denote an estimator. The term non-reversible implies that reversibility has not been used as a constraint in the estimation of $ P \u02c6 nonrev $. Of course $ P \u02c6 nonrev $ can be coincidentally reversible and will be reversible if the count matrix *C* is symmetric. For this reason, some early contributions in the field forced symmetry in *C* by counting *S* forward and backward. This practice is strongly discouraged as it will create a large bias unless the trajectories used are very long compared to the slowest time scales of the molecule.

#### 2. Reversible estimation

Now we consider the problem of finding the reversible MLE $ P \u02c6 rev $ by enforcing detailed balance (4) with respect to an unknown equilibrium distribution (*π _{i}*) as constraint in the estimation procedure. Note that the count matrix used for this approach is not modified, i.e., it comes from a forward-only or nonreversible counting and is generally not symmetric. Constraints (4) can be more conveniently handled by defining the new set of variables

Note that

We can thus recover the transition matrix from *X* = (*x _{ij}*) by

Inserting (21) into (16) and adding constraints for detailed balance and stochasticity leads to the reversible maximum likelihood problem,

Ignoring the inequality constraints, the optimality conditions are

with $ c i = \u2211 j c i j $ and $ x i = \u2211 j x i j $. There is no closed form solution when including the detailed balance constraint so that (22) has to be solved numerically. One option is to directly solve (23) for *x _{ij}* and turn it into a fixed-point iteration, as first proposed in Ref. 37,

where *k* counts the iteration number in the algorithm. For a starting iterate $ x i j ( 0 ) $ fulfilling the constraints in (22), for example, $ x i j ( 0 ) = ( c i j + c j i ) / \u2211 i , j ( c i j + c j i ) $, the iterates will be symmetric and fulfill the inequality constraints for all *k* > 0.

If we sum over *j* on both sides of (24) and use (20), we can instead reduce the problem to iterative estimation of the equilibrium distribution,

The iteration is terminated when ||*π*^{(k+1)} − *π*^{(k)}|| < *ϵ*. The final estimate $ \pi \u02c6 $ is then inserted into (23) to recover the reversible transition matrix estimate,

Note that both the optimum sought by Eqs. (25) and (26) exhibits $ p \u02c6 i j =0$ if *c _{ij}* +

*c*= 0. Thus, in both optimization algorithms, the sparsity structure of the matrix

_{ji}**C**+

**C**

^{T}can be used in order to restrict all iterations to the elements that will result in a nonzero element $ p \u02c6 i j >0$.

Furthermore, note that (25) and (26) are special cases of the transition-based reweighing analysis (TRAM) method — see Ref. 16, Eqs. (29) and (30) — for the special case of a single thermodynamic state. An example for the progress of the self-consistent iteration using an alanine dipeptide simulation is shown in Fig. 3.

#### 3. Reversible estimation for given stationary vector

While unbiased MD simulations are useful to estimate state-to-state transition probabilities *p _{ij}*, enhanced sampling algorithms such as umbrella sampling and replica-exchange MD can be much more efficient in order to gain insight of the equilibrium distribution

*π*. Reference 45 demonstrates how an uncertain estimate of

*π*can be combined with unbiased “downhill” trajectories in order to estimate rare event kinetics. A key in such a procedure is a way to estimate a reversible Markov model that is most likely given transition counts observed from MD simulations, but at the same time has a fixed equilibrium distribution

*π*. Here we derive a new, efficient estimation algorithm for this task.

Enforcing reversibility with respect to a given stationary vector results in the following constrained optimization problem:

*π* can only be the unique stationary distribution of *P* if *P* is irreducible. To ensure irreducibility, we restrict the state space to the largest (weakly) connected set of the undirected graph that is defined by the adjacency matrix *C* + *C ^{T}*. For a system with

*n*states, Eqs. (27)–(30) are a convex minimization problem in $O ( n 2 ) $ unknowns with $O ( n 2 ) $ equality and inequality constraints. Solving this with a standard interior-point method requires the solution of a linear system with $O ( n 2 ) $ unknowns to compute the search direction at each step. The resulting computational effort of $O ( n 6 ) $ operations for solving the linear system quickly becomes unfeasible for increasing

*n*. Therefore, we will propose a fixed-point iteration that is also feasible for large values of

*n*.

To solve the maximization problem, we ignore inequality constraint (29) at first. Row-stochasticity constraint (28) is enforced by introducing Lagrange multipliers *λ _{i}* and adding penalty terms $ \lambda i \u2211 j p i j \u2212 1 $ for all

*i*= 1, …,

*n*to the objective function. Detailed balance constraint (30) is included into the likelihood explicitly by the change of variables,

These substitutions result in the Lagrange function

that we seek to maximize. By setting the gradient of *F* with respect to all $ p i j \u2032 $ to zero and subsequently reversing the change of variables, we find the following expression for the maximum likelihood estimate:

Note the similarity of this equation with maximum likelihood result (26) where *π* has been self-consistently computed from the counts. The row counts *c _{i}* are here replaced by the yet unknown Lagrange multipliers

*λ*. In order to find the Lagrange multipliers, we sum Eq. (32) over

_{i}*j*,

This does not give a closed-form expression for *λ _{i}*. However, based on this equation, we propose the following fixed-point iteration for the Lagrange multipliers:

Motivated by the analogy between Lagrange multipliers and row counts described above, we set the starting point to

Taking the limit *λ _{i}* → 0

^{+}in (34) still leads to a consistent solution. Choosing strictly positive starting parameters according to (35) results in valid iterates from (34). In analogy to the reversible case, we iterate (34) and (35) until $ \lambda ( k + 1 ) \u2212 \lambda ( k ) <\u03f5$. An example for the progress of the self-consistent iteration using alanine dipeptide simulation data is shown in Fig. 4. Note that the convergence is nearly three orders of magnitude faster compared to the estimation with unknown equilibrium distribution (Fig. 3).

Given converged Lagrange multipliers, we can exploit (32) to find the maximum likelihood transition matrix $ P \u02c6 $. For this algorithm, inequality constraints (29) are automatically fulfilled when *c _{ij}* ≥ 0 for all

*i*,

*j*. Care must be taken in two situations: (i)

*λ*=

_{i}*λ*= 0 — one can show that the simultaneous limit

_{j}*λ*→ 0

_{i}^{+}and

*λ*→ 0

_{j}^{+}can only occur for

*c*+

_{ij}*c*= 0, but then we know that $ p \u02c6 i j =0$. (ii) A diagonal element

_{ji}*c*is zero. Depending on the values of

_{ii}*π*, the solution

*λ*may take the value of zero such that Equation (32) for

_{i}*i*=

*j*becomes $ p \u02c6 i i = c i i / \lambda i =0/0$ which is meaningless and is not the correct limit of

*p*as

_{ii}*c*goes to zero. However, this can be fixed easily by using $ p \u02c6 i i =1\u2212 \u2211 j \u2260 i p \u02c6 i j $ for the diagonal elements of

_{ii}*P*. In summary, we use the following equation for computing $ P \u02c6 $ from converged Lagrangian multipliers:

**Uniqueness of the estimator**: Since the above estimation algorithm is iterative, it is fair to ask whether the estimator $ P \u02c6 $ it converges to is unique, or whether there might be multiple local maxima that we could get stuck in. In this case, it is easy to show that the estimator is unique: Let *P*^{∗} be an optimal transition matrix. $ p i j \u2217 =0$ exactly if *c _{ij}* +

*c*= 0. Let Ω = {

_{ji}*p*|

_{ij}*c*+

_{ij}*c*> 0}. Then, the function $f ( P ) = \u2211 i , j c i j log p i j $ is strictly convex on Ω and the constraints restrict the solution on a convex subset $ \Omega \u0303 \u2282\Omega $. The minimization of a strictly convex function over a convex set has a unique solution.

_{ji}## IV. BAYESIAN ESTIMATION

We introduce new algorithms for sampling the full posterior probability distribution of Markov models, and, in particular, for estimating uncertainties of quantities of interest, such as relaxation time scales or mean first passage times. A key in these algorithms is the choice of a suitable prior which enforces the sampled matrices to have the same sparsity pattern as the transition count matrix, as this allows the credible intervals to lie around the true value even for large transition matrices. The relevance of the prior is first demonstrated for nonreversible Markov models, for which an efficient sampling algorithm is known. We then introduce new Gibbs sampling algorithms for reversible Markov models with and without constraints on the equilibrium distribution that vastly outperform previous algorithms for sampling reversible Markov models.

### A. Bayes’ theorem and Monte Carlo sampling

Bayes’ formula relates the likelihood of an observed effective count matrix *C* given a probability model *P* to the posterior probability of the model given the observation

The posterior accounts for the uncertainty coming from a finite observation. It incorporates *a priori* knowledge about the quantity of interest using the prior probability ℙ(*P*). We will see that a suitable choice of the prior is essential for the success of a Bayesian description for high-dimensional systems.

In general, if we are interested in an observable that is a function of a transition matrix, *f*(*P*), we would like to compute its posterior moments, such as the mean and the variance,

While we usually use the maximum likelihood transition matrix $ P \u02c6 $ to provide “best” estimates, $f ( P \u02c6 ) $, the above integrals are of interest because $\sigma ( f ) = V ar ( f ) $ gives us an estimate of the statistical uncertainty of *f*. Alternatively, we might be interested in the credible intervals which encompass the true value of *f* with some probability, such as 0.683 (1*σ* intervals) or 0.95 (2*σ* intervals). As integrals (38) and (39) are high-dimensional, we need to use Monte Carlo methods to approximate them.

In Monte Carlo methods, we generate a sample of transition matrices $ { P ( k ) } k = 1 N $ distributed according to the posterior and evaluate *f* at each element *P*^{(k)} in the ensemble. We then approximate posterior expectation value (38) and posterior variance (39) by

Obtaining good and reliable samples of the posterior ℙ(*P*|*C*) is very difficult. Previous approaches have suffered from some or all of the following difficulties, that are addressed here:

**Choice of the prior**: Given*n*Markov states (typically 100 s to 1000 s), transition matrices have on the order of*n*^{2}elements and are thus extremely high dimensional. Most priors used in the past allow to populate all these elements*p*, including those for which no transition has been observed. Although the effect of the prior can be overcome by enough simulation data, for any practical amount of simulation data, such priors will lead to posterior distributions whose probability mass is far away from the true model. This problem has been first addressed in Ref. 20 by designing a prior that equates mean and MLE for nonreversible transition matrices, leading to credible intervals that nicely envelop the true value. Here, we design corresponding priors for reversible Markov models._{ij}**Uncorrelated transition counts***C*: As discussed in Sec. III B, the likelihood, and thus the posterior, depends on how transition counts are harvested from the discrete trajectories which are generally time-correlated and not exactly Markovian at any particular lag time*τ*. While the MLE is often not or little affected by the exact way of counting*C*, the uncertainties will be dramatically different if*C*is, e.g., counted in a sliding window mode (using transitions starting at all times*t*= 0, 1, 2, …), or by subsampling the trajectory (using transitions starting at all times*t*= 0,*τ*, 2*τ*). Whereas the first approach underestimates the uncertainties, the second approach often overestimates them and is often not practical for large lagtimes*τ*. Here, we suggest to use the effective number of uncorrelated transition counts,*C*=*C*^{eff}, and a first approach to compute them is described in Ref. 56.**Efficiency of the sampler**: Finally, given a choice of prior and*C*, a sampling algorithm needs to explore the high-dimensional space of transition matrices in a reasonable time. This is especially problematic for reversible Markov models. The first Monte Carlo algorithm for sampling the reversible posterior, described in Ref. 39, suffers from poor mixing due to small acceptance probabilities of the individual steps. In Ref. 42, an improved sampler was proposed. Here, we propose sampling algorithms for reversible Markov models with and without fixed equilibrium distribution whose efficiencies go far beyond previous approaches.

### B. Non-reversible sampling

Let us first illustrate the effect of prior choice on Bayesian estimation of nonreversible Markov models. A convenient functional form for the prior is the Dirichlet prior

where *B* = (*b _{ij}*) is a matrix of prior-counts. For this choice, the posterior is given by

*z _{ij}* =

*c*+

_{ij}*b*is the matrix of posterior pseudo-counts. In the non-reversible case, we can generate independent samples from (43) by drawing rows of

_{ij}*P*

^{(k)}independently from Dirichlet distributions $ \u220f j p i j \alpha i j \u2212 1 $ with Dirichlet parameters

*α*=

_{ij}*z*+ 1 =

_{ij}*c*+

_{ij}*b*+ 1.

_{ij}^{9}

Choosing a uniform prior, *b _{ij}* = 0, assigns equal prior probability to all entries,

*p*, in the posterior ensemble. But this

_{ij}*a priori*assumption can lead to serious problems when estimating quantities for meta-stable systems.

Consider, for example, the following transition matrix for a birth-death chain consisting of two meta-stable sets *A* = {1, …, *m*}, *B* = {*m* + 2, …, *n*}, separated by a kinetic bottleneck in form of a single transition state,

For barrier parameter *b* = 3 and sets with *m* = 50 and *n* = 101, the expected time for hitting *B* from state *x* = 1 is 2 ⋅ 10^{5} steps. Now we are interested in the Bayesian estimator for a simulation of length *L* = 10^{7}. The true distribution can be estimated with arbitrary precision by repeating the simulation many times. Here, 10^{3} repetitions led to an estimate of the 90% percentile for the mean first passage time of [1.5, 2.7] ⋅ 10^{5} (see Table III).

Method . | Estimate . |
---|---|

True | $2. 0 1 . 5 2 . 7 \u22c51 0 5 $ |

Uniform prior b = 0 _{ij} | $1.9 5 1 . 9 2 . 0 \u22c51 0 3 $ |

Sparse prior b = − 1 _{ij} | $2. 0 1 . 5 2 . 7 \u22c51 0 5 $ |

Method . | Estimate . |
---|---|

True | $2. 0 1 . 5 2 . 7 \u22c51 0 5 $ |

Uniform prior b = 0 _{ij} | $1.9 5 1 . 9 2 . 0 \u22c51 0 3 $ |

Sparse prior b = − 1 _{ij} | $2. 0 1 . 5 2 . 7 \u22c51 0 5 $ |

In practice, we cannot afford to repeat the simulation many times but would like to approximate the true value and its statistical uncertainty from the given simulation data. Sampling the nonreversible posterior given expected counts for a single chain of length *L* = 10^{7} with a uniform prior, *b _{ij}* = 0, results in non-zero transition probabilities for elements

*p*which are zero in the true transition matrix. As a result, artificial kinetic pathways circumventing the bottleneck are appearing in the posterior ensemble which lead to a dramatic underestimate of the mean first passage time. The Bayesian estimate with 90% credible interval obtained from 10

_{ij}^{3}posterior samples is [1.9, 2.0] ⋅ 10

^{3}, and thus two orders of magnitude smaller than the true value 2 ⋅ 10

^{5}(Table III).

Using the prior *b _{ij}* = − 1 suggested in Ref. 20 results in 90% credible intervals, [1.5, 2.7] ⋅ 10

^{5}, which clearly cover the true value 2 ⋅ 10

^{5}(Table III). The choice

*b*= − 1 leads to a posterior distribution in which sampled transition matrices

_{ij}*P*have the same sparsity structure as the count matrix

*C*, i.e.,

*p*= 0 if

_{ij}*c*= 0. As count matrices in the present context are generally sparse, we call this prior briefly sparse prior. Apparently, the sparse prior leads to consistent credible intervals covering the true value.

_{ij}Fig. 5 shows the convergence of the 90% credible interval for the sparse and the uniform prior. The credible interval for the sparse prior envelopes the true value already given little data. To achieve consistency using the uniform prior requires simulations order of magnitudes longer than the time scale of the slowest process, thus rendering inference under this prior unpractical.

Note that our prior induces a fixed sparsity structure. This concept should not be confused with other sparsity inducing priors used, i.e., in the context of Bayesian compressed sensing,^{58} where the sparsity pattern is subject to uncertainty.

### C. A prior for reversible Markov models

Now we will present a new method for the sampling of reversible transition matrices. In our new approach, we replace Dirichlet prior (42) by a new prior for reversible sampling.

Similarly as in reversible maximum likelihood estimation, we define our reversible transition matrix sampler in the space of unconditional transition probabilities *x _{ij}*. For convenience, we restrict ourselves to the independent set of variables with

*i*≥

*j*(remember that

*x*=

_{ij}*x*for reversible matrices), and keep them normalized to 1,

_{ji} Although *X* is defined slightly different as in the maximum-likelihood case, the mapping from *X* back to *P* is still given by Eq. (21). We define a prior for reversible sampling on the set of *X* matrices rather than on *P*: Choosing *x _{ij}* as the set of independent variables has the advantage that obeying detailed balance amounts to sampling symmetric matrices,

*X*=

*X*,

^{T}The posterior for reversible sampling is then given by

Below we will first consider how to sample from (48) using general prior counts *b _{ij}*. Then, we will consider the specific choice

*b*= − 1 for all

_{ij}*i*≥

*j*and show that this choice has similar properties as the sparse prior in the nonreversible case.

### D. Sampling reversible transition matrices

There is no known method to generate independent samples from the posterior under the reversibility requirement. Instead we will use a Markov chain Monte Carlo (MCMC) method to generate samples from the posterior ensuring that each sampled transition matrix fulfills detailed balance condition (4). Our Markov chain will generate the ensemble $ { X ( k ) } k = 1 N $ via a set of updates advancing the chain from *X*^{(k)} → *X*^{(k+1)} starting from a valid initial state *X*^{(0)}. We can do a simple row-normalization of the *X* matrices to obtain the desired ensemble $ { P ( k ) } k = 1 N $. Expectation values and variances will again be estimated using (40) and (41).

Similarly as in Refs. 39 and 42, we will construct our Markov chain using a Gibbs sampling procedure, where we sample a single element of *X* in each step while leaving the other elements unchanged. We repeat this sampling procedure for every element of *X*, thus completing a Gibbs sweep. As detailed in the Appendix, we can use the following general Gibbs step to sample posterior (48):

- Select an arbitrary element
*x*. Propose a new (unscaled) matrix_{kl}*X*→*X*′ by sampling this element from the proposal density $q ( x k l \u2032 | X ) $,Here, $q ( x k l \u2032 | X ) $ is an arbitrary, scale-invariant density. Scale-invariance means that $q ( x k l \u2032 | X ) \u221dq ( c x k l \u2032 | c X ) $ for any positive constant$ x i j \u2032 = \u223c q ( x k l \u2032 | X ) ( i , j ) = ( k , l ) x i j else . $*c*. - Accept $ X \u0304 \u2032 $ as a new step in our Markov chain with probability min{1,
*p*} where_{acc}where $ b 0 = \u2211 k \u2265 l b k l $ and$ p acc = 1 \u2212 x k l + x k l \u2032 \u2212 n ( n + 1 ) 2 \u2212 b 0 q ( x k l | X \u2032 ) q ( x k l \u2032 | X ) \gamma ( x k l \u2032 | X ) \gamma ( x k l | X \u2032 ) , $*γ*is the marginal density,$ \gamma ( x k l \u2032 | X ) \u221d ( x k k \u2032 ) c k k + b k k ( x k \u2212 x k k + x k k \u2032 ) c k , k = l ( x k l \u2032 ) c k l + c l k + b k l ( x k \u2212 x k l + x k l \u2032 ) c k ( x l \u2212 x k l + x k l \u2032 ) c l , k \u2260 l . $ - Renormalize the matrix $ X \u2032 \u2192 X \u0304 \u2032 $ such that it fulfills (46),$ x \u0304 i j \u2032 = x i j \u2032 1 \u2212 x k l + x k l \u2032 . $

While this approach will work for any choice of prior counts, we will now use the sparse prior *b _{ij}* = − 1 for all

*i*,

*j*with the hope to achieve similarly good results as in the nonreversible case. For this choice, $\gamma ( x k l \u2032 | X ) $ is scale-invariant, i.e., $\gamma ( x k l \u2032 | X ) =\gamma ( c x k l \u2032 | c X ) $, and the Jacobian pre-factor in (50) is one. Thus, we have

and the ideal choice of the proposal density is *q* ≡ *γ*, which would guarantee that the acceptance probability is always 1. This proposal density degenerates to a point probability at zero if *c _{kl}* +

*c*= 0, which implies

_{lk}*b*= − 1 encodes

_{ij}*a priori*belief that any transition for which neither the forward direction nor the backward direction has ever been observed in the data has zero probability in the posterior ensemble. Thus, this prior enforces

*P*to have the same sparsity structure as the count matrix, like the choice

*b*= − 1 for nonreversible sampling. Note that the reversible MLE has the same sparsity structure as can be seen from update rule (24).

_{ij}We will choose proposal densities $q ( x k l \u2032 | X ) $ that are also scale-invariant. In this case, normalization step 3 above ($ X \u2032 \u2192 X \u0304 \u2032 $) can be omitted, i.e., if we accept *X*′, we can directly set it as our new sample *X*^{(k)} and obtain *P*^{(k)} by row normalization. We will now outline how to design the proposal density *γ* such that the acceptance probability is 1 or nearly 1. For *k* = *l*, sampling $ x k k \u2032 \u223c\gamma ( x k k \u2032 | X ) $ is equivalent to sampling the following transformed variable (see the Appendix):

So we can simply define $q ( x k k \u2032 | X ) \u2261\gamma ( x k k \u2032 | X ) $ and generate $ x k k \u2032 $ by

*k*≠

*l*, there is no known way to draw independent samples, but $\gamma ( x k l \u2032 | X ) $ can be well approximated by a gamma distribution by matching the maximum point and the second derivative at the maximum. A gamma distribution can be efficiently sampled and we use it as proposal density and accept the resulting $ x k l \u2032 $ with probability min{1,

*p*}. Specifically, our proposal step is

_{acc}with the parameters

using

which matches the value and the first two derivatives of the true marginal density at the maximum (see the Appendix for derivation) and leads to acceptance probabilities close to one for most values of *x _{kl}*. However, if the current value of

*x*is in one of the heavy tails of the distribution $\gamma ( x k l \u2032 | X ) $, the acceptance probability can be much less than 1 and the Markov chain can get stuck. In order to avoid this problem, we utilize a second step to generate $ x k l \u2032 $: After we sample $ x k l \u2032 $ from proposal density (55), we sample $ x k l \u2032 $ according to

_{kl} where $N ( x | m , \u2009 s ) $ denotes the normal distribution of *x* with mean *m* and standard deviation *s*. The proposal density defined by the above update is

and the corresponding acceptance probability is

In summary, the proposed Algorithm 1 is a Metropolis within Gibbs MCMC algorithm with adapted proposal probabilities for each Gibbs sampling step. For efficiency reasons, transition matrix elements (*i*, *j*) for which no forward or backward transition counts have been observed can be neglected in the sampling algorithm, in order to account for the effect of the sparse prior.

Input: C, X^{(j)} |

Output: X^{(j+1)} |

for all indexes (k, l) with k ≥ l and c + _{kl}c > 0 _{lk}do |

if k = l then |

Sample $ x k k ( j + 1 ) $ from (54) |

end |

else |

Calculate α and β by (56), (57), and sample $ x k l \u2032 $ |

from Gamma(α, β). |

$ p acc = \gamma ( x k l \u2032 | X ) \gamma ( x k l | X ) Gamma ( x k l | \alpha , \beta ) Gamma ( x k l \u2032 | \alpha , \beta ) $ |

Accept $ x k l \u2032 $ as $ x k l ( j + 1 ) $ with probability min{1, p} _{acc} |

Sample $ x k l \u2032 $ by $log\u2009 x k l \u2032 \u223cN ( log \u2009 x k l , 1 ) $. |

$ p acc = \gamma ( x k l \u2032 | X ) \gamma ( x k l | X ) x k l \u2032 x k l $ |

Accept $ x k l \u2032 $ as $ x k l ( j + 1 ) $ with probability min{1, p} _{acc} |

end |

end |

Input: C, X^{(j)} |

Output: X^{(j+1)} |

for all indexes (k, l) with k ≥ l and c + _{kl}c > 0 _{lk}do |

if k = l then |

Sample $ x k k ( j + 1 ) $ from (54) |

end |

else |

Calculate α and β by (56), (57), and sample $ x k l \u2032 $ |

from Gamma(α, β). |

$ p acc = \gamma ( x k l \u2032 | X ) \gamma ( x k l | X ) Gamma ( x k l | \alpha , \beta ) Gamma ( x k l \u2032 | \alpha , \beta ) $ |

Accept $ x k l \u2032 $ as $ x k l ( j + 1 ) $ with probability min{1, p} _{acc} |

Sample $ x k l \u2032 $ by $log\u2009 x k l \u2032 \u223cN ( log \u2009 x k l , 1 ) $. |

$ p acc = \gamma ( x k l \u2032 | X ) \gamma ( x k l | X ) x k l \u2032 x k l $ |

Accept $ x k l \u2032 $ as $ x k l ( j + 1 ) $ with probability min{1, p} _{acc} |

end |

end |

### E. A prior for reversible Markov models with fixed equilibrium distribution

As before we will be working with variables *x _{ij}* =

*π*related to transition matrix entries

_{i}p_{ij}*p*via (45). In contrast to the previous algorithm,

_{ij}*π*is not a function of

*P*but fixed. Thus, single normalization condition (46) is replaced by a condition for each row,

in order to ensure reversibility with respect to the given *π*.

All *x _{kl}* in the lower triangle (

*k*>

*l*) are used as independent variables. Given a valid

*X*matrix, an update that respects the constraints is given by

(67b) and (67d) ensure that normalization condition (66) holds for the new sample and will thus keep *π* constant, while (67c) restores the symmetry and thus ensures reversibility of *P*. We will again use prior (47) defined on the set of *X* matrices and sample from posterior (48). The ideal proposal density of $ x k l \u2032 $ is

which is the conditional distribution density for given all off-diagonal elements of *X* (except *x _{kl}*),

*π*, and the counts

*C*.

We have seen that a correct choice of prior parameters was essential in order to successfully apply the posterior sampling for meta-stable systems. As in the reversible case, we will use *b _{kl}* = − 1 for

*k*>

*l*to enforce

*x*= 0 whenever

_{kl}*c*+

_{kl}*c*= 0.

_{lk}However, the choice of prior counts for the diagonal elements *b _{kk}* is less straightforward. According to our experience, a good choice is to determine the value of

*b*based on the maximum likelihood estimate $ p \u02c6 k k $ of the

_{kk}*k*th diagonal element as

which ensures that the posterior expectation of *p _{kk}* is zero if and only if $ p \u02c6 k k =0$, and the conditional expectation of (67e),

matches the MLE of the one-dimensional likelihood function for *x _{kl}* given

*X*if $ p \u02c6 k k >0$ and

*c*= 0. (Note that for the MLE of

_{kk}*P*, there is at most one

*k*which satisfies $ p \u02c6 k k >0$ and

*c*= 0 — see proof in the Appendix.)

_{kk}However, in the case that $ p \u02c6 k k =0$, conditional (67e) would then degenerate so that $ x k k \u2032 =0$ with probability one, and the *k*th row and column of *X* would remain fixed in the sampling process. This effect can break ergodicity in the sampled Markov chain and therefore prevent convergence of the algorithm. This problem is avoided by regularizing the prior choosing the prior parameter as *b _{kk}* = − 1 +

*ϵ*for $ p \u02c6 k k =0$ such that (67e) does not degenerate, where

*ϵ*> 0 is a small number. In addition, we need to ensure that the Markov chain is started from an initial state

*X*

^{(0)}with $ x k k ( 0 ) >0$. In summary, we select the prior of

*X*for reversible sampling with fixed

*π*as

This choice of prior will again ensure that *c _{kl}* +

*c*= 0 results in

_{lk}*p*= 0 and

_{kl}*p*= 0 for all

_{lk}*k*<

*l*and for all posterior samples, a property shared by the reversible MLE with fixed stationary vector. This ensures that the posterior mass is located around the maximum likelihood estimate $ P \u02c6 $ and again prevents the occurrence of artificial kinetic pathways in the posterior ensemble.

### F. Sampling reversible Markov models with fixed equilibrium distribution

We now investigate how to efficiently sample conditional (67e). Here, we assume without loss of generality that *x _{kk}* <

*x*and transform $ x k l \u2032 \u2208 ( 0 , \u2009 x k k + x k l ) $ into a new variable

_{ll}*v*′ ∈ (0, + ∞) via

The ideal proposal density of *v*′ is then

with

Like in the previous algorithm, we can approximate the conditional of *v* by a gamma distribution as

with

and

See the Appendix for derivation. Then, *v*′ can be sampled by a Metropolis sampling step with the proposal density Gamma(*v*′|*α*, *β*) and the acceptance ratio

where *v* = *x _{kl}*/

*x*denotes the original value of

_{kk}*v*. In addition, in order to avoid the sampler from getting stuck at an extremely small or large value of

*v*, we also utilize the same strategy as in Section IV D to generate

*v*′ by

The proposed Algorithm 2 for sampling of reversible transition matrices with fixed stationary vector can again be characterized as a Metropolis within Gibbs MCMC algorithm with adapted proposal probabilities.

Input: C, π, X^{(j)} |

Output: X^{(j+1)} |

for all indexes (k, l) with k > l and c + _{kl}c > 0 _{lk}do |

if x < _{kk}x _{ll}then |

$v= x k l x k k $ |

$s= x l l + x k l x k k + x k l $ |

a_{1} = c + _{kl}c + _{lk}b _{kl} |

a_{2} = c + _{kk}b _{kk} |

a_{3} = c + _{ll}b _{ll} |

end |

else |

$v= x k l x l l $ |

$s= x k k + x k l x l l + x k l $ |

a_{1} = c + _{kl}c + _{lk}b _{kl} |

a_{2} = c + _{ll}b _{ll} |

a_{3} = c + _{kk}b _{kk} |

end |

Calculate α and β by (74) and (75) |

Sample v′ from Gamma(α, β). |

Let $ x k l \u2032 =min { x k k + x k l , x l l + x k l } \u22c5 v 1 + v $. |

$ p acc = \gamma v ( v \u2032 | X ) Gamma ( v | \alpha , \beta ) \gamma v ( v | X ) Gamma ( v \u2032 | \alpha , \beta ) $ using (67a)–(67d) |

Accept x as $ x k l ( j + 1 ) $ with probability min{1, _{kl}p} _{acc} |

Sample v′ by $log\u2009 v \u2032 \u223cN ( log v , 1 ) $. |

$ p acc = \gamma V ( v \u2032 | X ) v \u2032 \gamma V ( x k l | X ) v $ |

Let $ x k l \u2032 =min { x k k + x k l , x l l + x k l } \u22c5 v 1 + v $. |

Accept $ x k l \u2032 $ as $ x k l ( j + 1 ) $ with probability min{1, p} _{acc} |

end |

Input: C, π, X^{(j)} |

Output: X^{(j+1)} |

for all indexes (k, l) with k > l and c + _{kl}c > 0 _{lk}do |

if x < _{kk}x _{ll}then |

$v= x k l x k k $ |

$s= x l l + x k l x k k + x k l $ |

a_{1} = c + _{kl}c + _{lk}b _{kl} |

a_{2} = c + _{kk}b _{kk} |

a_{3} = c + _{ll}b _{ll} |

end |

else |

$v= x k l x l l $ |

$s= x k k + x k l x l l + x k l $ |

a_{1} = c + _{kl}c + _{lk}b _{kl} |

a_{2} = c + _{ll}b _{ll} |

a_{3} = c + _{kk}b _{kk} |

end |

Calculate α and β by (74) and (75) |

Sample v′ from Gamma(α, β). |

Let $ x k l \u2032 =min { x k k + x k l , x l l + x k l } \u22c5 v 1 + v $. |

$ p acc = \gamma v ( v \u2032 | X ) Gamma ( v | \alpha , \beta ) \gamma v ( v | X ) Gamma ( v \u2032 | \alpha , \beta ) $ using (67a)–(67d) |

Accept x as $ x k l ( j + 1 ) $ with probability min{1, _{kl}p} _{acc} |

Sample v′ by $log\u2009 v \u2032 \u223cN ( log v , 1 ) $. |

$ p acc = \gamma V ( v \u2032 | X ) v \u2032 \gamma V ( x k l | X ) v $ |

Let $ x k l \u2032 =min { x k k + x k l , x l l + x k l } \u22c5 v 1 + v $. |

Accept $ x k l \u2032 $ as $ x k l ( j + 1 ) $ with probability min{1, p} _{acc} |

end |

## V. RESULTS

### A. Validation

We first demonstrate the validity of the reversible sampling algorithm for the following 2 × 2 count-matrix:

In Fig. 6, we compare the sampled histograms using Algorithm 1 with analytical values for the non-reversible posterior with Dirichlet-prior-counts *b _{ij}* = − 1. Any 2 × 2 transition matrix automatically fulfills detailed balance, and therefore the analytical and sampled densities are expected to be equal. The histogram for samples of the reversible algorithm is indeed in agreement with the analytical posterior.

In Fig. 7, the sampled histogram for count matrix (81) with fixed stationary distribution $\pi = 0 . 25 , \u2009 0 . 75 \u22a4 $ using Algorithm 2 is compared with the exact posterior distribution. Detailed balance relation (4) with fixed stationary vector enforces a linear dependency between the transition matrix element *p*_{12} and *p*_{21}. The resulting posterior is, therefore, restricted to the line *π*_{1}*p*_{12} = *π*_{2}*p*_{21} such that the projection on *p*_{12} in Fig. 7 already contains the full information about the one-dimensional posterior. A comparison between histogram frequency and analytical density demonstrates the validity of the algorithm.

### B. Applications

To demonstrate the usefulness of the proposed algorithms, we apply them to molecular dynamics simulation data. Here, two systems are chosen to illustrate our methods: (1) the alanine-dipeptide molecule and (2) the bovine pancreatic trypsin inhibitor (BPTI) molecule.

We start by discussing the alanine dipeptide results. The system was simulated on GPU-hardware using the OpenMM simulation package^{59} using the *amber99sb-ildn* forcefield^{60} and the *tip3p* water model.^{61} The cubic box of length 2.7 nm contained a total of 652 solvent molecules. We used Langevin equations at *T* = 300 K with a time step of 2 fs. A total of 10 *μ*s of simulation data was generated. The *ϕ* and *ψ* dihedral angles were discretized using a 20 × 20 regular grid to obtain a matrix of transition counts *C* = (*c _{ij}*), here by sampling one count per lag time

*τ*. Below we will show histograms for two important observables,

*largest implied time scales*

*t*and

_{i}*expected hitting times*,

*τ*(

*A*→

*B*), for pairs

*A*,

*B*of meta-stable sets. We compute the posterior sample-mean and 90% credible intervals for 1

*μ*s of simulation data and show that the credible intervals nicely envelop a reference value obtained from the MLE transition matrix for the total simulation data, supporting the proposed prior as a “good” choice for reversible sampling in meta-stable systems.

#### 1. Alanine dipeptide, reversible sampling

In Fig. 8, we show histograms of implied time scales computed from a reversible posterior sample. The mean values estimated from the posterior sample are in good agreement with the reference values. Tables IV and V compare the reference values with the sample mean *μ* and sample standard deviation *σ* for each observable.

. | $ t \u02c6 i $ (ps) . | μ (ps)
. | σ (ps)
. | ϵ (ps)
. | t
. _{corr} |
---|---|---|---|---|---|

t_{2} | 1462 | 1556 | 303 | 19.00 | 197 |

t_{3} | 71 | 73 | 1 | 0.01 | 10 |

t_{4} | 36 | 43 | 5 | 0.06 | 7 |

. | $ t \u02c6 i $ (ps) . | μ (ps)
. | σ (ps)
. | ϵ (ps)
. | t
. _{corr} |
---|---|---|---|---|---|

t_{2} | 1462 | 1556 | 303 | 19.00 | 197 |

t_{3} | 71 | 73 | 1 | 0.01 | 10 |

t_{4} | 36 | 43 | 5 | 0.06 | 7 |

. | $ \tau \u02c6 $ (ns) . | μ (ns)
. | σ (ns)
. | ϵ (ns)
. | t
. _{corr} |
---|---|---|---|---|---|

$\tau ( C 5 \u2192 C 7 a x ) $ | 60.4 | 56.7 | 12.0 | 0.77 | 202 |

τ(C_{5} → α) _{L} | 43.6 | 40.6 | 8.3 | 0.53 | 206 |

τ(C_{5} → α) _{R} | 0.253 | 0.250 | 0.005 | 0.0004 | 218 |

. | $ \tau \u02c6 $ (ns) . | μ (ns)
. | σ (ns)
. | ϵ (ns)
. | t
. _{corr} |
---|---|---|---|---|---|

$\tau ( C 5 \u2192 C 7 a x ) $ | 60.4 | 56.7 | 12.0 | 0.77 | 202 |

τ(C_{5} → α) _{L} | 43.6 | 40.6 | 8.3 | 0.53 | 206 |

τ(C_{5} → α) _{R} | 0.253 | 0.250 | 0.005 | 0.0004 | 218 |

In order to gain a first impression of the efficiency of the sampling and the quality of our estimates, we compute the integrated autocorrelation time *t _{corr}* for each quantity sampled (here implied time scales and hitting times). The error of the sample mean

*m*[

*f*] compared to the true mean 〈

*f*〉 can then be estimated as

where *N*_{eff} = *N*/(1 + 2*t _{corr}*) is the effective number of samples with

*N*the total number of samples. See Ref. 62 for a thorough discussion.

*t*and

_{corr}*ϵ*are also reported in Table IV.

Figs. 8(d)–8(f) show histograms for expected hitting times for the three transitions $ C 5 \u2192 C 7 a x $, *C*_{5} → *α _{L}*, and

*C*

_{5}→

*α*between meta-stable sets in the

_{R}*ϕ*and

*ψ*dihedral angle plane. Again, mean values are in good agreement with the corresponding reference values. Table V summarizes the computed results. The table columns again contain the reference value $ \tau \u02c6 $, the mean value

*μ*, the standard deviation

*σ*, the estimated correlation time

*t*, and the error of the mean value

_{corr}*ϵ*.

#### 2. Alanine dipeptide, reversible sampling with fixed equilibrium distribution

Below we report results for sampling with fixed stationary distribution. The stationary distribution *π _{i}* was, for sake of simplicity, computed using the relative frequencies of state occurrences,

It should be noted that a more useful and independent source of *π* are enhanced sampling simulations targeted at rapidly generating a good estimate of the equilibrium probabilities alone. See Ref. 45 for methods and applications that combine MD simulations and enhanced sampling simulations in order to efficiently compute rare-event kinetics.

Results are shown in Table VI and Fig. 9. The sample mean is again in good agreement with the reference value. For the computation of the reference value, we use the MLE transition matrix of the full simulation data reversible with respect to the input stationary distribution for the posterior sampling. The additional constraint imposed by fixing the stationary distribution is clearly reflected in smaller standard deviations for all shown observables compared to the reversible case.

. | $ t \u02c6 i $ (ps) . | μ (ps)
. | σ (ps)
. | ϵ (ps)
. | t
. _{corr} |
---|---|---|---|---|---|

t_{2} | 1594 | 1520 | 196 | 0.6 | 1 |

t_{3} | 72 | 73 | 1 | 0.003 | 1 |

t_{4} | 38 | 41 | 3 | 0.01 | 1 |

. | $ t \u02c6 i $ (ps) . | μ (ps)
. | σ (ps)
. | ϵ (ps)
. | t
. _{corr} |
---|---|---|---|---|---|

t_{2} | 1594 | 1520 | 196 | 0.6 | 1 |

t_{3} | 72 | 73 | 1 | 0.003 | 1 |

t_{4} | 38 | 41 | 3 | 0.01 | 1 |

Histograms for expected hitting times between meta-stable sets are shown in Figs. 9(d)–9(f). The sample mean is again in good agreement with the reference value. Again, we summarize our results, cf. Table VII.

. | $ \tau \u02c6 $ (ns) . | μ (ns)
. | σ (ns)
. | t
. _{corr} | ϵ (ns)
. |
---|---|---|---|---|---|

$\tau ( C 5 \u2192 C 7 a x ) $ | 56.0 | 54.5 | 5.0 | 1 | 0.02 |

τ(C_{5} → α) _{L} | 41.5 | 39.5 | 5.1 | 1 | 0.02 |

τ(C_{5} → α) _{R} | 0.249 | 0.251 | 0.003 | 1 | 9.7 ⋅ 10^{−6} |

. | $ \tau \u02c6 $ (ns) . | μ (ns)
. | σ (ns)
. | t
. _{corr} | ϵ (ns)
. |
---|---|---|---|---|---|

$\tau ( C 5 \u2192 C 7 a x ) $ | 56.0 | 54.5 | 5.0 | 1 | 0.02 |

τ(C_{5} → α) _{L} | 41.5 | 39.5 | 5.1 | 1 | 0.02 |

τ(C_{5} → α) _{R} | 0.249 | 0.251 | 0.003 | 1 | 9.7 ⋅ 10^{−6} |

#### 3. Bovine pancreatic trypsin inhibitor, reversible sampling

For BPTI, we used the 1 ms simulation generated on the Anton supercomputer.^{63} Please refer to that paper for system setup and simulation details. We prepared data as follows: *C*_{α} atom positions were oriented to the mean structure and saved every 10 ns, resulting in about 100 000 configurations with 174 dimensions. Time-lagged independent component analysis (TICA)^{24,32} was applied to reduce this 174-dimensional space to the two dominant ICs as a spectral gap was found after the second nontrivial eigenvalue. *k*-means clustering with *k* = 100 was used to discretize this space.

Effective count matrices were obtained using the method described in Ref. 56 at a range of lag times up to 2 *μ*s. Fig. 10 shows the implied relaxation time scales obtained from a maximum likelihood estimate with values comparable to the Hidden Markov model analysis in Ref. 10. The figure also shows uncertainties computed from a reversible transition matrix sampling as described above with *N* = 1000 samples. Only every 20th transition matrix sample was used to compute time scales in order to reduce the computational effort to 50 eigenvalue decompositions. It is seen that the MLE is nicely contained in the 2*σ* (95%) credible interval. The entire transition matrix sampling for Fig. 10 took about 12.5 s on a 1.7 GHz Intel Core i7. Given that 8 lag times were used for 1000 samples of 100 × 100 matrices that contained about 40% non-zeros, about 2.56 × 10^{6} elements are sampled per second, and about 640 full transition matrix samples are generated per second. Below, a more systematic analysis of the computational efficiency is made.

### C. Efficiency

We compute acceptance probabilities of the Metropolis-Hastings steps and compare the statistical efficiency of the proposed sampling algorithm with the algorithm proposed in Ref. 39 that uses uniform proposal densities. Efficiency is measured in terms of achieved autocorrelation times for sampling of transition matrices with different sizes. As a representative observable, we choose the largest relaxation time scale *t*_{2} for the alanine dipeptide molecule and compute autocorrelation functions and autocorrelation times from a large sample of size *N* = 10^{6}. Two differently fine discretizations were used, resulting in *n* × *n*-shaped transition matrices with *n* = 258 and *n* = 1108.

Fig. 11 shows autocorrelation functions for the second largest relaxation time scale, *t*_{2}, for a reversible posterior ensemble. The autocorrelation function for the reversible sampling algorithm with posterior adapted proposals shows a much faster decay than the autocorrelation function for the algorithm in Ref. 39. Table VIII compares acceptance probabilities and autocorrelation times. The present proposal steps lead to very high acceptance probabilities, *p* > 0.99, for the sampling off-diagonal entries. The main advance, however, comes from the fact that the step for sampling diagonal transition matrix elements in Ref. 39 has suffered from a very poor acceptance probability. As that step was the only step that modified the equilibrium distribution, the sampler in Ref. 39 has very poor mixing properties. In contrast, our current algorithm generates independent samples for the diagonal elements, resulting in an acceptance probability of *p* = 1.0.

Algorithm . | n
. | p_{offdiag}
. | p_{diag}
. | t_{corr}
. |
---|---|---|---|---|

Old | 233 | 0.216 | 0.011 | 1088.1 |

1108 | 0.271 | 0.005 | 3241.9 | |

New | 233 | 0.994 | 1.0 | 194.7 |

1108 | 0.995 | 1.0 | 242.6 |

Algorithm . | n
. | p_{offdiag}
. | p_{diag}
. | t_{corr}
. |
---|---|---|---|---|

Old | 233 | 0.216 | 0.011 | 1088.1 |

1108 | 0.271 | 0.005 | 3241.9 | |

New | 233 | 0.994 | 1.0 | 194.7 |

1108 | 0.995 | 1.0 | 242.6 |

The autocorrelation times for the new sampler are more than a factor 5 shorter for the small (233 state) matrix and more than a factor 13 shorter for the large (1108 state) matrix, indicating a much higher efficiency of the new approach. The autocorrelation time of the new algorithm increases only mildly for matrices of increased dimension, indicating that the present algorithm will be useful for very large Markov models.

Fig. 12 shows autocorrelation functions for reversible sampling with fixed stationary vector. The posterior adapted proposals in reversible sampling algorithm with fixed stationary distribution again result in a much faster decay of the autocorrelation function than the uniform proposals of the algorithm in Ref. 39. Table IX compares acceptance probabilities and autocorrelation times for the two algorithms. For sampling with fixed stationary vector, there is no sampling step for the diagonal elements. Although the average acceptance probabilities are only a factor of 3-4 better for our new algorithm, the autocorrelation times are decreased by a factor 35 for the small system (233 states) and over a factor 300 for the large system (1108 states). Again, there is only a mild increase in autocorrelation time when the dimension of the sampled space is increased.

## VI. CONCLUSION

In this work, we have described and significantly extended the state of the art in reversible Markov model estimation. Reversible Markov models are expected to naturally arise from molecular dynamics implementations that fulfill microscopic reversibility. Reversibility is an essential property in order to analyze the equilibrium kinetics of a molecule. However, in order to have reversibility in a Markov model, it needs to be enforced in the estimation procedure. When done correctly, reversible estimation does not bias the model but rather reduces statistical errors as a result of a smaller number of degrees of freedom.

We have presented minor improvements to an existing self-consistent estimation algorithm for reversible Markov models. Then, we have presented a new and efficient algorithm to estimate reversible maximum likelihood Markov models given a fixed equilibrium distribution.

The main part of the presented work focuses on the long-standing problem of Bayesian estimation of the posterior ensemble of reversible transition matrices. Although several algorithms to sample reversible Markov models have been presented in the past, they have been hampered by three fundamental problems, two of which are addressed here: (i) Which prior should be chosen such that the posterior is located around the true value rather than completely elsewhere? (ii) How should transition counts be obtained when time series are correlated and not really Markovian at any given lag time *τ*? (iii) How can the sampling algorithm be made efficient such that also large transition matrices can be sampled in reasonable time?

To address problem (i), we develop priors that ensure that reference values and sample mean are similar. The key property of these priors is that they make the *a priori* assumption that transitions between states that have not been sampled in the trajectory in either direction have zero probability. This is a sparse prior, i.e., an improper prior enforcing that the sampled transition matrix has the same sparsity structure as the maximum likelihood estimate and as induced by the observation. In contrast to most other priors that have been previously suggested, these sparse priors achieve the desired property of creating errors bars that nicely envelop the reference estimates.

For problem (ii), we have described the principles of how it can be addressed. We suggest that effective count matrices are obtained using the concept of statistical inefficients. A separate preprint^{56} suggests an initial solution towards this aim that is successfully applied on simulation data of the bovine pancreatic trypsin inhibitor in the present paper. The solution of problem (ii) is still in its infancy and needs further investigation.

For problem (iii), we present highly efficient Gibbs sampling algorithms for reversible transition matrices and reversible transition matrices with fixed equilibrium distribution. Both methods are demonstrated to have acceptance probabilities close to 1 in their individual update steps. Autocorrelation times from samples of the slowest relaxation time scale are one or two orders of magnitude shorter than with a previous Gibbs sampling algorithm, indicating a high statistical efficiency of our sampler.

Implementations of all algorithms described here are available in PyEMMA^{64} as of version 2.0 or later — www.pyemma.org.

## Acknowledgments

This work was funded by the Deutsche Forschungsgemeinschaft (DFG) Project Nos. SFB 1114/C03 (F.P.), SFB 1114/A04 (H.W.), Grant No. 825/3-1 (B.T.S.), and the European Research Council (ERC) starting grant “pcCell” (F.N., B.T.S., F.P.). The authors would like to thank the anonymous referees for their valuable comments.

### APPENDIX: DETAILS FOR TRANSITION MATRIX SAMPLING

#### 1. Reversible transition matrix sampling: Derivation of marginal densities

We first pick a single element (*k*, *l*) of *X* (diagonal or off-diagonal) and sample it from a proposal density $ x k l \u2032 \u223cq ( x k l \u2032 | X ) $ that is scale-invariant with $q ( x k l \u2032 | X ) \u221dq ( c x k l \u2032 | c X ) $ for all *c* > 0,

and then renormalize the matrix such that it retains an element sum of 1,

Since $q ( x k l \u2032 | X ) $ is a probability density function, we can obtain from its scale-invariance that

and

According to Theorem 13.1 in Ref. 65, the posterior distribution ℙ(*X*|*C*) is the invariant distribution of the proposed update step if we accept $ X \u0304 \u2032 $ as the new sample with probability min{1, *p _{acc}*} and

where $ q x ( x \u0304 k l \u2032 | X ) $ denotes the proposal density of $ x \u0304 i j \u2032 $ given *X*. Note *X* only contains *n*(*n* + 1)/2 − 1 free variables. So we select {*x _{ij}*|

*i*≥

*j*, (

*i*,

*j*) ≠ (

*i*′,

*j*′)} as the free variable set of

*X*, where

*x*

_{i′j′}is an arbitrary element of

*X*with

*i*′ ≥

*j*′ and (

*i*′,

*j*′) ≠ (

*k*,

*l*). Let us consider each term on the right hand side of (A5).

From the definition of $ X \u0304 \u2032 $, we have

and

The proposal density of $ x \u0304 k l \u2032 $ given *X* can be expressed as

Therefore,

and

The partial derivative of $ x \u0304 i j \u2032 $ with respect to *x _{ij}* for (

*i*,

*j*) ≠ (

*k*,

*l*) can be calculated according to (A2) as

Combining all the above results leads to

with

#### 2. Reversible transition matrix sampling: Efficient proposal densities

##### a. Diagonals

Let us define variable $ s \u2032 = x k k \u2032 x k \u2212 x k k + x k k \u2032 $. If $ x k k \u2032 $ is sampled from the proposal density $\gamma ( x k k \u2032 | X ) $, the corresponding proposal density of *s*′ can be expressed as

Note that

Therefore,

and

The above equation implies that *s*′ follows the beta distribution with parameters *c _{kk}* and

*c*−

_{k}*c*. So we can sample

_{kk}*s*′ ∼ Beta(

*c*,

_{kk}*c*−

_{k}*c*) and get $ x k k \u2032 $ by the back-transform.

_{kk}##### b. Off-diagonals

We consider how to approximate $\gamma ( x k l \u2032 | X ) $ with *k* ≠ *l* and *b _{kl}* = − 1 by a gamma distribution density function. Define

The function $f ( x k l \u2032 ) $ is then given by

We approximate *f* using a three parameter family of functions

so that the corresponding approximate $\gamma ( x k l \u2032 | X ) $ is

The three parameters *α*, *β*, *f*_{0} are obtained matching *f* and $ f \u02c6 $ up to second derivatives at the maximum point

This leads to the following linear system:

with solution

and

The maximum point can be computed as the root of a quadratic equation in the usual way,

with parameters *a*, *b*, *c* given by

The second solution corresponding to (A23g) with negative sign in front of the square root can be safely excluded since $ y \u0304 $ is required to be non-negative.

#### 3. Reversible sampling with fixed stationary distribution: Efficient proposal densities

We first prove that there exists a maximum likelihood estimate $ P \u02c6 = [ p \u02c6 i j ] $ of the transition matrix which satisfies $| { k | p \u02c6 k k > 0 , c k k = 0} |\u22641$. Suppose that $ X \u2032 = [ x i j \u2032 ] $ is a maximum likelihood estimate of **X** and there are two different indices *k*, *l* which satisfy that $ x l l \u2032 \u2265 x k k \u2032 >0$ and *c _{kk}* =

*c*= 0. We can then construct a new matrix $ X \u2033 = [ x i j \u2033 ] $ with

_{ll} It can be verified that the likelihood of **X**″ is larger than **X**′, which implies that **X**″ is also a maximum likelihood estimate of **X**. Repeat the above procedure, we can finally get an maximum likelihood estimate $ X \u02c6 $ of **X** which has at most one *k* with $ x \u02c6 k k >0$ and *c _{kk}* = 0, and the corresponding $ P \u02c6 $ satisfies that $| { k | p \u02c6 k k > 0 , c k k = 0} |\u22641$.

We will now investigate how to approximate the density

with *s* > 1 by a gamma distribution density.

As in the reversible case, we will use the representation

and approximate *f*(*v*′) using the three parameter family, $ f \u02c6 ( v \u2032 | \alpha , \beta , f 0 ) $, given in (A20). The resulting approximate density, $ \gamma \u02c6 v ( v \u2032 | X ) $, has the same nice properties as the one from (A21).

Parameters *α*, *β*, and *f*_{0} are given by (A23d), (A23e), and (A23f). The maximum point $ v \u0304 $ is given by (A23g). The parameters *a*, *b*, *c* are