Markov state models are a widely used method for approximating the eigenspectrum of the molecular dynamics propagator, yielding insight into the long-timescale statistical kinetics and slow dynamical modes of biomolecular systems. However, the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these methods to novel biological systems. Here, we consider cross-validation with a new objective function for estimators of these slow dynamical modes, a generalized matrix Rayleigh quotient (GMRQ), which measures the ability of a rank-*m* projection operator to capture the slow subspace of the system. It is shown that a variational theorem bounds the GMRQ from above by the sum of the first *m* eigenvalues of the system’s propagator, but that this bound can be violated when the requisite matrix elements are estimated subject to statistical uncertainty. This overfitting can be detected and avoided through cross-validation. These result make it possible to construct Markov state models for protein dynamics in a way that appropriately captures the tradeoff between systematic and statistical errors.

## I. INTRODUCTION

Conformational dynamics are central to the biological function of macromolecular systems such as signaling proteins, enzymes, and channels. The molecular description of processes as diverse as protein folding, kinase activation, voltage-gating of ion channels, and ubiquitin signaling involves not only just the structure of a unique single conformation but also the conformational dynamics between a multitude of states accessible on the potential energy surface.^{1–4} These dynamics occur on a range of timescales and have varying degrees of structural complexity: localized vibrations may occur on the 0.1 ps timescale, while large-scale structural changes like protein folding can take seconds or longer.^{5} Although many experimental techniques—most notably X-ray crystallography and nuclear magnetic resonance spectroscopy—can yield detailed structural information on functional conformations, the experimental characterization of the dynamical processes, intermediate conformations, and transition pathways in macromolecular systems remains exceptionally challenging.^{6,7}

Atomistic molecular dynamics (MD) simulations can complement experiment and provide a powerful tool for probing conformational dynamics, allowing researchers to directly visualize and analyze the time evolution of macromolecular systems in atomic detail. Three major challenges for MD simulation of complex systems are the accuracy of the potential energy functions, adequate sampling of conformational space, and quantitative analysis of simulation results. The state-of-the-art on all three fronts has advanced rapidly in recent years. A new generation of increasingly accurate forcefields has recently emerged, such as those which include explicit polarizability and have been parameterized more systematically.^{8–12} On the sampling problem, the introduction of graphical processing units (GPUs) has dramatically expanded the timescales accessible with MD simulation at modest cost, and specialized MD-specific hardware and distributed computing networks have yielded further gains.^{13–17} In this work, we focus on the remaining challenge, the quantitative analysis of MD simulations.

Despite, or perhaps because of their detail, MD simulations require further analysis in order to yield insight into macromolecular dynamics or quantitative predictions capable of being tested experimentally. The direct result of a simulation, a MD trajectory, is a time series of Cartesian positions (and perhaps momenta) of dimension 3*N* (6*N* if momenta are retained), where *N* is the number of atoms in the system. Because routine MD simulations may contain tens or hundreds of thousands of atoms, these time series are extremely high-dimensional. A multitude of methods have been proposed for reducing the dimensionality or complexity of MD trajectories and enabling the analysis of the system’s key long-lived conformational states, dynamical modes, transition pathways, and essential degrees of freedom.^{18–24}

In this work, we combine two central ideas from machine learning and chemical physics—hyperparameter selection via cross-validation and variational approaches for linear operator eigenproblems—to create a new method for discriminating between alternative simplified models for molecular kinetics constructed from MD simulations. Towards this end, we prove a new variational theorem concerning the simultaneous approximation of the first *m* eigenfunctions of the propagator of a high-dimensional reversible stochastic dynamical system, which mathematically formalize the slow collective dynamical motions we wish to identify in molecular systems.

## II. CROSS VALIDATION

In seeking to estimate the slowest collective dynamical modes of a molecular system from a finite set of stochastic trajectories, statistical noise is unavoidable. These dynamical modes, which we formally identify as the first *m* eigenfunctions, *ϕ _{i}*, of the propagator, an integral operator associated with the dynamics of a molecular system (vide infra), are functions on ℝ

^{3N}to ℝ. Like the ground state wave function in quantum chemistry, these eigenfunctions can only be approximately represented in any finite basis set. Reducing this approximation error, a statistical bias, motivates the use of larger and more flexible basis sets. Unfortunately, in an effect known as the

*bias-variance tradeoff*,

^{25,26}larger basis sets tend to exacerbated a competing source of error, the model variance: with a fixed simulation data set but additional adjustable parameters due to a larger basis set, the model estimates of these eigenfunctions become more unstable and uncertain.

As has been known since at least the early 1930s, training a statistical algorithm and evaluating its performance on the same data set generally yield overly optimistic results.^{27} For this reason, standard practice in supervised machine learning is to divide a data set into separate training and test sets. The model parameters are estimated using the training data set, but performance is evaluated separately by scoring the now-trained model on the separate test set, consisting of data points that were left out during the training phase. To avoid overfitting, the choice between alternative models is made using test set performance not training set performance.

However, because researchers may expend significant effort to collect data sets, the exclusion of a large fraction of the data set from the training phase can be a costly proposition. *k*-fold cross-validation is one alternative that can be more data-economical, where the data are split into *k* disjoint subsets, each of which is cycled as both the training and test sets.

Let *X* be a collection of molecular dynamics trajectories (the data set), which we assume for simplicity to consist of a multiple of *k* independent MD trajectories of equal length. In *k*-fold cross validation, the trajectories are split into *k* equally sized disjoint subsets, called folds, denoted *X*^{(i)}, for *i* ∈ {1, 2, …, *k*}. These will serve as the test sets. Let *X*^{(−i)} = *X*∖*X*^{(i)} denote the set of all trajectories excluded from fold *i*; these will serve as the training sets.

Consider an algorithm to estimate the *m* slowest dynamical modes of the system, *g*. Examples of such estimators include Markov state models (MSMs)^{28} and time-structured independent components analysis (tICA).^{29,30} The result of this calculation, the estimated eigenfunctions, $ \varphi \u02c6 1 : m $, is taken to be a function of both an input dataset, *X*, and a set of hyperparameters, *θ*, which may include settings such as the number of states or clustering algorithm in a MSM or the basis set used in tICA,

Furthermore, consider an objective function, $O ( \varphi \u02c6 1 : m , X \u2032 ) $, which evaluates a set of proposed eigenfunctions, $ \varphi \u02c6 1 : m $, and a (possibly new) dataset *X*′, returning a single scalar measuring the performance or accuracy of these eigenfunctions. The mean cross validation performance of a set of hyperparameters is defined by the following expression, which builds *k* models on each of the training sets and scores them on the corresponding test sets,

Model selection can be performed by finding the hyperparameters, *θ*^{∗} = arg max_{θ}*MVC*(*θ*), which maximize the cross validation performance. Many variants of this protocol with different procedures for splitting the training and test sets, such as repeated random subsampling cross-validation and leave-one-out cross validation, are also possible.^{31}

The remainder of this work seeks to develop a suitable objective function, *O*, for estimates of the slow dynamical modes in molecular kinetics that can be used as shown above in a cross-validation protocol. This task is complicated by the fact that no ground-truth values of true eigenfunctions, *ϕ _{i}*, are available either in the training or test sets. Nevertheless, our goal is to construct a

*consistent*objective function, such that as the size of a molecular dynamics data set,

*X*, grows larger, the maximizer of

*O*(⋅,

*X*) converges in probability to the true propagator eigenfunctions,

*ϕ*

_{1:m},

## III. THEORY BACKGROUND

We begin by introducing the notion of the propagator and its eigenfunctions from a mathematical perspective, introducing the key variables and notation that will be essential for the remainder of this work. We largely follow the order of presentation in Prinz *et al.*^{28} which contains a longer and more thorough discussion.

Consider a time-homogeneous, ergodic, continuous-time Markov process **x**(*t*) ∈ Ω which is reversible with respect to a stationary distribution *μ*(**x**) : Ω → ℝ^{+}. Where necessary for concreteness, we take the phase space, Ω, to be ℝ^{3N}, where *N* is the number of atoms of a molecular system. The system’s stochastic evolution over an interval *τ* > 0 is described by a transition probability density

where *B*_{ϵ}(**y**) is the open *ϵ*-ball centered at **y** with infinitesimal measure *d***y**.

Consider an ensemble of such systems at time *t*, distributed according to some probability distribution *p _{t}*(

**x**). After waiting for a duration

*τ*, the distribution evolves to a new distribution,

which defines a continuous integral operator, $P ( \tau ) $, called the propagator with lag time *τ*. The propagator, $P ( \tau ) $, admits a natural decomposition in terms of its eigenfunctions and eigenvalues,

Furthermore, due to the reversibility of the underlying dynamics, $P ( \tau ) $ is compact and self-adjoint with respect to a *μ*^{−1} weighted scalar product,^{32}

where *f* and *g* are arbitrary scalar functions on Ω. The propagator has a unique largest eigenvalue λ_{1} = 1 with corresponding eigenfunction *ϕ*_{1}(**x**) = *μ*(**x**). The remaining eigenvalues are real and positive, can be sorted in descending order, and can be normalized to be *μ*^{−1}-orthonormal. Using the spectral decomposition of $P ( \tau ) $, the conformational distribution of an ensemble at arbitrary multiples of *τ* can be written as a sum of exponentially decaying relaxation processes,

where $ t i =\u2212 \tau ln \lambda i $. The eigenfunctions *ϕ _{i}*(

**x**) for

*i*= 2, … can thus be interpreted as dynamical modes, along each of which the system relaxes towards equilibrium with a characteristic timescale,

*τ*. Many molecular systems are characterized by

_{i}*m*individual

*slow*timescales with eigenvalues close to one, separated from the remaining eigenvalues by a

*spectral gap*. These slowest collective degrees of freedom, such as protein folding coordinates and pathways associated with enzyme activation/deactivation, are often identified with key functions in biological systems. The remaining small eigenvalues correspond to faster dynamical processes that rapidly decay to equilibrium. Under these conditions, the long-time dynamics induced by the propagator can be well described by consideration of only these slow eigenfunctions — that is, a rank-

*m*low-rank approximation to the propagator.

Furthermore, not only do these slow eigenfunctions form a convenient basis but also in fact they lead to an *optimal* reduced-rank description of the dynamics. That is, each of the partial sums formed by truncating the expansion in Eq. (8) at its first *m* terms is the closest possible rank-*m* approximation to $P ( \tau ) $ in spectral norm. This statement is made precise by the following theorem.

*Let*$P$

*be compact linear operator which is self-adjoint with respect to an inner product*〈 ⋅ , ⋅ 〉

_{μ−1}

*. Assume that the eigenvalues*λ

_{i}

*and associated eigenfunctions*

*ϕ*

_{i}*are sorted in descending order by eigenvalue. Define the rank- m*

*operator*$ P m $

*such that*$ P m \u2218f= \u2211 i = 1 m \lambda i \u3008f, \varphi i \u3009 \mu \u2212 1 \varphi i $

*and let*$ A m $

*be an arbitrary rank*

*m*

*operator. Then,*

*Proof.*This is the extension of the familiar Eckart-Young theorem to self-adjoint linear operators. The original result is by Schmidt.

^{33}See Courant and Hilbert (pp. 161)

^{34}and Micchelli and Pinkus

^{35}for further details.□

While Theorem 1 is a statement about operator approximation, it can also be viewed as a statement about optimal dimensionality reduction for description of slow dynamics. Over all *m*-dimensional dimensionality reductions, the one which projects the dynamics onto its first *m* propagator eigenfunctions preserves to the largest degree information about the long-timescale evolution of the original system.

## IV. OBJECTIVE FUNCTION AND SUBSPACE VARIATIONAL PRINCIPLE

In this section, we introduce the objective function discussed abstractly in Sec. II. We show that both the existing tICA^{29,30} and MSM^{28,36–39} methods can be interpreted as procedures which directly optimize this criteria using different restricted families of basis functions. Furthermore, we show that in the infinite-data limit, when the requisite matrix elements can be computed without error, a variational bound governs this objective function: *ansatz* eigenfunctions, $ \varphi \u02c6 1 : m $, which differ from the propagator’s true eigenfunctions, *ϕ*_{1:m}, are always assigned a score which is less than the score of the true eigenfunctions.

Unfortunately, in the more typical finite-data regime, this variational bound can be violated in a pernicious manner: as the size of the basis set increases past some threshold, models can give continually increasing training set performance (even breaking the variational bound), even as they get *less* accurate when measured on independent test data sets. This observation underscores the practical value of cross-validation in estimators for the slow dynamical processes in molecular kinetics.

Our results build on the important contributions of Noé and Nüske^{40} and Nüske *et al.*,^{41} who introduced a closely related variational approach for characterizing the slow dynamics in molecular systems. Our novel contribution stems from an approach to the problem through the lens of cross-validation, with its need for a single scalar objective function. While previous work focuses on estimators of each of the propagator eigenfunctions, *ϕ _{i}*, one at time, we focus instead on measures related to the simultaneous estimation of all of the first

*m*eigenfunctions collectively.

*Let*$P$

*be compact linear operator whose eigenvalues*λ

_{1}> λ

_{2}≥ λ

_{3}, …

*are bounded from above and which is self-adjoint with respect to an inner product*〈 ⋅ , ⋅ 〉

_{μ−1}

*. Furthermore, let*

*f*

*be an arbitrary set of*

*m*

*linearly independent functions on*Ω → ℝ

*,*$f= { f i ( \u22c5 ) } i = 1 m $

*. Let*𝕊

^{m}

*and*$ S + + m $

*be the space of*

*m*×

*m*

*real symmetric matrices and positive definite matrices, respectively. Define a matrix*

*P*∈ 𝕊

^{m}

*with*$ P i j =\u3008 f i ,P\u2218 f j \u3009 \mu \u2212 1 $

*and a matrix*$Q\u2208 S + + m $

*with*

*Q*= 〈

_{ij}*f*,

_{i}*f*〉

_{j}_{μ−1}

*. Define*$ R P [ f ] $

*as*

*Then,*

*The equality in Eq. (12) holds for* *f* = {*ϕ*_{1}, *ϕ*_{2}, …, *ϕ _{m}*}

*and any set of*

*m*

*functions,*

*f, such that*span(

*f*) = span({

*ϕ*

_{1},

*ϕ*

_{2}, …,

*ϕ*})

_{m}*.*

The proof of Theorem 2 follows from the Ky Fan theorem.^{42,43} Its proof, as well as the proof of Lemma 3, can be found in Appendix A.

This result implies that the slow eigenspace of the molecular propagator can be numerically determined by simultaneously varying a set of *ansatz* functions *f* to maximize $ R P [ f ] $. If the maxima is found, then *f* are the desired eigenfunctions up to a rotation. The matrix *P* has the form of a time-lagged covariance matrix between the *ansatz* functions, describing the tendency of the system to move from regions of phase space strongly associated one *ansatz* function to another in time *τ*. The matrix *Q* acts like a normalization, giving the equilibrium overlap between *ansatz* functions. Note that when the trial functions, *f*, are *μ*^{−1}-orthonormal, *Q* is simply the identity. Under these conditions, $ R P [ f ] $ then assumes a simple form as the sum of the individual Ritz values of the trial functions.

Physically, $ R P [ f ] $ can be interpreted as the “slowness” of the lower-dimensional dynamical process formed by projecting a high-dimensional process through the *m* *ansatz* functions. The maximization of $ R P [ f ] $ corresponds to a search for the coordinates along which the system decorrelates as slowly as possible.

Because it is bounded by the sum of the first *m* true eigenfunctions of the propagator, $ R P [ f ] $ is the foundation of the sought objective function for cross-validation of estimators for the slow dynamical modes in molecular kinetics. Unfortunately, it cannot be calculated exactly from a molecular dynamics simulation. Next, we show how the requisite matrix elements, *P _{ij}* and

*Q*, can be approximated from MD. The noise in these approximations will be a function of both the amount of available simulation data and the size and flexibility of a basis set, leading to the bias variance tradeoff discussed earlier. By the continuous mapping theorem and Theorem 2, consistency of the objective function (in the sense of Eq. (3)) is established if these estimators for

_{ij}*P*and

*Q*are consistent.

### A. Basis function expansion

Equipped with this variational theorem, we now consider the construction of an approximation to the dominant eigenspace of $P ( \tau ) $ using linear combinations of functions from a finite basis set. This reduces the problem of searching over the space of sets of *m* functions to a problem of finding a weight matrix that linearly mixes the basis functions.

Let $ { \phi a } a = 1 n $ be a set of *n* functions on Ω → ℝ, which will be used as basis functions in which to expand the slowest *m* propagator eigenfunctions. Physically motived basis functions for protein dynamics might include measurements of protein backbone dihedral angles, the distances between particular atoms, or similar structural coordinates. The basis can also be indicator functions for specific regions of phase space—the “Markov states” in a MSM.

Following Nüske *et al.*,^{41} we expand the *m* *ansatz* eigenfunctions as *μ*-weighted linear combinations of the basis functions, $ f i ( x ) = \u2211 a A i a \mu ( x ) \phi a ( x ) $, where *A* ∈ ℝ^{n×m} is a weight matrix of arbitrary expansion coefficients. From the basis functions, we define the time-lagged covariance and overlap matrices *C* ∈ 𝕊^{n} and $S\u2208 S + + n $, respectively, such that $ C i j =\u3008\mu \phi i ,P\u2218\mu \phi j \u3009 \mu \u2212 1 $ and *S _{ij}* = 〈

*μφ*,

_{i}*μφ*〉

_{j}_{μ−1}.

Then, by exploiting the linearity of the basis function expansion, the matrices *P* and *Q* can be written as matrix products involving the expansion coefficients, correlation, and overlap matrices,

These equations can be interpreted in a simple way: the time-lagged correlation and overlap of the *ansatz* functions with respect to one another can be formed from two similar matrices involving only the basis functions, *C* and *S*, and the expansion coefficients, *A*. When the *ansatz* functions, *f*, are constructed this way, $ R P [ f ] $ reduces to the generalized matrix Rayleigh quotient (GMRQ), $ R P [ f ] =R ( A ; C , S ) =R ( A ) $,

Following Lemma 3, we note that $R ( A ) $ is a function only of column span of *A* and is not affected by rescaling or the application of any invertible transformation of the columns. Therefore, the optimization of $R ( A ) $ can be seen as a single optimization problem over the set of all *m*-dimensional linear subspaces of Ω. This space is referred to as a *Grassmann manifold*.^{44} Note that when *m* = 1, *P* and *Q* are scalars and $R ( A ) $ reduces to a standard generalized Rayleigh quotient.

Furthermore, with fixed basis functions, the training problem, $ A \u2217 =arg\u2009 max A R ( A ; C , S ) $, is solved directly by a matrix *A*^{∗} with columns that are the *m* generalized eigenvectors of *C* and *S* with the largest eigenvalues, and this eigenproblem is identical to the one introduced for the tICA method^{29,30} and the Ritz method.^{40}

### B. Estimation of matrix elements from MD

Equipped with a collection of basis functions, {*φ*}, how can *C* and *S* be estimated from a MD dataset? As previously shown by Nüske *et al.*,^{41} the matrix elements *C* and *S* can be estimated from an equilibrium molecular dynamics simulations, $X= { x t } t = 1 T $, by exploiting the ergodic theorem and measuring the correlation between the basis functions, with or without a time lag,

Note that for Theorem 2 to be applicable, *C* is required to be symmetric, a property which is likely to be violated by the estimator Eq. (18). For this reason, in practice, we use an estimator that averages the matrix computed in Eq. (18) with its transpose. We call this method as transpose symmetrization, and it amounts to including each trajectory twice in the dataset, once in the forward and once in the reversed direction, as discussed by Schwantes and Pande.^{29}

MSMs^{28,36–39} are particular case of the proposed method, which have been widely applied to the analysis of biomolecular simulations,^{45–53} where the basis functions are chosen to indicator functions on a collection of non-overlapping subsets of the conformation space. Given a set of discrete non-overlapping states which partition Ω, $S= { s i } i = 1 n $, such that *s _{i}* ⊆ Ω, $ \u22c3 i = 1 n s i =\Omega $ and

*s*∩

_{i}*s*= 0̸, and define

_{j}Using this basis set, as previously shown by Nüske *et al.*,^{41} estimates of the correlation matrix elements *C _{ij}* can be obtained following Eq. (18) by counting the number of observed transitions between sets

*s*and

_{i}*s*. The overlap matrix

_{j}*S*is diagonal with entries,

*S*, that estimate the stationary probabilities of the sets,

_{ii}*S*≈ ∫

_{ii}_{x∈si}

*d*

**x**

*μ*(

**x**).

For the particular case of MSM basis sets, in contrast to the somewhat crude transpose symmetrization method, a more elegant enforcement of symmetry of *C* can be accomplished via a maximum likelihood estimator following Algorithm 1 of Prinz *et al.*^{28}

Equipped with these estimators for *C* and *S* from MD data, Eq. (15) now has a form which is suitable for use as a cross-validation objective function, $O ( \varphi \u02c6 1 : m , X \u2032 ) $. The proposed eigenfunctions, which may have been trained on a *different* dataset, are numerically represented by expansion coefficients, $ A \u02c6 $, and *C* and *S* act as sufficient statistics from the test dataset *X*′; the GMRQ objective function is $R ( A \u02c6 ; C ( X \u2032 ) , S ( X \u2032 ) ) $.

## V. ALGORITHMIC REALIZATION

The central practical purpose of cross-validation with GMRQ is, given a MD dataset, to select a set of appropriate basis functions with which to construct MSMs for system’s kinetics. Note that Eq. (22) leaves substantial flexibility in the definition of the basis set, since the partitioning of phase space into states is left unspecified.

Methodologies for constructing these states include clustering the conformations in the dataset using a variety of distance metrics, clustering algorithms, and dimensionality reduction techniques. Let *θ* be a variable containing the settings for these procedures, which parameterizes a function, $ g \theta MSM ( X ) $, that, given a collection of MD trajectories, returns a set of *n* states, $S$.

Procedurally, GMRQ-based cross-validation for MSMs is a protocol for assigning a scalar score, *MCV*(*θ*), to the MSM hyperparameters, *θ*, with the following steps.

Separate the full data set into

*k*disjoint folds, as described in Sec. II.For each fold,

*i*, use the training data set,*X*^{(−i)}, to construct a set of states, $ S ( \u2212 i ) = g \theta ( X ( \u2212 i ) ) $.Use the states $ S ( \u2212 i ) $ and the training data set

*X*^{(−i)}to build a Markov state model. This entails clustering the dataset to obtain the basis functions (states), {*φ*}, estimating the training set correlation and overlap matrices*C*^{(−i)}and*S*^{(−i)}from the trajectories, and computing their first*m*generalized eigenvectors, $A=arg\u2009 max A R ( A ; C \u2212 ( i ) , S ( \u2212 i ) ) $, with a standard generalized symmetric eigensolver (e.g., LAPACK’s dsygv subroutine).^{54}- These eigenvectors maximize the GMRQ on the training set, but how do they perform when tested on new data? Using the test set data,
*X*^{(i)}, and the states, $ S ( \u2212 i ) $,*as determined from the training set*, compute the test set correlation and overlap matrices,*C*^{(i)}and*S*^{(i)}. These trained eigenvectors,*A*, are scored on the test set by $R ( A ; C ( i ) , S ( i ) ) $. The key metric for model selection, the cross-validation mean test set generalized matrix Rayleigh quotient isAs an overfitting diagnostic, we also calculate the cross-validation mean training set GMRQ,$ M V C ( \theta ) = k \u2212 1 \u2211 i = 1 k R ( A ; C ( i ) , S ( i ) ) . $$ M V C \u2032 ( \theta ) = k \u2212 1 \u2211 i = 1 k R ( A ; C ( \u2212 i ) , S ( \u2212 i ) ) . $ Finally, the entire procedure is repeated for many choices of

*θ*, and the hyperparameter set that maximizes the mean cross validation score is chosen as the best model,*θ*^{⋆}= arg max_{θ}*MVC*(*θ*).

For this approach, one symptom of overfitting—the construction of models that describe the statistical noise in *X* rather than the underlying slow dynamical processes—is an overestimation of the eigenvalues of the propagator and overestimation of the GMRQ. Related statistical methods, such as kernel principal components analysis which also involve spectral analysis of integral operators under non-negligible statistical error suffer from the same effect, which has been termed variance inflation.^{55–57}

Left unspecified in this protocol are three important parameters: the degree of cross validation, *k*, the number of desired eigenfunctions, *m*, and the correlation lag time, *τ*. In our experiments, following common practice in the machine learning community, we use *k* = 5. Especially in the data-limited regime, the tradeoffs involving the choice of *k* are not entirely clear, as the objective lacks the form of an empirical risk minimization problem.^{26,58} For MSMs, substantial attention in the literature has been paid to the selection of the lag time, *τ*.^{28,38,59} With fixed basis function, it has been shown that the eigenfunction approximation error is a decreasing function of the *τ*, which motivates the use of larger values.^{60} On the other hand, larger values of *τ* limit the temporal resolution of the model. For MSMs of protein folding, the authors’ experience suggests that appropriate values for *τ* are often in the range between 1 and 100 ns. Finally, we suggest that *m*, the rank of GMRQ, be selected based on the number of slow dynamical processes in the system, as determined by an apparent gap in the eigenvalue spectrum of $P ( \tau ) $, or heuristically to a value between 2 and ∼10.

## VI. SIMULATIONS

### A. Double well potential

In order to gain intuition about the method, we begin by considering one of the simplest possible systems: Brownian dynamics on a double well potential. We consider a one dimensional Markov process in which a single particle evolves according to the stochastic differential equation,

where *V* is the reduced potential energy, *D* is the diffusion constant, and *R*(*t*) is a zero-mean delta-correlated stationary Gaussian process. For simplicity, we consider the potential

with reflecting boundary conditions at *x* = − *π* and *x* = *π*. Using an Euler integrator, a time step of Δ*t* = 10^{−3} and diffusion constant *D* = 10^{3}, we simulated 10 trajectories starting from *x* = 0 of length 10^{5} steps and saved the position every 100 steps. The potential and histogram of the resulting data points are shown in Fig. 1(b). We computed the true eigenvalues of the system’s propagator to machine precision by discretizing the propagator matrix elements on a dense grid (see Appendix C). The timescale of the slowest relaxation process in this system is *t*_{2} ≈ 7115.3 steps and the dataset contains approximately 94 transition events.

We now consider the construction of Markov state models for this system, and in particular, the selection of the number of states, *n*, using states, $S= { s i } i = 1 n $, which evenly partition the region between *x* = − *π* and *x* = *π* into *n* equally spaced intervals,

When *n* is too low, we expect that the discretization error in the MSM will dominate, and our basis will not be flexible enough to capture the first eigenfunction of the propagator. On the other hand, because the number of parameters estimated by the MSM is proportional to *n*^{2}, we expect that for *n* too large, our models will be overfit. We therefore use 5-fold cross validation with the GMRQ to select the appropriate number of states, balancing these competing effects. The cross-validation GMRQ for the first two eigenvectors (*m* = 2, *τ* = 100 steps) of the MSMs is shown in Fig. 1(a), along with the exact value of the GMRQ. The blue training curve gives the average GMRQ over the folds when scoring the models on the *same* trajectories that they were fit with, and is simply equal to the mean sum of the first two eigenvalues of the MSMs, whereas the red curve shows the mean GMRQ evaluated on the left-out test trajectories.

The training GMRQ increases monotonically, and we note with particular emphasis that it increases *past* the exact value when using a large number of states. This indicates that the models built with more than 200 states predict *slower* dynamics than the true propagator. This effect is impossible in the limit of infinite data as demonstrated by Eq. (12)—it is a direct manifestation of overfitting and indicates why straightforward variational optimization without testing on held-out data or consideration of statistical error fails in a data-limited regime. The training set eigenvectors, the maximizers of the training set GMRQ, are actually exploiting noise in the dataset more so than modeling the propagator eigenfunctions. On the other hand, the test GMRQ displays an inverted U-shaped behavior and achieves a maximum at *k* = 61. These models thus achieve the best *predictive* accuracy in capturing the systems slow dynamics, given the finite data available.

### B. Comparison of clustering procedures: Octaalanine

What methods of MSM construction are most robustly able to capture the long-timescale dynamics of protein systems? To address this question, we performed a series of analyses of 27 molecular dynamics trajectories of terminally blocked octaalanine, a small helix forming peptide. We used 8 different methods to construct the state discretization using clustering with three distance metrics and three clustering algorithms.

For clustering, we considered three distance metrics. The first was the backbone *ϕ* and *ψ* dihedral angles. Each conformation was represented by the sine and cosine of these torsions for a total of 32 features per frame, and distances for clustering were computed using a Euclidean metric. Second, we considered the distribution of reciprocal interatomic distances (DRID) distance metric introduced by Zhou and Caflisch,^{61} using the *Cα*, *Cβ*, *C*, *N*, and *O* atoms in each residue. Finally, we considered the Cartesian minimal root mean square deviation (RMSD) using the same set of atoms per residue.^{62} We also considered three clustering algorithms; *k*-centers,^{63} a landmark version of UPGMA hierarchical clustering (see Appendix D), and *k*-means.^{64}

For each pair of distance metric and clustering algorithm (excluding *k*-means and RMSD which is incompatible),^{65} we performed 5-fold cross validation using between 10 and 500 states for the clustering. For this experiment, we heuristically chose a lag time of *τ* = 10 ps and *m* = 6, to capture the first five dynamical processes in addition to the stationary distribution. The results are shown in Fig. 2, with blue curves indicating the mean GMRQ on the training set and red curves indicating the mean performance on the held-out sets. We find that in all cases, the performance on the training set is *optimistic*, in the sense that the *ansatz* eigenvectors fit during training score more poorly when reëvaluated on held out data. Furthermore, although the training curves all continue to increase with respect to the number of states within the parameter range studied—which might be interpreted from a variational perspective as the quality of the models continually increasing—the performance on the test sets tends to peak at a moderate number of states and then decrease. We interpret this as a sign of overfitting: when the number of states is too large, with models fitting the statistical noise in the dataset rather than the underlying slow dynamical processes. Of the parameters studied, the best performance appears to be using the combination of *k*-means clustering with the dihedral distance metric, using between 50 and 200 states. We also note that *k*-centers appears to yield particularly poor models for all distance metrics, which may be rationalized on the basis that, by design, the algorithm selects outlier conformations to serve as cluster centers.^{63}

## VII. DISCUSSION

Some amount of summarization, coarse-graining, or dimensionality reduction of molecular dynamics data sets is a necessary part of their use to answer questions in biological physics. In this work, we argue that the goal of this effort should essentially be to find the dominant eigenfunctions of the system’s propagator, an unknown integral operator controlling the system’s dynamics. We show that this goal can be formulated as the variational optimization of a single scalar functional, which can be approximated using trajectories obtained from simulation and a parametric basis set. Although overfitting is a concern with finite simulation data, this risk can be mitigated by the use of cross-validation.

When the basis sets are restricted to mutually orthogonal indicator functions or linear functions of the input coordinates, this method corresponds to the existing MSM and tICA methods. Unlike previous formulations, it provides a method by which MSM and tICA solutions can be “scored” on new data sets that were not used during parameterization, making it possible to measure the generalization performance of these methods and choose the various hyperparameters required for each method, such as the number of MSM states or clustering method. Furthermore, the extension to other families of basis functions (e.g., Gaussians) is straightforward, and GMRQ provides a natural quantitative basis on which to conclude whether these new methods are superior to existing basis sets.

### A. Connections to quantum mechanics and machine learning

The variational theorem for eigenspaces in this work has strong connections to work in two other related fields: excited state electronic structure theory in quantum mechanics and multi-class Fisher discriminant analysis in machine learning. In quantum mechanics, Theorem 2 is analogous to what has been called the ensemble or trace variational principle in that field,^{66–69} which bounds the sum of the energy of the first *m* eigenstates of the Hamiltonian by the trace of a matrix of Ritz values. While the goal of finding just the ground-state eigenfunction (*m* = 1) is more common in computational quantum chemistry, the simultaneous optimization of many eigenstates is critical for many applications including band-structure calculations for materials in solid state physics.

Furthermore, in machine learning, this work has an analog in the theory multi-class Fisher discriminant analysis.^{70} Here, the goal is to find a low-rank projection of a labeled multi-class dataset which maximizes the between-class variance of the dataset while controlling the within-class variances. The optimal discriminant vectors are shown to be the first *k* generalized eigenvectors of an eigenproblem involving these two variance matrices—the problem shares the same structure as Eq. (15) in this work.^{71} We anticipate that this parallel will aid the development of improved algorithms for the identification of slow molecular eigenfunctions, especially with respect to regularization and sparse formulations.^{72,73}

### B. Comparison to likelihood maximization

While we focus on the identification of the dominant eigenfunctions of the system’s propagator, a different viewpoint is that analysis of MD should essentially entail the construction of probabilistic, generative models over trajectories, fit, for example, by maximum likelihood or Bayesian methods.

As we show in Sec. IV B, and Nüske *et al.*^{41} have shown earlier, MSMs arise naturally from a maximization of Eq. (12) when the *ansatz* eigenfunctions are constrained to be linear combinations of a set of mutually orthogonal indicator functions. However, MSMs can also be viewed directly as probabilistic models, constructed by maximizing a likelihood function of the trajectories with respect to the model parameters. This probabilistic view has, in fact, been central to the field, driving the development of improved methods, for example, in model selection,^{25,74} parameterization,^{75} and coarse-graining.^{76,77} To what extent does this imply that the variational and probabilistic views are equivalent?

In Appendix B, we show that while these two views may coincide for the particular choice of basis set with MSMs, they need not be equivalent in general. In fact, the GMRQ-optimal model formed by the first *m* eigenfunctions of the propagator need not be positivity preserving, which is essential to form a probabilistic likelihood function in the sense of Kellogg, Lange, and Baker^{74} or McGibbon, Schwantes, and Pande.^{25} On the other hand, the two views *are* tightly coupled; their connection is given by the error bounds proved by Sarich, Noé and Schütte.^{60} When the model gives a good approximation to the slow propagator eigenspace (low eigenfunction approximation error, high GMRQ), a good approximation to the long-timescale transition probabilities is obtained.

Cross validation with the log likelihood requires either a generative model for the high dimensional data, such as a hidden Markov model (HMM)^{78} or dimensionality reduction before model comparison. This is a major disadvantage, because accurate and tractable generative models for time series with tens or hundreds of thousands dimensions are not generally available. However, treating dimensionality reduction as a preprocessing and applying probabilistic models afterwards, as done by McGibbon, Schwantes, and Pande,^{25} does not enable quantitative comparison between alternative competing dimensionality reduction protocols. With the GMRQ, on the other hand, the need for a reference state decomposition or high-dimensional generative model is eliminated,^{76} and different dimensionality reduction procedures can easily be compared in a quantitative manner, as shown in Fig. 2.

## VIII. CONCLUSIONS

The proliferation of new and improved methods for constructing low-dimensional models of molecular kinetics given a set of high-resolution MD trajectories has been a boon to the field, but the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these methods to novel biological systems. In this work, we have presented a new variational theorem governing the estimation of the space formed by the span of multiple eigenfunctions of the molecular dynamics propagator. With this method, a single scalar-valued functional scores a proposed model on a supplied data set, and the use of separate testing and training data sets makes it possible to quantify and avoid statistical overfitting. During the training step, tICA and MSMs are specific instance of this method with different types basis functions. This method extends these tools, making it possible to score trained models on new datasets and to perform hyperparameter selection.

We have applied this approach to compare eight different protocols for Markov state model construction on a set of MD simulations of the octaalanine peptide. We find that of the methods tested, *k*-means clustering with the dihedral angles using between 50 and 200 states appears to outperform the other methods and that the *k*-centers cluster method can be particularly prone to poor generalization performance. To our knowledge, this work is the first to enable such quantitative and theoretically well-founded comparisons of alternative parameterization strategies for MSMs.

We anticipate that this work will open the door to more complete automation and optimization of MSM construction. Free and open source software fully implementing these methods is available in the MDTraj and MSMBuilder3 packages from http://mdtraj.org and http://msmbuilder.org, along with example and tutorials.^{79} While the lag time, *τ* and rank, *m*, of the desired model must be manually specified, other key hyperparameters that control difficult-to-judge statistical tradeoffs, such as the number of states in a MSM, can be chosen to be optimizing the cross-validation performance. Furthermore, given recent advances in automated hyperparameter optimization in machine learning, we anticipate that this search itself can be fully automated.^{80}

## Acknowledgments

This work was supported by the National Science Foundation and National Institutes of Health under Grant Nos. NIH R01-GM62868, NIH S10 SIG 1S10RR02664701, and NSF MCB-0954714. We thank the anonymous reviewers for their many suggestions for improving this work, Christian R. Schwantes, Mohammad M. Sultan, Sergio Bacallado, and Krikamol Muandet for helpful discussions, and Jeffrey K. Weber for access to the octaalanine simulations.

### APPENDIX A: PROOFS OF THEOREM2 AND LEMMA 3

*ϕ*, of $P ( \tau ) $ form a complete basis. Expand each $ f i = \u2211 a w a i \varphi a $ with coefficients

_{i}*W*∈ ℝ

^{∞×m}with

*W*= 〈

_{ni}*f*,

_{i}*ϕ*〉

_{n}_{μ−1},

*D*(λ) with

*D*= λ

_{ii}_{i}. Then, Eqs. (A3) and (A6) can be rewritten in matrix form,

*Q*, which is guaranteed to exist because

*Q*is positive definite and

*B*=

*WF*

^{−1}. Then, rearrange the objective function using the cyclic property of the trace,

*B*=

^{T}B*F*

^{−1}

*W*

^{T}WF^{−1}=

*I*. Therefore, by application of the Ky Fan theorem,

_{m}^{42,43}

*f*= {

*ϕ*

_{1},

*ϕ*

_{2}, …,

*ϕ*}.

_{m}*et al.*,

^{44}let

*f*= {

*f*

_{1},

*f*

_{2}, …,

*f*} and

_{m}*M*∈ ℝ

^{m×m}be an arbitrary invertible matrix. Define a new collection of functions

*g*= {

*g*

_{1},

*g*

_{2}, …,

*g*}, such that $ g j = \u2211 i = 1 m M i j f i $ and a matrix

_{m}*W*′ ∈

*R*

^{∞×m}such that $ W n i \u2032 =\u3008 g i , \varphi n \u3009 \mu \u2212 1 $. Expanding the matrix elements of

*W*′, we note that

*W*′ and subsequently rewritten involving

*W*and

*M*. Expansion of the matrix products and application of the cyclic property of the trace confirms that $ R P [ g ] = R P [ f ] $,

The significance of this result is that it demonstrates $ R P $ to be invariant to linear transformations of *f* which preserve the space spanned by the functions. Much like the Ritz value of an trial vector is invariant to rescaling, or the angle between two planes is invariant to linear transformations of the basis vectors defining the planes, $ R P [ f ] $ is only a functional of the space spanned by *f*. This space—the set of all *m*-dimensional linear subspaces of a vector or Hilbert space—is referred to as a *Grassmann manifold*.^{44}

### APPENDIX B: TENSION BETWEEN SPECTRAL AND PROBABILISTIC APPROACHES

Here, we show, by way of a simple analytical example, the extent to which the variational and probabilistic approaches to the analysis of molecular dynamics data are indeed distinct. By explicitly constructing the propagator eigenfunctions for a Brownian harmonic oscillator, we show that the rank-*m* truncated propagator, $ P m ( \tau ) $, built from the first *m* eigenpairs of $P ( \tau ) $ is not in general a nonnegativity-preserving operator. That is, for some valid initial distributions, *p _{t}*(

**x**), the propagated distribution, $ p \u0303 t + \tau ( m ) ( x ) = P m ( \tau ) \u2218 p t ( x ) $, fails to be non-negative throughout Ω and thus does not represent a valid probability distribution,

This indicates that variational and probabilistic approaches have the potential to be almost contradictory in what they judge to be “good” models of molecular kinetics.

Consider the diffusion of a Brownian particle in the potential *U*(*x*) = *x*^{2}. For simplicity, we take the temperature and diffusion constant to be unity. This is an Ornstein-Uhlenbeck process and the dynamics are described by the Smoluchowski equation

with infinitesimal generator $L$ given by

and stationary distribution *μ*(*x*) = *π*^{−1/2}*e*^{−x2}.

We can expand the generator in terms of its eigenfunctions, *ϕ _{n}*(

*x*), and eigenvalues,

*ξ*, defined by

_{n} which can be recognized as the Hermite equation whose solutions are related to the Hermite polynomials, *H _{n}*. For

*n*= {0, 1, …}, the solutions are

where the normalizing constants, *c _{n}*, are chosen such that 〈

*ϕ*,

_{n}*ϕ*〉

_{m}_{μ−1}=

*δ*.

_{nm}The propagator $P ( \tau ) $ can be formed by integrating Eq. (B1) with respect to *t*, giving

$P ( \tau ) $ shares the same eigenfunctions as $L$. Its eigenvalues, λ_{n}, are related to the eigenvalues of $L$ by

We now define the rank-*m* truncated propagator, $ P m ( \tau ) $, such that

Consider an initial distribution, *p _{t}*(

*x*) =

*δ*(

*x*−

*x*

_{0}), propagated forward in time by $ P m $. Let $ p \u0303 \tau ( m ) = P m ( \tau ) \u2218\delta ( x \u2212 x 0 ) $. Then, Eq. (B11) simplifies to

Consider now the specific case of *m* = 2. Using the explicit expansion *H*_{0}(*x*) = 1 and *H*_{1}(*x*) = 2*x*, we have

Note that Eq. (B13) has a zero when *x* = − *e*^{2τ}/2*x*_{0}, and that

Because of this non-positivity, $ p \u0303 \tau ( 2 ) ( x ) $ is not a valid probability distribution.

This example demonstrates that the rank-*m* truncated propagator need not, in general, preserve the positivity of distributions it acts on. Therefore, if such a model of the dynamics are fit or assessed via maximum-likelihood methods on datasets consisting of observed transitions, despite being *optimal* by spectral norm, the true rank-*m* truncated propagator may appear to give a log likelihood of −∞. The variational and probabilistic approaches to modeling molecular kinetics can indeed be very different.

### APPENDIX C: DOUBLE-WELL POTENTIAL INTEGRATOR AND EIGENFUNCTIONS

To discretize the Brownian dynamics stochastic differential equation in Eq. (25) with reflecting boundary conditions at −*π* and *π*, we used the Euler integrator,

where steps that went outside the boundaries by a given distance were reflected back into the interval with a matching distance to the boundary

We computed the propagator eigenvalues by discretizing the interval into *n* MSM states $ { s i } i = 1 n $, following Eq. (27) and computing the matrix elements without stochastic sampling. This calculation is more straightforward by working with the transition matrix *T* ∈ ℝ^{n×n},

instead of the correlation and overlap matrices, *C* and *S*, directly. Note that as shown by Nüske *et al.*^{41} and Prinz *et al.*,^{28} *T* = *S*^{−1}*C*. Thus, the eigenvalues of *T* are identical to the generalized eigenvalues of (*C*, *S*).

To calculate the matrix elements of *T*, we consider each state, *s _{i}*, represented by its left endpoint,

*x*. For each pair of states (

_{i}*i*,

*j*), we calculate the probability of the random force required to transition between them in one step, from Eq. (C1), taking into account the fact that because of the reflecting boundary conditions, the transition could have taken place via a transition outside the interval followed by a reflection.

Let *δx* be the width of the states, *δx* = *n*^{−1}2*π*. For each *i* ∈ {1, …*n*} and *k* ∈ { − *n*, …, *n*}, we calculate the action for a step from *x _{i}* to

*x*+

_{i}*kδx*,

Let $ j i k \u2032 \u2208 { 1 , \u2026 , n} $ be the index of state which contains *bc*(*x _{i}* +

*kδx*). We calculate

*T*by summing the appropriate action terms which, because of the boundary conditions, get reflected into the same ending state,

_{ij} where *δ*_{i,j} is the Kronecker delta. This calculation is implemented in the file brownian1d.py in the MSMBuilder3 package. The eigenvalues of *T* converge rapidly as *n* increases. Our results in Fig. 1 use *n* = 500.

### APPENDIX D: LANDMARK UPGMA CLUSTERING

Landmark-based UPGMA (Unweighted Pair Group Method with Arithmetic Mean) agglomerative clustering is a simple scalable hierarchical clustering which does not require computing the full matrix of pairwise distances between all data points. The procedure first subsamples *l* “landmark” data points at regular intervals from the input data. These data points are then clustered using the standard algorithm, resulting in *n* clusters.^{81} Let *S _{n}* be the set of landmark data points assigned by the algorithm to the cluster

*n*and

*d*(

*x*,

*x*′) be the distance metric employed. Then, each remaining data point in the training set as well as new data points from the test set,

*x*

^{∗}, are assigned to cluster,

*s*(

*x*

^{∗}) ∈ {1, …,

*n*}, whose landmarks are on average closest to

### APPENDIX E: SIMULATION SETUP

We performed all-atom molecular dynamics simulations of terminally blocked octaalanine (Ace-(Ala)_{8}-NHMe) in explicit solvent using the GROMACS 4 simulation package,^{82} the AMBER ff99SB-ILDN-NMR forcefield,^{83} and the TIP3P water model.^{84} The system was energy minimized, followed by 1 ns of equilibration using the velocity rescaling thermostat (reference temperature of 298 K, time constant of 0.1 ps),^{85} Parrinello-Rahman barostat (reference pressure of 1 bar, time constant of 1 ps, isotropic compressibility of 5 × 10^{−5} bar),^{86} and Verlet integrator (time step of 2 fs). Production simulations were performed in the canonical ensemble using the same integrator and thermostat. Nonbonded interactions in all cases were treated with the particle mesh Ewald method, using a real space cutoff distance for Ewald summation as well as for van der Waals interactions of 10.0 Å.^{87} Twenty six such simulations were performed, with production lengths between 20 and 150 ns each. The total aggregate sampling was 1.74 *μ*s.