The identification of meaningful reaction coordinates plays a key role in the study of complex molecular systems whose essential dynamics are characterized by rare or slow transition events. In a recent publication, precise defining characteristics of such reaction coordinates were identified and linked to the existence of a so-called transition manifold. This theory gives rise to a novel numerical method for the pointwise computation of reaction coordinates that relies on short parallel MD simulations only, but yields accurate approximation of the long time behavior of the system under consideration. This article presents an extension of the method towards practical applicability in computational chemistry. It links the newly defined reaction coordinates to concepts from transition path theory and Markov state model building. The main result is an alternative computational scheme that allows for a global computation of reaction coordinates based on commonly available types of simulation data, such as single long molecular trajectories or the push-forward of arbitrary canonically distributed point clouds. It is based on a Galerkin approximation of the transition manifold reaction coordinates that can be tuned to individual requirements by the choice of the Galerkin ansatz functions. Moreover, we propose a ready-to-implement variant of the new scheme, which computes data-fitted, mesh-free ansatz functions directly from the available simulation data. The efficacy of the new method is demonstrated on a small protein system.

## I. INTRODUCTION

In recent years, it has become possible to numerically explore the chemically relevant slow transition processes in systems with several thousands of atoms. This was made possible due to the increase of raw computational power and deployment of specialized computing architectures,^{1} as well as by the development of accelerated integration schemes that bias the dynamics in favor of the slow transition processes, yet preserve the original statistics.^{2–4}

To obtain chemical insight about the essential dynamics of the system, this vast amount of high-dimensional data has to be adequately processed and filtered. One desirable goal often is a simplified model of the mechanism of action, in which the fast, unimportant processes are averaged out or otherwise disregarded. One way is to construct *kinetic* models of the system, i.e., identifying metastable reactant-, product-, and possibly intermediate states, and reducing the dynamics to a jump process between them. Under certain regularity assumptions on the root model that are readily fulfilled, such a model can be built in an automated, data-driven fashion.^{5,6} However, the simplicity of the resulting so-called *Markov state model* (MSM) comes with a price: since the long-time relaxation kinetics is described just by jumps between finitely many discrete states, any information about the transition process and its dynamical features is lost.

An alternative collection of approaches, to which this paper ultimately contributes, thus aims at the automated identification of good *reaction coordinates* or *order parameters*, mappings from the full to some lower-dimensional, but still *continuous* state space, onto which the full dynamics can be projected without loss of the essential processes. Often enough, this reaction coordinate alone (i.e., without the corresponding dynamical model) already contains more valuable chemical information than the kinetic models, as, for example, the free energy profile along the reaction coordinate allows the determination of the activation energy of the respective transition process.^{7}

The systematic and mathematically rigorously motivated construction of reaction coordinates is an area of active research; for an overview, see Ref. 7. Where it is available, chemical expert knowledge can be used to guide the construction.^{8,9} In the context of *transition path theory* (TPT),^{10,11} the committor function is known to be an ideal reaction coordinate^{12} for transitions between preselected metastable sets. Related to this, approximations to the dominant eigenfunctions of the transfer operator are also often considered ideal reaction coordinates,^{6,13,14} which has been confirmed in Ref. 15 for a subclass of time scale separated systems. However, the computation of both committor functions and transfer operator eigenfunctions is infeasible for very high-dimensional systems. Moreover, the authors have recently shown that said eigenfunctions yield redundant reaction coordinates, in the sense that often a further reduction is possible.^{16}

In the same work, the authors identified necessary characteristics that reaction coordinates have to exhibit in order to retain the slow processes (a “quality criterion”). In short, it must be possible to relate them to the dominant transfer operator eigenfunctions in a specific non-linear way. However, as we will see, the criterion is also interpretable in the context of TPT.

What is more, it was shown that the existence of reaction coordinates that fulfill the quality criterion is tied to the existence of a so-called *transition manifold* $M$, a low-dimensional manifold in the function space *L*^{1}. The property that defines $M$ is that, on moderate time scales *t*_{fast} < *t* ≪ *t*_{slow}, the *transition density functions* of the dynamics concentrate around $M$. A firm mathematical theory for the existence and identification of reaction coordinates was developed around this transition manifold.

The main practical result of Ref. 16 was the insight that any parametrization of $M$ can be turned into a good reaction coordinate. A numerical algorithm was proposed that allows the pointwise computation of this reaction coordinate and only requires the ability to generate trajectories of the aforementioned moderate length that start at the desired evaluation point.

While the method has a solid theoretical foundation and is directly applicable in many cases, there yet exists a certain gap between the theoretical advantages and the practical applications of the proposed scheme: While the ability to efficiently compute the reaction coordinate only in specific points is quite remarkable, in practice one often wishes to learn the reaction coordinate in *all* of the accessible state space (i.e., where pre-generated simulation data are available), as the location of the “interesting” points is unknown in advance. The originally proposed method cannot compute the reaction coordinate from dynamical “bulk data”—such as long equilibrated trajectories or the push-forward of point clouds that sample the canonical ensemble—that is preferably generated by contemporary simulation methods and software.

In the present work, we attempt to close this gap by proposing an alternative, purely data-driven algorithm for computing the transition manifold reaction coordinate. It is based on a classical Galerkin approximation of the reaction coordinate with freely selectable ansatz space. Its numerical realization requires only a so-called transition matrix between its discretization elements. A wide variety of techniques for building MSMs and similar algorithms are available for the construction of this matrix from the aforementioned types of bulk data.^{14,17,18} This makes it possible to transfer many techniques from the extensive toolbox of MSMs, as, for example, the use of customized Galerkin ansatz spaces explicitly adapted for molecular dynamical problems.^{19} Furthermore, this makes our approach instantly applicable whenever the construction of an (arguably less informative) MSM is possible.

Finally, with the objective to create an algorithm that requires only a minimum of *a priori* information about the system, we propose a very practical implementation of this Galerkin approximation that constructs a mesh-free set of Voronoi cell-based ansatz functions directly from the available simulation data. Interestingly, the task of optimally choosing the Voronoi centers leads to two well-known and highly scalable algorithms from data mining, namely, the k-means clustering algorithm and Poisson disk sampling algorithm, depending on the chosen error measure. We demonstrate the efficacy of this method by identifying chemically interpretable essential degrees of freedom of a 66-dimensional model of alanine dipeptide and a 1600-dimensional model of the fast-folding protein NTL9.

The paper is organized as follows: In Sec. II, the basic concepts of time scale-separated systems and reaction coordinates are introduced. Also, our central quality criteria for the characterization of good reaction coordinates are derived and a comparison with TPT is drawn. Section III introduces the concept of transition manifolds and explains the local burst-based algorithm. In Sec. IV, the new Galerkin approximation of the transition manifold reaction coordinate is derived as well as the Voronoi-based implementation. Section V demonstrates the application of our new method to a simple synthetic example system, as well as to the realistic molecular systems. Concluding remarks and an outlook can be found in Sec. VI.

## II. CHARACTERIZATION OF GOOD REACTION COORDINATES

### A. Metastable molecular dynamics

We model our molecular dynamical system as a continuous-time stochastic process *X*_{t} on some high-dimensional state space $X\u2282Rn$. Here $X$ may consist of either full Cartesian atomic coordinates or some other suitable degrees of freedom that adequately describe the micro state of the system. We require the process to fulfill common technical assumptions from the Markov approach to molecular dynamics,^{18,20} namely, Markovianity, ergodicity, and time-reversibility. Aside from that, the specific dynamical law that governs the evolution of *X*_{t} is ultimately arbitrary, but we in general think of *X*_{t} as a “random walk in a potential energy landscape.” The first example that comes to mind would be the Smoluchowski dynamics (also called *overdamped Langevin dynamics*)

where *V* denotes the potential energy function, *β* = 1/*k*_{B}*T* denotes the inverse temperature, and *W*_{t} denotes a standard Wiener process. However, our theory can also be applied to the non-overdamped Langevin dynamics (projected onto the positional degrees of freedom), or any other thermostated molecular dynamics that samples the stationary probability density,

Here, $Z=\u222bXe\u2212\beta v$ is a normalizing constant.

### B. Reaction coordinates

Formally, a reaction coordinate is a low-dimensional variable of the full system, i.e., a smooth function $\xi :Rn\u2192Rk$ with *k* ≪ *n*. In practice, *k* will often be only one- or two-dimensional and correspond to some chemically interpretable quantity, e.g., a certain collection of backbone dihedral angles in a peptide, or the distance between important functional groups. The *reduced* or *projected system* is then given by *ξ*(*X*_{t}), which is now a stochastic process on $Rk$.

While the map ** x** ↦

*ξ*(

**) describes the pointwise projection of the system, the projection of densities that evolve with the system is described by the Zwanzig projection operator,**

*x*^{21}denoted by $Q\xi $,

where *δ*_{z} is the delta distribution and *W*(*z*) is a normalization term. Its action can be described as follows: if for some time *t* the random variable *X*_{t} is distributed according to some density *p*_{t}, then the random variable *ξ*(*X*_{t}) is distributed according to $Q\xi pt$, which is a density over $Rk$.

By the definition above, any function over $Rn$ may be called a reaction coordinate. Thus, one of the key questions we aim to answer in this article is as follows: what criterion distinguishes “good” from “bad” reaction coordinates?

In many MD systems, the chemically interesting reaction processes correspond to transitions between two or more *metastable states*, regions of state space that “trap” the dynamics for long times before a sudden transition to another metastable state occurs. Typical examples include protein- and peptide folding, receptor-ligand binding, and conformational change of large biomolecules. It is customary to picture these transitions as occurring along certain *transition pathways* in the potential energy landscape, but there is no uniformly accepted definition of these pathways. Proposed variants include the minimum energy path,^{22} minimum free energy path,^{23} and the principal curve.^{24} By a first intuitive definition, reaction coordinates should thus describe the “progress of the reaction along the transition pathway.” A common computational scheme thus goes as follows:

Compute the transition pathway (using, for example, the string method

^{25}).Parameterize the transition pathway.

Project the state space onto the transition pathway.

The value of the parametrization in a projected point then gives the reaction coordinate value in that point.

However, due to the ambiguous concept of transition pathways, this approach lacks rigor. Variants of transition pathways that are based on local features of the energy landscape only, such as the minimum energy pathway, can be shown to fail to describe the (global) slow transition processes.^{24} Moreover and most importantly, the question of how to globally project state space onto the transition path in a “dynamically correct” way remains unanswered. A nearest-point projection, as is, for example, used in the definition of principal curves, can be shown to fail with simple example systems.

Thus, in order to find a rigorous criterion for good reaction coordinates, we need to take a closer look at the “global” stochastic evolution of *X*_{t} and its slow parts. We will, however, eventually come back to the picture of potential energy surfaces and interpret our criterion with regard to transition pathways (see Example 1.1).

### C. The transfer operator

Regardless of the specific dynamical model, the stochastic evolution of *X*_{t} is entirely described by its transition probability density *p*^{t}: Given a starting point *x*, the probability density for finding the system at some point *y* after time *t* ≥ 0 is denoted by *p*^{t}(*x*, *y*),

where “∼” means “distributed according to.” *p*^{t}(*x*, ·) can be estimated by starting a large number of parallel simulations of the stochastic dynamics, all with starting point *x*, and estimating the resulting end point density (for example, using histogram or kernel density estimation methods).

With *p*^{t}, the evolution of a general starting density *X*_{0} ∼ *u*_{0} can then be expressed as

The operator $Pt$ is known as the *Perron-Frobenius operator* or *transfer operator* of the system. In the case of the Smoluchowski dynamics (1), it is equal to the solution operator of the associated Fokker-Planck equation.

We see that *p*^{t} and by extension $Pt$ describe the complete stochastic evolution. While the analytical derivation of $pxt$ and $Pt$ is possible only for the most simple of systems (for example, for the Ornstein-Uhlenbeck process^{26}), they will play the central role in the description of slow sub-processes of the dynamics and the computation of optimal reaction coordinates.

*Remark.*

*Koopman operator*, defined by

*X*_{t}, i.e., is the conditional expectation of

*η*

_{0}(

*X*_{t}), provided we started in

*x*at time

*t*= 0,

### D. Dominant time scales

Under fairly general conditions, it can be shown that the spectrum of $Pt$ consists of discrete real eigenvalues

and that the eigenvalue *λ*_{0} = 1 is simple and belongs to the eigenfunction *ρ* (the stationary density). We denote by $vi$ the eigenfunction belonging to $\lambda it$. Except *λ*_{0}, all eigenvalues decay exponentially as *t* → *∞*, which corresponds to the relaxation of the process towards the stationary ensemble, regardless of the starting density. The relaxation rate of the *i*th slowest process, known as the *i-th implied time scale*,^{18} is given by

From now on, we assume the system to possess *d* slow sub-processes, typically (but not necessarily) corresponding to the rare transitions between *d* metastable sets, and that we are primarily interested in accurately describing these slow processes. In this case, the dominant *d* + 1 eigenvalues ${\lambda 0t,\u2026,\lambda dt}$ will be positive and separated from the remaining eigenvalues by a *spectral gap*, i.e., $\lambda dt\u226b\lambda d+1t$. We can then express the action of the operator $Pt$, and thus the stochastic evolution of the process, in terms of the dominant eigenfunctions,

where *c*_{i} = ∫*u*_{0}$vi$. This means that the information about the long-term evolution of the slow processes is entirely contained in the *d* dominant eigenpairs $(\lambda it,vi)$. Consequently, we consider the preservation of the dominant eigenpairs under projection onto the reaction coordinate a suitable objective for optimally choosing the reaction coordinate.

*Remark.*

The dominant eigenpairs of the transfer operator are also the primary object of interest in the Markov approach to coarse graining molecular dynamics, as mentioned in the Introduction. Here, the goal is to use the eigenfunctions to build a discrete *Markov State Model* (MSM),^{5,6,27} which replaces the original molecular dynamics by a finite-state Markov jump process between the metastable states. Though all information about the transition regions and paths is lost by this approach, the long-time transition rates between the states are preserved. These models have been successfully applied to a wide range of real-life molecular systems.^{13,27–29}

The reaction coordinate we will ultimately define and compute will preserve the dominant eigenfunctions; thus, the projected process ** ξ**(

*X*_{t}) also still contains all the information about the long-term transition processes. In this sense, the motivation of ours and the MSM approach are deeply linked.

### E. A criterion for good reaction coordinates

Our investigation so far points out an apparent discrepancy in the concurrent understanding of what criterion defines “good” reaction coordinates: On one hand, it is a common perception that good reaction coordinates should parameterize some sort of transition pathway, along which a reaction event progresses “most likely.” On the other hand, if one is interested in the longterm behavior of the system, the projection onto the reaction coordinate must preserve the slowest processes; so, a definition based on the dominant eigenpairs of the transfer operator seems natural. However, this second requirement is applicable to very general and not necessarily metastable systems and thus does not even require the existence of a transition pathway in the classical sense.

We will now see that these two viewpoints can still be unified and that there exists a criterion for good reaction coordinates based on the transfer operator that also leads to the parametrization of the transition pathway.

Let the *projected transfer operator*, transporting probability densities of the projected process *ξ*(*X*_{t}), be denoted by $P\xi \u2009t$. Let $(\mu it,wi)$ denote the eigenpairs of $P\xi t$. By the preceding reasoning, we now call *ξ* a *good reaction coordinate* if for the dominant eigenpairs holds

i.e., the full and projected dominant eigenvalues are similar, and

i.e., the eigenfunctions of $Pt$ can be approximately reconstructed from the eigenfunctions of $P\xi t$ and *ξ*. This way, all information about the *d* slowest processes is contained in *ξ*(*X*_{t}).

It has been shown in Ref. 16 that (5) follows from (CI), so (CI) is a sufficient criterion for good reaction coordinates (in the sense of preserving the long time scales). If the approximation in (CI) holds sharp, we call *ξ* an *optimal reaction coordinate*.

The first idea that comes to mind is to define the reaction coordinate directly as the dominant eigenfunctions (weighted by the stationary density for technical reasons)

This reaction coordinate is indeed optimal, as was shown in Ref. 16. Indeed, the authors in Ref. 15 have also identified (6) as an ideal reaction coordinate, though only for a narrower sub-class of time scale-separated systems. However, there are two major practical disadvantages in choosing the eigenfunctions as reaction coordinates that ultimately prevent us from computing and using them:

The eigenproblem is global and thus prohibitively expensive to solve numerically in high dimensions. If we wish to compute the value of an eigenfunction at only a single position in $X$, we need an approximation of $Pt$ that is accurate on all of $X$. There have been attempts to mitigate this, but the conceptual problem remains.

The eigenfunction reaction coordinate is often redundant. In systems where the slow processes correspond to the transitions between

*d*metastable sets, i.e.,*d*potential wells, (6) would define a*d*-dimensional reaction coordinate. However, in practice, many of these potential wells often lie along the same transition path, and consequently the transitions between those wells would be describable by just a one-dimensional reaction coordinate. See the example in Fig. 1 for an illustration.

We will now reformulate criterion (CI) in a way that addresses the concerns above and at the same time makes it compatible with the transition path intuition of reaction coordinates. Consider a reaction coordinate *ξ* of some dimension *r* ≤ *d*, fulfilling (CI). Furthermore, assume that for each starting point *x*, we can write the transition density *p*^{t}(*x*, ·) as a linear combination of the eigenfunctions $vi$,

where *p*^{t}(*x*, ·) denotes the *y*-dependent function *p*^{t}(*x*, *y*) for all *y*. It can be shown (see Appendix A) that the prefactors are again connected to the eigenpairs: $di(x,t)=\lambda itvi(x)/\rho (x)$. As we still are interested only in long lag times *t* where the non-dominant eigenvalues have already decayed, we can truncate the series,

Finally, we use that *ξ* fulfills the criterion (CI) and that *ρ* is the 0-th dominant eigenfunction of $Pt$ and get

The right-hand side of this equation depends only on the reaction coordinate value *ξ*(*x*) and not the full state space coordinate *x*. This means that the left-hand side also can depend only on *ξ*. Thus, in order for *ξ* to be a good reaction coordinate, the transition density function *p*^{t}(*x*, ·) must only depend on the *r*-dimensional *ξ*(*x*) and not on the full *n*-dimensional *x*. We thus get the following equivalent criterion: *ξ* is a good reaction coordinate if and only if

for some function $p\u0303t$, all *x*, and “intermediate” lag times *t*. Intermediate here means that *t* must be larger than the equilibration time scale of the fast processes, but can be chosen much smaller than the equilibration time scale of the slow processes. In terms of the implied time scales (4), this writes *t*_{d} > *t* > *t*_{d+1}.

### F. Connection to transition path theory

Unlike (CI), criterion (CII) now allows an interpretation in the context of Transition Path Theory (TPT). To be precise, we argue that the committor function, which is seen in TPT as the optimal reaction coordinate,^{30} fulfills criterion (CII).

In a system with two metastable sets A and B, the forward committor function *q*_{A}(*x*) is defined as the probability that the process *X*_{t} first visits *A* rather than *B* given the starting point *X*_{0} = *x*. For a starting point outside the metastable sets, and for “intermediate” lag times t as required by (CII), the probability to find the system in one of the metastable sets after time t is essentially 1, as the process quickly leaves the transition region. Moreover, the system equilibrates quickly inside the metastable sets. Thus, the transition density essentially depends only on whether it is more likely to find the evolved system in *A* or in *B*,

Here, $1A$ denotes the indicator function over *A* and $cAt(x),\u2009cBt(x)$ are the probabilities to find the evolved system in *A* and *B*, respectively,

As we have chosen *t* as *intermediate*, i.e., so short that it is unlikely to leave a metastable set within time *t* once it has been reached, $cAt(x)$ is essentially equal to the committor function. Thus we have

where we see that the right-hand side now depends only on *q*_{A}(*x*) and not on the full value *x*. With the function

the reaction coordinate *ξ*(*x*) ≔ *q*_{A}(*x*) thus fulfills criterion (CII). Our new criterion thus confirms the committor function as a good reaction coordinate in the sense of preserving the slow transition process between the metastable sets.

Note, however, that while the definition of committor functions depends on the existence (and the knowledge) of metastable states, criterion (CII) can also be applied in systems where the slowest processes do not correspond to transitions between metastable states (such as systems with explicit time scale separation). Criterion (CI) does not even require a spectral gap at all, i.e., reaction coordinates fulfilling (CI) will preserve the *d* slowest processes even if the subsequent processes live on similar time scales. Thus, our theory offers a much more general characterization of good reaction coordinates that, however, agrees with the concept of committor functions in the special cases where the latter is applicable. What is more, the usage of committor functions as reaction coordinates is susceptible to the same computational problems as transfer operator eigenfunctions that were detailed in Sec. II E.

The following example demonstrates that criterion (CII) can formally distinguish “intuitively good” from “intuitively bad” reaction coordinates:

*Example 1.1.*

*β*= 0.5. First, consider the one-dimensional reaction coordinate

*ξ*, i.e., no two points on the transition pathway take the same value under

*ξ*. Furthermore, the isolines (sets of constant value) of

*ξ*intersect the transition pathway perpendicularly.

*ξ*was identified in Ref. 31 as the ideal reaction coordinate and can also be considered “intuitively good” from the standpoint of transition pathways. Note, however, that

*ξ*is not equal to the committor function.

We can distinguish *ξ* and *ζ*, without any knowledge of the transition pathway, the metastable sets, or the potential, by considering only the transition density functions *p*^{t}(*x*, ·) along their isolines. We observe that for different starting points along any isoline of *ξ*, the densities *p*^{t}(*x*, ·) for intermediate lag time *t* look very similar. That means that *p*^{t}(*x*, ·) effectively depends only on *ξ*(*x*). The same property does not hold for the bad reaction coordinate *ζ*: here the densities *p*^{t}(*x*, ·) differ substantially for starting points along a single isoline. In conclusion, *ξ* fulfills criterion (CII), whereas *ζ* does not, i.e., criterion (CII) can distinguish good from bad reaction coordinates (in this example).

*Summary.*

The equivalent criteria (CI) and (CII) allow for a rigorous characterization of good reaction coordinates such that the long time scales of the full molecular system are inherited by the projection of the dynamics onto the low-dimensional state space spanned by the reaction coordinates. At the same time, these criteria agree with but extend and refine the comprehension of good reaction coordinates that is pervasive in transition path theory.

## III. NUMERICAL COMPUTATION OF GOOD REACTION COORDINATES

### A. The transition manifold

From now on, we always assume *t* to be an “intermediate” lag time as required by criterion (CII). This criterion implies that two transition density functions *p*^{t}(*x*_{1}, ·) and *p*^{t}(*x*_{2}, ·) are close to each other for two points *x*_{1}, *x*_{2} of similar reaction coordinate value, even if *x*_{1} and *x*_{2} themselves are not close. We will now render this “neighborhood relation” of densities more precise and exploit it in order to efficiently compute good reaction coordinates.

For each state space point *x*, the transition density *p*^{t}(*x*, ·) is a function in the infinite-dimensional function space *L*^{1}, i.e., the space of absolutely integrable functions. However, the insight that *p*^{t}(*x*, ·) effectively depends only on *ξ*(*x*), i.e., an *r*-dimensional coordinate, implies that the set of all transition density functions,

effectively forms an only *r*-dimensional manifold in this function space. In the common case of *r* = 1, $M$ is effectively a curve in *L*^{1}. We call $M$ the *transition manifold* of the system.

*Remark.*

While there is a connection between transition path theory and transition manifolds as shown in Sec. II E, to the best of the authors’ knowledge, there is no formal equivalence between the transition manifold and any existing definition of transition pathway.

^{16}that the reaction coordinate defined as

### B. Embedding of the transition manifold

In order to find a parametrization of the transition manifold $M$, we employ the general-purpose *Diffusion Maps* manifold learning algorithm.^{32,33} Explaining the algorithm in detail would go well beyond the scope of this article, so we only coarsely sketch its usage: Let a sufficiently large collection of data points {*z*_{1}, …, *z*_{M}} on or near a manifold in some vector space be given. The algorithm then detects the dimension *r* of this manifold and returns for each data point *z* an *r*-dimensional vector $(E1(z),\u2026,Er(z))\u22ba$ that represents a parametrization $E$ of the manifold, evaluated at *z*. Application of Diffusion Maps requires the choice of a certain kernel bandwidth parameter that essentially determines what distance should be considered “far away.” We assume from now on that this parameter can be chosen reliably; an optimal strategy has been detailed in Ref. 34.

The Diffusion Maps algorithm, in principle, works in arbitrary metric spaces, as it requires only an appropriate notion of distance between data points. We will, however, not attempt to parameterize the transition manifold directly in *L*^{1}, as the calculation of distances between *L*^{1} functions is numerically costly. Instead, we will first *embed* the transition manifold $M$ into a Euclidean space and use the standard Euclidean distance there.

Surprisingly, constructing such an embedding requires virtually no knowledge about $M$. Let $F:L1\u2192Rq$ be an *arbitrarily chosen* map from the function space *L*^{1} to the Euclidean space of dimension 2*r* + 1 (or greater), where *r* is the dimension of the transition manifold. Then—slightly simplified—the famous Whitney embedding theorem^{35,36} states that for any such $F$, the probability for $F(M)$ again being an *r*-dimensional manifold in $R2r+1$ is *exactly one*. For the purpose of this article, this means that we can effectively choose $F$ randomly—if only its image dimension is large enough—and be sure that the manifold structure of $M$ gets preserved under $F$. We can then compute a parametrization of the *embedded* manifold $F(M)$ using the Diffusion Maps algorithm, which then corresponds to a parametrization of the original manifold $M$. A sketch of the overall embedding procedure is shown in Fig. 3.

Specifically, we will work with the 2*r* + 1 embedding functions

with linear observables

where the factors *a*_{ij} are chosen randomly (e.g., uniformly drawn from the interval [0, 1]). Note, however, that we have great freedom in the choice of the functions *η*_{i}. Linear functions were chosen simply out of convenience.

We see immediately that by this choice of the embedding, the embedded density (9) then is the Koopman operator (3) applied to *η*, i.e., the expectation value of *η* under the evolved dynamics,

The right-hand side can now be computed numerically by a simple Monte Carlo sampling procedure. Let $\Phi jt(x),j=1,\u2026,M$ denote the endpoints of *M* independent trajectories of length *t*, all starting in *x*. Then

Thus, we create the data points in $R2r+1$ that we apply the Diffusion Maps algorithm to as follows:

*Remark.*

The above algorithm requires the knowledge of two intrinsic parameters of the system: (1) the “intermediate” lag time *t*, in order to simulate trajectories of the right length, and (2) the expected dimension *r* of the reaction coordinate, in order to choose the right number of embedding observables. For both quantities, rough estimates can be used in practice.

The weak requirement *t*_{slow} > *t* > *t*_{fast} on the lag time *t* permits a high tolerance with respect to numerical errors. Thus, rough Markov models can, for example, be used to estimate the time scales. Also, in real-live chemical systems, one often has a general idea about the nature of the fast and slow processes (e.g., whether one is interested in the re-configuration of individual dihedral angles or the forming of higher-level structures) that can guide the choice.

For the dimension *r*, an iterative procedure can be used: First start with a low estimate for *r* (e.g., *r* = 1) and perform Algorithm 1. If the chosen *r* was equal to or higher than the correct dimension of the transition manifold, the Diffusion Maps algorithm should detect an *r*- or lower-dimensional manifold in the embedded data points. If it fails to do so, increase *r* by choosing additional observables and restart the embedding procedure. This strategy generates only little overhead as the simulations $\Phi jt(xi)$ and the previously embedded points can be reused.

1: Choose M points {x_{1}, …, x_{L}} that cover the relevant parts of state space, i.e., the metastable sets and the |

transition regions. |

2: Choose the factors a_{ij} in (10), e.g., uniformly randomly in [0, 1]. |

3: For each x_{i}, simulate M independent trajectories of length t. Let the end points be denoted by $\Phi jt(xi)$. |

4: Compute the data points in $R2r+1$ as |

$zi\u21901M\u2211j=1M\eta \Phi jt(xi).$ |

5: Apply the Diffusion Maps algorithm to {z_{1}, …, z_{L}}. |

Output: Approximation to the r-dimensional reaction coordinate (8), evaluated at the points {x_{1}, …, x_{L}}, i.e., |

{ξ(x_{1}), …, ξ(x_{L})}. |

1: Choose M points {x_{1}, …, x_{L}} that cover the relevant parts of state space, i.e., the metastable sets and the |

transition regions. |

2: Choose the factors a_{ij} in (10), e.g., uniformly randomly in [0, 1]. |

3: For each x_{i}, simulate M independent trajectories of length t. Let the end points be denoted by $\Phi jt(xi)$. |

4: Compute the data points in $R2r+1$ as |

$zi\u21901M\u2211j=1M\eta \Phi jt(xi).$ |

5: Apply the Diffusion Maps algorithm to {z_{1}, …, z_{L}}. |

Output: Approximation to the r-dimensional reaction coordinate (8), evaluated at the points {x_{1}, …, x_{L}}, i.e., |

{ξ(x_{1}), …, ξ(x_{L})}. |

Alternatively, assuming the rough Markov model mentioned above can correctly identify the number *d* of dominant time scales, this can be used as an upper bound for *r*. Even if *d* vastly overestimates *r*, the final reaction coordinates *ξ* (after application of the Diffusion Maps algorithm) will have the correct low dimension *r*.

## IV. GALERKIN APPROACH FOR COMPUTING REACTION COORDINATES

As we have seen, the transition manifold-based reaction coordinate (8) fulfills rigorous optimality criteria regarding the preservation of the long time scales and being of the smallest possible dimension. Unfortunately, the above algorithm to compute it has two major practical shortcomings:

*ξ*can only be computed*pointwise*and has no closed analytic form. For every new evaluation point, many numerical MD simulations have to be started. Furthermore, the evaluation points have to be chosen in regions relevant to the slow transition processes (i.e., in the transition regions and metastable sets), which is a non-trivial task, especially in high-dimensional systems.The computation of

*ξ*is based on multiple short, instead of one, long MD simulations. Although this can also be seen as an advantage, the way modern MD software works often favors the simulation of single long trajectories. Furthermore, there is a vast archive of already pre-computed trajectories for many interesting metastable molecular systems. If this data could be used to compute*ξ*, those systems could be coarse-grained with minimal effort.

In the following, we will thus describe a Galerkin discretization of the embedding function (9). Importantly, this discretization will be very similar to the discretization of the dominant transfer operator eigenfunctions performed in MSM analysis, further emphasizing the close connection of the methods. Moreover, this will allow us to calculate our reaction coordinates from the same data sources also used in MSM building and utilize a wide range of analogous discretization techniques.

### A. Galerkin approximation of reaction coordinates

We first write the embedded density (11) directly as a function of the starting point *x*,

We make the weak assumption that all the components of $\xi \u0303$ are square-integrable with respect to the stationary density, i.e., $\xi \u0303$ lies in the function space $L\rho 2$ with inner product

*Remark.*

The function $\xi \u0303$ can already be understood as a 2*r* + 1-dimensional reaction coordinate; that is, we could in theory accept a reaction coordinate with higher than optimal dimension in order to save us the application of the Diffusion Maps algorithm. Thus, we will refer to $\xi \u0303$ as the “pre-reaction coordinate.”

We now discretize $\xi \u0303$ using a Galerkin approximation,^{6} i.e., we seek the function $\xi \u0303N$ inside a finite-dimensional function space $VN$ that best approximates $\xi \u0303$. Classical choices of the ansatz space $VN$ are, for example, the space of all polynomials over $X$ up to a certain degree, the space consisting of *N* characteristic functions over a finite partition of $X$, or some other finite element space. The Galerkin approximation is performed independently on the 2*r* + 1 individual components of $\xi \u0303$. However, we will omit the subscripts in order to help readability and simply treat $\xi \u0303$ and *η* as one-dimensional functions for the remainder of this section.

*φ*

_{1}, …,

*φ*

_{N}} be a basis of $VN$. Then the Galerkin approximation $\xi \u0303N$ has the following closed form:

*Gram matrix*

*transition matrix*

*η*is the randomly chosen observable from (9). The precise derivation of Eq. (13) is given in Appendix B. The quantities

*S*,

*T*, and

*c*can now be computed numerically, and thus $\xi \u0303N$ can be evaluated at any state space point

*x*.

*Remark.*

The exact the matrices *S* and *T* are also commonly found at the heart of methods that aim to reconstruct long-term dynamics directly via the transfer operator eigenfunctions,^{14,28} such as Markov state models. The Galerkin approximation of $\xi \u0303$ is thus applicable whenever those methods are.

### B. Data-based computation of the transition matrix

The entries of the transition matrix *T* and Gram matrix *S* can now be approximated based on simulation data. Consider two sets of data points on $X$,

where $XM$ samples the stationary density *ρ* and $YM\u2282X$ is the time-*t* evolution of $XM$ under the dynamics. To be precise, *y*_{i} = Φ^{t}*x*_{i}, with *t* being again the “intermediate” lag time. These data can, for example, be obtained from a single equilibrated numerical trajectory of step size *τ* (assuming that *t* is a multiple of *τ*),

or the concatenation of multiple trajectories that *together* sufficiently sample *ρ*. Alternatively, $XM$ could be the output of an enhanced sampling algorithm, such as Markov chain Monte Carlo methods,^{37} and $YM$ the endpoints of individual trajectories starting in $XM$.

As frequently used in the Markov state approach,^{38} the inner product ⟨·,·⟩_{ρ} can be approximated from *ρ*-distributed data via Monte Carlo quadrature. *T* and *S* can thus be approximated as

Moreover, the factors *c*_{l} become

Subsequently, (13) can be evaluated at arbitrary state space points without significant additional costs. Choosing evaluation points *x*_{i} that again cover the relevant parts of state space, for example, a subsample of the data points $XM$, we can apply the Diffusion Maps algorithm to the embedded points ${\xi \u0303N(x1),\u2026,\xi \u0303N(xL)}$ and again extract the final *r*-dimensional reaction coordinate. Algorithm 2 shows an accordingly modified version of Algorithm 1.

Input: Data sets $XM,YM$ as in (14). |

1: Choose a Galerkin basis {φ_{1}, …, φ_{N}} that adequately approximates smooth functions over the relevant |

parts of state space. |

2: Choose the factors a_{ij} in (10), e.g., uniformly randomly in [0, 1]. |

3: Compute the matrices T, S and the vector c via |

$Tkl\u21901M\u2211j=1M\phi k(xj)\phi l(yj),$ |

$Skl\u21901M\u2211j=1M\phi k(xj)\phi l(xj),$ |

$cl\u2190\u2211k=1N(S\u22121)kl1M\u2211j=1M\eta (xj)\phi k(xj).$ |

4: Choose L evaluation points {x_{1}, …, x_{L}} that cover the relevant parts of state space, i.e., the metastable sets |

and the transition regions. |

5: Compute the data points in $R2r+1$ as |

$zi\u2190\u2211k,j=1N\phi j(xi)(S\u22121)kj\u2211l=1NTklcl.$ |

6: Apply the Diffusion Maps algorithm to {z_{i}, …, z_{M}}. |

Output: Approximation to the r-dimensional reaction coordinate (8), evaluated at the points {x_{1}, …, x_{L}}, i.e., |

{ξ(x_{1}), …, ξ(x_{L})}. |

Input: Data sets $XM,YM$ as in (14). |

1: Choose a Galerkin basis {φ_{1}, …, φ_{N}} that adequately approximates smooth functions over the relevant |

parts of state space. |

2: Choose the factors a_{ij} in (10), e.g., uniformly randomly in [0, 1]. |

3: Compute the matrices T, S and the vector c via |

$Tkl\u21901M\u2211j=1M\phi k(xj)\phi l(yj),$ |

$Skl\u21901M\u2211j=1M\phi k(xj)\phi l(xj),$ |

$cl\u2190\u2211k=1N(S\u22121)kl1M\u2211j=1M\eta (xj)\phi k(xj).$ |

4: Choose L evaluation points {x_{1}, …, x_{L}} that cover the relevant parts of state space, i.e., the metastable sets |

and the transition regions. |

5: Compute the data points in $R2r+1$ as |

$zi\u2190\u2211k,j=1N\phi j(xi)(S\u22121)kj\u2211l=1NTklcl.$ |

6: Apply the Diffusion Maps algorithm to {z_{i}, …, z_{M}}. |

Output: Approximation to the r-dimensional reaction coordinate (8), evaluated at the points {x_{1}, …, x_{L}}, i.e., |

{ξ(x_{1}), …, ξ(x_{L})}. |

*Remark.*

Another advantage of Algorithm 2 is that when adding a new evaluation point *x*_{L+1}, no new simulations have to be started. Only the Diffusion Map algorithm has to be re-applied to the now extended embedded points {*z*_{1}, …, *z*_{L+1}}.

### C. Implementation: Voronoi-based Galerkin approximation

For the Markov State Model construction, there exists an extensive collection of elaborate Galerkin basis sets that have been successfully applied to real-world biomolecular systems, and all of them can, in principle, be used to approximate the reaction coordinate *ξ*. Examples are hierarchical wavelet bases,^{39} meshfree basis functions based on Shepard’s approach,^{40,41} and specialized problem-adapted basis sets, such as a tensor basis for peptide chains.^{19} In this section, we detail a simple, yet practical algorithm that constructs a particular meshfree ansatz space directly from the available simulation data. Similar basis functions have been explored in the context of MSMs in Ref. 42.

Let {*A*_{1}, …, *A*_{N}} be sets that partition $X$, i.e., $\u22c3iAi=X$ and *A*_{i} ∩ *A*_{j} = ∅, *i* ≠ *j*. Choosing the indicator functions over the sets *A*_{i},

as the basis of $VN$, the entry *T*_{kl} of the transition matrix is effectively just the relative number of transitions from set *A*_{k} to set *A*_{l} within the data sets $XM,\u2009YM$. The Gram matrix is diagonal, with *S*_{kk} being the relative number of data points in $XM$ that lie in *A*_{k}. This partition-based Galerkin approximation of the transfer operator is known as Ulam’s method in the MSM literature.^{43}

The evaluation of $\xi \u0303N$ at a specific point *x* ∈ *A*_{k} then becomes

#### 1. Choice of the partition sets

Choosing the partition sets naively, for example, as a regular box grid, invokes the infamous *curse of dimensionality*, as the number of boxes rises exponentially with the system’s dimension. We thus propose a partition into grid-free Voronoi cells {*A*_{1}, …, *A*_{N}} with center points adapted to the dynamical data $XM$. With this, we will also be able to avoid the explicit construction (and storage) of the transition matrix.

Our objective is to approximate $\xi \u0303$ in the region of state space that is covered with the available data points $XM$. The question is then how the Voronoi centers $E={e1,\u2026,eN}\u2282X$ should be chosen in order to achieve this. In the following, we demonstrate that two different criteria on the approximation quality of $\xi \u0303$ lead to two different algorithms for selecting the Voronoi centers.

##### a. Minimizing the L^{2} error.

Since $\xi \u0303\u2208L\rho 2$, we may ask to minimize the error

where ∥·∥_{ρ} is the norm induced by the inner product (12). In Appendix C, we show that under weak assumptions, this error is minimized by choosing as the Voronoi centers the output of the *k*-means clustering algorithm^{44} applied to the data $XM$ with *k* = *N*. *k*-means is highly scalable for both large amounts of clusters *N* and a large number of data points *M* and is readily available in many software packages.

##### b. Minimizing uniform error.

Thinking of $\xi \u0303$ as an observable, it is natural to minimize the *uniform* observable error

In Appendix C, we show that, again under weak assumptions, the minimum is achieved if the centers cover the region of $X$ where data are available evenly such that the Voronoi cells all have similar diameters. This problem is related to *Poisson disk* or *blue noise (sub)sampling* in computer vision.^{45} The following *picking algorithm*^{41} computes an approximately equidistant subsample of $XM$.

In conclusion, minimizing the *L*^{2} error of $\xi \u0303$ leads to *k*-means clustering as an algorithm for picking the Voronoi centers, while minimizing the uniform error of $\xi \u0303$ leads to the farthest point picking Algorithm 3. In Sec. V, we compare both alternatives. In general, *k*-means will lead to denser Voronoi cells in metastable regions, while Algorithm 3 will lead to evenly sized Voronoi cells.

Input: $XM$, N |

1: e_{1} ← random point from $XM$ |

2: for j = 2, …, N do |

3: pick the point with the maximum distance from the previous points: |

$ej\u2190arg\u2009maxx\u2208XMmini=1,\u2026,j\u22121\Vert x\u2212ei\Vert $ |

4: end for |

Output: Voronoi centers E = {e_{1}, …, e_{N}} |

Input: $XM$, N |

1: e_{1} ← random point from $XM$ |

2: for j = 2, …, N do |

3: pick the point with the maximum distance from the previous points: |

$ej\u2190arg\u2009maxx\u2208XMmini=1,\u2026,j\u22121\Vert x\u2212ei\Vert $ |

4: end for |

Output: Voronoi centers E = {e_{1}, …, e_{N}} |

In order to compute *T*, *S*, and *c*, the data points from $XM$ and $YM$ have to be assigned to their respective partition set. In the case of Voronoi cells, this is easily done by a nearest point search between $XM$ and *E* and $YM$ and *E*, respectively.

*Summary.*

The criteria (CI) and (CII) and the concept of transition manifolds offer a new perspective on optimal reaction coordinates. Reaction coordinates that fulfill these criteria can in fact be computed using the same data sources and state space discretization techniques as classical MSMs which means that the entire machinery invented for building MSMs can be utilized for their computation.

## V. EXAMPLES

### A. Curved double well potential

As our first demonstration, we compute the reaction coordinate of the simple curved double-well potential from Example 1.1 using Algorithm 2. It will allow us to visualize the (embedded) transition manifold and compare the computed reaction coordinate with the minimum energy pathway and the committor function.

In this low-dimensional example, the relaxation time scales associated with the slowest processes of the full system can be computed numerically; see Table I. They were computed by a sufficiently fine approximation of the transfer operator, computation of its eigenvalues, and using formula (4).

Dominant time scales . | |||
---|---|---|---|

. | t_{1}
. | t_{2}
. | t_{3}
. |

Full system | 5.9332 | 0.9021 | 0.6031 |

ξ, k-means alg. | 5.8899 | 0.8615 | 0.5625 |

ξ, picking alg. | 5.9034 | 0.8789 | 0.5838 |

ζ(x_{1}, x_{2}) = x_{1} | 5.7130 | 0.7964 | 0.5380 |

Dominant time scales . | |||
---|---|---|---|

. | t_{1}
. | t_{2}
. | t_{3}
. |

Full system | 5.9332 | 0.9021 | 0.6031 |

ξ, k-means alg. | 5.8899 | 0.8615 | 0.5625 |

ξ, picking alg. | 5.9034 | 0.8789 | 0.5838 |

ζ(x_{1}, x_{2}) = x_{1} | 5.7130 | 0.7964 | 0.5380 |

As expected, the system is time scale-separated, with the single slow time scale representing the mean expected waiting time^{46} for a single transition between the two wells. The lag time *t* = 2 falls in between the slow and fast time scales, so we use it as the “intermediate” lag time for Algorithm 2. Moreover, we assume the dimension *r* = 1 of the transition manifold to be known.

As the source of dynamical data, we utilize a single well-equilibrated trajectory

of the dynamics with step size *τ* = 10^{−2} and overall 2 · 10^{7} steps. This trajectory is used to construct the data sets $XM,\u2009YM$ via formula (15).

We partition the interesting region of $R2$ into 1000 Voronoi cells. The characteristic functions over the cells then form the Galerkin basis for Algorithm 2, as detailed in Sec. IV C. The centers of the cells can be chosen using either the k-means algorithm or the picking algorithm (Algorithm 3); we compare both methods in the following. As the evaluation points {*x*_{1}, …, *x*_{L}} that are required for Algorithm 2, we simply re-use the 1000 Voronoi center points, as they already cover the interesting state space regions.

#### 1. Results

Figure 4(a) shows the computed Voronoi center points. While the points based on the picking algorithm cover $XM$ evenly (by construction), the k-means-based center points appear to emphasize the metastable regions and slightly under-sample the transition regions.

Figure 4(b) shows the approximation to $\xi \u0303$ evaluated at {*x*_{1}, …, *x*_{L}}, computed via (13). These are the 2*r* + 1-dimensional data points *z*_{i} in Algorithm 2. The points quite obviously concentrate around a one-dimensional manifold. This is the embedding of the transition manifold $M$ into $R3$ via (9).

The Diffusion Maps algorithm applied to the points indeed finds the correct dimension *r* = 1 of the embedded manifold and parameterizes it. The coloring in Fig. 4(b) indicates the value of the one-dimensional parametrization at the respective embedded point. This is also the value of the final reaction coordinate at the respective evaluation point, i.e., *ξ*(*x*_{i}). Assigning this value to the whole Voronoi cell that *x*_{i} belongs to yields the final reaction coordinate *ξ* that is defined in all $R2$, shown as the coloring in Fig. 4(c).

*ξ* clearly parameterizes the minimum energy pathway, with a smooth gradient in the transition region. Moreover, *ξ* qualitatively resembles the system’s committor function that is shown in Fig. 4(d).

#### 2. Time scale analysis

To quantitatively verify the quality of the computed reaction coordinate, we compare the time scales of the full process *X*_{t} and the projected process *ξ*(*X*_{t}). Note that this is equivalent to comparing the dominant eigenvalues (5), i.e., a necessary condition for the criterion (CI). In fact, the time scales of the projected process were computed by first approximating the eigenvalues of the projected transfer operator [using the projected trajectory $\xi (x0),\xi (\Phi \tau x0),\xi (\Phi 2\tau x0),\u2026$] and then again using formula (4).

Table I shows that our Galerkin-approximated reaction coordinate *ξ* approximates the dominant time scale *t*_{1} of the full system very well, both for Voronoi centers chosen by the k-means and the picking algorithm. In fact, even the non-dominant time scales *t*_{2}, *t*_{3}, … are reproduced quite well, even though our theory only holds for the dominant time scales. Compared to the naively chosen reaction coordinate, *ζ*(*x*_{1}, *x*_{2}) = *x*_{1}, our approximation error is noticeably lower, although *ζ* still preserves the time scales surprisingly well.

### B. Alanine dipeptide

We demonstrate that with Algorithm 2, one can successfully use longtime simulation data to identify quantitatively good reaction coordinates in realistic molecular systems, that the resulting reaction coordinates are interpretable chemically, and that the reaction coordinates can be used to quantitatively restore the information about the long-time transition processes (in the form of the transfer operator eigenfunctions).

For this, we consider a single alanine dipeptide molecule in aqueous solution at temperature 400 K. The molecule consists of 22 atoms (including hydrogen atoms); thus the full Cartesian state space $X$ is 66-dimensional. We chose to analyze this rather small example system as it still possesses a clearly defined time scale separation that bigger systems often lack. Furthermore, the system possesses a chemically intuitive reaction coordinate that will serve as a benchmark: usually, two backbone dihedral angles *φ*, *ψ* are considered responsible for the long-term kinetics of alanine dipeptide, with four configurations of these angles forming metastable states (see Fig. 5). We emphasize, however, that this information is used only for illustration and comparison purposes and that we compute our reaction coordinate *ξ* based on the full 66-dimensional data.

The relaxation time *t* = 20 ps and the embedding dimension *r* = 2 are assumed to be known. We will see later that *t* indeed falls into a timescale gap. For the dynamical data, a single 40 ns long trajectory of the system was generated using the MD software Gromacs. The trajectory was stripped from the solvent molecules, downsampled to step width *τ* = 0.02 ps, and its center of mass fixed at the center of the simulation box, yielding the 66-dimensional trajectory

with *M* = 2 · 10^{6}. Using (15), we generated the data sets $XM,\u2009YM$.

We computed 2000 Voronoi centers in the region covered by the trajectory using both the k-means and the picking algorithm. The projection of these points onto the (*φ*, *ψ*)-plane can be seen in Fig. 6(b). While this projection offers only an incomplete insight into the distribution of the full 66-dimensional center points, it indicates that the k-means algorithm again emphasizes the metastable sets, whereas Algorithm 3 covers the total range of values more evenly. Again, for the evaluation points {*x*_{1}, …, *x*_{L}}, we re-purposed the 2000 Voronoi center points.

For the embedding functions $\eta :R66\u2192R5$, linear functions with coefficients drawn uniformly randomly from [0, 1] were chosen just as in Sec. V A.

#### 1. Results

Figure 6 visualizes the computed reaction coordinates with Voronoi center points chosen by the k-means algorithm (left) and picking algorithm (right).

As the dimension of the transition manifold was assumed to be *r* = 2, the dimension of the embedding space and thus the values $\xi \u0303(xi),\u2009i=1,\u2026,L$, is 2*r* + 1 = 5, which makes it impossible to directly visualize the embedded transition manifold. However, plotting just the first three of the five components still offers a good insight into the structure of the embedded transition manifold; see Fig. 6(a). Unlike in the first example, the two-dimensional manifold structure in the embedded points is not obviously apparent. Instead, the points $\xi \u0303(xi)$ appear to be mainly concentrated around four clusters that form two connected pairs. The Diffusion Maps algorithm still recognizes the point cloud as parametrizable by a two-dimensional coordinate and computes the parametrization, i.e., our final reaction coordinate *ξ* at the evaluation points. Figures 6(a) and 6(b) show in color the two components of *ξ* at the embedded evaluation points and at the (*φ*, *ψ*)-projection of the evaluation points, respectively. The latter confirms that the observed four clusters correspond to the four metastable states, and the connections between the pairs of clusters correspond to points that are located along the transition pathways. It also explains why there is seemingly no connection between the two pairs of clusters: the transition pathway connecting clusters A and C is too sparsely populated by evaluation points—especially in the k-means case—in order to show the connection. Overall, we see a clear correlation between the computed reaction coordinate *ξ* and the reference reaction coordinate (*φ*, *ψ*).

#### 2. Time scale analysis

We again compute the implied time scales of the reduced process *ξ*(*X*_{t}). To yield the highest accuracy possible for the given data set, we utilize the PyEMMA software package^{47} with its built-in methods to discretize the transfer operator, estimate its eigenvalues, and compute the time scales.

Computing the time scales of the full 66-dimensional process with the necessary accuracy is not possible, so we cannot conduct a rigorous error analysis for this system. Instead, we utilize the variational principle of conformation dynamics^{48} which states that the time scales of the full process are always *underestimated* by those of any projection of the process. Thus, larger dominant time scales of the projected process in general correspond to a better reaction coordinate. However, due to the possibility of systematic errors in approximating the projected time scales (discretization of the transfer operator, finite amount of dynamical data), this variational principle might be violated. Thus, we additionally offer a comparison to the time scales of a manually chosen two-dimensional reaction coordinate that can generally be considered “good,” namely, the backbone dihedrals (*φ*, *ψ*). Still, we emphasize that these time scales do not represent the “ground truth.” The coordinate (*φ*, *ψ*) is also not necessarily optimal in the sense of the variational principle, and thus again gives only an approximation of the full system’s true dominant time scales.

Using these two error estimators, we compare our reaction coordinates *ξ* for both the k-means and the picking algorithm to a two-dimensional TICA (time-lagged independent component analysis) projection, a dimensionality reduction method that is popular in MD analysis.^{14} TICA finds the directions in the data sets with maximal global autocorrelation for a specified lag time and thus always yields *linear* reaction coordinates. For this lag time, *τ* = 120 ps was chosen as it maximizes the cumulative kinetic variance (95.5%).^{49}

The three (nontrivial) dominant time scales and their deviation from the benchmark (*φ*, *ψ*)-projection can be seen in Table II. The remaining time scales *t*_{i}, *i* ≥ 4 are significantly smaller (<5 ps) and are considered non-dominant and thus irrelevant.

Dominant time scales (ps) . | |||
---|---|---|---|

. | t_{1}
. | t_{2}
. | t_{3}
. |

ξ, k-means alg. | 194.58 | 62.50 | 41.80 |

ξ, picking alg. | 194.41 | 62.25 | 41.63 |

TICA | 191.78 | 61.27 | 29.84 |

(φ, ψ) | 194.71 | 62.93 | 41.27 |

Dominant time scales (ps) . | |||
---|---|---|---|

. | t_{1}
. | t_{2}
. | t_{3}
. |

ξ, k-means alg. | 194.58 | 62.50 | 41.80 |

ξ, picking alg. | 194.41 | 62.25 | 41.63 |

TICA | 191.78 | 61.27 | 29.84 |

(φ, ψ) | 194.71 | 62.93 | 41.27 |

Judging by both the variational principle and the comparison to the benchmark projection, both of our new reaction coordinates provide a measurably better approximation of the dominant time scales than the TICA reaction coordinate, though the latter remains competitive.

#### 3. Eigenfunction reconstruction

As the reaction coordinate *ξ* was constructed to fulfill criterion (CI), it should be possible to reconstruct the full system’s dominant transfer operator eigenfunctions $vi$, *i* = 1, 2, 3, which are functions over the 66-dimensional state space $X$, from the eigenfunctions $wi$ of the projected transfer operator, i.e., functions over $R2$. As the reaction coordinates computed with the k-means and the picking algorithm variant of Algorithm 2 are qualitatively equal, we limit the investigation to the k-means reaction coordinate.

Even though the state space is 66-dimensional, the eigenfunctions of the full transfer operator can still be approximated with reasonable accuracy by a Galerkin method if an appropriate mesh-free basis is used. Luckily, we have already constructed such a basis, namely, the Voronoi basis used for computing the reaction coordinates. Thus, we are able to re-use exactly the same transition matrix *T* and Gram matrix *S* assembled in Algorithm 2. Computing the Galerkin approximation of the eigenfunctions $vi$ then corresponds to solving a 2000 × 2000 eigenvector problem.

On the other hand, as *ξ* is only two-dimensional, computing the eigenfunctions $wi$ of the projected transfer operator $T\xi \u2009t$ is possible by a fine grid-based Galerkin method. To construct the corresponding transition matrix, the projected trajectory

is used. The functions $v^i(\u22c5)\u2254wi\xi (\u22c5)$ then should reconstruct the $vi$.

Of course, being functions over the 66-dimensional state space, the $vi$ and $v^i$ are difficult to visualize. We thus again project them onto the (*φ*, *ψ*)-plane using a simple interpolation procedure. The result can be seen in Fig. 7. We observe excellent qualitative agreement between the full and the reconstructed eigenfunctions, or at least their (*φ*, *ψ*)-projections.

*Remark.*

This last section has again shown the close relationship between the transfer operator eigenfunctions and the newly defined reaction coordinates, both in their expressive power and the data required to compute them. In this concrete example, even the computational effort is identical, as both the computation of the full transfer operator eigenfunctions and the application of the Diffusion Maps algorithm to the embedded evaluation points require the solution of a 2000 × 2000 eigenproblem. Therefore, our proposed numerical method is not necessarily computationally advantageous over directly computing the eigenfunctions.

However, we want to stress again that the newly defined transition manifold-based reaction coordinates are advantageous on a conceptual level. First, they obey a rigorous optimality criterion and thus are guaranteed to preserve the system’s slowest time scales. Second, they are interpretable in the context of transition pathways, as detailed in Sec. II F. For the alanine dipeptide, the computed two-dimensional reaction coordinate *ξ* is directly interpretable as a transformation of the two principal dihedral angles, whereas using the three dominant eigenfunctions as reaction coordinates would yield a three-dimensional reaction coordinate that is redundant in describing the molecule’s internal slow dynamics.

### C. Conformational analysis of NTL9

Finally, we demonstrate the applicability of our method to a realistic high-dimensional system. For this, we analyze a 1.11 ms long molecular trajectory of the fast-folding protein NTL9, generated on the Anton supercomputer.^{50} Instead of Cartesian coordinates, we use the amino acid chain’s *contact map*, i.e., the matrix containing the pairwise distances of the residues, as coordinates for the further analysis. This eliminates the need to remove the global translational and rotational motion from the trajectory. As the protein consists of 40 residues, this results in a 1600-dimensional state space (although it could be reduced due to symmetry of the contact map matrix).

As we are interested in the forming of secondary structures such as *α* helices and *β* sheets, we choose a lag time 1-2 orders of magnitude faster than those processes, *τ* = 10 ns. To generate the data set $XM$, *M* = 1.11 · 10^{6} frames were uniformly subsampled from the trajectory. $YM$ was generated the same way, only with a lag time of *τ*. From $XM$, we drew *L* = 5550 Voronoi center points *x*_{i} using the picking algorithm.

For the expected transition manifold dimension *r*, and the corresponding number of embedding functions 2*r* + 1, we used a simplified version of the iterative procedure proposed at the end of Sec. III B: Start with a low value (*r* = 1 in this example) and see if useful structure can be identified in the embedded transition manifold. If not, increase *r* and repeat the embedding procedure.

#### 1. Results

Quite surprisingly, the transition manifold in this case already reveals its structure under an embedding into $R3$. In the embedded Voronoi center points $\xi \u0303(xi)$, four clusters are clearly visible [Fig. 8(a)]. The clusters are robust under the choice of the embedding functions.

For simplicity, i.e., in order to avoid the parameter tuning of an automated clustering algorithm, we assigned the points to the clusters manually. Their average contact maps and secondary structure are shown in Figs. 8(b) and 8(c).

Interpreting the four clusters as conformations, our results are to a large degree consistent with those of Mardt *et al.*,^{51} who performed analysis on the same dataset using deep learning methods. Our conformations “Unfolded,” “Folded 1,” and “Folded 2” correspond very well to the main conformations identified by their algorithm. Note that “Unfolded” is not a conformation in the chemical sense, but rather a loose collection of various unfolded configurations. The populations of the conformations [percentages in Fig. 8(c)] are also comparable to those in Ref. 51. Our slightly lower values can be explained by the difference in how the populations are calculated. However, our conformation “Folded 3” does not appear in their analysis. While its population is quite low, its structure subtly yet distinctively differs from the other conformations, so we do not consider its existence a statistical artifact. Furthermore, we were not able to find the finer sub-structures of the “Unfolded” conformation that were identified in Ref. 51.

Let *ξ*(*x*_{i}) denote the first diffusion maps coordinate on the embedded points, which indicates the direction of largest variance. We see a strong correlation between *ξ* and the mean inter-residuum distance, i.e., the average of all entries of the contact map matrix (Fig. 9). Thus, *ξ* describes the “degree of foldedness” of the protein, which can be considered a reasonable one-dimensional reaction coordinate of this system. However, unlike in the dialanine example, the second and higher diffusion map coordinates here did not correspond to some easily interpretable physical property that finer resolves the transitions between the identified conformations and instead seemed to consist only of higher modes of the first diffusion map coordinate.

*Remark.*

Although the results are already very encouraging and show the potential usefulness of the method for very high-dimensional systems, the setup can be refined in a number of ways. Most importantly, instead of the simple Voronoi cell-based Galerkin method, specialized ansatz spaces such as meshfree basis functions with global support might be able to better approximate the reaction coordinate, in particular, in the undersampled transition regions. We are planning to explore these and other refinements of the method as well as its application to further high-dimensional molecular systems in an upcoming study.

## VI. CONCLUSION

In this paper, we reviewed a novel framework for the characterization and computation of optimal reaction coordinates, originally introduced in Ref. 16, and presented efficient algorithms for the computational identification of such reaction coordinates that allow for direct application to real-world molecular systems. Moreover, we found that the new framework agrees with the TPT characterization of good reaction coordinates in classical metastable systems, but offers more rigorous criteria that are applicable to much broader classes of multiscale systems with and without time scale gap.

In particular, we introduced a discretization approach to the data-driven computation of reaction coordinates that fulfill these rigorous criteria. This approach is usable whenever a transition matrix between the discretization elements can be computed from available simulation data, e.g., if the data represent a long (equilibrated) trajectory so that the entire machinery invented for building MSMs can now be utilized for the computation of reliable reaction coordinates with provable approximation quality.

As a demonstration, we provided two algorithms to construct a meshfree basis of Voronoi ansatz functions directly from the data. Both algorithms are highly scalable and readily available, making this method straightforward to apply for practitioners who have existing simulation data on their hard drives. We showed that in molecular systems of medium size, the resulting approximation error in the dominant time scales is competitive with state of the art dimension reduction techniques. This demonstrates that the reaction coordinates we compute here can be used to build efficient coarse grained models. Of course the computed reaction coordinates themselves are also of independent value. In the case of the alanine dipeptide, we showed that our computed reaction coordinates and the dihedral angles, which are typically used as reaction coordinates for this system, produce a similar portrait of the system when viewed in this reduced space.

As a next step, we plan to apply our method to higher-dimensional systems with *a priori* unknown reaction coordinates using specialized Galerkin ansatz spaces borrowed from Markov state model theory. For these systems, however, the requirement to have simulation data that sample the stationary density is of course somewhat strict. We thus also plan to work on relaxing this requirement to a “local” version, i.e., we will work with samples that are equilibrated only in some smaller region of the state space, in order to compute reaction coordinates in that region.

## ACKNOWLEDGMENTS

This research has been funded by Deutsche Forschungsgemeinschaft (DFG) through Grant No. CRC 1114 “Scaling Cascades in Complex Systems.”

### APPENDIX A: SPECTRAL DECOMPOSITION OF THE TRANSITION DENSITY FUNCTION

We derive the exact form of the coefficients *d*_{i}(*x*, *t*) in the decomposition of the transition density function *p*^{t}(*x*, ·),

Recall that the stochastic process was assumed to be *reversible*, which formally equates to the transition density function and the stationary density *ρ* fulfilling the *detailed balance condition*

With that, it is easy to see that the transfer operator $Pt$, defined in (2), is self-adjoint with respect to the weighted inner product

where in (*) the detailed balance condition was used. Thus, the eigenfunctions $vi$ of $Pt$ form an orthogonal basis of the associated inner product space.

We assume from now on that the function *p*^{t}(*x*, ·) lies in (or can be approximated with sufficient accuracy) in this space. Then we have

Now, *p*^{t}(*x*, ·) can be seen as the time-*t* evolution of the Dirac density *δ*_{x} under the dynamics; thus,

Using the self-adjointness of $Pt$, we get

As $vi$ is an eigenfunction of $Pt$ to eigenvalue $\lambda it$, this is

Finally, taking the inner product with *δ*_{x} is equivalent to a point evaluation at *x*,

The overall decomposition thus reads

### APPENDIX B: DERIVATION OF THE GALERKIN-APPROXIMATED REACTION COORDINATE

Let $VN$ be a final-dimensional function space spanned by the basis {*φ*_{1}, …, *φ*_{N}}. The Galerkin projection of $\xi \u0303$ onto $VN$ with respect to the inner product ⟨·,·⟩_{ρ} is defined as

with the nonnegative, symmetric Gram matrix

The Galerkin projection of the observable *η* is analogously defined and with the factors

can be written as

We assume that the Galerkin ansatz space is suitable to approximate *η*, i.e., ∥*η* − *η*_{N}∥_{ρ} is small, where ∥·∥_{ρ} is the norm induced by ⟨·,·⟩_{ρ}.

Recall that $\xi \u0303$ can also be written as the Koopman operator applied to *η*: $\xi \u0303(x)=Kt\eta (x)$. With this, the scalar product in (B1) can be estimated as follows:

For (*), it was used that $\Vert Kt\Vert \rho =1$, i.e., $Kt$ does not amplify the approximation error of *η*_{N}. Define the *N* × *N transition matrix* by

Then the Galerkin approximation (B1) becomes

### APPENDIX C: ERROR MEASURE FOR THE VORONOI CENTER POINTS

We show that different choices of the error measure for approximating $\xi \u0303$ by its Voronoi Galerkin projection $\xi \u0303N$ lead to different optimal strategies in choosing the Galerkin center points.

#### 1. Minimizing the L^{2} error

Assume that by choosing the Voronoi center points {*e*_{1}, …, *e*_{N}} that define $\xi \u0303N$, we want to minimize the error

The difficulty is that neither $\xi \u0303$ nor $\xi \u0303N$ is known in advance. We thus construct a Monte Carlo estimator of $\Vert \xi \u0303\u2212\xi \u0303N\Vert \rho $ based on the sampled data $XM={x1,\u2026,xM}$. Here

is the mean of $\xi \u0303$ in cell *A*_{k}. First, since *A*_{1}, …, *A*_{N} partition $X$,

The integral can be approximated by a Monte Carlo sum over the *M ρ*-distributed samples *x*_{i},

where ∥ · ∥ is the Euclidean norm in $R2r+1$. Finally, since $\xi \u0303N=\u2211k\u27e81Ak,\xi \u0303\u27e9\rho \u27e81Ak,1\u27e9\rho 1Ak$, we may approximate $\xi \u0303N(x)$ for *x* ∈ *A*_{k} with another Monte Carlo sum

Combining everything gives

*S*_{ξ}(*A*_{1}, …, *A*_{N}) can be recognized as the objective function of *k*-means clustering in the image space of the reaction coordinate $\xi \u0303$. To minimize this objective function directly, one would have to know $\xi \u0303$. If we, however, additionally assume that $\xi \u0303$ is Lipschitz continuous with Lipschitz constant *L*, then

where *e*_{k} is such that $\xi \u0303(ek)=\xi \xafAk$ and ∥ · ∥ now is the Euclidean norm in $Rn$. Minimizing this upper bound is now achieved by *k*-means clustering the data set $XM$ in the original state space.

#### 2. Minimizing the uniform error

Assume that we now want to minimize the uniform error

Assume again that $\xi \u0303$ is Lipschitz continuous with Lipschitz constant *L*. Evidently, we have

with $\xi \xafAk$ as defined in (C2). Let now *e*_{k} ∈ *A*_{k} be such that $\xi \u0303(ek)=\xi \xafAk$ (such an *e*_{k} exists by continuity of $\xi \u0303$). Then, with $\Vert \u22c5\Vert \u221e,Ak$ denoting the uniform norm in *A*_{k},

where diam(*A*_{k}) is the diameter of the Voronoi cell *A*_{k}. Since *A*_{1}, …, *A*_{N} partition $X$, we have

Minimizing this upper bound then means looking for Voronoi centers such that the diameter of the largest Voronoi cell is minimized. Since the number of Voronoi cells and the volume of the set $XM$ to be covered are fixed, the minimum is achieved if the centers cover $XM$ evenly such that the Voronoi cells all have similar diameters. Therefore, we may alternatively maximize the diameter of the smallest Voronoi cell, which is bounded from below by the minimal internal point distance,

The inequality holds because min ∥*e*_{i} − *e*_{j}∥ is twice the distance from *e*_{i} to that face of *A*_{i} which is closest to *e*_{i}, while the diameter of *A*_{i} is by definition larger. Maximizing the lower bound then leads to the objective function of maximal minimal internal point distance