Understanding dynamics in complex systems is challenging because there are many degrees of freedom, and those that are most important for describing events of interest are often not obvious. The leading eigenfunctions of the transition operator are useful for visualization, and they can provide an efficient basis for computing statistics, such as the likelihood and average time of events (predictions). Here, we develop inexact iterative linear algebra methods for computing these eigenfunctions (spectral estimation) and making predictions from a dataset of short trajectories sampled at finite intervals. We demonstrate the methods on a low-dimensional model that facilitates visualization and a high-dimensional model of a biomolecular system. Implications for the prediction problem in reinforcement learning are discussed.

## I. INTRODUCTION

Modern observational, experimental, and computational approaches often yield high-dimensional time series data (trajectories) for complex systems. In principle, these trajectories contain rich information about dynamics and, in particular, the infrequent events that are often most consequential. In practice, however, high-dimensional trajectory data are often difficult to parse for useful insight. The need for more efficient statistical analysis tools for trajectory data is critical, especially when the goal is to understand rare-events that may not be well represented in the data.

We consider dynamics that can be treated as Markov processes. A common starting point for statistical analyses of Markov processes is the transition operator, which describes the evolution of function expectations. The eigenfunctions of the transition operator characterize the most slowly decorrelating features (modes) of the system.^{1–5} These can be used for dimensionality reduction to obtain a qualitative understanding of the dynamics,^{6,7} or they can be used as the starting point for further computations.^{8–10} Similarly, prediction functions, which provide information about the likelihood and timing of future events as a function of the current state, are defined through linear equations of the transition operator.^{10,11}

A straightforward numerical approach to obtaining these functions is to convert the transition operator to a matrix by projecting onto a finite basis for Galerkin approximation.^{1,2,10–15} The performance of such a linear approximation depends on the choice of basis,^{10,11,15} and previous work often resorts to a set of indicator functions on a partition of the state space (resulting in a Markov state model or MSM^{14}) for lack of a better choice. While Galerkin approximation has yielded many insights,^{16,17} the limited expressivity of the basis expansion has stimulated interest in (nonlinear) alternatives.

In particular, artificial neural networks can be harnessed to learn eigenfunctions of the transition operator and prediction functions from data.^{5,18–25} However, existing approaches based on neural networks suffer from various drawbacks. As discussed in Ref. 5, their performance can often be very sensitive to hyperparameters, requiring extensive tuning and varying with random initialization. Many use loss functions that are estimated against the stationary distribution,^{25–30} so that metastable states contribute most heavily, which negatively impacts performance.^{24,30} Assumptions about the dynamics (e.g., microscopic reversibility) limit applicability. In Ref. 24, we introduced an approach that overcomes the issues above, but it uses multiple trajectories from each initial condition; this limits the approach to analysis of simulations and moreover requires specially prepared datasets.

The need to compute prediction functions from observed trajectory data also arises in reinforcement learning. There the goal is to optimize an expected future reward (the prediction function) over a policy (a Markov process). For a fixed Markov process, the prediction problem in reinforcement learning is often solved by temporal difference (TD) methods, which allow the use of arbitrary ensembles of trajectories without knowledge of the details of the underlying dynamics.^{31} TD methods have a close relationship with an inexact form of power iteration, which, as we describe, can perform poorly on rare-event related problems.

Motivated by this relationship, as well as by an inexact power iteration scheme previously proposed for approximating the stationary probability distribution of a Markov process using trajectory data,^{32} we propose a computational framework for spectral estimation and rare-event prediction based on inexact iterative numerical linear algebra. Our framework includes an inexact Richardson iteration for the prediction problem, as well as an extension to inexact subspace iteration for the prediction and spectral estimation problems. The theoretical properties of exact subspace iteration suggest that eigenfunctions outside the span of the approximation will contribute significantly to the error of our inexact iterative schemes.^{33} Consistent with this prediction, we demonstrate that learning additional eigenvalues and eigenfunctions simultaneously through inexact subspace iteration accelerates convergence dramatically relative to inexact Richardson and power iteration in the context of rare events. While we assume the dynamics can be modeled by a Markov process, we do not require knowledge of their form or a specific underlying model. The method shares a number of further advantages with the approach discussed in Ref. 24 without the need for multiple trajectories from each initial condition in the dataset. This opens the door to treating a wide range of observational, experimental, and computational datasets.

The remainder of the paper is organized as follows: In Sec. II, we describe the quantities that we seek to compute in terms of linear operators. In Secs. III and IV, we introduce an inexact subspace iteration algorithm that we use to solve for these quantities. Section V illustrates how the loss function can be tailored to the known properties of the desired quantity. Section VI summarizes the two test systems that we use to illustrate our methods: a two-dimensional potential, for which we can compute accurate reference solutions, and a molecular example that is high-dimensional but still sufficiently tractable that statistics for comparison can be computed from long trajectories. In Sec. VII, we explain the details of the invariant subspace iteration and then demonstrate its application to our two examples. Finally, Sec. VIII details how the subspace iteration can be modified to compute prediction functions and compares the effect of different loss functions, as well as the convergence properties of power iteration and subspace iteration. We conclude with implications for reinforcement learning.

## II. SPECTRAL ESTIMATION AND THE PREDICTION PROBLEM

*dominant eigenfunctions and eigenvalues*of the transition operator $Tt$ for a Markov process $Xt\u2208Rd$, defined as

*f*is an arbitrary real-valued function and the subscript

*x*indicates the initial condition

*X*

^{0}=

*x*. The transition operator encodes how expectations of functions evolve in time. The right eigenfunctions of $Tt$ with the largest eigenvalues characterize the most slowly decorrelating features (modes) of the Markov process.

^{1,2,4,5}

*prediction functions*of the general form

*T*is the first time

*X*

^{t}∉

*D*from some domain

*D*, and Ψ and Γ are functions associated with the escape event (rewards in reinforcement learning). Prototypical examples of prediction functions that appear in our numerical results are the mean first passage time (MFPT) from each

*x*

*A*and

*B*are disjoint sets (“reactant” and “product” states) and $D=(A\u222aB)c$. The MFPT is important for estimating rates of transitions between regions of state space, while the committor can serve as a reaction coordinate

^{29,34–36}and as a key ingredient in transition path theory statistics.

^{24,37,38}For any

*τ*> 0, the prediction function

*u*(

*x*) satisfies the linear equation

*x*∈

*D*, with boundary condition

*x*∉

*D*. In (5), $I$ is the identity operator and

^{10}We use the notation

*τ*∧

*T*= min{

*τ*,

*T*}. The parameter

*τ*is known as the lag time. While it is an integer corresponding to a number of integration steps, in our numerical examples, we often express it in terms of equivalent time units.

*X*

^{t}in high dimensions and without direct access to a model governing its evolution. Instead, we have access to trajectories of

*X*

^{t}of a fixed, finite duration

*τ*. A natural and generally applicable approach to finding an approximate solution to the prediction problem is to attempt to minimize the residual of (5) over parameters

*θ*of a neural network

*u*

_{θ}(

*x*) representing

*u*(

*x*). For example, we recently suggested an algorithm that minimizes the residual norm

*r*(

*x*) is the right-hand side of (5) and

*μ*is an arbitrary distribution of initial conditions

*X*

^{0}(boundary conditions were enforced by an additional term).

^{24}The norm ‖⋅‖

_{μ}is defined by ‖

*f*‖

_{μ}

^{2}=⟨

*f*,

*f*⟩

_{μ}, where ⟨

*f*,

*g*⟩

_{μ}= ∫

*f*(

*x*)

*g*(

*x*)

*μ*(

*dx*) for arbitrary functions

*f*and

*g*. The gradient of the residual norm in (8) with respect to neural-network parameters

*θ*can be written as

*μ*and a single sample trajectory ${Xjt}t=0\tau $ of

*X*

^{t}for each $Xj0$, the first term in the gradient (9) can be approximated without bias as

*T*

_{j}is the first time $Xjt\u2209D$.

In contrast, unbiased estimation of the second term in (9) requires access to two independent trajectories of *X*^{t} for each sample initial condition since it is quadratic in $S\tau $.^{24,31} In the reinforcement learning community, TD methods were developed for the purpose of minimizing a loss of a very similar form to (8), and they perform a “semigradient” descent by following only the first term in (9).^{31} As in the semigradient approximation, the algorithms proposed in this paper only require access to inner products of the form $f,Ag\mu $ for an operator $A$ related to $T\tau $ or $S\tau $, and they avoid terms that are non-linear in $A$. As we explain, such inner products can be estimated using trajectory data. However, rather than attempting to minimize the residual directly by an approximate gradient descent, we view the eigenproblem and prediction problems through the lens of iterative numerical linear algebra schemes.

## III. AN INEXACT POWER ITERATION

*θ*′ of the loss

*θ*′ =

*θ*. The result of many steps of gradient descent (later, “inner iterations”) on this loss with

*θ*held fixed can then be viewed as an inexact Richardson iteration for (5), resulting in a sequence

*s*, $u\theta s$ is a parameterized neural-network approximation of the solution to (5). To unify our discussion of the prediction and spectral estimation problems, it is helpful to observe that (12) can be recast as an equivalent inexact power iteration

*u*)

^{⊤}is the dominant eigenfunction of $A$ and has eigenvalue equal to 1.

Reference 32 introduced an inexact power iteration to compute the stationary probability measure of $T\tau $, i.e., its dominant left eigenfunction. As those authors note, an inexact power update, such as (13), can be enforced using a variety of loss functions. In our setting, the $L\mu 2$ norm in (11) can be replaced by any other measure of the difference between *u*_{θ′} and $S\tau u\theta +r$, as long as an unbiased estimator of its gradient with respect to *θ*′ is available. This flexibility is discussed in more detail in Sec. V, and we exploit it in applications presented later in this article. For now, we focus instead on another important implication of this viewpoint: the flexibility in the form of the iteration itself.

We will see that when the spectral gap of $A$, the difference between its largest and second largest eigenvalues, is small, the inexact power iteration (or the equivalent Richardson iteration) described above fails to reach an accurate solution. The largest eigenvalue of $A$ in (14) is 1 and its next largest eigenvalue is the dominant eigenvalue of $S\tau $. For rare-event problems, the difference between these two eigenvalues is expected to be very small. Indeed, when *X*^{0} is drawn from the quasi-stationary distribution of *X*^{t} in *D*, the logarithm of the largest eigenvalue of $S\tau $ is exactly $\u2212\tau /E[T]$.^{39} Thus, when the mean exit time from *D* is large, we can expect the spectral gap of $A$ in (14) to be very small. In this case, classical convergence results tell us that exact power iteration converges slowly, with the largest contributions to the error coming from the eigenfunctions of $S\tau $ with largest magnitude eigenvalues.^{33} Iterative schemes that approximate multiple dominant eigenfunctions simultaneously, such as subspace iteration and Krylov methods, can converge much more rapidly.^{33} In Sec. IV, we introduce an inexact form of subspace iteration to address this shortcoming. Beyond dramatically improving performance on the prediction problem for rare-events when applied with $A$ in (14), it can also be applied with $A=T\tau $ to compute multiple dominant eigenfunctions and values of $T\tau $ itself.

## IV. AN INEXACT SUBSPACE ITERATION

*k*dominant eigenfunctions/values of $A$ proceeds as follows. Let $\phi \theta saa=1k$ be a sequence of

*k*basis functions parameterized by

*θ*

^{s}(these can be scalar or vector valued functions depending on the form of $A$). We can represent this basis as the vector valued function

*k*basis functions by approximately applying $A$ to each of the components of $U\theta s$,

*K*

^{s+1}is an invertible, upper-triangular

*k*×

*k*matrix that does not change the span of the components of $U\theta s+1$ but is included to facilitate training. One way the approximate application of $A$ can be accomplished is by minimizing

*θ*and

*K*with

*θ*

^{s}held fixed.

*W*and Λ are

*k*×

*k*matrices. The matrix Λ is diagonal, and the

*a*th eigenvalue

*λ*

_{a}of $A$ is approximated by Λ

_{aa}; the corresponding eigenfunction

*v*

_{a}is approximated by $\u2211b=1kWab\phi \theta sb$.

Even when sampling is not required to estimate the matrices in (19) and (20), the numerical rank of *C*^{τ} becomes very small as the eigenfunctions become increasingly aligned with the single dominant eigenfunction. To overcome this issue, we apply an orthogonalization step between iterations (or every few iterations). Just as the matrices *C*^{0} and *C*^{τ} can be estimated using trajectory data, the orthogonalization procedure can also be implemented approximately using data.

*α*

_{s}as the number of iterations increases results in better generalization properties of the final solution and helps ensure convergence of the iteration. For our numerical experiments, we use

*σ*is a user chosen hyperparameter that sets the number of iterations performed before damping begins.

The details, including estimators for all required inner products, in the case of the eigenproblem $(A=T\tau )$ are given in Sec. VII and Algorithm 1. For the prediction problem with $A$ as in (14), they are given in Sec. VIII and Algorithm 2.

Require: Subspace dimension k, transition data ${Xj0,Xj\tau}j=1n$, batch size B, learning rate η, number of subspace iterations S, number of inner iterations M, regularization parameters γ_{1} and γ_{2} |

1: Initialize ${\phi \theta a}a=1k$ and $\phi \u03031aa=1k$ |

2: for s = 1 … S do |

3: for m = 1 … M do |

4: Sample a batch of data ${Xj0,Xj\tau}j=1B$ |

5: $L\u03021\u21901B\u2211j=1B\u2211a=1k12(\u2211b=1k\phi \theta b(Xj0)Kba)2\u2212\u2211b=1k\phi \theta b(Xj0)Kba\alpha s\phi \u0303sa(Xj\tau )+(1\u2212\alpha s)\phi \u0303sa(Xj0)$ |

6: $L\u0302K\u2190\gamma 1\Vert K\u2212diag(K)\Vert F2$ |

7: $L\u0302norm\u2190\gamma 2\u2211a=1k(2\nu a(1B\u2211j=1B\phi \theta a(Xj0)2\u22121)\u2212\nu a2)$ |

8: $L\u0302\u2190L\u03021+L\u0302K+L\u0302norm$ |

9: $\theta \u2190\theta \u2212\eta \u2207\theta L\u0302$ |

10: $K\u2190K\u2212\eta triu(\u2207KL\u0302)$ |

11: $\nu a\u2190\nu a+\eta \u2207\nu aL\u0302$ |

12: end for |

13: Compute the matrix $\Phi ia=\phi \theta a(Xi0)$ ⊳ $\Phi \u2208Rn\xd7k$ |

14: Compute diagonal matrix $Naa2=\u2211i\phi \theta a(Xi0)2$ |

15: Compute QR-decomposition Φ = QR ⊳ $Q\u2208Rn\xd7k$ and $R\u2208Rk\xd7k$ |

16: $\phi \u0303sa\u2190\u2211b=1k\phi \theta b(R\u22121N)ba$ |

17: $K1:i,i\u2190argminc1n\u2211j=1n(\u2211a=1i\phi \theta a(Xj0)ca\u2212\phi \u0303sa(Xj\tau ))2+\gamma 2\u2211a=1i\u22121ca2$ |

18: end for |

19: Compute the matrices $Cabt=1n\u2211j=1n\phi \u0303sa(Xj0)\phi \u0303sb(Xjt)$ for t = 0, τ ⊳ $Ct\u2208Rk\xd7k$ |

20: Solve the generalized eigenproblem C^{τ}W = C^{0}WΛ for W and Λ |

21: return eigenvalues Λ, eigenfunctions $va=\u2211b=1kWab\phi \u0303sb$ |

Require: Subspace dimension k, transition data ${Xj0,Xj\tau}j=1n$, batch size B, learning rate η, number of subspace iterations S, number of inner iterations M, regularization parameters γ_{1} and γ_{2} |

1: Initialize ${\phi \theta a}a=1k$ and $\phi \u03031aa=1k$ |

2: for s = 1 … S do |

3: for m = 1 … M do |

4: Sample a batch of data ${Xj0,Xj\tau}j=1B$ |

5: $L\u03021\u21901B\u2211j=1B\u2211a=1k12(\u2211b=1k\phi \theta b(Xj0)Kba)2\u2212\u2211b=1k\phi \theta b(Xj0)Kba\alpha s\phi \u0303sa(Xj\tau )+(1\u2212\alpha s)\phi \u0303sa(Xj0)$ |

6: $L\u0302K\u2190\gamma 1\Vert K\u2212diag(K)\Vert F2$ |

7: $L\u0302norm\u2190\gamma 2\u2211a=1k(2\nu a(1B\u2211j=1B\phi \theta a(Xj0)2\u22121)\u2212\nu a2)$ |

8: $L\u0302\u2190L\u03021+L\u0302K+L\u0302norm$ |

9: $\theta \u2190\theta \u2212\eta \u2207\theta L\u0302$ |

10: $K\u2190K\u2212\eta triu(\u2207KL\u0302)$ |

11: $\nu a\u2190\nu a+\eta \u2207\nu aL\u0302$ |

12: end for |

13: Compute the matrix $\Phi ia=\phi \theta a(Xi0)$ ⊳ $\Phi \u2208Rn\xd7k$ |

14: Compute diagonal matrix $Naa2=\u2211i\phi \theta a(Xi0)2$ |

15: Compute QR-decomposition Φ = QR ⊳ $Q\u2208Rn\xd7k$ and $R\u2208Rk\xd7k$ |

16: $\phi \u0303sa\u2190\u2211b=1k\phi \theta b(R\u22121N)ba$ |

17: $K1:i,i\u2190argminc1n\u2211j=1n(\u2211a=1i\phi \theta a(Xj0)ca\u2212\phi \u0303sa(Xj\tau ))2+\gamma 2\u2211a=1i\u22121ca2$ |

18: end for |

19: Compute the matrices $Cabt=1n\u2211j=1n\phi \u0303sa(Xj0)\phi \u0303sb(Xjt)$ for t = 0, τ ⊳ $Ct\u2208Rk\xd7k$ |

20: Solve the generalized eigenproblem C^{τ}W = C^{0}WΛ for W and Λ |

21: return eigenvalues Λ, eigenfunctions $va=\u2211b=1kWab\phi \u0303sb$ |

Require: Subspace dimension k, stopped transition data $Xj0,Xj\tau \u2227Tjj=1n$, reward data $\rho (Xj)j=1n$, batch size B, learning rate η, number of subspace iterations S, number of inner iterations M, regularization parameters γ_{1} and γ_{2} |

1: Initialize ${\phi \theta a}a=1k$ and $\phi \u03031aa=1k$ |

2: for s = 1 … S do |

3: for m = 1 … M do |

4: Sample a batch of data ${Xj0,Xj\tau \u2227Tj}j=1B$, $\rho (Xj)j=1B$ |

5: $L\u03021\u21901B\u2211j=1B\u2211a=1k12(\u2211b=1k\phi \theta b(Xj0)Kba)2\u2212\alpha s\u2211b=1k\phi \theta bKba(\phi \u0303sa(Xj\tau \u2227Tj)+\rho (Xj)Ia1)$ |

6: $L\u03022\u2190\u22121B\u2211j=1B\u2211a=1k(1\u2212\alpha s)\u2211b=1k\phi \theta bKba\phi \u0303sa(Xj0)$ |

7: $L\u0302K\u2190\gamma 1\Vert K\u2212diag(K)\Vert F2$ |

8: $L\u0302norm\u2190\gamma 2\u2211a=2k(2\nu a(1B\u2211j=1B\phi \theta a(Xj0)2\u22121)\u2212\nu a2)$ |

9: $L\u0302\u2190L\u03021+L\u03022+L\u0302K+L\u0302norm$ |

10: $\theta \u2190\theta \u2212\eta \u2207\theta L\u0302$ |

11: $K\u2190K\u2212\eta (triu(\u2207KL\u0302))$ |

12: $\nu a\u2190\nu a+\eta \u2207\nu aL\u0302$ |

13: end for |

14: if Ψ(x) = 0 then |

15: Compute the matrix $\Phi ia=\phi \theta a(Xi0)$ ⊳ $\Phi \u2208Rn\xd7k$ |

16: else |

17: Compute the matrix $\Phi ia=\phi \theta a(Xi0)$ for a > 1 ⊳ $\Phi \u2208Rn\xd7(k\u22121)$ |

18: end if |

19: Compute QR-decomposition Φ = QR |

20: Compute diagonal matrix $Naa2=\u2211i\phi \theta a(Xi0)2$ |

21: $\phi \u0303sa\u2190\u2211b=1k\phi \theta b(R\u22121N)ba$ ⊳ if Ψ(x) = 0 exclude a = 1 |

22: $K1:i,i\u2190argminc1n\u2211j=1n\u2211a=1i\phi \theta a(Xj0)ca\u2212\phi \u0303sa(Xj\tau \u2227Tj)2+\gamma 2\u2211a=1i\u22121ca2$ |

23: end for |

24: Compute the matrix $Cabt=1n\u2211j=1n\phi \theta a(Xj0)\phi \theta b(Xjt)$ for t = 0, τ ∧ T_{j} ⊳ $Ct\u2208Rk\xd7k$ |

25: Compute the vector $pa=1n\u2211j=1n\phi \theta a(Xj0)\rho (Xj)$ |

26: Solve linear system (C^{0} − C^{τ})w = p ⊳ if Ψ(x) = 0 enforce w_{1} = 1 |

27: return $u=\u2211a=1kwa\phi \theta a$ |

Require: Subspace dimension k, stopped transition data $Xj0,Xj\tau \u2227Tjj=1n$, reward data $\rho (Xj)j=1n$, batch size B, learning rate η, number of subspace iterations S, number of inner iterations M, regularization parameters γ_{1} and γ_{2} |

1: Initialize ${\phi \theta a}a=1k$ and $\phi \u03031aa=1k$ |

2: for s = 1 … S do |

3: for m = 1 … M do |

4: Sample a batch of data ${Xj0,Xj\tau \u2227Tj}j=1B$, $\rho (Xj)j=1B$ |

5: $L\u03021\u21901B\u2211j=1B\u2211a=1k12(\u2211b=1k\phi \theta b(Xj0)Kba)2\u2212\alpha s\u2211b=1k\phi \theta bKba(\phi \u0303sa(Xj\tau \u2227Tj)+\rho (Xj)Ia1)$ |

6: $L\u03022\u2190\u22121B\u2211j=1B\u2211a=1k(1\u2212\alpha s)\u2211b=1k\phi \theta bKba\phi \u0303sa(Xj0)$ |

7: $L\u0302K\u2190\gamma 1\Vert K\u2212diag(K)\Vert F2$ |

8: $L\u0302norm\u2190\gamma 2\u2211a=2k(2\nu a(1B\u2211j=1B\phi \theta a(Xj0)2\u22121)\u2212\nu a2)$ |

9: $L\u0302\u2190L\u03021+L\u03022+L\u0302K+L\u0302norm$ |

10: $\theta \u2190\theta \u2212\eta \u2207\theta L\u0302$ |

11: $K\u2190K\u2212\eta (triu(\u2207KL\u0302))$ |

12: $\nu a\u2190\nu a+\eta \u2207\nu aL\u0302$ |

13: end for |

14: if Ψ(x) = 0 then |

15: Compute the matrix $\Phi ia=\phi \theta a(Xi0)$ ⊳ $\Phi \u2208Rn\xd7k$ |

16: else |

17: Compute the matrix $\Phi ia=\phi \theta a(Xi0)$ for a > 1 ⊳ $\Phi \u2208Rn\xd7(k\u22121)$ |

18: end if |

19: Compute QR-decomposition Φ = QR |

20: Compute diagonal matrix $Naa2=\u2211i\phi \theta a(Xi0)2$ |

21: $\phi \u0303sa\u2190\u2211b=1k\phi \theta b(R\u22121N)ba$ ⊳ if Ψ(x) = 0 exclude a = 1 |

22: $K1:i,i\u2190argminc1n\u2211j=1n\u2211a=1i\phi \theta a(Xj0)ca\u2212\phi \u0303sa(Xj\tau \u2227Tj)2+\gamma 2\u2211a=1i\u22121ca2$ |

23: end for |

24: Compute the matrix $Cabt=1n\u2211j=1n\phi \theta a(Xj0)\phi \theta b(Xjt)$ for t = 0, τ ∧ T_{j} ⊳ $Ct\u2208Rk\xd7k$ |

25: Compute the vector $pa=1n\u2211j=1n\phi \theta a(Xj0)\rho (Xj)$ |

26: Solve linear system (C^{0} − C^{τ})w = p ⊳ if Ψ(x) = 0 enforce w_{1} = 1 |

27: return $u=\u2211a=1kwa\phi \theta a$ |

## V. ALTERNATIVE LOSS FUNCTIONS

As mentioned above, the inexact application of the operator $A$ can be accomplished by minimizing loss functions other than (17). The key requirement for a loss function in the present study is that $A$ appears in its gradient only through terms of the form $\u27e8f,Ag\u27e9\mu $ for some functions *f* and *g*, so that the gradient can be estimated using trajectory data. As a result, we have flexibility in the choice of loss and, in turn, the representation of *u*. In particular, we consider the representation *u*_{θ} = *ζ*(*z*_{θ}), where *ζ* is an increasing function, and *z*_{θ} is a function parameterized by a neural network. An advantage of doing so is that the function *ζ* can restrict the output values of *u*_{θ} to some range. For example, when computing a probability such as the committor, a natural choice is the sigmoid function $\zeta (x)=(1+e\u2212x)\u22121$.

*θ*the loss function

*V*is an antiderivative of

*ζ*, and

*θ*

^{s}is fixed. The subscript

*X*

^{0}∼

*μ*in this expression indicates that

*X*

^{0}is drawn from

*μ*. Note that, as desired, $A$ appears in the gradient of (22) only in an inner product of the required form, and an approximate minimizer,

*θ*

^{s+1}, of this loss satisfies $\zeta (z\theta s+1)\u2248Au\theta s$. This general form of loss function is adapted from variational expressions for the divergence of two probability measures.

^{32,40}

The $L\mu 2$ loss in (17), which we use in several of our numerical experiments, corresponds to the choice *ζ*(*x*) = *x* and *V*(*x*) = *x*^{2}/2. The choice of $\zeta (x)=(1+e\u2212x)\u22121$ mentioned above corresponds to *V*(*x*) = log(1 + *e*^{x}); we refer to the loss in (22) with this choice of *V* as the “softplus” loss. We note that in the context of committor function estimation, in the limit of infinite *τ* the “softplus” loss corresponds directly to the maximum log-likelihood loss (for independent Bernoulli random variables) that defines the logistic regression estimate of the probability of reaching *B* before *A*. Logistic regression has previously been used in conjunction with datasets of trajectories integrated all the way until reaching *A* or *B* to estimate the committor function.^{41–46}

## VI. TEST PROBLEMS

We illustrate our methods with two well-characterized systems that exemplify features of molecular transitions. In this section, we provide key details of these systems.

### A. Müller-Brown potential

^{47}(Fig. 1)

*C*

_{i}= { − 200, −100, −170, 15},

*a*

_{i}= { − 1, −1, −6.5, 0.7},

*b*

_{i}= {0, 0, 11, 0.6},

*c*

_{i}= { − 10, −10, −6.5, 0.7},

*y*

_{i}= {1, −0.27, −0.5, −1}, and

*z*

_{i}= {0, 0.5, 1.5, 1}. In Sec. VIII B, we tune the parameters to make transitions between minima rarer; there, the parameters are

*C*

_{i}= { − 250, −150, −170, 15},

*a*

_{i}= { − 1, −3, −6.5, 0.7},

*b*

_{i}= {0, 0, 11, 0.6},

*c*

_{i}= { − 10, −30, −6.5, 0.7},

*y*

_{i}= {1, −0.29, −0.5, −1}, and

*z*

_{i}= {0, 0.5, 1.5, 1}. For prediction, we analyze transitions between the upper left minimum (

*A*) and the lower right minimum (

*B*) in Fig. 1; these states are defined as

*t*≤

*τ*, the $\xi jt$ are independent standard Gaussian random variables, and the timestep is

*δ*

_{t}= 0.001. We use an inverse temperature of

*β*= 2, and we vary

*τ*as indicated below (note that

*τ*is an integer number of steps, but we present our results for the Müller–Brown model in terms of the dimensionless scale used for

*δ*

_{t}immediately above). For each test, we use independent random numbers and redraw the initial conditions because it is faster to generate the trajectories than to read them from disk in this case. All error bars are computed from the empirical standard deviation over ten replicate datasets.

*x*= (

*y*,

*z*) and

*x*′ = (

*y*±

*δ*

_{x},

*z*) or (

*y*,

*z*±

*δ*

_{x}) on a grid with spacings

*δ*

_{x},

*P*−

*I*, where

*I*is the identity matrix, accordingly to set the time units below). The diagonal entry

*P*(

*x*|

*x*) is fixed to make the transition matrix row-stochastic. We use

*δ*

_{x}= 0.0125; the grid is a rectangle with

*y*∈ [−1.5, 1.0], and

*z*∈ [−0.5, 2.0]. The reference invariant subspaces are computed by diagonalizing the matrix $2(P\u2212I)/\beta \delta x2$ with a sparse eigensolver; we use scipy.sparse.linalg. We obtain the reference committor $q\u0302+$ for grid sites in $(A\u222aB)c$ by solving

*A*∪

*B*by setting $q\u0302+=1\u0302B$. Here, $q\u0302+$ and $1\u0302B$ are vectors corresponding to the functions evaluated at the grid sites.

### B. AIB_{9} helix-to-helix transition

The second system is a peptide of nine *α*-aminoisobutyric acids (AIB_{9}; Fig. 2). Because AIB is achiral around its *α*-carbon atom, AIB_{9} can form left- and right-handed helices with equal probabilities, and we study the transitions between these two states. This transition was previously studied using MSMs and long unbiased molecular dynamics simulations.^{49,50} AIB_{9} poses a stringent test due to the presence of many metastable intermediate states.

*ϕ*and

*ψ*dihedral angles. We classify an amino acid as being in the “left” state if its dihedral angle values are within a circle of radius 25° centered at (41°, 43°), that is,

*A*and

*B*are defined by the amino acids at sequence positions 3–7 being all left or all right, respectively. We do not use the two residues on each end of AIB

_{9}in defining the states as these are typically more flexible.

^{50}The states can be resolved by projecting onto dihedral angle principal components (dPCs; Fig. 2, right) as described previously.

^{51}

Following a procedure similar to that described in Ref. 50, we generate a dataset of short trajectories. From each of the 691 starting configurations in Ref. 50, we simulate ten trajectories of duration 20 ns with initial velocities drawn randomly from a Maxwell–Boltzmann distribution at a temperature of 300 K. The short trajectory dataset thus contains 6910 trajectories, corresponding to a total sampling time of 138.2 *µ*s. We use a timestep of 4 fs together with a hydrogen mass repartitioning scheme,^{52} and configurations are saved every 40 ps. We employ the AIB parameters from Forcefield_NCAA^{53} and the GBNeck2 implicit-solvent model.^{54} Simulations are performed with the Langevin integrator in OpenMM 7.7.0^{55} using a friction coefficient of 1 ps^{−1}. To generate a reference for comparison to our results, we randomly select 20 configurations from the dataset above and, from each, run a simulation of 5 *µ*s with the same simulation parameters. For all following tests on AIB_{9}, batches consist of pairs of frames separated by *τ* drawn randomly with replacement from the short trajectories (i.e., from all possible such pairs in the dataset). From each frame, we use only the atom positions because the momenta decorrelate within a few picoseconds, which is much shorter than the lag times that we consider. However, in principle, the momenta could impact the prediction functions^{56} and be used as neural-network inputs as well.

## VII. SPECTRAL ESTIMATION

*QR*, where

*Q*is an

*n*×

*k*matrix with orthogonal columns and

*R*is an upper triangular

*k*×

*k*matrix. Finally, we set $\phi \u0303sa=\u2211b=1k\phi \theta b(R\u22121N)ba$, where

*N*is a diagonal matrix with entries equal to the norms of the columns of Φ (before orthogonalization). To ensure that the networks remain well-separated (i.e., the eigenvalues of

*C*

^{0}remain away from zero), we penalize large off-diagonal entries of

*K*by adding to the loss

*γ*

_{1}allows us to tune the strength of this term relative to others, and ‖·‖

_{F}is the Frobenius norm. We control the scale of each network output using the strategy from Ref. 32; that is, we add to the loss a term of the form

*ν*

_{a}, which we maximize with gradient ascent (or similar optimization). In general, our numerical experiments suggest that it is best to keep

*γ*

_{1}and

*γ*

_{2}relatively small. We find that stability of the algorithm over many subspace iterations is improved if the matrix

*K*is set at its optimal value before each inner loop. To do this, we set

^{57}for all numerical tests. We summarize our procedure for spectral estimation in Algorithm 1.

### A. Müller–Brown model

As a first test of our method, we compute the *k* = 3 dominant eigenpairs for the Müller–Brown model. Since we know that the dominant eigenfunction of the transition operator is the constant function *v*_{1}(*y*, *z*) = 1 with eigenvalue *λ*_{1} = 1, we directly include this function in the basis as a non-trainable function, i.e., $\phi \theta 1(y,z)=1$. To initialize $\phi \u03031a$ for each *a* > 1, we choose a standard Gaussian vector $(Ya,Za)\u2208R2$, and set $\phi \u03031a(y,z)=yYa+zZa$. This ensures that the initial basis vectors are well-separated and the first QR step is numerically stable. Here and in all subsequent Müller–Brown tests, batches of trajectories are drawn from the entire dataset with replacement. Other hyperparameters are listed in Table I.

. | Spectral estimation . | Committor . | MFPT . | |||
---|---|---|---|---|---|---|

Hyperparameter . | Müller-Brown . | AIB_{9}
. | Müller-Brown . | Modified Müller-Brown . | AIB_{9}
. | AIB_{9}
. |

Subspace dimension k | 3 | 5 | 1 | 2, 1a | 1 | 5 |

Input dimensionality | 2 | 174 | 2 | 2 | 174 | 174 |

Hidden layers | 6 | 6 | 6 | 6 | 6 | 6 |

Hidden layer width | 64 | 128 | 64 | 64 | 150 | 150 |

Hidden layer nonlinearity | CeLU | CeLU | ReLU | ReLU | ReLU | ReLU |

Output layer nonlinearity | None | tanh | Sigmoid/none | None | None | None |

Outer iterations S | 10 | 100 | 100 | 4 + 10a | 100 | 300 |

Inner iterations M | 5000 | 2000 | 200 | 5000 | 2000 | 1000 |

σ | 2 | 50 | 50 | 0 | 50 | 0 |

Batch size B | 2000 | 1024 | 5000 | 2000 | 1024 | 2000 |

Learning rate η | 0.001 | 0.0001 | 0.001 | 0.001 | 0.001 | 0.001 |

γ_{1} | 0.15 | 0.001 | ⋯ | ⋯ | ⋯ | 0.1 |

γ_{2} | 0.01 | 0.01 | ⋯ | ⋯ | ⋯ | 10 |

Loss for $\phi \theta 1$ | $L\mu 2$ | $L\mu 2$ | $L\mu 2$/softplus | Softplus | Softplus | $L\mu 2$ |

Loss for $\phi \theta a$ for a > 1 | $L\mu 2$ | $L\mu 2$ | ⋯ | $L\mu 2$ | ⋯ | $L\mu 2$ |

. | Spectral estimation . | Committor . | MFPT . | |||
---|---|---|---|---|---|---|

Hyperparameter . | Müller-Brown . | AIB_{9}
. | Müller-Brown . | Modified Müller-Brown . | AIB_{9}
. | AIB_{9}
. |

Subspace dimension k | 3 | 5 | 1 | 2, 1a | 1 | 5 |

Input dimensionality | 2 | 174 | 2 | 2 | 174 | 174 |

Hidden layers | 6 | 6 | 6 | 6 | 6 | 6 |

Hidden layer width | 64 | 128 | 64 | 64 | 150 | 150 |

Hidden layer nonlinearity | CeLU | CeLU | ReLU | ReLU | ReLU | ReLU |

Output layer nonlinearity | None | tanh | Sigmoid/none | None | None | None |

Outer iterations S | 10 | 100 | 100 | 4 + 10a | 100 | 300 |

Inner iterations M | 5000 | 2000 | 200 | 5000 | 2000 | 1000 |

σ | 2 | 50 | 50 | 0 | 50 | 0 |

Batch size B | 2000 | 1024 | 5000 | 2000 | 1024 | 2000 |

Learning rate η | 0.001 | 0.0001 | 0.001 | 0.001 | 0.001 | 0.001 |

γ_{1} | 0.15 | 0.001 | ⋯ | ⋯ | ⋯ | 0.1 |

γ_{2} | 0.01 | 0.01 | ⋯ | ⋯ | ⋯ | 10 |

Loss for $\phi \theta 1$ | $L\mu 2$ | $L\mu 2$ | $L\mu 2$/softplus | Softplus | Softplus | $L\mu 2$ |

Loss for $\phi \theta a$ for a > 1 | $L\mu 2$ | $L\mu 2$ | ⋯ | $L\mu 2$ | ⋯ | $L\mu 2$ |

^{a}

Four subspace iterations with *k* = 2 followed by ten iterations with *k* = 1.

Figure 3 shows that we obtain good agreement between the estimate produced by the inexact subspace iteration in Algorithm 1 and reference eigenfunctions. Figure 4 (upper panels) shows how the corresponding eigenvalues vary with lag time; again there is good agreement with the reference. Furthermore, there is a significant gap between *λ*_{3} and *λ*_{4}, indicating that a three-dimensional subspace captures the dynamics of interest for this system.

*k*-means to cluster the configurations into 400 states and counting the transitions between clusters. This is a very fine discretization for this system, and the MSM is sufficiently expressive to yield eigenfunctions in good agreement with the reference. The relative error of 1 −

*λ*

_{2}is comparable for the two methods (Fig. 4, lower left). To compare two finite dimensional subspaces, $U$ and $V$, we define the subspace distance as

^{4}

_{F}is the Frobenius norm. Figure 4 (lower right) shows the subspace distances from the reference as functions of lag time. We see that the inexact subspace iteration better approximates the three-dimensional dominant eigenspace for moderate to long lag times, even though the eigenvalues are comparable.

### B. AIB_{9}

For the molecular test system, we compute the dominant five-dimensional subspace. We compare the inexact subspace iteration in Algorithm 1 with MSMs constructed on dihedral angles (“dihedral MSM”) and on Cartesian coordinates (“Cartesian MSM”). We expect the dihedral MSM to be accurate given that the dynamics of AIB_{9} are well-described by the backbone dihedral angles,^{49,50} and we thus use it as a reference. It is constructed by clustering the sine and cosine of each of the backbone dihedral angles (*ϕ* and *ψ*) for the nine residues (for a total of 2 × 2 × 9 = 36 input features) into 1000 clusters using *k*-means and counting transitions between clusters. The Cartesian MSM is constructed by similarly counting transitions between 1000 clusters from the *k*-means algorithm, but the clusters are based on the Cartesian coordinates of all non-hydrogen atoms after aligning the backbone atoms of the trajectories, for a total of 174 input features. Because of the difficulty of clustering high-dimensional data, we expect the Cartesian MSM basis to give poor results. The neural network for the inexact subspace iteration receives the same 174 Cartesian coordinates as input features. We choose to use Cartesian coordinates rather than dihedral angles as inputs because it requires the network to identify nontrivial representations for describing the dynamics.

As in the Müller–Brown example, we use $\phi \theta 1=1$ and a random linear combination of coordinate functions to initialize $\phi \u03031a$ for *a* > 1. Other hyperparameters are listed in Table I. With these choices, the neural-network training typically requires about 20 minutes on a single NVIDIA A40 GPU; this is much longer than the time required for diagonalization of the 1000 × 1000 MSM transition matrix, which is nearly instantaneous. However, the time for neural-network training is negligible compared with the time to generate the dataset, which is the same for both approaches.

Taking the dihedral MSM as a reference, the Cartesian MSM systematically underestimates the eigenvalues (Fig. 5). The subspace iteration is very accurate for the first four eigenvalues, but the estimates for the fifth are low and vary considerably from run to run. A very small gap between *λ*_{4} and *λ*_{5} may contribute to the difficulty in estimating *λ*_{5}. In Fig. 6, we plot the first two non-trivial eigenfunctions (*v*_{2} and *v*_{3}), which align with the axes of the dPC projection. The eigenfunction *v*_{2} corresponds to the transition between the left- and right-handed helices; the eigenfunction *v*_{3} is nearly orthogonal to *v*_{2} and corresponds to transitions between intermediate states. It is challenging to visualize the remaining two eigenfunctions by projecting onto the first two dPCs because *v*_{4} and *v*_{5} are orthogonal to *v*_{2} and *v*_{3}. The estimates for *v*_{2} are in qualitative agreement for all lag times tested (Fig. 6 shows results for *τ* corresponding to 40 ps), but the subspace iteration results are less noisy for the shortest lag times. Moreover, the estimate for *v*_{3} from subspace iteration agrees more closely with that from the dihedral MSM than does the estimate for *v*_{3} from the Cartesian MSM. The subspace distance for *v*_{2} and *v*_{3} between the subspace iteration and the dihedral MSM is 0.947, compared with a value of 0.969 for the subspace distance between the two MSMs. Together, our results indicate that the neural networks are able to learn the leading eigenfunctions and eigenvalues of the transition operator (dynamical modes) of this system despite being presented with coordinates that are not the natural ones for describing the dynamics.

## VIII. PREDICTION

*T*

_{j}is the first time $Xjt$ enters $Dc$ and

*r*is the right-hand side of (5), as previously.

The Richardson iterate, $\phi \theta 1$, must satisfy the boundary condition $\phi \theta 1(x)=\Psi (x)$ for *x* ∉ *D*. The other basis functions should satisfy $\phi \theta a(x)=0$ for *x* ∉ *D*. In practice, we enforce the boundary conditions by explicitly setting $\phi \theta 1(x)=\Psi (x)$ and $\phi \theta a(x)=0$ for *a* > 1 when *x* ∉ *D*.

*k*-dimensional linear system

*a*,

*b*≥ 1,

*t*= 0,

*τ*, and

*k*− 1)-dimensional linear system by excluding the indices

*a*= 1 and

*b*= 1 in (39) and (40) and setting

^{10,11}with the basis ${\phi \theta a}a=2k$ and a “guess” function of $\phi \theta 1$.

When the boundary conditions are zero, the orthogonalization procedure and the matrix *K* are applied to all basis functions as in Sec. VII. When the boundary conditions are non-zero, the orthogonalization procedure is only applied to the basis functions ${\phi \theta a}a=2k$, and *K*_{a1} = *I*_{a1} the *a*1 element of the identity matrix. We summarize our procedure for prediction in Algorithm 2.

### A. Müller–Brown committor

In this section, we demonstrate the use of our method for prediction by estimating the committor for the Müller–Brown model with a shallow intermediate basin at (−0.25, 0.5) (Fig. 1). Here, the sets *A* and *B* are defined as in Eq. (24) and *T* is the time of first entrance to $Dc=A\u222aB$. In this case, a one-dimensional subspace iteration (i.e., *k* = 1 in Algorithm 2) appears sufficient to accurately solve the prediction problem. Figure 7 shows the largest eigenvalue of the stopped transition operator $S\tau $ [the second largest of $A$ in (14)] computed from our grid-based reference scheme (Sec. VI A). Richardson iteration should converge geometrically in this eigenvalue,^{33} and so, for moderate lag times, we can expect our method to converge in a few dozen iterations. To initialize the algorithm we choose $\phi \u030311=1B$. All other hyperparameters are listed in Table I.

*k*-means to cluster the configurations outside

*A*and

*B*into 400 states and counting the transitions between clusters. In addition to the root mean square error (RMSE) for the committor itself, we show the RMSE of

*A*and

*B*. This function amplifies the importance of values close to zero and one. We include

*ɛ*because we want to assign only a finite penalty if the procedure estimates

*q*to be exactly zero or one; we use

*ɛ*=

*e*

^{−20}.

Results as a function of lag time are shown in Fig. 8. We see that the Richardson iterate is more accurate than the MSM for all but the shortest lag times. When using the $L\mu 2$ loss, the results are comparable, whereas the softplus loss allows the Richardson iterate to improve the RMSE of the logit function in (44) with no decrease in performance with respect to the RMSE of the committor. Results as a function of the size of the dataset are shown in Fig. 9 for a fixed lag time of $\tau =0.1\delta t\u22121$. The Richardson iterate generally does as well or better than the MSM. Again, the differences are more apparent in the RMSE of the logit function in (44). By that measure, the Richardson iterate obtained with both loss functions is significantly more accurate than the MSM for small numbers of trajectories. The softplus loss maintains an advantage even for large numbers of trajectories.

### B. Accelerating convergence by incorporating eigenfunctions

As discussed in Sec. III, we expect Richardson iteration to converge slowly when the largest eigenvalue of $S\tau $, *λ*_{1}, is close to 1. More precisely, the number of iterations required to reach convergence should scale with $\u22121/log\lambda 1=ET/\tau $, the mean escape time from the quasi-stationary distribution to the boundary of *D* divided by the lag time. With this in mind, we can expect inexact Richardson iteration for the Müller–Brown to perform poorly if we deepen the intermediate basin at (−0.25, 0.5) as in Fig. 10 (top left). Again, the sets *A* and *B* are defined as in (24), and *T* is the time of first entrance to $Dc=A\u222aB$. In this case, −1/log *λ*_{1} is on the order of 100 for the lag times we consider and, as expected, inexact Richardson iteration converges slowly (Fig. 10, bottom left). Estimates of the committor by inexact Richardson iteration do not reach the correct values even after hundreds of iterations (Fig. 10, bottom right).

We now show that convergence can be accelerated dramatically by incorporating additional eigenfunctions of $S\tau $ (i.e., *k* > 1 in Algorithm 2). For the Müller–Brown model with a deepened intermediate basin, the second eigenvalue of $S\tau $ is of order 10^{−4} for a lag time of *τ* = 1000 steps or $1\delta t\u22121$ (while the first is near one as discussed above). We therefore choose *k* = 2 with $\phi \u030312$ initialized as a random linear combination of coordinate functions as in previous examples. We run the subspace iteration for four iterations, compute the committor as a linear combination of the resulting functions, and then refine this result with a further ten Richardson iterations (i.e., *k* = 1 with the starting vector as the output of the *k* = 2 subspace iteration). To combine the functions, we use a linear solve which incorporates memory (Algorithm 3).^{58,59} We find that the use of memory improves the data-efficiency substantially for poorly conditioned problems. For our tests here, we use three memory kernels, corresponding to *τ*_{mem} = ⌊*τ*/4⌋.

Require: Stopped transition data $Xj0,Xj1\u2227Tj,\u2026,Xj\tau \u2227Tjj=1n$, guess function h, reward data $\rho (Xj)j=1n$, basis set ${fa}a=1k$, lag between memory kernels τ_{mem}. |

1: for s = 0 … (τ/τ_{mem}) do |

2: Initialize the matrix C^{s} with zeros ⊳ $Cs\u2208R(k+1)\xd7(k+1)$ |

3: $C11s\u21901$ |

4: for a = 2 … k do |

5: $Ca1s\u21901n\u2211j=1nfa(Xj0)\rho (Xjs\tau mem\u2227Tj)$ |

6: for b = 2 … k do |

7: $Cabs\u21901n\u2211j=1nfa(Xj0)fb(Xjs\tau mem\u2227Tj)$ |

8: end for |

9: end for |

10: end for |

11: A ← C^{1} – C^{0} |

12: for s = 0 … (τ/τ_{mem}) − 2 do |

13: $Ms\u2190Cs+2\u2212Cs+1\u2212A(C0)\u22121Cs+1\u2212\u2211j=0sMj(C0)\u22121Cs\u2212j$ |

14: end for |

15: $Amem\u2190A+\u2211s=0(\tau /\tau mem)\u22122Ms$ |

16: Solve A_{mem}w = w |

17: return $u=h+\u2211a=2kwafa$ |

Require: Stopped transition data $Xj0,Xj1\u2227Tj,\u2026,Xj\tau \u2227Tjj=1n$, guess function h, reward data $\rho (Xj)j=1n$, basis set ${fa}a=1k$, lag between memory kernels τ_{mem}. |

1: for s = 0 … (τ/τ_{mem}) do |

2: Initialize the matrix C^{s} with zeros ⊳ $Cs\u2208R(k+1)\xd7(k+1)$ |

3: $C11s\u21901$ |

4: for a = 2 … k do |

5: $Ca1s\u21901n\u2211j=1nfa(Xj0)\rho (Xjs\tau mem\u2227Tj)$ |

6: for b = 2 … k do |

7: $Cabs\u21901n\u2211j=1nfa(Xj0)fb(Xjs\tau mem\u2227Tj)$ |

8: end for |

9: end for |

10: end for |

11: A ← C^{1} – C^{0} |

12: for s = 0 … (τ/τ_{mem}) − 2 do |

13: $Ms\u2190Cs+2\u2212Cs+1\u2212A(C0)\u22121Cs+1\u2212\u2211j=0sMj(C0)\u22121Cs\u2212j$ |

14: end for |

15: $Amem\u2190A+\u2211s=0(\tau /\tau mem)\u22122Ms$ |

16: Solve A_{mem}w = w |

17: return $u=h+\u2211a=2kwafa$ |

The bottom row of Fig. 11 illustrates the idea of the subspace iteration. The second eigenfunction (Fig. 11, center) is peaked at the intermediate. As a result, the two neural-network functions linearly combined by the Galerkin approach with memory can yield a good result for the committor (Fig. 11, bottom right). Figure 12 compares the RMSE for the committor and the RMSE for the logit in (44) for Algorithm 2 with *k* = 1 (pure Richardson iteration) and *k* = 2 (incorporating the first non-trivial eigenfunction), and an MSM with 400 states. We see that the Richardson iteration suffers large errors at all lag times; as noted previously, this error is mainly in the vicinity of the intermediate. The MSM cannot accurately compute the small probabilities, but does as well as the subspace iteration in terms of RMSE.

### C. AIB_{9} prediction results

As an example of prediction in a high-dimensional system, we compute the committor for the transition between the left- and right-handed helices of AIB_{9} using the inexact Richardson iteration scheme (*k* = 1 in Algorithm 2) with the softplus loss function. Specifically, for this committor calculation, *T* is the time of first entrance to $Dc=A\u222aB$ with *A* and *B* defined in Sec. VI B. As before, we initialize $\phi \u030311=1B$.

*µ*s reference trajectories to compute an empirical committor as a function of the neural network outputs, binned into intervals

*s*∈ [0, 1 − Δ

*s*]. Here, we use Δ

*s*= 0.05. The overall error in the committor estimate is defined as

Results for the full dataset with *τ* corresponding to 400 ps are shown in Fig. 13. The projection on the principal components is consistent with the symmetry of the system, and the predictions show good agreement with the empirical committors. As *τ* decreases, the results become less accurate (Fig. 14, top left); at shorter lag times we would expect further increases in the error. We also examine the dependence of the results on the size of the dataset by subsampling the short trajectories and then training neural networks on the reduced set of trajectories (Fig. 14, top right). We find that the performance steadily drops as the number of trajectories is reduced and degrades rapidly for the datasets subsampled more than 20-fold (Fig. 14, bottom right), corresponding to about 7 *µ*s of total sampling.

Finally, we compute the MFPT to reach the right-handed helix using the same dataset. For the MFPT calculation *T* is the time of first entrance to $Dc=B$. Note that the time of first entrance to *B* includes long dwell times in *A* and is expected to be much larger than the time of first entrance to *A* ∪ *B*.

*s*∈ [0,

*m*

_{max}− Δ

*s*], where Δ

*s*= 3 and

*m*

_{max}= 57 ns. Overall error is defined analogously to Eq. (46).

In Fig. 15, we show the MFPT obtained from Algorithm 2 with *k* = 5 and the $L\mu 2$ loss function. Initially, we set $\phi \u030311$ equal to an arbitrary positive function (we use $51A$) and $\phi \u0303sa$ for *a* > 1 to a random linear combination of coordinate functions. In Fig. 16, we examine the convergence of the MFPT from the left-handed helix to the right-handed helix for the MFPT computed with *k* = 1 (pure Richardson iteration) and *k* = 5. To obtain the results shown, we train neural networks on the short-trajectory dataset and then average their predictions for structures in the left-handed helix state in the long reference trajectories. The horizontal line in Fig. 16 indicates a MFPT of about 56 ns estimated from the long reference trajectories. We see that the algorithm with *k* = 5 converges much faster (note the differing scales of the horizontal axes) and yields accurate results at all lag times other than the shortest shown. The need to choose *k* > 1 for this MFPT calculation is again consistent with theoretical convergence behavior of exact subspace iteration. Because the typical time of first entrance to *B* from points in *A* is very large, we expect the dominant eigenvalue of $S\tau $ to be very near to one when $D=Bc$. In contrast, the committor calculation benefits from the fact that the time of first entrance to *A* ∪ *B* is much shorter, implying a smaller dominant eigenvalue of $S\tau $ when $D=(A\u222aB)c$.

## IX. CONCLUSIONS

In this work, we have presented a method for spectral estimation and rare-event prediction from short-trajectory data. The key idea is that we use the data as the basis for an inexact subspace iteration. For the test systems that we considered, the method not only outperforms high-resolution MSMs, but it can be tuned through the choice of loss function to compute committor probabilities accurately near the reactants, transition states, and products. Other than the Markov assumption, our method requires no knowledge of the underlying model and puts no restrictions on its dynamics.

As discussed in prior neural-network based prediction work,^{24,30} our method is sensitive to the quality and distribution of the initial sampling data. However, our work shares with Ref. 24 the major advantage of allowing the use of arbitrary inner products. This enables adaptive sampling of the state space^{24,60} and—together with the features described above—the application to observational and experimental data, for which the stationary distribution is generally unavailable.

In the present work, we focused on statistics of transition operators, but our method can readily be extended to solve problems involving their adjoint operators as well. By this means, we can obtain the stationary distribution as well as the backward committor. The combination of forward and backward predictions allows the analysis of transition paths using transition path theory without needing to generate full transition paths^{37,38,48} and has been used to understand rare transition events in molecular dynamics^{10,13,16,17,61,62} and geophysical flows.^{63–67} We leave these extensions to future work.

In cases in which trajectories that reach the reactant and product states are available, it would be interesting to compare our inexact iterative schemes against schemes for committor approximation based on logistic regression and related approaches.^{35,41–46,68} These schemes are closely related to what is called “Monte Carlo” approximation in reinforcement learning,^{31} and also to the inexact Richardson iteration that we propose here with *τ* → ∞.

We have seen that temporal difference (TD) learning, a workhorse for prediction in reinforcement learning, is closely related to an inexact form of Richardson iteration. Variants like TD(*λ*), have similar relationships to inexact iterative schemes. As we showed, subspace iteration is a natural way of addressing slow convergence. We thus anticipate that our results have implications for reinforcement learning, particularly in scenarios in which the value function depends on the occurrence of some rare-event. Finally, we note that our framework can be extended to the wider range of iterative numerical linear algebra algorithms. In particular, Krylov or block Krylov subspace methods may offer further acceleration. In fact, very recently an approach along these lines was introduced for value function estimation in reinforcement learning.^{69}

## ACKNOWLEDGMENTS

We are grateful to Arnaud Doucet for pointing our attention to TD methods and the inexact power iteration in Ref. 32, both of which were key motivations for this work. We also thank Joan Bruna for helpful conversations about reinforcement learning. This work was supported by the National Institutes of Health (Award No. R35 GM136381) and the National Science Foundation (Award No. DMS-2054306). S.C.G. acknowledges support by the National Science Foundation Graduate Research Fellowship under Grant No. 2140001. J.W. acknowledges support from the Advanced Scientific Computing Research Program within the DOE Office of Science through Award No. DE-SC0020427. This work was completed in part with resources provided by the University of Chicago Research Computing Center, and we are grateful for their assistance with the calculations. “Beagle-3: A Shared GPU Cluster for Biomolecular Sciences” was supported by the National Institutes of Health (NIH) under the High-End Instrumentation (HEI) Grant Program Award No. 1S10OD028655-0.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

J.S., S.C.G., A.R.D., and J.W. conceptualized research. J.S. developed the method. C.L. adapted the linear solve used for prediction from Refs. 58 and 59. J.S. and S.C.G. performed the numerical tests. A.R.D. and J.W. supervised the research. All authors wrote and edited the manuscript.

**John Strahan**: Conceptualization (equal); Investigation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Spencer C. Guo**: Conceptualization (equal); Investigation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Chatipat Lorpaiboon**: Methodology (equal); Writing – review & editing (equal). **Aaron R. Dinner**: Conceptualization (equal); Funding acquisition (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). **Jonathan Weare**: Conceptualization (equal); Funding acquisition (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available within the article. Code implementing the algorithms and a Jupyter notebook illustrating use of the method on the Müller–Brown example are available at https://github.com/dinner-group/inexact-subspace-iteration.

## REFERENCES

*An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation*

*Volume 797 of Advances in Experimental Medicine and Biology*

_{6}hexamer

*Reinforcement Learning: An Introduction*

*Matrix Computations*

*Computer Simulations in Condensed Matter Systems: From Materials to Chemical Biology*

*Probability and its Applications*