Understanding dynamics in complex systems is challenging because there are many degrees of freedom, and those that are most important for describing events of interest are often not obvious. The leading eigenfunctions of the transition operator are useful for visualization, and they can provide an efficient basis for computing statistics, such as the likelihood and average time of events (predictions). Here, we develop inexact iterative linear algebra methods for computing these eigenfunctions (spectral estimation) and making predictions from a dataset of short trajectories sampled at finite intervals. We demonstrate the methods on a low-dimensional model that facilitates visualization and a high-dimensional model of a biomolecular system. Implications for the prediction problem in reinforcement learning are discussed.
I. INTRODUCTION
Modern observational, experimental, and computational approaches often yield high-dimensional time series data (trajectories) for complex systems. In principle, these trajectories contain rich information about dynamics and, in particular, the infrequent events that are often most consequential. In practice, however, high-dimensional trajectory data are often difficult to parse for useful insight. The need for more efficient statistical analysis tools for trajectory data is critical, especially when the goal is to understand rare-events that may not be well represented in the data.
We consider dynamics that can be treated as Markov processes. A common starting point for statistical analyses of Markov processes is the transition operator, which describes the evolution of function expectations. The eigenfunctions of the transition operator characterize the most slowly decorrelating features (modes) of the system.1–5 These can be used for dimensionality reduction to obtain a qualitative understanding of the dynamics,6,7 or they can be used as the starting point for further computations.8–10 Similarly, prediction functions, which provide information about the likelihood and timing of future events as a function of the current state, are defined through linear equations of the transition operator.10,11
A straightforward numerical approach to obtaining these functions is to convert the transition operator to a matrix by projecting onto a finite basis for Galerkin approximation.1,2,10–15 The performance of such a linear approximation depends on the choice of basis,10,11,15 and previous work often resorts to a set of indicator functions on a partition of the state space (resulting in a Markov state model or MSM14) for lack of a better choice. While Galerkin approximation has yielded many insights,16,17 the limited expressivity of the basis expansion has stimulated interest in (nonlinear) alternatives.
In particular, artificial neural networks can be harnessed to learn eigenfunctions of the transition operator and prediction functions from data.5,18–25 However, existing approaches based on neural networks suffer from various drawbacks. As discussed in Ref. 5, their performance can often be very sensitive to hyperparameters, requiring extensive tuning and varying with random initialization. Many use loss functions that are estimated against the stationary distribution,25–30 so that metastable states contribute most heavily, which negatively impacts performance.24,30 Assumptions about the dynamics (e.g., microscopic reversibility) limit applicability. In Ref. 24, we introduced an approach that overcomes the issues above, but it uses multiple trajectories from each initial condition; this limits the approach to analysis of simulations and moreover requires specially prepared datasets.
The need to compute prediction functions from observed trajectory data also arises in reinforcement learning. There the goal is to optimize an expected future reward (the prediction function) over a policy (a Markov process). For a fixed Markov process, the prediction problem in reinforcement learning is often solved by temporal difference (TD) methods, which allow the use of arbitrary ensembles of trajectories without knowledge of the details of the underlying dynamics.31 TD methods have a close relationship with an inexact form of power iteration, which, as we describe, can perform poorly on rare-event related problems.
Motivated by this relationship, as well as by an inexact power iteration scheme previously proposed for approximating the stationary probability distribution of a Markov process using trajectory data,32 we propose a computational framework for spectral estimation and rare-event prediction based on inexact iterative numerical linear algebra. Our framework includes an inexact Richardson iteration for the prediction problem, as well as an extension to inexact subspace iteration for the prediction and spectral estimation problems. The theoretical properties of exact subspace iteration suggest that eigenfunctions outside the span of the approximation will contribute significantly to the error of our inexact iterative schemes.33 Consistent with this prediction, we demonstrate that learning additional eigenvalues and eigenfunctions simultaneously through inexact subspace iteration accelerates convergence dramatically relative to inexact Richardson and power iteration in the context of rare events. While we assume the dynamics can be modeled by a Markov process, we do not require knowledge of their form or a specific underlying model. The method shares a number of further advantages with the approach discussed in Ref. 24 without the need for multiple trajectories from each initial condition in the dataset. This opens the door to treating a wide range of observational, experimental, and computational datasets.
The remainder of the paper is organized as follows: In Sec. II, we describe the quantities that we seek to compute in terms of linear operators. In Secs. III and IV, we introduce an inexact subspace iteration algorithm that we use to solve for these quantities. Section V illustrates how the loss function can be tailored to the known properties of the desired quantity. Section VI summarizes the two test systems that we use to illustrate our methods: a two-dimensional potential, for which we can compute accurate reference solutions, and a molecular example that is high-dimensional but still sufficiently tractable that statistics for comparison can be computed from long trajectories. In Sec. VII, we explain the details of the invariant subspace iteration and then demonstrate its application to our two examples. Finally, Sec. VIII details how the subspace iteration can be modified to compute prediction functions and compares the effect of different loss functions, as well as the convergence properties of power iteration and subspace iteration. We conclude with implications for reinforcement learning.
II. SPECTRAL ESTIMATION AND THE PREDICTION PROBLEM
In contrast, unbiased estimation of the second term in (9) requires access to two independent trajectories of Xt for each sample initial condition since it is quadratic in .24,31 In the reinforcement learning community, TD methods were developed for the purpose of minimizing a loss of a very similar form to (8), and they perform a “semigradient” descent by following only the first term in (9).31 As in the semigradient approximation, the algorithms proposed in this paper only require access to inner products of the form for an operator related to or , and they avoid terms that are non-linear in . As we explain, such inner products can be estimated using trajectory data. However, rather than attempting to minimize the residual directly by an approximate gradient descent, we view the eigenproblem and prediction problems through the lens of iterative numerical linear algebra schemes.
III. AN INEXACT POWER ITERATION
Reference 32 introduced an inexact power iteration to compute the stationary probability measure of , i.e., its dominant left eigenfunction. As those authors note, an inexact power update, such as (13), can be enforced using a variety of loss functions. In our setting, the norm in (11) can be replaced by any other measure of the difference between uθ′ and , as long as an unbiased estimator of its gradient with respect to θ′ is available. This flexibility is discussed in more detail in Sec. V, and we exploit it in applications presented later in this article. For now, we focus instead on another important implication of this viewpoint: the flexibility in the form of the iteration itself.
We will see that when the spectral gap of , the difference between its largest and second largest eigenvalues, is small, the inexact power iteration (or the equivalent Richardson iteration) described above fails to reach an accurate solution. The largest eigenvalue of in (14) is 1 and its next largest eigenvalue is the dominant eigenvalue of . For rare-event problems, the difference between these two eigenvalues is expected to be very small. Indeed, when X0 is drawn from the quasi-stationary distribution of Xt in D, the logarithm of the largest eigenvalue of is exactly .39 Thus, when the mean exit time from D is large, we can expect the spectral gap of in (14) to be very small. In this case, classical convergence results tell us that exact power iteration converges slowly, with the largest contributions to the error coming from the eigenfunctions of with largest magnitude eigenvalues.33 Iterative schemes that approximate multiple dominant eigenfunctions simultaneously, such as subspace iteration and Krylov methods, can converge much more rapidly.33 In Sec. IV, we introduce an inexact form of subspace iteration to address this shortcoming. Beyond dramatically improving performance on the prediction problem for rare-events when applied with in (14), it can also be applied with to compute multiple dominant eigenfunctions and values of itself.
IV. AN INEXACT SUBSPACE ITERATION
Even when sampling is not required to estimate the matrices in (19) and (20), the numerical rank of Cτ becomes very small as the eigenfunctions become increasingly aligned with the single dominant eigenfunction. To overcome this issue, we apply an orthogonalization step between iterations (or every few iterations). Just as the matrices C0 and Cτ can be estimated using trajectory data, the orthogonalization procedure can also be implemented approximately using data.
The details, including estimators for all required inner products, in the case of the eigenproblem are given in Sec. VII and Algorithm 1. For the prediction problem with as in (14), they are given in Sec. VIII and Algorithm 2.
Require: Subspace dimension k, transition data , batch size B, learning rate η, number of subspace iterations S, number of inner iterations M, regularization parameters γ1 and γ2 |
1: Initialize and |
2: for s = 1 … S do |
3: for m = 1 … M do |
4: Sample a batch of data |
5: |
6: |
7: |
8: |
9: |
10: |
11: |
12: end for |
13: Compute the matrix ⊳ |
14: Compute diagonal matrix |
15: Compute QR-decomposition Φ = QR ⊳ and |
16: |
17: |
18: end for |
19: Compute the matrices for t = 0, τ ⊳ |
20: Solve the generalized eigenproblem CτW = C0WΛ for W and Λ |
21: return eigenvalues Λ, eigenfunctions |
Require: Subspace dimension k, transition data , batch size B, learning rate η, number of subspace iterations S, number of inner iterations M, regularization parameters γ1 and γ2 |
1: Initialize and |
2: for s = 1 … S do |
3: for m = 1 … M do |
4: Sample a batch of data |
5: |
6: |
7: |
8: |
9: |
10: |
11: |
12: end for |
13: Compute the matrix ⊳ |
14: Compute diagonal matrix |
15: Compute QR-decomposition Φ = QR ⊳ and |
16: |
17: |
18: end for |
19: Compute the matrices for t = 0, τ ⊳ |
20: Solve the generalized eigenproblem CτW = C0WΛ for W and Λ |
21: return eigenvalues Λ, eigenfunctions |
Require: Subspace dimension k, stopped transition data , reward data , batch size B, learning rate η, number of subspace iterations S, number of inner iterations M, regularization parameters γ1 and γ2 |
1: Initialize and |
2: for s = 1 … S do |
3: for m = 1 … M do |
4: Sample a batch of data , |
5: |
6: |
7: |
8: |
9: |
10: |
11: |
12: |
13: end for |
14: if Ψ(x) = 0 then |
15: Compute the matrix ⊳ |
16: else |
17: Compute the matrix for a > 1 ⊳ |
18: end if |
19: Compute QR-decomposition Φ = QR |
20: Compute diagonal matrix |
21: ⊳ if Ψ(x) = 0 exclude a = 1 |
22: |
23: end for |
24: Compute the matrix for t = 0, τ ∧ Tj ⊳ |
25: Compute the vector |
26: Solve linear system (C0 − Cτ)w = p ⊳ if Ψ(x) = 0 enforce w1 = 1 |
27: return |
Require: Subspace dimension k, stopped transition data , reward data , batch size B, learning rate η, number of subspace iterations S, number of inner iterations M, regularization parameters γ1 and γ2 |
1: Initialize and |
2: for s = 1 … S do |
3: for m = 1 … M do |
4: Sample a batch of data , |
5: |
6: |
7: |
8: |
9: |
10: |
11: |
12: |
13: end for |
14: if Ψ(x) = 0 then |
15: Compute the matrix ⊳ |
16: else |
17: Compute the matrix for a > 1 ⊳ |
18: end if |
19: Compute QR-decomposition Φ = QR |
20: Compute diagonal matrix |
21: ⊳ if Ψ(x) = 0 exclude a = 1 |
22: |
23: end for |
24: Compute the matrix for t = 0, τ ∧ Tj ⊳ |
25: Compute the vector |
26: Solve linear system (C0 − Cτ)w = p ⊳ if Ψ(x) = 0 enforce w1 = 1 |
27: return |
V. ALTERNATIVE LOSS FUNCTIONS
As mentioned above, the inexact application of the operator can be accomplished by minimizing loss functions other than (17). The key requirement for a loss function in the present study is that appears in its gradient only through terms of the form for some functions f and g, so that the gradient can be estimated using trajectory data. As a result, we have flexibility in the choice of loss and, in turn, the representation of u. In particular, we consider the representation uθ = ζ(zθ), where ζ is an increasing function, and zθ is a function parameterized by a neural network. An advantage of doing so is that the function ζ can restrict the output values of uθ to some range. For example, when computing a probability such as the committor, a natural choice is the sigmoid function .
The loss in (17), which we use in several of our numerical experiments, corresponds to the choice ζ(x) = x and V(x) = x2/2. The choice of mentioned above corresponds to V(x) = log(1 + ex); we refer to the loss in (22) with this choice of V as the “softplus” loss. We note that in the context of committor function estimation, in the limit of infinite τ the “softplus” loss corresponds directly to the maximum log-likelihood loss (for independent Bernoulli random variables) that defines the logistic regression estimate of the probability of reaching B before A. Logistic regression has previously been used in conjunction with datasets of trajectories integrated all the way until reaching A or B to estimate the committor function.41–46
VI. TEST PROBLEMS
We illustrate our methods with two well-characterized systems that exemplify features of molecular transitions. In this section, we provide key details of these systems.
A. Müller-Brown potential
B. AIB9 helix-to-helix transition
The second system is a peptide of nine α-aminoisobutyric acids (AIB9; Fig. 2). Because AIB is achiral around its α-carbon atom, AIB9 can form left- and right-handed helices with equal probabilities, and we study the transitions between these two states. This transition was previously studied using MSMs and long unbiased molecular dynamics simulations.49,50 AIB9 poses a stringent test due to the presence of many metastable intermediate states.
Following a procedure similar to that described in Ref. 50, we generate a dataset of short trajectories. From each of the 691 starting configurations in Ref. 50, we simulate ten trajectories of duration 20 ns with initial velocities drawn randomly from a Maxwell–Boltzmann distribution at a temperature of 300 K. The short trajectory dataset thus contains 6910 trajectories, corresponding to a total sampling time of 138.2 µs. We use a timestep of 4 fs together with a hydrogen mass repartitioning scheme,52 and configurations are saved every 40 ps. We employ the AIB parameters from Forcefield_NCAA53 and the GBNeck2 implicit-solvent model.54 Simulations are performed with the Langevin integrator in OpenMM 7.7.055 using a friction coefficient of 1 ps−1. To generate a reference for comparison to our results, we randomly select 20 configurations from the dataset above and, from each, run a simulation of 5 µs with the same simulation parameters. For all following tests on AIB9, batches consist of pairs of frames separated by τ drawn randomly with replacement from the short trajectories (i.e., from all possible such pairs in the dataset). From each frame, we use only the atom positions because the momenta decorrelate within a few picoseconds, which is much shorter than the lag times that we consider. However, in principle, the momenta could impact the prediction functions56 and be used as neural-network inputs as well.
VII. SPECTRAL ESTIMATION
A. Müller–Brown model
As a first test of our method, we compute the k = 3 dominant eigenpairs for the Müller–Brown model. Since we know that the dominant eigenfunction of the transition operator is the constant function v1(y, z) = 1 with eigenvalue λ1 = 1, we directly include this function in the basis as a non-trainable function, i.e., . To initialize for each a > 1, we choose a standard Gaussian vector , and set . This ensures that the initial basis vectors are well-separated and the first QR step is numerically stable. Here and in all subsequent Müller–Brown tests, batches of trajectories are drawn from the entire dataset with replacement. Other hyperparameters are listed in Table I.
. | Spectral estimation . | Committor . | MFPT . | |||
---|---|---|---|---|---|---|
Hyperparameter . | Müller-Brown . | AIB9 . | Müller-Brown . | Modified Müller-Brown . | AIB9 . | AIB9 . |
Subspace dimension k | 3 | 5 | 1 | 2, 1a | 1 | 5 |
Input dimensionality | 2 | 174 | 2 | 2 | 174 | 174 |
Hidden layers | 6 | 6 | 6 | 6 | 6 | 6 |
Hidden layer width | 64 | 128 | 64 | 64 | 150 | 150 |
Hidden layer nonlinearity | CeLU | CeLU | ReLU | ReLU | ReLU | ReLU |
Output layer nonlinearity | None | tanh | Sigmoid/none | None | None | None |
Outer iterations S | 10 | 100 | 100 | 4 + 10a | 100 | 300 |
Inner iterations M | 5000 | 2000 | 200 | 5000 | 2000 | 1000 |
σ | 2 | 50 | 50 | 0 | 50 | 0 |
Batch size B | 2000 | 1024 | 5000 | 2000 | 1024 | 2000 |
Learning rate η | 0.001 | 0.0001 | 0.001 | 0.001 | 0.001 | 0.001 |
γ1 | 0.15 | 0.001 | ⋯ | ⋯ | ⋯ | 0.1 |
γ2 | 0.01 | 0.01 | ⋯ | ⋯ | ⋯ | 10 |
Loss for | /softplus | Softplus | Softplus | |||
Loss for for a > 1 | ⋯ | ⋯ |
. | Spectral estimation . | Committor . | MFPT . | |||
---|---|---|---|---|---|---|
Hyperparameter . | Müller-Brown . | AIB9 . | Müller-Brown . | Modified Müller-Brown . | AIB9 . | AIB9 . |
Subspace dimension k | 3 | 5 | 1 | 2, 1a | 1 | 5 |
Input dimensionality | 2 | 174 | 2 | 2 | 174 | 174 |
Hidden layers | 6 | 6 | 6 | 6 | 6 | 6 |
Hidden layer width | 64 | 128 | 64 | 64 | 150 | 150 |
Hidden layer nonlinearity | CeLU | CeLU | ReLU | ReLU | ReLU | ReLU |
Output layer nonlinearity | None | tanh | Sigmoid/none | None | None | None |
Outer iterations S | 10 | 100 | 100 | 4 + 10a | 100 | 300 |
Inner iterations M | 5000 | 2000 | 200 | 5000 | 2000 | 1000 |
σ | 2 | 50 | 50 | 0 | 50 | 0 |
Batch size B | 2000 | 1024 | 5000 | 2000 | 1024 | 2000 |
Learning rate η | 0.001 | 0.0001 | 0.001 | 0.001 | 0.001 | 0.001 |
γ1 | 0.15 | 0.001 | ⋯ | ⋯ | ⋯ | 0.1 |
γ2 | 0.01 | 0.01 | ⋯ | ⋯ | ⋯ | 10 |
Loss for | /softplus | Softplus | Softplus | |||
Loss for for a > 1 | ⋯ | ⋯ |
Four subspace iterations with k = 2 followed by ten iterations with k = 1.
Figure 3 shows that we obtain good agreement between the estimate produced by the inexact subspace iteration in Algorithm 1 and reference eigenfunctions. Figure 4 (upper panels) shows how the corresponding eigenvalues vary with lag time; again there is good agreement with the reference. Furthermore, there is a significant gap between λ3 and λ4, indicating that a three-dimensional subspace captures the dynamics of interest for this system.
B. AIB9
For the molecular test system, we compute the dominant five-dimensional subspace. We compare the inexact subspace iteration in Algorithm 1 with MSMs constructed on dihedral angles (“dihedral MSM”) and on Cartesian coordinates (“Cartesian MSM”). We expect the dihedral MSM to be accurate given that the dynamics of AIB9 are well-described by the backbone dihedral angles,49,50 and we thus use it as a reference. It is constructed by clustering the sine and cosine of each of the backbone dihedral angles (ϕ and ψ) for the nine residues (for a total of 2 × 2 × 9 = 36 input features) into 1000 clusters using k-means and counting transitions between clusters. The Cartesian MSM is constructed by similarly counting transitions between 1000 clusters from the k-means algorithm, but the clusters are based on the Cartesian coordinates of all non-hydrogen atoms after aligning the backbone atoms of the trajectories, for a total of 174 input features. Because of the difficulty of clustering high-dimensional data, we expect the Cartesian MSM basis to give poor results. The neural network for the inexact subspace iteration receives the same 174 Cartesian coordinates as input features. We choose to use Cartesian coordinates rather than dihedral angles as inputs because it requires the network to identify nontrivial representations for describing the dynamics.
As in the Müller–Brown example, we use and a random linear combination of coordinate functions to initialize for a > 1. Other hyperparameters are listed in Table I. With these choices, the neural-network training typically requires about 20 minutes on a single NVIDIA A40 GPU; this is much longer than the time required for diagonalization of the 1000 × 1000 MSM transition matrix, which is nearly instantaneous. However, the time for neural-network training is negligible compared with the time to generate the dataset, which is the same for both approaches.
Taking the dihedral MSM as a reference, the Cartesian MSM systematically underestimates the eigenvalues (Fig. 5). The subspace iteration is very accurate for the first four eigenvalues, but the estimates for the fifth are low and vary considerably from run to run. A very small gap between λ4 and λ5 may contribute to the difficulty in estimating λ5. In Fig. 6, we plot the first two non-trivial eigenfunctions (v2 and v3), which align with the axes of the dPC projection. The eigenfunction v2 corresponds to the transition between the left- and right-handed helices; the eigenfunction v3 is nearly orthogonal to v2 and corresponds to transitions between intermediate states. It is challenging to visualize the remaining two eigenfunctions by projecting onto the first two dPCs because v4 and v5 are orthogonal to v2 and v3. The estimates for v2 are in qualitative agreement for all lag times tested (Fig. 6 shows results for τ corresponding to 40 ps), but the subspace iteration results are less noisy for the shortest lag times. Moreover, the estimate for v3 from subspace iteration agrees more closely with that from the dihedral MSM than does the estimate for v3 from the Cartesian MSM. The subspace distance for v2 and v3 between the subspace iteration and the dihedral MSM is 0.947, compared with a value of 0.969 for the subspace distance between the two MSMs. Together, our results indicate that the neural networks are able to learn the leading eigenfunctions and eigenvalues of the transition operator (dynamical modes) of this system despite being presented with coordinates that are not the natural ones for describing the dynamics.
VIII. PREDICTION
The Richardson iterate, , must satisfy the boundary condition for x ∉ D. The other basis functions should satisfy for x ∉ D. In practice, we enforce the boundary conditions by explicitly setting and for a > 1 when x ∉ D.
When the boundary conditions are zero, the orthogonalization procedure and the matrix K are applied to all basis functions as in Sec. VII. When the boundary conditions are non-zero, the orthogonalization procedure is only applied to the basis functions , and Ka1 = Ia1 the a1 element of the identity matrix. We summarize our procedure for prediction in Algorithm 2.
A. Müller–Brown committor
In this section, we demonstrate the use of our method for prediction by estimating the committor for the Müller–Brown model with a shallow intermediate basin at (−0.25, 0.5) (Fig. 1). Here, the sets A and B are defined as in Eq. (24) and T is the time of first entrance to . In this case, a one-dimensional subspace iteration (i.e., k = 1 in Algorithm 2) appears sufficient to accurately solve the prediction problem. Figure 7 shows the largest eigenvalue of the stopped transition operator [the second largest of in (14)] computed from our grid-based reference scheme (Sec. VI A). Richardson iteration should converge geometrically in this eigenvalue,33 and so, for moderate lag times, we can expect our method to converge in a few dozen iterations. To initialize the algorithm we choose . All other hyperparameters are listed in Table I.
Results as a function of lag time are shown in Fig. 8. We see that the Richardson iterate is more accurate than the MSM for all but the shortest lag times. When using the loss, the results are comparable, whereas the softplus loss allows the Richardson iterate to improve the RMSE of the logit function in (44) with no decrease in performance with respect to the RMSE of the committor. Results as a function of the size of the dataset are shown in Fig. 9 for a fixed lag time of . The Richardson iterate generally does as well or better than the MSM. Again, the differences are more apparent in the RMSE of the logit function in (44). By that measure, the Richardson iterate obtained with both loss functions is significantly more accurate than the MSM for small numbers of trajectories. The softplus loss maintains an advantage even for large numbers of trajectories.
B. Accelerating convergence by incorporating eigenfunctions
As discussed in Sec. III, we expect Richardson iteration to converge slowly when the largest eigenvalue of , λ1, is close to 1. More precisely, the number of iterations required to reach convergence should scale with , the mean escape time from the quasi-stationary distribution to the boundary of D divided by the lag time. With this in mind, we can expect inexact Richardson iteration for the Müller–Brown to perform poorly if we deepen the intermediate basin at (−0.25, 0.5) as in Fig. 10 (top left). Again, the sets A and B are defined as in (24), and T is the time of first entrance to . In this case, −1/log λ1 is on the order of 100 for the lag times we consider and, as expected, inexact Richardson iteration converges slowly (Fig. 10, bottom left). Estimates of the committor by inexact Richardson iteration do not reach the correct values even after hundreds of iterations (Fig. 10, bottom right).
We now show that convergence can be accelerated dramatically by incorporating additional eigenfunctions of (i.e., k > 1 in Algorithm 2). For the Müller–Brown model with a deepened intermediate basin, the second eigenvalue of is of order 10−4 for a lag time of τ = 1000 steps or (while the first is near one as discussed above). We therefore choose k = 2 with initialized as a random linear combination of coordinate functions as in previous examples. We run the subspace iteration for four iterations, compute the committor as a linear combination of the resulting functions, and then refine this result with a further ten Richardson iterations (i.e., k = 1 with the starting vector as the output of the k = 2 subspace iteration). To combine the functions, we use a linear solve which incorporates memory (Algorithm 3).58,59 We find that the use of memory improves the data-efficiency substantially for poorly conditioned problems. For our tests here, we use three memory kernels, corresponding to τmem = ⌊τ/4⌋.
Require: Stopped transition data , guess function h, reward data , basis set , lag between memory kernels τmem. |
1: for s = 0 … (τ/τmem) do |
2: Initialize the matrix Cs with zeros ⊳ |
3: |
4: for a = 2 … k do |
5: |
6: for b = 2 … k do |
7: |
8: end for |
9: end for |
10: end for |
11: A ← C1 – C0 |
12: for s = 0 … (τ/τmem) − 2 do |
13: |
14: end for |
15: |
16: Solve Amemw = w |
17: return |
Require: Stopped transition data , guess function h, reward data , basis set , lag between memory kernels τmem. |
1: for s = 0 … (τ/τmem) do |
2: Initialize the matrix Cs with zeros ⊳ |
3: |
4: for a = 2 … k do |
5: |
6: for b = 2 … k do |
7: |
8: end for |
9: end for |
10: end for |
11: A ← C1 – C0 |
12: for s = 0 … (τ/τmem) − 2 do |
13: |
14: end for |
15: |
16: Solve Amemw = w |
17: return |
The bottom row of Fig. 11 illustrates the idea of the subspace iteration. The second eigenfunction (Fig. 11, center) is peaked at the intermediate. As a result, the two neural-network functions linearly combined by the Galerkin approach with memory can yield a good result for the committor (Fig. 11, bottom right). Figure 12 compares the RMSE for the committor and the RMSE for the logit in (44) for Algorithm 2 with k = 1 (pure Richardson iteration) and k = 2 (incorporating the first non-trivial eigenfunction), and an MSM with 400 states. We see that the Richardson iteration suffers large errors at all lag times; as noted previously, this error is mainly in the vicinity of the intermediate. The MSM cannot accurately compute the small probabilities, but does as well as the subspace iteration in terms of RMSE.
C. AIB9 prediction results
As an example of prediction in a high-dimensional system, we compute the committor for the transition between the left- and right-handed helices of AIB9 using the inexact Richardson iteration scheme (k = 1 in Algorithm 2) with the softplus loss function. Specifically, for this committor calculation, T is the time of first entrance to with A and B defined in Sec. VI B. As before, we initialize .
Results for the full dataset with τ corresponding to 400 ps are shown in Fig. 13. The projection on the principal components is consistent with the symmetry of the system, and the predictions show good agreement with the empirical committors. As τ decreases, the results become less accurate (Fig. 14, top left); at shorter lag times we would expect further increases in the error. We also examine the dependence of the results on the size of the dataset by subsampling the short trajectories and then training neural networks on the reduced set of trajectories (Fig. 14, top right). We find that the performance steadily drops as the number of trajectories is reduced and degrades rapidly for the datasets subsampled more than 20-fold (Fig. 14, bottom right), corresponding to about 7 µs of total sampling.
Finally, we compute the MFPT to reach the right-handed helix using the same dataset. For the MFPT calculation T is the time of first entrance to . Note that the time of first entrance to B includes long dwell times in A and is expected to be much larger than the time of first entrance to A ∪ B.
In Fig. 15, we show the MFPT obtained from Algorithm 2 with k = 5 and the loss function. Initially, we set equal to an arbitrary positive function (we use ) and for a > 1 to a random linear combination of coordinate functions. In Fig. 16, we examine the convergence of the MFPT from the left-handed helix to the right-handed helix for the MFPT computed with k = 1 (pure Richardson iteration) and k = 5. To obtain the results shown, we train neural networks on the short-trajectory dataset and then average their predictions for structures in the left-handed helix state in the long reference trajectories. The horizontal line in Fig. 16 indicates a MFPT of about 56 ns estimated from the long reference trajectories. We see that the algorithm with k = 5 converges much faster (note the differing scales of the horizontal axes) and yields accurate results at all lag times other than the shortest shown. The need to choose k > 1 for this MFPT calculation is again consistent with theoretical convergence behavior of exact subspace iteration. Because the typical time of first entrance to B from points in A is very large, we expect the dominant eigenvalue of to be very near to one when . In contrast, the committor calculation benefits from the fact that the time of first entrance to A ∪ B is much shorter, implying a smaller dominant eigenvalue of when .
IX. CONCLUSIONS
In this work, we have presented a method for spectral estimation and rare-event prediction from short-trajectory data. The key idea is that we use the data as the basis for an inexact subspace iteration. For the test systems that we considered, the method not only outperforms high-resolution MSMs, but it can be tuned through the choice of loss function to compute committor probabilities accurately near the reactants, transition states, and products. Other than the Markov assumption, our method requires no knowledge of the underlying model and puts no restrictions on its dynamics.
As discussed in prior neural-network based prediction work,24,30 our method is sensitive to the quality and distribution of the initial sampling data. However, our work shares with Ref. 24 the major advantage of allowing the use of arbitrary inner products. This enables adaptive sampling of the state space24,60 and—together with the features described above—the application to observational and experimental data, for which the stationary distribution is generally unavailable.
In the present work, we focused on statistics of transition operators, but our method can readily be extended to solve problems involving their adjoint operators as well. By this means, we can obtain the stationary distribution as well as the backward committor. The combination of forward and backward predictions allows the analysis of transition paths using transition path theory without needing to generate full transition paths37,38,48 and has been used to understand rare transition events in molecular dynamics10,13,16,17,61,62 and geophysical flows.63–67 We leave these extensions to future work.
In cases in which trajectories that reach the reactant and product states are available, it would be interesting to compare our inexact iterative schemes against schemes for committor approximation based on logistic regression and related approaches.35,41–46,68 These schemes are closely related to what is called “Monte Carlo” approximation in reinforcement learning,31 and also to the inexact Richardson iteration that we propose here with τ → ∞.
We have seen that temporal difference (TD) learning, a workhorse for prediction in reinforcement learning, is closely related to an inexact form of Richardson iteration. Variants like TD(λ), have similar relationships to inexact iterative schemes. As we showed, subspace iteration is a natural way of addressing slow convergence. We thus anticipate that our results have implications for reinforcement learning, particularly in scenarios in which the value function depends on the occurrence of some rare-event. Finally, we note that our framework can be extended to the wider range of iterative numerical linear algebra algorithms. In particular, Krylov or block Krylov subspace methods may offer further acceleration. In fact, very recently an approach along these lines was introduced for value function estimation in reinforcement learning.69
ACKNOWLEDGMENTS
We are grateful to Arnaud Doucet for pointing our attention to TD methods and the inexact power iteration in Ref. 32, both of which were key motivations for this work. We also thank Joan Bruna for helpful conversations about reinforcement learning. This work was supported by the National Institutes of Health (Award No. R35 GM136381) and the National Science Foundation (Award No. DMS-2054306). S.C.G. acknowledges support by the National Science Foundation Graduate Research Fellowship under Grant No. 2140001. J.W. acknowledges support from the Advanced Scientific Computing Research Program within the DOE Office of Science through Award No. DE-SC0020427. This work was completed in part with resources provided by the University of Chicago Research Computing Center, and we are grateful for their assistance with the calculations. “Beagle-3: A Shared GPU Cluster for Biomolecular Sciences” was supported by the National Institutes of Health (NIH) under the High-End Instrumentation (HEI) Grant Program Award No. 1S10OD028655-0.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
J.S., S.C.G., A.R.D., and J.W. conceptualized research. J.S. developed the method. C.L. adapted the linear solve used for prediction from Refs. 58 and 59. J.S. and S.C.G. performed the numerical tests. A.R.D. and J.W. supervised the research. All authors wrote and edited the manuscript.
John Strahan: Conceptualization (equal); Investigation (equal); Writing – original draft (equal); Writing – review & editing (equal). Spencer C. Guo: Conceptualization (equal); Investigation (equal); Writing – original draft (equal); Writing – review & editing (equal). Chatipat Lorpaiboon: Methodology (equal); Writing – review & editing (equal). Aaron R. Dinner: Conceptualization (equal); Funding acquisition (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). Jonathan Weare: Conceptualization (equal); Funding acquisition (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available within the article. Code implementing the algorithms and a Jupyter notebook illustrating use of the method on the Müller–Brown example are available at https://github.com/dinner-group/inexact-subspace-iteration.