The generalized master equation (GME) provides a powerful approach to study biomolecular dynamics via non-Markovian dynamic models built from molecular dynamics (MD) simulations. Previously, we have implemented the GME, namely the quasi Markov State Model (qMSM), where we explicitly calculate the memory kernel and propagate dynamics using a discretized GME. qMSM can be constructed with much shorter MD trajectories than the MSM. However, since qMSM needs to explicitly compute the time-dependent memory kernels, it is heavily affected by the numerical fluctuations of simulation data when applied to study biomolecular conformational changes. This can lead to numerical instability of predicted long-time dynamics, greatly limiting the applicability of qMSM in complicated biomolecules. We present a new method, the Integrative GME (IGME), in which we analytically solve the GME under the condition when the memory kernels have decayed to zero. Our IGME overcomes the challenges of the qMSM by using the time integrations of memory kernels, thereby avoiding the numerical instability caused by explicit computation of time-dependent memory kernels. Using our solutions of the GME, we have developed a new approach to compute long-time dynamics based on MD simulations in a numerically stable, accurate and efficient way. To demonstrate its effectiveness, we have applied the IGME in three biomolecules: the alanine dipeptide, FIP35 WW-domain, and *Taq* RNA polymerase. In each system, the IGME achieves significantly smaller fluctuations for both memory kernels and long-time dynamics compared to the qMSM. We anticipate that the IGME can be widely applied to investigate biomolecular conformational changes.

## I. INTRODUCTION

Biological molecules are often large and flexible, and their conformational changes are important for understanding their biological functions. Molecular dynamics (MD) simulations have been widely adopted in recent decades to investigate the conformational dynamics of biomolecules at the atomic level. However, the long timescale of biomolecular processes poses a challenge for MD simulation. Some processes reach milliseconds or even seconds, and atomistic simulations typically only reach sub-milliseconds for large biomolecules.

The Markov State Model (MSM)^{1–27} is a powerful approach used to bridge the timescale gap between atomistic MD simulations and biological processes. In MSMs, conformations are grouped into states according to their structural or dynamical similarities, and the conformational dynamics are modelled as Markovian transitions between these stable and metastable states at a discrete time sequence *t* = *nτ*_{T} (*τ*_{T} is the lag time). The Markovian transitions between each pair of states can be computed from many short MD trajectories, and the long-time dynamics can be predicted with the first-order master equation. To achieve Markovian dynamics, one often needs to construct an MSM with many small states, so that the relaxation within each state is easier to achieve, hence the transitions between pairs of states become Markovian within the affordable lag time (bound by length of MD trajectories). However, these MSMs usually contain a large number of (e.g., hundreds to thousands) states,^{28} which are useful to make quantitative predictions to be attested with experiments, but often hinder the comprehension of biological insights.^{29}

To address the above-mentioned challenge facing MSMs, we have developed a non-Markovian framework^{30,31} based on the generalized master equation (GME),^{32,33} in which we explicitly consider the memory kernels to predict the long-time dynamics of biomolecular conformational changes. In particular, we developed a straightforward implementation of the GME, the quasi-MSM (qMSM),^{30} to explicitly consider the memory kernels with the GME. The qMSM is constructed from short-time dynamics within the memory kernel relaxation time *τ*_{K}, and it can predict the long-time dynamics with the GME. The qMSM has been successfully applied in two large biomolecular systems: the gate opening dynamic of RNA polymerase^{34} and the conformational dynamics of messenger RNA (mRNA) recognition of RNA Induced Silencing Complex.^{35} To fit the single-molecular force spectroscopy data, Dill and co-workers^{36,37} have also developed a non-Markovian memory kernel (NMMK) method based on GME, where they applied the maximum entropy principle to obtain the memory kernels from experimental data.

In addition to the GME, the generalized Langevin equation (GLE) has also been recently applied to study biomolecular dynamics.^{38–46} The GLE is a useful tool for investigating the non-Markovian dynamics of biomolecules along one or several collective variables.^{43,44} Unlike the Langevin equation, friction in the GLE becomes a time integral of historical dynamics that involves memories. The memory effect in friction can significantly affect the conformational dynamics of biomolecules as revealed in the Grote–Hynes theory.^{44,45} Furthermore, it has been shown that the decay of the memory kernels in GLE heavily rely on the choice of collective variables. If good collective variables are selected, the memory kernel decay time is significantly shorter than the slowest timescale of the system (e.g., for the folding of Ala9^{44}). In addition to GME and GLE, Nóe and co-workers have developed HMM^{47} that provides an alternative way to model long-timescale biomolecular dynamics using a few states at relatively short lag times. HMMs can effectively approximate the projected Markov models containing hundreds of small states (microstates) using only a few hidden states. Rather than explicitly considering memory kernels as in GME or GLE, HMMs allow fuzzy state boundaries, where a MD conformation has a probability to be assigned to multiple states.

The initial success demonstrates that GME represents a promising approach to study conformational dynamics of complexed biomolecules. However, there still exists a major challenge in the robustness of the computed memory kernels when applying GME to study complex conformational changes. The qMSM often suffers from the numerical instability. For example, small numerical fluctuations of the short-time dynamics (the input of the qMSM) may lead to large fluctuations or even failures when predicting the long-time dynamics in many large biomolecular systems.^{34,35} To mitigate these fluctuations, we previously adopted a smoothing technique for the qMSM when dealing with large biomolecular systems.^{34,35} While the smoothing technique helps prevent failure of the qMSM, it still suffers from large uncertainty for predicted dynamics of complexed conformational changes (see Fig. 5). To improve the robustness of algorithms, the integrations often provide a useful approach by reducing the numerical fluctuations. For example, Dinner and co-workers^{48} have previously developed a method by integrating over multiple lag times when applying the variational approach to MSMs. They show that it can significantly reduce numerical fluctuations and yield more robust results than the variational approach.

We here introduce a new method, the Integrative GME (IGME), to overcome the above-mentioned challenge of the qMSM. The IGME utilizes the time integrals of memory kernels instead of the memory kernel tensor as used in the qMSM, thus the numerical fluctuation of the IGME is much smaller than that of the qMSM. In this paper, we first derive the analytical solution of the GME under the condition where memory kernels have decayed to zero. We then develop a two-step method to build IGME models and obtain parameters of this analytical solution by fitting to the MD simulations. Three biomolecular systems are presented to demonstrate the strength of the IGME: the alanine dipeptide, the FIP35 WW domain, and the gate opening dynamic of a bacterial RNA polymerase (RNAP). We show that the IGME has comparable performance with the qMSM for small alanine dipeptide that have sufficient samplings; while the IGME displays much less uncertainty in calculating the memory kernels and predicting long-time dynamics than the qMSM for the dynamics of WW domain and RNAP that contain larger fluctuations.

## II. THEORY AND METHOD

### A. The generalized master equation (GME) formulism of conformational dynamics

The dynamics of a molecular system satisfy the Liouville equation. The Liouville equation describes the dynamics in a high dimensional phase space, which is difficult to analyze or visualize. As a result, many tools have been developed to analyze the MD simulations in a low dimensional space, e.g., collective variables. In most cases, the dynamics of interest lie on a low dimensional manifold.

^{29}to project the continuous phase space onto discreate states. In this projection operator, the projection from the phase space to the state space would satisfy the following form:

^{29,49}

**belongs to state**

*x**j*, and $\chi jx=0$ indicates

**doesn’t belong to state**

*x**j*].

**is a vector containing the position of all atoms in the system.**

*x**χ*

_{j}and

*χ*

_{k}are orthogonal when

*j*≠

*k*: $\chi jx\chi kx=\delta jk\chi jx$. We then define the bracket operator as an ensemble average over equilibrium populations [$\rho eqx$] for the entire system:

*L*is the Liouville operator. Upon the projection, the dynamics follows the Nakajima-Zwanzig equation (which can be derived from the Liouville’s equation):

^{50,51}

*I*is an identity matrix). The second term on the right-hand side of Eq. (3) is called the inhomogeneous term, which is zero when $\rho 0$ is initiated from the equilibrium sampling. As demonstrated in our previous work,

^{30}the above Nakajima–Zwanzig equation can be rewritten as the GME:

*τ*

_{K}is the memory kernel decay time: $Kt\u2265\tau K=0$. The GME given by Eq. (4) determines the evolution of the transition probability matrix (TPM) at any time point. In Eq. (4), $Tt$ is referred to as the TPM, and $Tijt$ represents the probability of finding the system in state

*j*after time

*t*, given that it started from state

*i*:

*T*is a row-normalized matrix: $\u2211jTij=1$. $T\u03070$ is defined as follows:

^{30}we have developed a numerical implementation of the GME, qMSM, to compute the memory kernel tensor

*K*

_{ij}(

*t*) directly from the GME. In the qMSM, the equation used to compute memory kernels contains the first order time derivative of $Tt$, which may induce numerical fluctuations in the memory kernels. Consequently, the qMSM is heavily affected by the numerical fluctuations of $Tt$, and generates substantial numerical fluctuations when predicting long-time dynamics. Previously, we have developed a smoothing scheme to reduce the numerical fluctuations,

^{34}but the numerical fluctuations can still be large for large biomolecular systems (see Fig. 5). In this work, we introduce a new theory to utilize the integrals of memory kernels to solve the GME.

### B. The derivation of the integrative-GME (IGME) theory

*T*(

*t*) and

*M*

_{n}(

*t*). However, we notice that there often exist timescale separations for biomolecular dynamics, where the memory kernels decay to zero at significantly shorter timescale

*τ*

_{K}than the slowest dynamics of the system: $Kt\u2265\tau K\u22480$. Under this conditions (

*t*≥

*τ*

_{K}), the integrals of memory kernels become constants: $Mnt>\tau K=Mn\tau K\u2261Mn$. This subsequently allows us to analytically solve the Taylor expansion form of the GME, which can be rewritten as:

*T*(

*t*) alone, which can be analytically solved in a self-consistent manner. The self-consistent form of Eq. (11) is:

*m*-th iteration. First, we need to choose an initial guess $m=0$) to begin the iterative process. We here adopted an initial guess [$T0(t)$] as the solution of the zeroth order truncation of the Taylor-expansion form of GME [Eq. (11)]:

*M*

_{0}are constant matrices, $lnT\u03020\u2261T\u03070\u2212M0$ is also a constant matrix. Next, we move on to the next iteration (

*m*= 1). In this iteration, Eq. (12) becomes:

*m*= 0 [i.e., $T0t$] on the right-hand side of Eq. (14),

*m*= 1. For further interactions, we can perform a similar operation to obtain the general form for the solution at the

*m*-th iteration:

*m*→ ∞, the solution should be converged: $Tmt=Tm\u22121t$. We can then obtain the solution of $Tt$ at

*m*-th iteration [i.e., $T(m\u2192\u221e)t$] can be derived:

*A*, $T\u0302$ and

*M*

_{n}are all

*N*×

*N*matrices, where

*N*corresponds to the number of states. Our self-consistent solutions are based on the initial guess [$T0(t)$ in Eq. (13)]. In fact, this initial guess corresponds to the physical systems containing memory kernels as delta functions as discussed in our previous work.

^{30}Under this condition, the memory kernel decays infinitely fast in comparison to the time scale of interest, representing the limit of extreme timescale separation. In this section, we use the self-consistent approach to derive an analytical solution for the GME at

*t*≥

*τ*

_{K}, and we obtain the converged solution at an infinite number of interactions [$T(m\u2192\u221e)t$]. We note that Eqs. (18) and (19) can also be obtained via an alternative approach by solving the characteristic equations of the delayed differential equations (see Appendix A for details).

Matrices $T\u03070$, *A*, $T\u0302$ and *M*_{n} in Eq. (19) can be obtained by numerically fitting them to MD simulations. Specifically, Eq. (18) enables us to determine *A* and $T\u0302$, and Eq. (19) would allow us to obtain the time integrals of memory kernels at various orders: *M*_{n}. Interestingly, with *A* and $T\u0302$, we can already obtain the long-timescale dynamics [*T*(*t* ≥ *τ*_{K})] using Eq. (18) without explicitly knowing the integrals of memory kernels, *M*_{n}. Indeed, in our numerical implementations of the IGME model (Sec. II C), we fit Eq. (18) to MD simulation data to obtain two constant matrices *A* and $T\u0302$, and subsequently propagate the dynamics using Eq. (18). Equation (18) also indicates that a Markovian model (i.e., $Tn\tau T=T\tau Tn$) can only be reached when the lag time is long enough so that ln *A* (containing the contributions from the fast dynamics) becomes negligible, i.e., $ln\u2061A\u226a\tau T\u2061lnT\u0302$. Therefore, *τ*_{K} in IGME is usually shorter than the Markovian lag time *τ*_{T} in an MSM. We also notice that when *t* → *∞* in Eq. (18), $limt\u2192\u221elnTt=tlnT\u0302$, and then $T\u0302=limt\u2192\u221eTt1/t$. As $T\u0302$ describes the dynamics at the infinite long lag time, it provides an efficient way to estimate the timescales of the slowest dynamic modes.

*M*

_{n}using Eq. (19) is not a trivial task. In particular, obtaining higher-order memory kernels is only possible with the knowledge of the mathematical forms of the memory kernels. However, we show that it is possible to estimate the time integral of the memory kernels at the zeroth order, $M0=\u222b0\tau KKsds$. We notice that the last two terms in the right-hand side of Eq. (19) is originated from the Taylor expansion of the convolution term in the GME [Eq. (11)]. Specifically, the zeroth order term is

*M*

_{0}, while the higher-order terms are: $\u2212lnT\u0302M1+12lnT\u03022M2+\cdots $. As discussed in the previous paragraph, the constant matrix $T\u0302$ is related to the MSM at the infinite long lag time [$lnT\u0302=limt\u2192\u221elnTt/t$]. The implied timescales (

*ITS*) of this MSM [$Tt\u2192\u221e$] are defined as $ITSi=\u2212limt\u2192\u221et/ln\lambda it$, and the eigenvalues of $T\u0302$ thus satisfy $ln\lambda \u0302i=\u22121/ITSi$. For biomolecular systems, the implied timescales correspond to the slowest dynamics of the system, and the values of −1/

*ITS*

_{i}are typically close to zero. As a result, $ln\lambda \u0302i$ approaches zero and $lnT\u0302$ is close to the zero matrix. Therefore, the zero-order term (

*M*

_{0}) would represent a good approximation of this Taylor expansion. If we only keep the zero-order term, we could have:

For all three biomolecular systems tested in this work, we show that this approximation provides an efficient way to estimate the integrals of the memory kernels (*M*_{0}) in reasonable accuracy, which may benefit future studies that depend on this quantity (see Sec. II D and Fig. 6).

Using alanine dipeptide as the example, we demonstrate how Eq. (18) works for biomolecular dynamics (see Fig. 1). We generate a four-state model for the conformational dynamics of alanine dipeptide, and the four states are shown in green, blue, red, and magenta regions of the Ramachandran plot in Fig. 1(a). At *t* ≥ *τ*_{K} (in Fig. 1, *τ*_{K} = 0.5 ps), a linear relationship between $lnTt$ and *t* following Eq. (18) could be seen as the red lines [Fig. 1(b)]. The intercepts between the red lines and the vertical axis (i.e., ln *A*) denote the fast dynamics that fully relax at *t* ∼ *τ*_{K}.

### C. Construction of an IGME model

An IGME model follows Eq. (18) and can be constructed by finding the two constant matrices, *A* and $T\u0302$. Once these two constant matrices are obtained, the time dependent dynamics of the system can be predicted by computing $Tt$ from Eq. (18). In particular, the IGME model is constructed from TPMs at multiple lag times: $T\tau K,T(\tau K+\Delta t),T(\tau K+2\Delta t),\u2026T\tau K+L$. Here, two hyper parameters are involved for building an IGME: *τ*_{K} and *L*. We need to choose *τ*_{K} to be sufficient long so that the memory kernels decay to zero at *t* > *τ*_{K} and integrals of memory kernels become constant matrices. The *L* of IGME is another hyper-parameter that requires an additional approach to optimize. For well sampled data, any values of *L* will lead to identical IGMEs. However, when the sampling is insufficient, the TPMs counted from MD trajectories may contain numerical fluctuations, and this can significantly affect the IGME. To address this situation, we use a range of TPM data (spanning the lag time from *τ*_{K} to *τ*_{K} + *L*) to reduce fluctuations of the input TPMs. Below, we have outlined our procedure for obtaining *A* and $T\u0302$ for a given set of hyperparameters: *τ*_{K} and *L*. In Sec. II D, we will introduce our protocol to select the optimal values for *τ*_{K} and *L*.

Compute TPMs from MD simulation trajectories at lag times within $\tau K,\tau K+L$: $TMD\tau K$, $TMD\tau K+\Delta t$, …,$TMD\tau K+L$.

Obtain the initial values of

*A*and $T\u0302$ using the least square fitting (LSF) method. First, compute the logarithms of TPMs from the previous step (i.e., $lnTMD\tau K$,…, $lnTMD\tau K+L$). Then, perform the least square fitting for each element of the logarithm of TPMs: $lnTtij=\alpha ij+\beta ijt$ to obtain the initial estimates: $Ainit=exp\alpha $ and $T\u0302init=exp\beta $.Optimize

*A*and $T\u0302$ using the steepest descent method. In each optimization step, we evaluate the loss function [Eq. (22)] and employ the steep descent to optimize all elements in the matrices*A*and $T\u0302$. We then output the optimized*A*^{opt}and $T\u0302opt$, along with the rms error between the prediction of $IGMEAopt,T\u0302opt$ and MD data.

*A*and $T\u0302$. While the LSF method offers a straightforward and efficient approach for fitting

*A*and $T\u0302$, it experiences several numerical issues. LSF utilizes the linear relationship of Eq. (18), which consists of

*N*

^{2}linear equations (

*N*is the number of states) that can be fitted separately. To compute the logarithm of $lnTt$, we have implemented a spectrum-based approach, which involves combing the original eigenvectors with the logarithms of eigenvalues as below:

**is an arbitrary**

*X**N*×

*N*matrix,

**Λ**is a diagonal matrix of eigenvalues of

**,**

*X***is the identity matrix, and**

*I***is a matrix of right eigenvectors of**

*V***. $ln\Lambda ij=\delta ijln\Lambda ii$. This LSF implementation may suffer from several numerical issues. Firstly, the eigenvalues of $Tt$ are required to be positive in Eq. (18), as negative eigenvalues will induce complex numbers in $lnTt$. As a result, LSF is not applicable when $Tt$ contains negative eigenvalues. Secondly, the TPMs directly obtained by LSF, $TLSFt$, may not satisfy the normalization requirement of the TPM [i.e., $\u2211jTijt=1$]. Thirdly, the TPMs obtained by LSF is not greatened to satisfy the detailed balance (i.e.,**

*X**π*

_{i}

*T*

_{ij}=

*π*

_{j}

*T*

_{ji}).

*A*and $T\u0302$ obtained from the LSF method. In our steepest descent optimization, we also introduce two necessary constraints to avoid the negative eigenvalues. First, we employed the exponential form in Eq. (18) to avoid the negative eigenvalue issue. Then, we enforced $\u2211jTijt=1$ and detailed balance to the IGME via a steepest descent method with the following loss function:

*F*represent the Frobenius norm. The second term of Eq. (22) corresponds to the $\u2211jTijt=1$ requirement of TPM, and the third term of Eq. (22) corresponds to the requirement of detailed balance $\pi Tt=TtT\pi T$.

### D. Finding the optimal hyper-parameters for IGME models

*τ*

_{K}and

*L*, we adopted a systematic search approach by scanning all possible values of

*τ*

_{K}and

*L*to find the best IGME models. A good IGME should be able to well reproduce the TPMs, $Tt$. Therefore, we can use the Root Mean Square Error (RMSE) between the predicted TPMs [$TIGMEt=elnA+tlnT\u0302$] and the original TPMs from MD data [$Tt$] to evaluate the quality of the IGME:

^{30}

*L*

_{max}is the maximum length of lag time. In Eq. (23),

*T*

^{IGME}is built from TPMs between

*τ*

_{K}and

*τ*

_{K}+

*L*(0 <

*τ*

_{K}<

*τ*+

*K*

_{L}≤

*L*

_{max}), while the RMSE is computed from

*t*= 0 to

*t*=

*L*

_{max}.

### E. MD simulation setups

#### 1. Alanine dipeptide

AMBER99SB force field^{52} was used for alanine dipeptide, and the TIP3P model^{53} was used for water. After energy minimization, the box size was equilibrated with a 20 ns NPT simulation. Next, a 20 ns NVT trajectory was performed to generate 100 initial conformations. Finally, 100 MD trajectories started with the 100 initial conformations were performed, each containing 10 ns trajectories that saved with 0.01 ps interval between frames. The simulation was performed with GROMACS 4.5.3.^{54} Other details of the MD setup can be found in our previous paper.^{30}

The four-state model was generated from a splitting-and-lumping approach. In this approach, we first split all MD conformations into 1000 microstates by K-Centers clustering,^{55–57} then we lumped the microstates into four microstates with the PCCA+ (Perron Cluster-Cluster Analysis).^{58} The error bars of the residence probability of MD data were obtained from a 100-round bootstrapping. In each bootstrapping we randomly chose 100 trajectories to compute TPMs.^{30} When computing the TPMs, the detailed balance was enforced by directly symmetrizing the transition count matrix.

#### 2. WW domain

The simulation dataset of WW domain was provided by Shaw *et al.* research.^{59} The dataset contains two 100 *µ*s trajectories saved in 200 ps interval between frames. We first performed tICA (Time-lagged Independent Components)^{60–62} using a tICA lagtime of 18 ns based on the pairwise distances of the 35 alpha carbon atoms of the protein backbone to obtain three collective variables (i.e., three tICs) that can describe the slowest dynamics. Next, we adopted a split-and-lumping approach to obtain the four-state model. We first split all conformations into 800 microstates with the K-Centers algorithm,^{55–57} then lumped them into four metastable states with the PCCA+ algorithm.^{58} The error bars of the residence probability of MD data were obtained from a ten-round bootstrapping. In the bootstrapping, we first divided each trajectory into five segments, then in each bootstrapping we randomly chose ten segments of the trajectories to compute TPMs.^{30} When computing the TPMs, the detailed balance was enforced by directly symmetrizing the transition count matrix.

#### 3. Taq RNAP

The MD simulation dataset of this system was taken from our previous study.^{34} This dataset consists of 306 MD simulation trajectories, and each has 200 ns with the saving interval of 100 ps. The TICA analysis was performed to reduce the number of features from 1,770 pairwise distances (involving alpha Carbon atoms of the clamp, *β*-lobe, *β*-protrusion, Switch regions, and active site) to three TICs containing the slowest dynamics of the system. Then the four-state model^{34} was built with a splitting-and-lumping approach, where a 100-state microstate model was built from the K-Centers clustering in the TICA space, and the four-state model was obtained with PCCA+ algorithm. The error bars of the residence probability of MD data were obtained from a 200-round bootstrapping. In each bootstrapping we randomly chose 100 trajectories to compute TPMs.^{34} When computing the TPMs, the detailed balance was enforced by the maximum likelihood estimator^{63} method.

## III. RESULTS AND DISCUSSIONS

In this section, we demonstrate the performance of the IGME in three systems: dynamics of alanine dipeptide, folding of FIP35 WW domain, and the gate opening of a bacterial RNAP.

### A. The alanine dipeptide

We first applied the MSM, qMSM and IGME to a four-state model of the conformational dynamics of alanine dipeptide in explicit solvent. The four states are shown in green, blue, red, and magenta regions of the Ramachandran plot in Fig. 2(a). The MD simulation dataset of the alanine dipeptide is a well-sampled one; thus, the numerical fluctuation for the memory kernels is small, and both the IGME and qMSM yield good models that can perfectly match with MD simulations [Fig. 2(b)]. In particular, the Root Mean Square Error (RMSE) of the qMSM could reach very small values below 1.4 × 10^{−3} when *τ*_{K} ≥ 1.5 ps, while the IGME yields even better models [RMSE is below 3.4 × 10^{−4} when *τ*_{K} ≥ 1.5 ps, see Fig. 2(b)]. The Chapman–Kolmogorov test in Fig. 2(c) also shows that the qMSM and IGME built at *τ*_{K} = 1.5 ps perfectly match with the MD simulations. In contrast, the accuracy of the MSM is significantly lower in *τ* = 0 ∼ 8 ps, as the RMSEs of the MSM are much larger than those of the qMSM or IGME.

As shown in Fig. 2(b), the extensive scanning of the hyperparameters of *τ*_{K} and *L* can provide highly accurate results and help build IGME models with small RMSEs. In Fig. 2(b), the purple region marks the top 5% IGME models with the smallest RMSEs ($<8.1\xd710\u22125$). These top 5% IGME models can give the most accurate predictions of dynamics. For example, in Fig. 2(d), the top 5% IGME models can correctly predict the variational bound of the slowest implied timescales (compared to the qMSM), with even smaller numerical uncertainties.

### B. The FIP35 WW domain

To test the performance of the IGME in a more complex system, we examined the conformational dynamics of FIP35 WW domain. Figure 3(a) shows a four-state model of the folding dynamics of FIP35 WW domain. In our previous work, we have shown that a qMSM constructed at *τ*_{K} = 15 ns was able to accurately predict long-timescale dynamics, while the MSM could only be built at the Markovian lag time of *τ* = 200 ns for this four-state model of WW domain.^{30} This can also be seen in Fig. 3(c), where the qMSM would work at *τ*_{K} = 15 ns but the MSM has significant deviations at *τ* = 15 ns. Figure 3(b) shows that the MSM is generally less accurate than the qMSM. In Fig. 3(b), we can see the RMSEs of the MSM are much larger than those of the qMSM or IGME.

The simulation data of WW domain contains larger fluctuations than the alanine dipeptide dataset. In Fig. 2(c), the uncertainty of the residence probability of alanine dipeptide is around 4%, while in Fig. 3(c), the uncertainty could be 40%. Consequently, the qMSM of FIP35 WW domain is less stable than alanine dipeptide. In the short-time dynamics (shorter than length of trajectories) shown in Figs. 2(b) and 3(b), the RMSE of FIP35 WW domain already contains larger fluctuation than alanine dipeptide at short time scales (shorter than length of trajectories). While in the long-time dynamics as shown in Figs. 2(d) and 4(a), the uncertainty of qMSM predicted dynamics for WW domain contains large uncertainties (∼17% at *τ*_{K} ∼ 70 ns), which is significantly larger than alanine dipeptide ($<1.7%$).

The IGME can provide models with much greater numerical stability than the qMSM for the dynamics of WW domain. In Fig. 3(b), the scanning of *τ*_{K} and *L* shows smooth distributions of RMSE in the $\tau K,L$ space, significantly less fluctuating than the qMSM. In certain regions of $\tau K,L$, the IGME can reach even much smaller errors than qMSM. The numerical stability of IGME also greatly facilitates the prediction of long-time dynamics. In Figs. 4(a) and 4(c), the top 5% IGME models [selected from the purple region of Fig. 3(b)] can generate much more accurate results for the variational bound of the slowest implied timescales ($<2%$) as well as the MFPT of unfolding process ($<2%$), when compared with the qMSM ($\u223c17%$ at *τ*_{K} ∼ 70 ns). Therefore, the IGME is better than the qMSM when computing the long-time dynamics of a simulation system which contains less sufficient sampling.

### C. The gate opening dynamics of *Taq* RNAP

Finally, we demonstrate the power of the IGME in a large biomolecular system, the gate opening of Taq RNAP, which is shown in Fig. 5(a). Here we adopt the same four-state models as our previous paper.^{34} For this system, the qMSM at certain values of *τ*_{K} fail to converge, while the converged qMSMs contain large fluctuations. Therefore, we have introduced a smoothing technique,^{34,35} the qMSM_{sm}, to reduce the fluctuation of the TPMs used to build qMSMs. This smoothing technique could greatly improve the convergence of the qMSM from 51% (the original qMSM at $\tau K\u22080,90$ ns) to 92% (qMSM_{sm}). For the converged data as shown in Fig. 5(b), the smoothing technique could further reduce the error of predicted TPMs from 1.6‰ (averaged RMSE of the qMSM within $\tau K\u220830,90$ ns) to 0.7‰ (qMSM_{sm}).

However, for complex systems like *Taq* RNAP, qMSMs with the smoothing scheme still contain large fluctuations that hinder accurate predictions of long-time dynamics. The uncertainty of the variational bound of slowest implied timescales of the qMSM_{sm} varies between 23% at *τ*_{K} ∼ 40 ns and 108% at *τ*_{K} ∼ 10 ns, as shown in the blue bars of Fig. 5(d). In addition, another issue of the smoothing scheme can be seen in Fig. 6(c), where the smoothed TPMs will also reduce the short-time memory kernels and lead to divergence of memory kernels at long lag times.

In sharp contrast, the IGME can generate significantly more accurate results with smaller numerical fluctuations. As seen from the accuracy of the IGME in Fig. 5(b), the best qMSM_{sm} has RMSE = 0.4‰, which is larger than the RMSE of the green, cyan, blue, and magenta regions of IGME that containing 30% of all IGME models in the exhaustive scan. The purple region consisting of the top 5% IGME models could reach an even smaller RMSE (0.23‰–0.28‰) than the qMSM. The improvement of numerical stability can be seen Fig. 5(d): when *τ*_{K} ≥ 20 ns, the uncertainty of ITS bound calculated by IGME varies from 1% at *τ*_{K} ∼ 30 ns to 7% at *τ*_{K} ∼ 50 ns.

### D. Discussions

As discussed in Sec II B, we can also evaluate the integral of the memory kernels using the zero-order term of the Eq. (19) ($M0\u2248T\u03070\u2212lnT\u0302$). *M*_{0} represents the integral matrix of memory kernels, and the mean integral of memory kernels $MIK\tau K$ can be simply computed from *M*_{0} via $MIK\tau K=M0F/N$. $M0F$ is the Frobenius norm of *M*_{0}, and *N* is the number of states. We show that the MIKs computed from the qMSM and IGME are in reasonable agreement for Alanine Dipeptide [Fig. 6(a)], WW-domain [Fig. 6(b)], and Taq RNAP [Fig. 6(c)]. When the MIK reaches a plateau, meaning the memory kernels have completely relaxed and the MIK will not change when extending the lag time, the MIK computed from the IGME could perfectly match with the qMSM. The MIK computed from the IGME is also much smoother than from the qMSM. Even at long lag times, fluctuations always exist even for the alanine dipeptide system. While in the IGME, the MIK is numerically more stable, as the numerical fluctuations are much smaller than in the qMSM.

In the IGME, we have employed the Taylor series to expand the convolution term of the GME. Previously, Laplacian transform was used in many papers^{33,37,43} to compute memory kernels from the time dependent transition matrixes [e.g., $Tt$]. For a dynamical system, the Laplacian transform at frequency *s* → 0 provides the Markovian approximation to the GME, while the Laplacian transform at *s* → ∞ describes the short-time dynamics. In systems with timescale separations, there should exist a characteristic frequency *s*_{K} that corresponds to *τ*_{K} in our IGME theory, where the memory kernel decays to zero at *t* ≥ *τ*_{K} or $s\u2265sK$. By employing the Laplacian transform, one could potentially develop an analogous method to IGME in the future by considering all contributions of frequencies within $s\u2264sK$.

Constructing IGMEs requires longer MD trajectories compared to GME (or qMSMs). This is because the validity of IGME’s solution relies on the time integrals of memory kernels becoming constant matrices (*t* ≥ *τ*_{K}) and obtaining a numerically stable estimation of memory kernel integrals generally necessitates slightly longer simulations than $\tau KqMSM$ (required for qMSM). Let’s consider the WW-domain as an example. Our previous work^{30} has shown that qMSMs outperform MSMs by predicting correct long-timescale dynamics of the WW-domain folding, while being built from significantly shorter MD simulations: $\tau KqMSM=25ns$ v.s. $\tau TMSM=200ns$. For this WW-domain system, our IGME requires 33 ns (*τ*_{K} = 31 ns and *L* = 2 ns) vs $\tau KqMSM=25ns$ to build models with comparable RMSEs of ∼1.5 × 10^{−3} [see Fig. 3(b)]. For the alanine dipeptide, the IGME requires 1.6 ps (*τ*_{K} = 1.3 ps and *L* = 0.3 ps) vs $\tau KqMSM=1.5ps$ vs $\tau TMSM=12.6ps$ to build models with comparable errors (RMSE ∼ 1.4 × 10^{−3}). In all three systems, the sum of *τ*_{K} + *L* in our IGME is still significantly shorter than the Markovian lag times: $\tau TMSM$.

Recently, the time convolutionless GME theory $U$-GME^{64} has been developed and it is shown that the rate matrix [$Rt$] becomes time invariant when the lag time is longer than a characteristic time. This is also consistent with our IGME theory. In particular, the solutions of IGME (currently as a function of TPMs) could be re-written as a function of the rate matrix and we show that this rate matrix becomes constant when the memory kernel fully decays (*t* ≥ *τ*_{K}, see Appendix B for details). IGME and UGME also exhibit comparable performance in predicting the slowest implied timescales for the WW-domain and RNAP systems.^{31} Different from GME-based methods, HMM^{47} allows fuzzy state boundaries rather than considering memory kernels. We have demonstrated that GME-based methods (e.g., IGME) are numerically more robust than the existing implementation of HMM, because IGME is fitted to several TPMs at lag times longer than *τ*_{K} while HMM maximizes the likelihood function defined by entire input trajectories.^{31}

## IV. CONCLUSIONS

In this paper, we present a new method, the Integrative GME (IGME). This method adopts a Taylor expansion of memory term in the GME and utilizes the time integrals of memory kernels to reduce the complexity of the GME. Under the condition (*t* ≥ *τ*_{K}) where the memory kernels have decayed to zero ($Kt\u2265\tau K\u22480$), the time integrals of memory kernels of different moments become constant matrices, allowing for the analytical solution of the GME [Eq. (18)] through a self-consistent approach. This analytical solution can be fitted with the MD simulation data, providing accurate and numerically stable predictions of the long-time dynamics. The two hyperparameters, *τ*_{K} and *L*, are obtained with an exhaustive scan approach. Choosing the best IGME models allows for more accurate and numerically more stable predictions compared to qMSMs. Beside the analytical solution of $Tt$ [Eq. (18)], the self-consistent equation of $T\u0302$, Eq. (19), can provide an alternative way of computing integrals of memory kernels. The memory kernel computed from Eq. (19) is much more stable than in the qMSM. To demonstrate the power of IGME, we have applied it to study the dynamics of the alanine dipeptide, the folding of FIP35 WW domain, and the gate opening of *Taq* RNAP. For all three systems, we show that the IGME achieves significantly smaller fluctuations for both memory kernels and long-time dynamics compared to the qMSM. Biomolecular dynamics often exhibit timescale separations, where the memory kernels decay to zero at significantly shorter timescales than the slowest dynamics of the system. Therefore, we expect that the IGME holds great promise to be widely applied to study complex conformational changes of biomolecules.

## ACKNOWLEDGMENTS

We would like to thank Dr. Jianfeng Lu for helpful discussions. We would also like to express our gratitude to an anonymous reviewer for bringing to our attention an alternative approach to solving the GME. We acknowledge the support from the NIH/NIGMS under Award No. 1R01GM147652-01A1, the support from the Office of the Vice-Chancellor for Research and Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin Alumni Research Foundation, and the support from the Hirschfelder Professorship Fund.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Siqin Cao**: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Methodology (lead); Software (equal); Validation (lead); Visualization (lead); Writing – original draft (equal); Writing – review & editing (equal). **Yunrui Qiu**: Methodology (supporting); Software (equal); Writing – original draft (supporting); Writing – review & editing (supporting). **Michael L. Kalin**: Writing – original draft (equal); Writing – review & editing (equal). **Xuhui Huang**: Conceptualization (supporting); Funding acquisition (lead); Investigation (supporting); Methodology (supporting); Resources (lead); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The source code of IGME is available at https://github.com/xuhuihuang/IGME. Other simulation datasets are available from the corresponding authors upon request.

### APPENDIX A: AN ALTERNATIVE APPROACH TO SOLVE THE GME

*t*≥

*τ*

_{K}and $Kt\u2265\tau K=0$] via the characteristic equation. We first consider the condition of

*t*→ ∞, and adopt a nontrivial solution of the homogeneous Delayed Differential Equation: $Tt\u223cAeBt$, where

*A*and

*B*are constant matrices. The characteristic equation of Eq. (4) is:

*e*

^{−Bs}. In addition, when

*t*≥

*τ*

_{K}, it is given that the integrals of memory kernels become constants: $Mnt>\tau K=Mn\tau K\u2261Mn$.

*t*→ ∞, there should be a distinctive model [i.e.,

*T*(

*t*)consisted of a set of unique dynamic modes] to describe their conformational dynamics. In fact, when the lag time

*t*>

*τ*

_{T}≥

*τ*

_{K}(

*τ*

_{T}is the Markovian lag time),

^{30}such a unique MSM already emerges to describe the system’s conformational dynamics. Therefore, the solution of the Eq. (A2) contains a single term:

*t*→ ∞: $Tt\u2192\u221e=AT\u0302t$. We next solve the GME at

*t*≥

*τ*

_{K}. Similar to the solution at

*t*→ ∞, we can also substitute the non-trivial solution $Tt\u223cAeBt$ into Eq. (4) and obtain the corresponding characteristic equation at

*t*≥

*τ*

_{K},

*t*≥

*τ*

_{K}):

*t*≥

*τ*

_{K}[Eq. (A6)] should be consistent with the solution at

*t*→ ∞ [Eq. (A4)]. Therefore, Eq. (A6) contains only one exponential term $ALeBLt$. If we also define $BL=lnT\u0302$, we can obtain the same solution as Eqs. (18) and (19) in the main text:

### APPENDIX B: THE RATE MATRIX OF THE $U$-GME

*t*≥

*τ*

_{K}), all the memory kernel terms become invariant of the lag time: $Mnt\u2261Mn\tau K=Mn$. Therefore, the second and third term in the right-hand side of the above equation vanish when

*t*≥

*τ*

_{K}. In addition, according to Eqs. (B3) and (18) at

*t*≥

*τ*

_{K}, $T\u22121tTnt=lnT\u0302n$, the fourth term also vanishes. As a result, the rate matrix $Rt$ becomes a constant matrix at

*t*≥

*τ*

_{K}and is related to$T\u0302$:

## REFERENCES

*An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation*

*c*oxidase and functional implications

*k*-center problem