Recently, dynamic analysis methods in signal processing have been applied to the analysis of molecular dynamics (MD) trajectories of biopolymers. In the context of a relaxation mode analysis (RMA) method, based on statistical physics, it is explained why the signal-processing methods work well for the simulation trajectories of biopolymers. A distinctive difference between the RMA method and the signal-processing methods is the introduction of an additional parameter, called an evolution time parameter. This parameter enables us to better estimate the relaxation modes and rates, although it increases computational difficulty. In this paper, we propose a simple and effective extension of the RMA method, which is referred to as the positive definite RMA method, to introduce the evolution time parameter robustly. In this method, an eigenvalue problem for the time correlation matrix of physical quantities relevant to slow relaxation in a system is first solved to find the subspace in which the matrix is numerically positive definite. Then, we implement the RMA method in the subspace. We apply the method to the analysis of a 3-μs MD trajectory of a heterotrimer of an erythropoietin protein and two of its receptor proteins, and we demonstrate the effectiveness of the method.
I. INTRODUCTION
Conformational fluctuations in biopolymers have been extensively investigated via molecular simulations and mode identifications.1–15 Advances in computer hardware and software have enabled us to simulate the motion of a large number of atoms on the time scale of femtoseconds to milliseconds. Consequently, it has become possible to analyze long trajectories of biopolymers. Among the analysis methods, those using second-order statistics and linear transformations are mathematically easier and computationally less expensive. One such method is the principal component analysis (PCA) method, which has been widely used for the study of conformational fluctuations in biopolymers.2–4 PCA uses an orthogonal transformation that maps mean-centered variables, e.g., the relative positions to average positions of atoms in biopolymers, to principal components {Φm(Q)}, where Q denotes the state of a system. The principal components are uncorrelated,
where ⟨·⟩ denotes the average over the equilibrium ensemble of Q and Λm denotes the mth largest variance of Φm. However, PCA is a static analysis in the sense that the correlation in Eq. (1) does not depend on time, that is, this analysis simply provides time-averaged pictures. For the understanding of conformational fluctuations in biopolymers, dynamic analysis is important.
When time series are analyzed, we can consider a dynamically extended transformation, which maps mean-centered variables to variables uncorrelated at any time lag t (>0):
In the methods for identifying modes that satisfy Eq. (2), certain mathematical assumptions are necessary. A method was first proposed for blind signal separation (BSS) in signal processing.16 The goal of BSS is to recover source signals from observed signals that are linear combinations of the source signals. This method, which is referred to as AMUSE, is based on the assumption that observed signals are linear mixtures of stationary source signals that are uncorrelated at any time lag; Eq. (2) itself is satisfied for the source signals. A method proposed later by L. Molgedey and H. G. Schuster17 is based on a stronger assumption than AMUSE. The assumption is that the observed signals are linear mixtures of the equilibrium source signals, which are independent at any time lag. Because the source signals are assumed to be independent, this method is categorized as an independent component analysis (ICA) method. ICA is a special case of BSS. These methods have been applied to the analysis of molecular dynamics (MD) trajectories of biopolymers,7,11,12 and, according to the reports, they enable us to identify “slow” modes in biopolymers. However, there are some issues regarding the applications of these methods to the analysis of the simulation trajectories of biopolymers.
In general, time series of positions and velocities of atoms are not linear mixtures of source signals that are uncorrelated or independent at any time lag. In fact, the separated signals are not independent. Nevertheless, how are these methods able to make the separated signals uncorrelated?
In the signal-processing methods, the number of source signals is assumed to be equal to or less than that of the observed signals. Because of this assumption, the separated signals are uniquely determined with no dependence on a delay time parameter τ, which is used in these analyses.16,17 However, in reality, the separation depends on the choice of τ. Is the assumption adequate in the analysis of simulation trajectories of biopolymers?
These methods do not have any quantity characterizing slowness. Thus, the identified modes are not necessarily “slow” modes. Why are “slow” modes apparently identified?
These questions can be answered by using a relaxation mode analysis (RMA) method,18 based on statistical physics. RMA is a dynamic analysis for the identification of slow relaxation modes in a system. If the time evolution of the state of a system is described by a Markov process with detailed balance, relaxation modes {ϕm(Q)} satisfy
where denotes the time reversal of ϕm(Q) and λm denotes the mth slowest relaxation rate of ϕm(Q). In the RMA method, approximate relaxation modes and their relaxation rates are estimated from time series through variational optimization as explained in Sec. II A. This method was applied to the analysis of Monte Carlo (MC) data in random spin systems18 first and MC and MD trajectories in homopolymers19,20 and heteropolymers8,9 later (see Ref. 21 for a review of RMA especially in biomolecules).
The differences between the BSS, ICA, and RMA methods are summarized in Table I. As listed in Table I, the algorithms of BSS, ICA, and RMA are formulated by sequential diagonalization, simultaneous diagonalization, and variational optimization, respectively. Although the formulations of BSS, ICA, and RMA are different, the BSS and ICA methods are equivalent to each other11,12 and the RMA method, except the differences between Eqs. (2) and (3) and an evolution time parameter explained later.7,8
Analysis . | Assumption . | Formulation . | Parameter . |
---|---|---|---|
BSS16 | Stationary source signals | Sequential | Delay time |
uncorrelated at any time lag | diagonalization | ||
ICA17 | Equilibrium source signals | Simultaneous | Delay time |
independent at any time lag | diagonalization | ||
RMA18 | Markov process | Variational | Delay time |
with detailed balance | optimization | and evolution time |
Analysis . | Assumption . | Formulation . | Parameter . |
---|---|---|---|
BSS16 | Stationary source signals | Sequential | Delay time |
uncorrelated at any time lag | diagonalization | ||
ICA17 | Equilibrium source signals | Simultaneous | Delay time |
independent at any time lag | diagonalization | ||
RMA18 | Markov process | Variational | Delay time |
with detailed balance | optimization | and evolution time |
We now discuss the above questions in the context of the RMA method.
The RMA method needs the assumption that the time evolution of the state of a system is described by a Markov process with detailed balance. In the context of signal processing, this assumption is not about source signals but about observed signals. The space and time scales of current interest are much larger than those of atomic motions in biopolymers. Therefore, the assumption of a Markov process with detailed balance is justified in the simulations of biopolymers.
In the RMA method, we variationally estimate approximate relaxation modes through the observation of a limited number of physical quantities that include numerous relaxation modes. In terms of signal processing, the number of source signals is much greater than that of observed signals, and thus, the separated signals do not completely reproduce the source signals. Therefore, the mode identification naturally depends on a delay time parameter τ.
In the RMA method, relaxation rates or their reciprocals, that is, relaxation times, are rigorously defined in Eq. (3), and the slowness of relaxations is determined. The RMA method enables us to identify not only modes uncorrelated at any time lag but also slow relaxation modes in a system.
Because the variational problem in the RMA method is equivalent to a generalized eigenvalue problem, in the analysis, we solve the generalized eigenvalue problem of the two time correlation matrices of physical quantities relevant to slow relaxation in a system, C(t0 + τ) and C(t0). Here, in addition to a delay time parameter τ, another parameter t0, denoting evolution time, is introduced. The evolution time parameter t0 plays a role in the reduction of the contribution of faster relaxation modes in the physical quantities and enables us to better estimate the relaxation modes and rates (see Refs. 8, 14, and 21 for more detail on the role of t0). However, the introduction of this parameter increases the difficulty in the usage of the RMA method in systems with a relatively large number of degrees of freedom and limited simulations. Relatively low statistical precision of the time correlation matrices because of the lack of the number of samples gives rise to this difficulty, numerically breaking the positive definiteness of real symmetric matrices, preventing us from solving the generalized eigenvalue problem. In many previous studies using the RMA method,8,13,18–20,22–27 the considered systems are relatively small, and thus, the sufficient number of samples for calculating the time correlation matrices with high statistical precision is obtained from long-time simulations. On the other hand, biopolymers are systems with a relatively large number of degrees of freedom, and it is difficult to perform the long-time simulations in general.
In order to solve this problem, we propose a new dynamic analysis method, which is referred to as the positive definite RMA (PDRMA) method. In this method, we first find a subspace in which the time correlation matrix C(t0) is numerically positive definite. Then, we implement the RMA method in the subspace. Because the time correlation matrix in the subspace is positive definite, the generalized eigenvalue problem becomes solvable in any system. In this paper, we first present a summary of the conventional RMA method and a formulation of the PDRMA method. Then, in order to demonstrate the effectiveness of the PDRMA method, we apply this method to the analysis of a 3-μs MD trajectory of a heterotrimer of an erythropoietin (EPO) protein and two of its receptor proteins. The protein trimer is a relatively large system and is suitable for this demonstration. Finally, we briefly discuss the identified modes.
II. THEORY
A. RMA method
Let us consider the case where the time evolution of the state of a system is described by a Markov process. In this case, the time evolution of the system is described by a master equation
where P is a time-dependent probability density/mass function of a state Q. The left and right eigenvalue problems for the matrix whose (Q, Q′) entry is Γ(Q|Q′) are defined by
and
respectively.28 The left and right eigenvectors are chosen to satisfy the orthonormal condition
Here, , where Peq is the probability density/mass function of the equilibrium distribution. Then, the left and right eigenvectors satisfy
If the detailed balance condition holds, equals the time reversal of ϕm(Q), and the real part of λm becomes positive.29 Thus, as stated in Sec. I, ϕm(Q) denotes the relaxation mode, and λm denotes the relaxation rate. In order to obtain the relaxation modes and rates, we should solve the eigenvalue problems defined by Eqs. (5) and (6) under the condition of Eq. (7). However, in general, we do not know the explicit form of the matrix in the master equation. Even if the form is known, the eigenvalue problems are difficult to solve analytically.
In the RMA method, approximate relaxation modes and rates are estimated without knowing the explicit form of the matrix in the master equation. When the detailed balance condition holds, Eqs. (5) and (6) are formulated as a variational problem for
where the stationary value of is . Thus, we can find approximate eigenvalues and eigenvectors as with the variational method in quantum mechanics. For simplicity, we consider the case where . This holds, for example, when Q denotes a set of positions of atoms (for more general cases, see Refs. 9, 20, and 21). An approximate relaxation mode Xp(Q) is defined as a linear combination of physical quantities {Ri(Q)}, which are relevant to slow relaxation in a system,
where Ri(t; Q′) = ∑QRi(Q)e−Γt(Q|Q′) and evolution time t0 is introduced to reduce the contribution of faster relaxation modes in Ri(Q).8,14,21 Substituting the approximate relaxation mode (10) into the variational problem for Eq. (9), we obtain
where Ci,j(t) denotes the time correlation of Ri(Q) and Rj(Q) at time lag t, which is calculated directly from the time series. The orthonormal condition of Eq. (7) becomes
In the analysis, we solve a generalized eigenvalue problem defined by Eq. (11) under the condition of Eq. (12).
The RMA method is related to Markov state models.10,11,13,21,30 Eigenvalue problems in Markov state models can be formulated as a variational problem in the RMA method, where t0 = 0 and Boolean-valued variables {δi(Q)} are used instead of physical quantities {Ri(Q)}. Here, if a state Q of a system belongs to the ith coarse-grained state, then δi(Q) = 1 or else 0. The introduction of t0 into conventional Markov state models provides better eigenvalues and eigenvectors.13 The signal-processing and RMA methods can be used for the definition of the coarse-grained states.10,11,13
B. PDRMA method
In the PDRMA method, we first solve an eigenvalue problem defined by
under the orthonormal condition
to find the subspace in which C(t0) becomes numerically positive definite. Nc positive eigenvalues and corresponding eigenvectors are used for the following procedure. Note that this first procedure is regarded as time-lagged PCA. Thus, large fluctuations at t0 are described by the principal components. Because the fluctuations described by faster relaxation modes relax at t0, the principal components include slower relaxation modes compared with those identified by conventional PCA.
A new time correlation is defined as
where plays a role in the simplification of the following procedure. Then, we solve a generalized eigenvalue problem defined by
under the orthonormal condition
and
respectively. In the PDRMA method, we solve the eigenvalue problem defined by Eqs. (13) and (14), and then, we solve the eigenvalue problem defined by Eqs. (18) and (19). Thus, we solve two eigenvalue problems for time correlation matrices at time lag t0 and τ in order. Substituting Eq. (15) into Eqs. (16) and (17), we find that this process corresponds to solving a generalized eigenvalue problem defined by
under the orthonormal condition
Here, . Therefore, the PDRMA process corresponds to solving the generalized eigenvalue problem defined by Eq. (11) under the orthonormal condition of Eq. (12) in the limited space. When Nc equals the original degrees of freedom, these processes are equivalent.
If the estimated relaxation modes and their relaxation rates approximately reproduce the property of the original relaxation modes and rates of Eq. (8), they satisfy
Then, a time correlation is reproduced as
where . Comparing the correlations with those directly calculated from the time series, we confirm whether the chosen parameters obtain a proper estimation of the relaxation modes and rates. Although it is difficult to know the appropriate value of t0a priori, we can estimate the value from the long-time behavior of the original {Ci,i(t)}. In the case of one dimension, the relaxation rate is obtained from the slope of the straight line through two points (t0 + τ, lnC(t0 + τ)) and (t0, ln C(t0)).14 Thus, the value of t0 needs to be large enough to measure the relatively gentle gradient of the original {Ci,i(t)}. However, in the conventional RMA method, the chosen value of t0 is limited to the value at which the fastest decaying correlation is statistically precise enough to solve the generalized eigenvalue problem defined by Eq. (11). On the other hand, in PDRMA, we ignore the subspace in which C(t0) becomes numerically nonpositive definite because of the lack of statistical precision. Therefore, we can use the large value of t0 in the PDRMA method.
From Eq. (23), a physical quantity Ri(Q) is approximately expanded as
Here, the variance of equals one. To correspond with PCA, needs to be multiplied by the norm of . In this case, Eq. (24) is rewritten as
where and .
The PDRMA method is a straightforward extension of the principal component RMA (PCRMA) method, which is proposed to handle a relatively large number of degrees of freedom in proteins.9 In the process, we first perform PCA, whose procedure corresponds to solving the eigenvalue problem defined by Eqs. (13) and (14) for t0 = 0, and apply the RMA method in which principal components with large variances are used as physical quantities relevant to slow relaxation in a system. Because an evolution time t0 is used in the RMA procedure after PCA, there is a possibility that the generalized eigenvalue problem is numerically unsolvable in PCRMA. In fact, for coping with the t0 problem, the method for RMA using two evolution times, which is an improved version of the conventional RMA method, is used in addition to the PCRMA method.9 In PDRMA, an evolution time t0 in PCRMA moves to the first eigenvalue problem. Therefore, the PDRMA method enables us to naturally avoid the t0 problem.
The process of the PDRMA method is similar to that of AMUSE. In fact, when t0 = 0 and Nc equals the original degrees of freedom in a system, the processes are equivalent, although the theoretical backgrounds are completely different (t0 is a distinctive example because of the differences). Therefore, the PDRMA program can be implemented with a slight modification of the AMUSE program, viz., the introduction of t0 and Nc.
III. COMPUTATIONAL DETAILS
We selected a heterotrimer of an EPO protein and two of its receptor proteins as a target for analysis. The initial structure was prepared from a structure of an EPO complexed with extracellular domains of its receptor [Protein Data Bank (PDB) ID: 1CN431]. The prepared protein trimer is composed of 596 amino acid residues or 9258 atoms. It was solvated with a 10-Å buffer of TIP3P water (49 315 molecules) in each direction and neutralized with four Na+ ions for the particle mesh Ewald (PME) method, whose cutoff was set to 10 Å. The total number of atoms in a rectangular box was 157 207. After energy optimization, heating, and equilibration, we performed a 3-μs MD simulation with a 2-fs time step at 1 bar and 298.15 K, where the AMBER ff14SB force field32 was used. All the MD simulations were performed with AMBER.33 In the production run, AMBER on the GPUs was used for acceleration.34–36
For analysis, we used the positions of all atoms except hydrogen atoms of the protein trimer for every 10 ps. Hence, the number of degrees of freedom is 13 935 (= 4645 × 3), and the number of samples is 3 × 105. After removing the translational and rotational motions, we implemented the PDRMA methods for t0 = 0, 100 ps, and 1 ns. Each Nc was chosen so that the proportion of relative contribution of the variance in the subspace to the total variance was approximately 80%. Thus, Nc = 50, 26, and 17 for t0 = 0, 100 ps, and 1 ns, respectively. A delay time τ was determined to be the largest one at which a modified time correlation matrix D(τ) became positive definite; thus, τ = 8 ns, 20.7 ns, and 37.4 ns for t0 = 0, 100 ps, and 1 ns, respectively.
IV. RESULTS AND DISCUSSION
We implement the PDRMA methods for t0 = 0, 100 ps, and 1 ns, successfully solving two eigenvalue problems defined by Eqs. (13) and (14) and Eqs. (18) and (17). First, we investigate how t0 affects the estimated relaxation modes and rates. As stated in Sec. II, original time correlations are reproduced when approximate relaxation modes and their relaxation rates are properly estimated. The representative time-displaced autocorrelations calculated directly from the time series and reproduced through PDRMA are shown in Fig. 1. As the value of t0 becomes larger, the reproduced autocorrelations are in better agreement with those calculated directly from the time series at a long time interval. This is because t0 reduces the contribution of fast relaxation modes, which describe the short-time behavior of the time correlations, in Ri(Q). Because the 100th correlation calculated directly from the time series is skewed owing to the lack of statistical precision, it is difficult to judge the agreement precisely. Nevertheless, the overall trend at a long time interval seems to be reproduced both for t0 = 100 ps and 1 ns. Therefore, we find that relaxation modes and rates are estimated better by using the PDRMA method.
The reproduced time correlations are given by Eq. (23), which depend on the eigenvectors and eigenvalues in the generalized eigenvalue problem defined by Eq. (20). Because the estimated relaxation modes and rates are derived from these eigenvectors and eigenvalues, respectively, the improvement of the reproduced time correlations means that either or both of the estimated relaxation modes and rates are improved. In order to investigate which of the relaxation modes and rates dominate the reproduced correlations, we show time-displaced autocorrelations of the first–ninth slowest relaxation modes for t0 = 0, 100 ps, and 1 ns in Fig. 2 and list their relaxation times in Table II. From Fig. 2, the slower relaxation modes are nearly identified with little dependence on the delay and evolution times. From Table II, on the other hand, the slower relaxation times estimated for t0 = 1 ns are approximately twice as large as those evaluated for t0 = 0. Therefore, we find that the relaxation times predominantly affect the reproduced correlations given by Eq. (23). We observed earlier that, relaxation modes estimated for small τ and t0 include slower relaxation modes, although the relaxation times are underestimated.8 In the present study, this tendency is also observed.
. | Relaxation time . | Relaxation time . | Relaxation time . |
---|---|---|---|
Mode . | [ns] . | [ns] . | [ns] . |
number p . | (t0 = 0) . | (t0 = 100 ps) . | (t0 = 1 ns) . |
1 | 1.0 × 103 | 1.7 × 103 | 2.4 × 103 |
2 | 3.9 × 102 | 6.6 × 102 | 7.8 × 102 |
3 | 2.2 × 102 | 3.4 × 102 | 4.4 × 102 |
4 | 1.9 × 102 | 3.3 × 102 | 3.4 × 102 |
5 | 1.4 × 102 | 2.0 × 102 | 2.1 × 102 |
6 | 9.2 × 101 | 1.3 × 102 | 1.6 × 102 |
7 | 8.5 × 101 | 1.2 × 102 | 1.3 × 102 |
8 | 7.6 × 101 | 1.1 × 102 | 1.1 × 102 |
9 | 6.8 × 101 | 8.9 × 101 | 8.7 × 101 |
. | Relaxation time . | Relaxation time . | Relaxation time . |
---|---|---|---|
Mode . | [ns] . | [ns] . | [ns] . |
number p . | (t0 = 0) . | (t0 = 100 ps) . | (t0 = 1 ns) . |
1 | 1.0 × 103 | 1.7 × 103 | 2.4 × 103 |
2 | 3.9 × 102 | 6.6 × 102 | 7.8 × 102 |
3 | 2.2 × 102 | 3.4 × 102 | 4.4 × 102 |
4 | 1.9 × 102 | 3.3 × 102 | 3.4 × 102 |
5 | 1.4 × 102 | 2.0 × 102 | 2.1 × 102 |
6 | 9.2 × 101 | 1.3 × 102 | 1.6 × 102 |
7 | 8.5 × 101 | 1.2 × 102 | 1.3 × 102 |
8 | 7.6 × 101 | 1.1 × 102 | 1.1 × 102 |
9 | 6.8 × 101 | 8.9 × 101 | 8.7 × 101 |
Next, we investigate the relaxation modes obtained through PDRMA for t0 = 1 ns. In Fig. 3, we plot the collectivity of the modes.5 The collectivity is given by the normalized information entropy of the distribution of contributions,
where N denotes the number of atoms and , which denotes the contribution of an atom to all atoms. The collectivity takes the value 0 when only one contribution is nonzero and the value 1 when all the contributions are equivalent. From Fig. 3, we find that the relaxation modes are classified into lower collective modes (p = 1–4, 8) and higher collective modes (otherwise) with the line = 0.83.
Here, we focus on the first and fifth slowest relaxation modes with lower collectivity and higher collectivity, respectively. From Fig. 4(a), the first slowest relaxation mode (RM1) describes a rare phenomenon in the sense that the phenomenon occurs once in a 3-μs simulation. From Fig. 4(b), it can be found that the Cα atom in Asp553 has the largest contribution and Cα contributions near Asp553 correspond to the arrows depicted on EPO in Fig. 4(c). Figure 4(c) shows that the motion near Asp553 is a flap motion of a loop in EPO. Furthermore, the flap motion is related to the distant edges of domains. Therefore, RM1 describes not a local motion but a rare collective motion with relatively low collectivity. On the other hand, the fifth slowest relaxation mode (RM5) describes a domain motion as seen in Figs. 5(a)–5(c). The domains shown in Fig. 5(c) are intracellular domains. The PDRMA method enables us to identify slow relaxation modes in biopolymers, which describes not only domain motions but also rare phenomena.
V. CONCLUSIONS
In this paper, we proposed a new dynamic analysis method, the PDRMA method, which is a simple and effective extension of the conventional RMA method, for the robust introduction of an evolution time parameter. In the PDRMA method, an eigenvalue problem for a time correlation matrix of physical quantities relevant to slow relaxation in a system is first solved to find a subspace in which the matrix is numerically positive definite. Then, the RMA method is implemented in the subspace. Herein, we applied the method to the analysis of a 3-μs MD trajectory of a heterotrimer of an EPO protein and two of its receptor proteins. Although the degrees of freedom are relatively large, we successfully solved the eigenvalue problems in PDRMA for nonzero t0 values. The introduction of t0 enabled us to better estimate the relaxation modes and rates. Especially, the estimation of relaxation rates improved reasonably. The estimated relaxation modes were classified into lower collective modes and higher collective modes. The first relaxation mode, which has lower collectivity, described a rare phenomenon, which is a flap motion of a loop in EPO. The fifth relaxation mode, with higher collectivity, described intracellular domain motion. The PDRMA method can be implemented in systems with a relatively large number of degrees of freedom and limited simulations for the investigation of conformational fluctuations of biopolymers including rare phenomena and domain motions. With the proposed method, we can calculate free energy landscapes8 and define coarse-grained Markov states13 as with the conventional RMA method. Furthermore, the PDRMA method can be used for dimensionality reduction before computationally expensive analysis with the methods using higher-order statistics or nonlinear transformations.
ACKNOWLEDGMENTS
The authors would like to thank Dr. Hiroshi Nakagawa for cooperating during the initial structure preparation. The authors would also like to thank Dr. Satoshi Natori for discussing the RMA methods. This work was partially supported by JST PRESTO (No. JP-MJPR13LB). This work was also supported, in part, by JSPS KAKENHI Grant No. JP24540441. N. Karasawa acknowledges a research grant from the Keio Leading-edge Laboratory of Science and Technology.