Molecular Dynamics (MD) simulations of proteins implicitly contain the information connecting the atomistic molecular structure and proteins’ biologically relevant motion, where large-scale fluctuations are deemed to guide folding and function. In the complex multiscale processes described by MD trajectories, it is difficult to identify, separate, and study those large-scale fluctuations. This problem can be formulated as the need to identify a small number of collective variables that guide the slow kinetic processes. The most promising method among the ones used to study the slow leading processes in proteins’ dynamics is the time-structure based on time-lagged independent component analysis (tICA), which identifies the dominant components in a noisy signal. Recently, we developed an anisotropic Langevin approach for the dynamics of proteins, called the anisotropic Langevin Equation for Protein Dynamics or LE4PD-XYZ. This approach partitions the protein’s MD dynamics into mostly uncorrelated, wavelength-dependent, diffusive modes. It associates with each mode a free-energy map, where one measures the spatial extension and the time evolution of the mode-dependent, slow dynamical fluctuations. Here, we compare the tICA modes’ predictions with the collective LE4PD-XYZ modes. We observe that the two methods consistently identify the nature and extension of the slowest fluctuation processes. The tICA separates the leading processes in a smaller number of slow modes than the LE4PD does. The LE4PD provides time-dependent information at short times and a formal connection to the physics of the kinetic processes that are missing in the pure statistical analysis of tICA.
I. INTRODUCTION
Large-scale fluctuations and global structural rearrangements play an essential role in the biological functions of biopolymers. DNA transcription and replication involve the self-assembly of large multiprotein complexes that spontaneously form through step-by-step processes where binding of proteins is facilitated by the molecular flexibility.1 At the single molecule level, folding of the proteins to their most probable conformation involves large-scale molecular fluctuations and slow global structural rearrangements that are guided by cooperative dynamics.2–5 Proteins’ slow fluctuations may have important biological implication in the mechanism of protein binding and function.6–13 Following the hypothesis of the Monod–Wyman–Changeux model, a protein’s spontaneous fluctuations can lead the conformation selection mechanism where the binding partner selects the most favorable conformation among the ones made available by fluctuations.14–17 Thus, fluctuations in the unbound protein can signal the regions that are involved in protein binding and function.
Molecular dynamics (MD) simulations of proteins in solvent are a powerful method to identify fluctuations and investigate the role that the chemical structure, or primary sequence, of a protein play in multiscale dynamics. However, the information contained in the simulation trajectory is difficult to analyze because dynamical processes are often coupled on multiple length scales. Therefore, it is crucial to devise statistical procedures that conveniently separate the multidimensional trajectory of a simulation into a set of independent dynamical processes that, when added together, form the observed data. These different contributions should be as independent as possible for one to be able to analyze and classify their dynamical response separately. Traditionally, this issue has been addressed by adopting statistical tools from signal processing to extract from a noisy response the most critical information, which is usually a slowly fluctuating signal or a collection of slowly fluctuating signals.
The most widely used analysis method for simulation trajectories is the principal component analysis or PCA method based on the definition of a covariance matrix of the selected variables.18 Correlation is a linear association measure, and uncorrelated processes are defined as having the cross terms in the covariance matrix equal to zero. However, the goal is to isolate from the trajectory independent fluctuations. Independent processes and uncorrelated processes are different from the mathematical point of view. Independent processes are defined as having a joint probability distribution that can be separated into a product of individual distributions.19 Independence includes not only linear (uncorrelated) but also nonlinear relationships.20 In a nutshell, linear models, such as PCA, in general, cannot decompose the dynamics into independent motions because those motions can be nonlinear. Independent, nonlinear, mode dynamics can be identified by an Independent Component Analysis (ICA).19
In 2011, Naritomi and Fuchigami introduced to the field of protein dynamics the time-structure based independent component analysis or tICA, which is a specific type of ICA method.21 In tICA, the dynamics is defined by a time-lagged covariance matrix and its limit at zero lag time, which is the conventional covariance matrix. By solving the generalized eigenvalue problem, the protein’s dynamics are separated by tICA into modes that are uncorrelated at both zero lag time and at τtICA.22 These constraints substitute for the stringent independence criteria normally required; i.e., the mode independence at τtICA substitutes for the independence of nonlinear zero-lag correlations.19,23 The time-structure based method was later revisited by Pande and co-workers and with the name of “time-lagged independent component analysis” by Noe and co-workers.24,25 We will use the terms “time-structure based” and “time-lagged” interchangeably. When tICA is paired with a Markov State Model (MSM) of the kinetics of transition between modes, it accurately detects dominant slow modes of motion.24,25 The tICA modes identify the slowest dynamical decorrelation and thus are considered to be the optimal linear coordinates to represent the slow dynamics:26 for example, the tICA modes have been used as variationally optimal collective coordinates for enhanced sampling in metadynamics.27,28
While tICA remains a rigorous statistical analysis of the multidimensional simulation trajectory, it still does not provide a physical interpretation of the slow dynamics or the connection between slow motions and protein’s atomistic structure and interactions. Naritomi and Fuchigami partially addressed this issue by examining the slow dynamics of protein’s domains and backbone using tICA. Until now, an equation of motion that related the structure and interaction potential to the dynamics was missing in their study.21,29 The degrees of freedom or “features” input to the tICA are chosen based on their ability to predict the slowest dynamics but are not necessarily connected to an equation of motion for describing the time evolution of the input coordinates. One determines the efficacy of the chosen features a posteriori using cross-validation methods.30,31 In this study, we relate the tICA formalism to our Langevin Equation for Protein Dynamics (LE4PD).32–36 In the limit in which the tICA lag time is zero, the two are formally equivalent.32 Furthermore, the LE4PD is a formal extension of the equation of motion for a macromolecule, obtained from the first-principles Liouville equation using Mori–Zwanzig projection operators.37–40 Thus, albeit involved, there is a formal connection between the tICA fluctuations and the fundamental equation of motion for the dynamics of a protein.
The LE4PD formalism is based on the Rouse–Zimm equation for the dynamics of a polymer in solution,41–43 which we extended to include physical characteristics that are specific to folded proteins. Typically (i) inside the hydrophobic core of a protein, where atoms are not exposed to the solvent, the hydrodynamic interaction is screened, but atoms still experience friction; thus, the LE4PD adopts a residue-dependent friction coefficient calculated from the extent that each amino acid is solvent-exposed; (ii) protein dynamics are non-linear and molecular rearrangements of the protein during fluctuations involve the crossing of energy barriers that play a major role in protein dynamics and folding. The LE4PD approximately accounts a posteriori for the nonlinearities in the dynamics through the construction of free-energy landscapes for each mode and the rescaling of the timescale of barrier-crossing.32,34,44
Through normal mode diagonalization, the LE4PD separates the dynamics sampled in a long MD simulation, or in a set of short MD simulations, into a set of diffusive normal modes that are largely independent. These modes directly depend on real-space information, as the amino acid-specific local frictions, the water’s viscosity, the potential of mean force, and the internal energy barriers are included. From the mode-dependent free-energy landscape, one can identify the energy minima and the pathways that cross energy barriers, thus isolating the mode-dependent local fluctuations. Along the pathway, one can sample the protein conformation, thus depicting the conformational transitions during barrier-crossing. A simple Kramers’ rescaling, or applying a MSM analysis to the mode-dependent free-energy surface (FES), provides the transition times of the mode-dependent fluctuations. Relaxation dynamics of time correlation functions predicted by LE4PD have been shown to be accurate when compared with experimental data of T1, T2, and nuclear Overhauser effect (NOE) nuclear magnetic resonance (NMR) relaxation,34,35 as well as to short-time Debye–Waller factors from x-ray scattering experiments.33 Recently, Beyerle and Guenza extended the LE4PD approach to describe anisotropic fluctuations in the LE4PD-XYZ model.32
In this study, we focus on tICA, given that at zero lag time, the covariance matrix is formally consistent with the inverse of the LE4PD-XYZ force matrix. We compare the position, timescale, and pathway of slow fluctuations measured by tICA with the ones described by the LE4PD-XYZ. The question we aim to address is if a Langevin-mode decomposition can be as effective as tICA in isolating the leading dynamical processes from a protein MD trajectory.
We analyze an extensive, 1-µs long MD simulation of the protein ubiquitin in a solution of sodium chloride at physiological conditions. Ubiquitin is a regulatory protein in eukaryotic cells, known for its role as a post-translational modifier of other proteins through mono- and poly-ubiquitination processes. It can bind to substrates either covalently45,46 or non-covalently.47 By identifying the slowest fluctuations in the isolated protein, we can locate important regions for ubiquitin’s binding where a partner’s selection can be guided by the thermodynamics and the kinetics of the different fluctuation processes.
This study addresses several relevant questions related to protein fluctuations and the tICA method: Are the tICA’s slow fluctuations similar to the ones that LE4PD-XYZ identifies? How are the results of this study dependent on the choice of the tICA lag time? Are there different but compatible procedures to correctly identify the best tICA lag time? Can a Langevin equation that adopts the tICA covariance matrix to build the intramolecular force matrix describe the correct dynamics of the system as measured by time correlation functions? How important is the role of hydrodynamics in tICA? How significant are the internal energy barriers that are only approximately accounted for by tICA?
By comparing LE4PD-XYZ to tICA, this study formally connects the tICA method to a Langevin equation of motion. Along similar lines of thinking, Takano and co-workers recently proposed the Relaxation Mode Analysis (RMA), which is similar to tICA.48–51 Both RMA and tICA maximize the time-dependent correlation matrix of the fluctuations at a given lag time, τtICA, and at an initial time, t0, while dynamics faster than t0 are averaged out.51 The difference between the two approaches is that RMA calculates the covariance matrix at a time t0 ≠ 0, while tICA is a particular case of RMA, where t0 = 0.51 The RMA also has some similarities with our LE4PD approach as both accurately model with a Langevin equation of motion the slow dynamics of the protein, even if the details of the two dynamical equations are different.
This paper is structured as follows: Sec. II briefly summarizes the anisotropic LE4PD-XYZ approach and the calculation of the LE4PD mode-dependent free-energy surfaces. Section III presents the tICA method while also proposing the calculation of a single-mode-dependent free-energy map, which MSM analyzes. Section IV illustrates the comparison between LE4PD-XYZ’s and tICA’s slowest fluctuations and also includes a biological interpretation of the observed fluctuations. Section V discusses the compatibility of the FES for the slowest tICA mode and the slowest LE4PD-XYZ modes with and without hydrodynamic interactions. The calculation of time correlation functions with the two approaches is in Sec. VI. Because the results of the tICA method depend on the tICA lag time selected, we analyze the dependence of the tICA single-mode FES on this parameter in Sec. VII. Finally, we assess the advantages and disadvantages of the proposed single-mode tICA method in comparison with the more traditional two-mode tICA analysis in Sec. VIII. A brief discussion with conclusions summarizes the findings of this study in Sec. IX.
II. THE ANISOTROPIC LANGEVIN EQUATION FOR PROTEIN DYNAMICS OR LE4PD-XYZ
In recent years, we have developed a coarse-grained model to describe protein fluctuations in the amino acid positions, called the Langevin Equation for Protein Dynamics (LE4PD).33–36,44,52 The original LE4PD is isotropic and is presented in Sec. S6 of the supplementary material. Beyerle and Guenza recently extended it to the related anisotropic formalism, called the LE4PD-XYZ method.32 The anisotropic LE4PD directly connects the PCA fluctuations to an equation of motion that contains the covariance matrix in the amino acids positions.32 We briefly review the LE4PD-XYZ model here, while we refer for more details on both LE4PD models to the aforementioned original manuscripts.
The first step in developing the anisotropic LE4PD is to define as the leading variables the deviations of the position of the protein’s alpha carbons from their average values, .32 Here, denotes the usual static average calculated over a trajectory of length M frames.
Each component of the position vector fluctuation follows the anisotropic LE4PD equation of motion,
where . Furthermore, kB is the Boltzmann constant, T is the temperature in kelvin, and is a stochastic velocity. The average residue friction coefficient is , where ζi is the friction coefficient of amino acid i. The matrix describes the hydrodynamic interaction between the α component of residue i and the β component of residue j,
where is the average inverse distance between residues i and j and is the average residue radius exposed to the solvent.
In this equation, the dynamics is defined in a body-fixed system of coordinates, where both translation and rotation dynamics have been eliminated. The trajectory of the protein, analyzed to build the H′ and A′ matrices, for example, is also in a body-fixed reference system, where translation and rotation are absent.53–56 The equation is solved by applying the fluctuation–dissipation condition, as described in our previous publication.32
The matrix describes the inverse of the covariance between the β component of residue j and the γ component of residue k as
where is the matrix of correlations of the bond fluctuations in Cartesian coordinates with , . I is the 3 × 3 identity matrix, and a is the N − 1 × N matrix of the amino acid connectivity (with i = 1, …, N − 1 and j = 1, …, N),
Here, δαβ is the Kronecker delta, and the “⊗” symbol denotes the Kronecker product.57
From the simulation trajectory, we calculate (i) the average fluctuations in the α-carbon positions, which enter through the U matrix the inverse of the covariance matrix [Eq. (3)]; (ii) the average inverse distance between the residues, which enter the hydrodynamic interaction matrix [Eq. (2)]; (iii) the friction coefficient of each residue, ζi, and the average residue radius exposed to the solvent, , which also enter Eq. (2). The simulation trajectory is also used to test the quality of the theoretical predictions of time correlation functions in Sec. VI.
More details on the anisotropic LE4PD model, and how it is formally related to the isotropic LE4PD, are given in Ref. 32. Equation (1) is solved using the eigenvalue decomposition of the H′A′ matrix product, Q′−1H′A′Q′ = Λ′, which gives the equation of motion for the evolution of the LE4PD-XYZ modes, ,
with being the characteristic diffusive rate of mode a58 and being the stochastic velocity in mode coordinates.
A. Building a free-energy map in anisotropic coordinates and measuring fluctuations timescales
Using the decomposition of Q′ for the anisotropic H′A′ matrix, the mode coordinate of the anisotropic LE4PD can be separated into its x–, y–, and z– components as
where are the unit vectors in the x-, y-, and z- directions, and the spherical mode coordinates and free-energy surfaces can be defined as
where the probability for the protein of adopting, in mode a, a conformation with angles is
The linear combination of all the anisotropic modes leads to structural and time-dependent properties, directly comparable to simulations or experimental data. From the anisotropic free-energy surfaces, we calculate fluctuations in the three spatial directions. As an example, Fig. 1 shows the LE4PD-XYZ analysis of a 1-µs long MD simulation of the protein ubiquitin (for details on the simulation, see Sec. S8 in the supplementary material). The same trajectory will be analyzed with the tICA to provide a comparison between the two methods. Figure 1 shows in panel (a) the FES in the mode coordinates for the seventh LE4PD-XYZ mode without hydrodynamics (mode seventh is the slowest one in this formalism). The FES displays two minima separated by a small energy barrier. The protein’s conformations along the transition pathway between these two minima are displayed in panel (b). Panels (c) and (d) report data from a MSM analysis of the mode trajectory (for details on the MSM, see Sec. S9 in the supplementary material).
More specifically, panel (c) shows the projection of the second MSM eigenvector, ψ2, onto the FES. The second eigenvector of the MSM transition matrix identifies the two deepest minima and the top of the energy barrier between them in an FES.59 When the second MSM eigenvector matches the population in the two minima and at the top of the barrier, the MSM lag time identified by this process provides the transition time across the barrier.44 Panel (d) shows the calculation of the transition time as a function of the MSM lag time. When a process fulfills the Chapman–Kolmogorov (CK) equation, the process follows Markovian statistics, and the transition time becomes independent of the lag time, i.e., the transition time becomes constant.59 Panel (d) shows that the transition to Markovian dynamics (when the blue line becomes flat) is reasonably close to the transition time calculated from the second MSM eigenvector (the vertical dashed line). Thus, the two procedures to evaluate the transition time give similar values for the seventh mode. Note that Fig. 1 displays results for the LE4PD-XYZ theory without hydrodynamic interactions (see for a discussion Sec. V). An identical calculation performed for the slowest LE4PD-XYZ mode with hydrodynamic interactions (mode six) gives similar free-energy maps and MSM analyses. This result suggests that the hydrodynamics interaction does not affect the dynamics of the slowest fluctuations. Note that the calculations performed take into account hydrodynamics and also include residue-dependent friction coefficients.
III. TIME-LAGGED INDEPENDENT COMPONENT ANALYSIS OR tICA
The time-lagged independent component analysis is a method extensively used in the field of signal processing, information theory, and artificial neural networks to identify hidden factors that are shared and underlie the observed multivariate data.23 This technique has been applied in several fields, including the analysis of protein dynamics to identify the prevalent large-scale motion inside a simulation trajectory. By introducing a time lag in the covariance matrix, one effectively includes the temporal dimension in the analysis of the leading fluctuations, making it possible to model kinetic processes. The time-lagged ICA is an extension of the principal component analysis (PCA) method, where one takes care of isolating the most slowly decorrelating dynamics while including the time dependence of the data as an explicit variable in the analysis. The tICA method has been reviewed in several recent publications and will be only summarized here.21,24,29,60–62
While tICA is a general approach that applies to any set of coordinates, here, we are interested in performing a tICA of the alpha-carbon trajectory of a protein with N residues. We define as tICA coordinates the , where represents the fluctuations out of the equilibrium structure of the position of the space coordinates , with and i = 1, …, N, with N being the number of amino acids in the protein. The time-lagged covariance matrix is defined, for a lag time τtICA, as
and for τtICA = 0, the covariance matrix recovers the static, structural matrix that is used in PCA, as . Here, denotes an average over the time-lagged trajectory containing M frames.
The tICA modes, or tICs, are found by solving the following generalized eigenvalue equation:19,24
where Ω is the matrix of right eigenvectors of Cr(τtICA) and ΛIC(τtICA) is the diagonal matrix of the related eigenvalues.
From the solution of the generalized eigenvalue problem, one has that the eigenvector matrix, Ω, diagonalizes both Cr(τtICA) and Cr(0),
where I is an identity matrix of the same dimensions as Cr(τtICA) and Cr(0). The tICA modes, z(t), are determined by transforming the input coordinates ΔR(t) as z(t) = ΩTΔR(t).
A. Connection between the tICA and the LE4PD-XYZ
Given the definition of the LE4PD-XYZ A matrix of Eq. (3), it is straightforward to show that
where the tICA matrix is defined in Eq. (8). A more detailed calculation of this relation is reported in Ref. 32.
It follows that in the limit of zero lag time, the tICA eigenvectors are identical to the eigenvectors of the LE4PD-XYZ matrix when hydrodynamics is discarded, i.e., the hydrodynamic matrix H = I. The tICA eigenvalues, instead, are equivalent to the inverse of the LE4PD-XYZ eigenvalues. For the hydrodynamic matrix to become equal to the identity matrix, one needs to assume that there are not solvent-mediated hydrodynamic interactions between amino acids and that the friction coefficient of each amino acid is set equal to the average friction coefficient. Note that the formal equivalence between the tICA dynamics at lag time zero and the LE4PD-XYZ approach represented by Eq. (5), with H = I, implies that the fluctuations are harmonic and the internal energy barriers are also discarded. Thus, both the LE4PD-XYZ and the tICA dynamics should include in the equation of motion the correction due to the energy barriers calculated from the mode-dependent free-energy landscapes and the hydrodynamic interaction.
B. Converting to spherical coordinates creates a free-energy surface for each tICA mode
In this section, we show how a free-energy landscape can be associated with each tICA mode. This step is important because the mode-specific free-energy barriers allow one to study the details of conformation fluctuations, thus determining the pathway of transition and the height of the energy barriers in the fluctuations associated with tICA modes. We follow here a procedure similar to the one we established for the LE4PD-XYZ modes.
The elements of the tICA eigenvectors Ω can be decomposed into their x–, y–, and z– projections,
where , and are the unit vectors in the x-, y-, and z-directions and ⊗ denotes the Kronecker product.57 This decomposition is useful as it allows for the creation of tIC-dependent free-energy surfaces, which can be compared directly with the LE4PD free-energy surfaces (see Sec. V).
To define a Free-Energy Surface (FES) for each of the tICA mode coordinates, we start by projecting the space coordinates of the fluctuations onto tICA modes using the tICA eigenvectors. For the tICA modes, the eigenvector matrix ΩT, which transforms the into the z(t) tIC coordinate system, can be decomposed into its contributions from the x–, y–, and z– components of ,
which allows for the decomposition of each tIC za(t) into its contributions from the x–, y–, and z– components of the input coordinates ,
This decomposition can be used to describe each tIC in a new spherical coordinate system,
With the definitions of θa(t), ϕa(t), and Ra(t), one can create two-dimensional free-energy surfaces in (θa, ϕa) by averaging over the radial coordinate Ra(t),
The main advantages of constructing the free-energy surfaces in this manner are that (1) each surface is tIC-specific because the dynamics among tICs are largely decoupled (an evaluation of the extent of coupling in tICA modes is reported in Sec. S1 of the supplementary material), and (2) energetic pathways and fluctuations along this surfaces are easy to visualize for each tIC. As with previous LE4PD analyses, a variant of the string method is utilized to find minimum free-energy pathways between energy wells on the surface.32,44,63 A MSM analysis can provide the time scale associated with each tICA mode. Note that the tICA can be applied once we have selected a lag time, τtICA. In this study, the tICA lag time is 2 ns; we present the procedure to establish this value in Sec. VII.
C. Characterization of the first and the second tICA modes: FES, pathways, and conformational transitions
As an example of the information inherent in F(θa, ϕa) for the tICs, Figs. 2 and 3 show the results of the analysis in the coordinate space for the two slowest tICA modes extracted from the 1-µs simulation of ubiquitin.
Figure 2(a) shows the free-energy map, F(θ1, ϕ1), for the first tICA mode, z1(t), with a pathway drawn between the two prominent minima on the surface. Figure 2(b) displays the fluctuations along the alpha-carbon backbone of ubiquitin when moving along the pathway given in Fig. 2(a); the colors of the structures in Fig. 2(b) correspond to the colors of the images along the pathway in Fig. 2(a). Movement along the minimum energy pathway for z1(t) shows fluctuations in the 50 s loop (blue arrow), the C-terminal tail (black arrow), and the Lys11 loop (red arrow), each of which is a known binding region of ubiquitin to other proteins.45,46,64
Figure 2(c) shows the projection of the most slowly decaying eigenfunction, ψ2, from the MSM transition matrix constructed on this surface starting from the MD trajectory. Note that by assuming different MSM lag times, one obtains different eigenvectors ψ2 and different projections onto the surface. By selecting τMSM = 4.0 ns, we see that the most positive projection of ψ2 lies in the minimum in the bottom half of the surface, and the maximum projection of ψ2 lies in the minimum in the top half of the surface. Thus, with τMSM = 4.0 ns selected, the spectrum indicates that the slowest process described by the MSM corresponds to transitions between the two minima on the surface, whose fluctuations should be described well by the extracted structures from the pathway given in Fig. 2(b).
To test the validity of τMSM = 4.0 ns found with this procedure, Fig. 2(d) shows the implied timescale of t2, i.e., the timescale of the process described by ψ2, as a function of MSM lag time τMSM. The vertical dashed line marks the lag time used in the construction of the MSM shown in Fig. 2(c). We see that at the selected MSM lag time, the dynamics are Markovian, and the t2 time is constant, confirming the validity of the selected τMSM. Thus, for ψ2 shown in Fig. 2(c), the MSM transition matrix, T, is constructed by sampling the trajectory with an interval of τMSM = 4.0 ns, and the predicted timescale is ns. In summary, combining the tIC free-energy surface in with the Markov state modeling analysis predicts that the timescale of movement between the two minima in Fig. 2(a) is ∼53 ns. The corresponding dynamics along the alpha-carbon backbone during this event are illustrated in Fig. 2(b).
Figure 3 illustrates the analogous analysis for the surface spanned by the second-slowest tIC. Drawing a transition pathway between the two minima on the surface [Fig. 3(a)] and extracting the structures along that pathway from the MD simulation show that this tIC describes fluctuations in the Lys11 loop and C-terminal tail regions of ubiquitin [Fig. 3(b)].45,46 Again, using the decomposition of ψ2 from the MSM on this surface to choose the lag time of the MSM [Fig. 3(c)], the process of transitioning between the minima on the surface is predicted to occur over a timescale of 6.7 ns [Fig. 3(d)]. Thus, the surface for the second-slowest tIC predicts mainly motion in the tail and Lys11 loop, occurring over a timescale of 6.7 ns.
We repeat the evaluation of the t2 times for all the tICs, and we present these results for the first ten tICs in comparison with the first ten LE4PD-XYZ modes in Tables I and II. Further information on the timescales associated with the first and second tICA modes and on the amplitude and position of their fluctuations are presented in Secs. IV and IV A, respectively.
. | . | LE4PD-XYZ . | . | |
---|---|---|---|---|
. | LE4PD . | w/HI . | W/o HI . | tICA . |
Mode . | (ns) . | (ns) . | (ns) . | (ns) . |
1 | 3.9(1.05) | 8.0(3.2) | 6.5(2.8) | 52.6 (4.0) |
2 | 0.7(0.1) | 3.7(1.1) | 4.6(1.5) | 6.7 (1.6) |
3 | 0.9(0.35) | 4.3(2.5) | 4.3(2.0) | 4.8(1.6) |
4 | 2.4(0.5) | 6.4(4.0) | 1.0(0.3) | ⋯ |
5 | 0.1(0.01) | 4.8(4.0) | 5.5(4.9) | 2.5(1.0) |
6 | 11.0(0.9) | 3.3(1.0) | 3.1(2.0) | ⋯ |
7 | 0.5(0.25) | 3.6(2.0) | 7.4(3.0) | 2.5(1.6) |
8 | 0.4(0.11) | 0.6(0.2) | ⋯ | 6.1(5.0) |
9 | 0.24(0.1) | 0.3(0.1) | 1.3(0.5) | 0.8(0.3) |
10 | 0.35(0.3) | 0.4(0.1) | 0.4(0.2) | 7.0(5.0) |
. | . | LE4PD-XYZ . | . | |
---|---|---|---|---|
. | LE4PD . | w/HI . | W/o HI . | tICA . |
Mode . | (ns) . | (ns) . | (ns) . | (ns) . |
1 | 3.9(1.05) | 8.0(3.2) | 6.5(2.8) | 52.6 (4.0) |
2 | 0.7(0.1) | 3.7(1.1) | 4.6(1.5) | 6.7 (1.6) |
3 | 0.9(0.35) | 4.3(2.5) | 4.3(2.0) | 4.8(1.6) |
4 | 2.4(0.5) | 6.4(4.0) | 1.0(0.3) | ⋯ |
5 | 0.1(0.01) | 4.8(4.0) | 5.5(4.9) | 2.5(1.0) |
6 | 11.0(0.9) | 3.3(1.0) | 3.1(2.0) | ⋯ |
7 | 0.5(0.25) | 3.6(2.0) | 7.4(3.0) | 2.5(1.6) |
8 | 0.4(0.11) | 0.6(0.2) | ⋯ | 6.1(5.0) |
9 | 0.24(0.1) | 0.3(0.1) | 1.3(0.5) | 0.8(0.3) |
10 | 0.35(0.3) | 0.4(0.1) | 0.4(0.2) | 7.0(5.0) |
. | . | LE4PD-XYZ . | . | |
---|---|---|---|---|
. | LE4PD . | w/HI . | w/o HI . | tICA . |
Mode . | (ns) . | (ns) . | (ns) . | (ns) . |
1 | 5.3(1.8) | 14.6(12.0) | 16.2(12.0) | 54.0(5.0) |
2 | 3.3(1.6) | 14.4(10.0) | 16.6(12.0) | 12.6(5.0) |
3 | 1.9(1.2) | 9.6(8.0) | 9.2(8.0) | 10.5(5.0) |
4 | 4.7(1.6) | 7.2(6.0) | 9.5(8.0) | 9.1(5.0) |
5 | 3.6(1.6) | 4.8(4.0) | 7.7(6.0) | 9.3(5.0) |
6 | 33.7(25.0) | 4.6(4.0) | 21.5(12.0) | 6.6(5.0) |
7 | 1.2(1.0) | 19.9(12.0) | 12.6(10.0) | 5.7(5.0) |
8 | 3.0(1.6) | 2.4(2.0) | 4.6(4.0) | 6.1(5.0) |
9 | 0.5(0.4) | 4.0(3.5) | 1.8(1.5) | 6.4(5.0) |
10 | 0.35(0.3) | 1.3(1.0) | 3.7(3.0) | 7.0(5.0) |
. | . | LE4PD-XYZ . | . | |
---|---|---|---|---|
. | LE4PD . | w/HI . | w/o HI . | tICA . |
Mode . | (ns) . | (ns) . | (ns) . | (ns) . |
1 | 5.3(1.8) | 14.6(12.0) | 16.2(12.0) | 54.0(5.0) |
2 | 3.3(1.6) | 14.4(10.0) | 16.6(12.0) | 12.6(5.0) |
3 | 1.9(1.2) | 9.6(8.0) | 9.2(8.0) | 10.5(5.0) |
4 | 4.7(1.6) | 7.2(6.0) | 9.5(8.0) | 9.1(5.0) |
5 | 3.6(1.6) | 4.8(4.0) | 7.7(6.0) | 9.3(5.0) |
6 | 33.7(25.0) | 4.6(4.0) | 21.5(12.0) | 6.6(5.0) |
7 | 1.2(1.0) | 19.9(12.0) | 12.6(10.0) | 5.7(5.0) |
8 | 3.0(1.6) | 2.4(2.0) | 4.6(4.0) | 6.1(5.0) |
9 | 0.5(0.4) | 4.0(3.5) | 1.8(1.5) | 6.4(5.0) |
10 | 0.35(0.3) | 1.3(1.0) | 3.7(3.0) | 7.0(5.0) |
IV. A COMPARISON BETWEEN THE tICA AND THE LE4PD-XYZ SLOWEST FLUCTUATIONS
In this section, we perform a quantitative comparison of the transition times predicted for the slow tICA modes and for the LE4PD-XYZ modes, starting from the same MD trajectory of ubiquitin in solution (for the MD simulation method, see Sec. S8 in the supplementary material).
For both the LE4PD-XYZ and the tICA free-energy surfaces, we evaluate the MSM times following the procedure presented in Secs. II A and III C. To calculate the transition times, t2, we construct the MSM for each mode and estimate the timescales, t2, using either the mapping of the second MSM eigenvector onto the FES or the Markovian criterion (i.e., the Chapman–Kolmogorov [CK] condition) for the mode trajectories (for details on the MSM, see Sec. S9 in the supplementary material). Table I presents the values of t2 calculated using the second MSM eigenvector ψ2, while Table II reports the values of the transition times calculated using the Chapman–Kolmogorov criterion for a Markovian process.59,65 In Tables I and II, we also report the values of transition times calculated for the internal modes of the isotropic LE4PD equation with the hydrodynamic interaction included (the LE4PD theory is briefly summarized in Sec. S6 in the supplementary material). The data for the LE4PD-XYZ approach are reported both with and without hydrodynamic interaction included. From these data, we can assess the importance of hydrodynamics in the time scale of the protein’s fluctuations. Note that the calculations are performed while including hydrodynamics and also account for residue-dependent friction coefficients.
Note that the isotropic LE4PD is an equation of motion in the lab reference system, while the LE4PD-XYZ starts from a body-centered trajectory where translation and rotation have been eliminated.53–56 Thus, we report the internal motion of LE4PD after the first three rotational modes are discarded. In Tables I and II, all the dashed entries denote free-energy surfaces where the extreme projections of ψ2 are never located in minima on the surface and are thus not suited for Markov state modeling in the manner desired here.
Crossing the energy barriers may slow down differently different modes. Some internal modes may have larger energy barriers than the first mode. This is, in fact, the case for ubiquitin, as one can see from reading Tables I and II. When t2 is calculated using the second MSM eigenvector (Table I), the slowest mode for the isotropic LE4PD method, with hydrodynamic interaction included, is mode 6, while the slowest mode for the anisotropic LE4PD-XYZ model without hydrodynamics is mode 7. If the Markovianity criteria are enforced (Table II), the slowest fluctuations are in mode 7 for the LE4PD-XYZ with hydrodynamics and in mode 6 for the LE4PD-XYZ without hydrodynamics.
From the timescales listed in Table I, all the LE4PD methods give roughly the same timescales for the slowest motions of the system. The first tICA mode, however, displays dynamics that are five times slower than LE4PD. The first tIC corresponds to the contemporary motion in the three flexible binding regions of ubiquitin, as shown in Fig. 2, and predicts that this motion occurs almost ten times slower than the roughly analogous motion predicted by the isotropic LE4PD mode 6 and LE4PD-XYZ mode 7 with hydrodynamics, respectively. However, when the MSM lag time is selected using the CK condition, which does not always coincide with the lag time selected by optimizing the projection of ψ2 from the MSM,44 the gap between the predicted timescales of the slow LE4PD and tICA modes is reduced, as shown in Table II.
Overall, the timescales presented are similar in magnitude, with the tICA modes being generally slower than the LE4PD-XYZ. Note that the plots to calculate t2 are in all cases on a logarithmic scale and that small changes in the selected τMSM can give large differences in the value of t2 [see Figs. 1(d) and 2(d)]. Except for the first tICA mode, all the other tICA and LE4PD modes do not show an evident Markovian nature of the dynamics, and one should take the exact values of t2 with some reservations.
A. Localization of mode-dependent fluctuations detected by tICA and LE4PD-XYZ
To compare the dynamics predicted by the slow tICs and LE4PD modes, we calculate the mode-dependent fluctuation profiles as a function of amino acid sequence along the backbone of the protein for the first ten modes of the LE4PD-XYZ theory without hydrodynamic interaction and of the tICA approach. These correspond to the last two columns on the right of Tables I and II. Local fluctuations are well represented by the local mode lengthscale (LML).36,44 In the anisotropic formalism of LE4PD-XYZ, where , the eigenvectors are partitioned into their x–, y–, and z– components, thus isolating the x–, y–, and z–projections of as
where μa,LE4PD-XYZ are the eigenvalues of A′.32
For the tICA, fluctuations are derived from the definition of the modes, , and the Moore–Penrose generalized inverse67 of ΩT, Ω−1T as
The mean-square fluctuations of residue i, given by ⟨Δxi(t)Δxi(t)⟩ + ⟨Δyi(t)Δyi(t)⟩ + ⟨Δzi(t)Δzi(t)⟩, can be written in terms of the tICs as
where the definition ⟨za(t)zb(t)⟩ = δab is used to obtain Eq. (24). The tICA LMLs are reported in Fig. 5.
Thus, Figs. 4 and 5 show the mode-dependent fluctuations calculated from the one-microsecond ubiquitin simulation using the anisotropic LE4PD without hydrodynamics and the tICA, respectively, for the first ten modes of each method. For tICA, the slowest tIC describes concerted fluctuations in the tail, Lys11, and 50 s loops of the protein. For the LE4PD-XYZ approach, most of the low-index modes describe fluctuations in the C-terminal tail of the protein. One needs to look at the fourth mode to find slow fluctuations in the Lys11 loop and to the sixth and seventh modes to find slow fluctuations in the 50 s loop. Neither of the LE4PD approaches gives a single mode describing simultaneous motion in the three important regions of the protein. However, we observe that there is good correspondence between the slowest tIC and the anisotropic LE4PD-XYZ mode 7, which is the slowest mode for the anisotropic LE4PD when hydrodynamic effects are neglected.
Because the three slowest processes all appear in the first tICA mode, while they are partitioned in six different LE4PD-XYZ modes, we observe that the tICA procedure can group the slowest, important, dynamics in a smaller number of modes than the LE4PD, which, instead, partitions the protein’s slow dynamics into several leading modes with different time and length scales. When the goal is identifying the slowest fluctuations in one mode, tICA appears to be more efficient than the LE4PD in isolating the slow fluctuations if the lag time, τtICA, is selected appropriately. However, suppose the ultimate goal is the accurate analysis of the protein’s slow dynamics. In that case, the LE4PD approach has a more desirable outcome because it maintains the information on the fast dynamics and provides better resolution at short times. As shown in Sec. VI, the LE4PD can predict the dynamics as measured by time correlation functions with higher accuracy than the tICA modes.
The quantitative comparisons of the mode-dependent fluctuations calculated using the isotropic LE4PD with hydrodynamics, the anisotropic LE4PD-XYZ without hydrodynamics, and the tICA are presented in the supplementary material (Sec. S4).
B. Biological interpretation of ubiquitin’s fluctuations
Following the conformational selection model, a protein in the absence of its binding partner samples all the energetically available states, and among those states are some that can bind to the substrates of interest.14–17 Thus, the residue fluctuations, observed in the tICA and LE4PD analyses, may provide useful information on the timescales and length scales of relevant binding modes. By direct inspection of the LMLs calculated with the LE4PD-XYZ and with tICA, we observe that there are three important regions of slow fluctuation dynamics in ubiquitin. Fluctuations in the C-terminal tail and in the flexible loop containing Lys11, which are visible in most of the slow modes, are implicated in the covalent association to other proteins and in covalent binding to lysine in polyubiquitination.45–47 Given that ubiquitin binds covalently to numerous proteins of different sizes and flexibilities, it is perhaps not surprising that these fluctuations cover a wide range of length scales and timescales. The third region of important fluctuations involves the 50 s loop, which is known to participate in the hydrophobic binding to the A20 zinc-finger motif of Ras guanine exchange factor Rabex-5, where the residue Y25 of Rabex-5 forms a hydrogen bond with residue E51 of ubiquitin, which has the largest fluctuations in the 50 s loop.64,68
Dynamics in the 50 s loop region of ubiquitin is correlated with the breaking of a hydrogen bond between G53 and E24, which helps maintain the protein’s folded structure.69 Thus, breaking this hydrogen bond serves as a “gatekeeper” to selecting different conformations. For example, in Ref. 69, the authors found that only 29% of 155 x-ray structures examined showed ubiquitin with the same hydrogen bonding pattern between G53 and E24 found in the folded structure from Ref. 70, which is also the starting structure in this study. Furthermore, in Ref. 71, the authors demonstrated that the interchange between hydrogen bonding patterns in the 50 s loop modulate large-scale conformational changes (contraction and expansion) along the entire primary sequence of ubiquitin. This affects the protein’s ability to bind to a set of ubiquitinases known as ubiquitin-specific proteases and marks the 50 s loop as a potential site of allosteric inhibition. Thus, experimental evidence indicates that local conformational changes in the 50 s loop are required for global conformational transitions in ubiquitin.
V. SIMILARITIES BETWEEN THE tICA AND THE LE4PD-XYZ FREE-ENERGY SURFACES
In this section, we quantify the agreement between the energy maps for the slowest modes identified by tICA and LE4PD-XYZ. Figure 6 displays in each row the comparison between the LE4PD slowest modes and the tICA slowest mode for the two LE4PD models we study, namely, the isotropic LE4PD and the anisotropic LE4PD-XYZ theory.
In the first column, Fig. 6 shows the FES of the LE4PD projected trajectory, which displays energy minima for the most populated conformations of the protein. For this FES, the second column of Fig. 6 presents the second eigenvector obtained from the Markov State Model (MSM) analysis of the FES. The superposition of the second MSM eigenvector to the LE4PD energy map indicates which transition represents the slowest fluctuation for the given LE4PD mode.65,72,73 The third column in Fig. 6 shows the comparison between a tICA mode and the LE4PD mode. The superposition is accomplished by projecting the first tIC onto the LE4PD free-energy map and testing if the most extreme tICA conformations are the ones that correspond to the minima in the LE4PD FES. To perform this comparison, we assign each conformation in the tICA mode trajectory to the closest MSM microstate in the LE4PD-mode FES surface using the root mean-square distance from each MSM microstate as the assignment metric. Then, the tICA mode trajectory populates the FES, giving a projection of the tICA mode that is completely analogous, in both meaning and interpretation, to the projection of an eigenvector ψi from the MSM onto the LE4PD FES (see the second column of Fig. 6). The approach of projecting a tICA mode onto a free-energy surface has been previously applied by Sultan and Pande27 to verify the interpretation for the slowest tIC from a simulation of alanine dipeptide.
When projecting the tICs, z(t), onto the surfaces, the average of z(t) within each MSM LE4PD microstate i, Si, is calculated as
with Mi being the total number of frames the z(t) trajectory resides in the Si LE4PD microstate over the course of the simulation. This local average of z(t) within each of the discrete states is what is reported in Fig. 6.
The slowest tIC is the optimal linear approximation to the full-space Markov propagator of the system.24 The ψ2 from the MSM on the slowest LE4PD modes are also estimators of the slowest processes of the system; a high similarity between the projected spectra of the slow tICs and ψ2 indicates high similarity between the predicted dynamics from the two models. That is, if the slow dynamics predicted in each approach are consistent with each other, then the spectra of both the slow tICs and ψ2 should predict probability flow between the deep minima on the surfaces of the slowest LE4PD modes. The ψ2 are already parameterized to do so,44 but the slow tICs are, in principle, ignorant of the LE4PD surface. We use this technique to confirm that the slow LE4PD modes can extract the slow dynamics compatible with tICA modes.8
The three rows in Fig. 6 represents, from top to bottom, the following calculations. In the first row, the isotropic LE4PD (mode 6) with hydrodynamics agrees with the first tICA mode. In the second row, the anisotropic LE4PD-XYZ (mode 7) without hydrodynamics compares well with the first tICA mode. The third row shows a comparison between the third tICA mode (mode 3) and the fifth mode of the anisotropic LE4PD-XYZ without hydrodynamics. It is clear from these results that the slow dynamics detected by tICA and anisotropic LE4PD-XYZ are similar, even if the slow dynamics can be distributed differently in the LE4PD, LE4PD-XYZ, and tICA modes (see Figs. 4 and 5).
Note that the technique used here of projecting the tICs onto the surfaces of the LE4PD modes is analogous to the technique used in Refs. 74–77 to model experimental observables using Markov state models. Like an experimental observable, the separation of two minima of the surfaces into “high z” and “low z” states indicates that transitions on the surface correspond to transitions between a high z state and a low z state, similar to how fluorescence experiments on a protein search for transitions between a high fluorescence state, indicating that the protein is sampling conformations where the fluorophores are far apart, and a low fluorescence state, where the protein is sampling conformations where the fluorophors are close together.78,79
In conclusion, Fig. 6 demonstrates that both LE4PD approaches are able to capture the same slow motion as the tICA. The correlation between the time series of z1 and ψ2 from the MSM of the slowest isotropic LE4PD mode is high (ρ = 0.92), indicating that both z1 and ψ2 are predictive of the slow dynamics in ubiquitin. The correlation coefficient between the time series of z1 and ψ2 from the MSM of the slowest anisotropic LE4PD-XYZ mode is ρ = 0.73, which is still acceptable. The correlation coefficient between the time series of z3 and the ψ2 for the fifth LE4PD-XYZ mode is ρ = 0.54.
VI. TESTING THE tICA AND LE4PD PREDICTIONS OF TIME CORRELATION FUNCTIONS AGAINST SIMULATIONS
The analysis of the amplitude, location, and timescale of the slow fluctuations for ubiquitin with the three methods (tICA, LE4PD, and LE4PD-XYZ) show that they correctly identify the regions in the protein where slow fluctuations occur. However, the slow fluctuations are partitioned in different modes for the two methods. To gain further details on the capabilities of the two approaches, we introduce, as the ultimate test of the tICA’s and LE4PD’s ability to predict with accuracy slow time dynamics, the comparison of their time correlation functions (tcfs) to the tcfs directly calculated from the simulation trajectory.
The normalized autocorrelation function for the fluctuations of each residue is defined as . For the LE4PD approaches, the autocorrelation function is calculated by including for each mode the slowing down of the dynamics due to the presence of an energy barrier in the FES.32,44 Recently, we have shown that neglecting the hydrodynamic interaction modifies the LE4PD-XYZ curves, leading to a (moderately) worse agreement with the simulation data.32 Figure 7 shows the fluctuation decay of the tcfs for residues sampled along the primary sequence of ubiquitin. This figure compares the LE4PD-XYZ results with hydrodynamics included to the tcfs from the simulations: the agreement is remarkable. It also shows the tcfs for the LE4PD-XYZ without hydrodynamic interactions, which are less in agreement, at least for the residues presented in the figure.
At a given lag time, C(t) can be written in terms of the tICA eigenspectra by inverting the relationship as and using the (near) independence of the tICs as
The decay timescales for each tICA mode, τa, are calculated empirically by the integration of the autocorrelation function obtained from the simulations21 and assuming that each mode is represented by a single exponential decay. This procedure should account for the barriers present along each tICA coordinate in, at least, a coarse manner.32,80 This time, τa, is, in general, different from the inverse of the eigenvalues λIC [Eq. (10)] because that time does not include the mode-dependent energy barrier. If one adopted the inverse of the eigenvalues λIC as the timescale of decay, the tcfs calculated from tICA would display an even faster and more unphysical decay than the one observed when including the mode-dependent energy barrier for tICA (see Fig. 7). Once a lag time is selected, we build the matrix Cr(τtICA) [Eq. (10)], and by diagonalization, we derive the eigenvectors and eigenvalues that enter Eq. (25).
The time correlation functions calculated from the tICs [Eq. (25)] are directly compared to the one from the simulation trajectory in Fig. 7. For each residue shown and for most residues across the primary sequence of ubiquitin, the tcfs predicted from the LE4PD-XYZ with hydrodynamics are in better agreement with the simulated tcfs than those predicted from the tICA or the LE4PD-XYZ without hydrodynamics.
The accuracy of the two approaches is quantified in Fig. 8, which shows the mean absolute error [⟨MAE(t)⟩] between the simulated and predicted C(t) for all the residues in ubiquitin. This metric of quantifying the distance between the “true” [simulated, C(t)] and “estimated” [predicted, (t)] is defined as
which is the average distance between the two autocorrelation functions (acf). M0 is the number of frames before an acf cutoff time, which is the time at which the acf first decays to a specified value. For example, using a cutoff of C(t) = 0.0 we calculate Eq. (26) over all points of the acf until the acf attains a value of 0.0 for the first time.
Figure 8 shows that for most amino acids, the error metric is lower for the anisotropic LE4PD with hydrodynamics compared to the tICA estimator of C(t). This figure shows data for four different cutoff times, indicating that the result is robust and is not affected by the choice of the cutoff time. Thus, using the anisotropic LE4PD with hydrodynamics gives a better prediction, on average, of the C(t) autocorrelation function compared to tICA (lag time 2.0 ns) across all the residues in ubiquitin.
One may assume that the disagreement of tICA with the simulated tcfs observed in Fig. 7 is related to the choice of the lag time and that choosing either a longer or shorter tICA lag time may give a better agreement in the tcf of specific bonds. This is, in fact, the case. Figure 9 shows how using a shorter lag time (2 ps) yields tcfs in good agreement with residues’ tcfs in the highly flexible Lys11 loop, especially at timescales less than 10 ns. Similarly, using a longer lag time (20 ns) gives tICs that agree well with the simulated tcfs of several residues in the 50 s loop, where the slowest fluctuations of the protein occur. This analysis supports the heterogeneity of ubiquitin dynamics and the concept that there may be a different optimal lag time for each different tICA mode, since one can locally optimize the residues’ relaxation in different regions by varying the tICA lag time.
VII. THE OPTIMUM tICA LAG TIME CORRESPONDS TO THE TIME THAT SAMPLES THE HIGHEST BARRIER IN THE FREE-ENERGY SURFACE
The tICA approach is general and applies to any time-dependent set of coordinates and any lag time, τtICA. After selecting the input coordinates to the tICA, which in this study are the coordinates of the fluctuations away from the average structure calculated over the MD trajectory, ΔR, there remains a single adjustable parameter: the observation lag time or τtICA. This time parameter is used to construct the time-lagged covariance matrix [see Eq. (9)]. The tICA modes illustrate the dynamics taking place over a timescale longer than τtICA, while dynamical phenomena that are faster are averaged out and cannot be detected. Thus, only selecting the proper lag time can lead to the correct sampling of the dynamical phenomena that one desires to study.21,24,25
The selection of τtICA is usually accomplished by performing the MSM analysis of multi-dimensional free-energy maps at different lag times. This step is followed by testing the results a posteriori to verify which lag time leads to the longest possible timescale and to the most efficient separation of those timescales. Here, we propose a procedure to select a priori the optimal tICA lag time. However, note that the traditional a posteriori verification of the optimal tICA lag time agrees with the 2 ns value used in our calculations (see Sec. S5 in the supplementary material).
We start here from the one-dimensional mode-dependent free-energy surface constructed as described in Sec. III B and select the optimum τtICA lag time as the one that samples the highest energy barrier in the FES. To illustrate how the choice of τtICA affects the tICA modes, Figs. 10 and 11 show how an increase in the lag time modifies the dynamics that the first tICA mode samples. The energy landscape displays two deep minima and two possible paths that connect the two minima at all the lag times studied. Both Figs. 10 and 11 display the two pathways in the two FES at the top and the bottom of each panel, respectively. The protein configurations that populate the two pathways are shown in the middle of the panel: the configurations on the left (blue-white-red) represent the path in the top FES (blue-white-red path). In contrast, the configurations on the right (pink-yellow-light blue) represent the path in the bottom FES (pink-yellow-light blue path). While at all the time lags studied, the dynamics of the protein involves mostly fluctuations in the C-terminal tail, at an increasing lag time, the fluctuations in the tails become less pronounced, and new fluctuations start to appear in the Lys11 loop and in the 50 s loop.
The mode-dependent FES look qualitatively similar at τtICA smaller than 20 ps, and at τtICA larger than 2 ns. If we report the barrier height of the red-white-blue pathway between minima in Figs. 10 and 11 as a function of the lag time (see Fig. 12), we observe that when the FES is calculated at increasing τtICA, the barrier height increases until τtICA ≈ 2.0 ns when it starts decreasing. Figure 12 also reports the calculated Markov State Model (MSM) time, t2, which is given by the projection of the second MSM eigenvector as described in Sec. II. t2 is the time needed by the system to cross the barrier and shows a nice correlation with the barrier height for increasing tICA lag times. This analysis agrees with the concept that the best tICA lag time is the one that leads to the slowest dynamics, and hence the longest timescales, in a MSM analysis.
Intuitively, the non-homogeneity of ubiquitin’s dynamics when changing the tICA lag time, observed in Fig. 12, seems associated with the well-known hierarchical energy landscape of proteins in the folded state.3,9 At short lag times, the tICA is sampling faster dynamics than at large lag times. Fast fluctuations cross small barriers along the pathway while sampling the energy landscape. As the tICA lag time is increased, the analysis picks up slower fluctuations, with a corresponding increase in predicted timescales and barrier heights. The fall-off in t2 (and in barrier heights) at longer tICA lag times is likely due to a loss of statistics as the lag time is made large and the system makes direct “hops” between deep minima, thus avoiding the sampling of the barriers. Because the t2 from the MSM is being reported as the timescale of the slowest processes found by the tICA, and at both long and short τtICA there are no large barriers sampled, the tICA coordinates, which are unit-free and do not encode lengthscales, return a similar quadratic or barrier-free surface to the MSM analysis.
To conclude this section, we observe that for each mode, our procedure identifies the optimal mode-dependent tICA lag time using the height of the energy barriers. Thus, different tICA modes are likely to have different optimal lag time so that the definition of an optimal lag time cannot be unique. We adopt for the optimal lag time the one measured in the slowest tICA mode.
VIII. A COMPARISON BETWEEN 1D AND 2D MAPS OF tICA MODES
In this section, we compare the outcome of this study with the results of the analysis performed using the conventional procedure, which combines a two-dimensional, or multidimensional, tICA free-energy map with an MSM analysis of the kinetics.8,24,25,81–84 In what has become a fairly typical workflow for the analysis of MD simulations of biomolecules using Markov state models, the MD trajectory is projected onto not just one mode but a number n of the slowest tICA modes. This procedure reduces the high dimensionality of the original free-energy landscape by identifying the slowest dominant modes. One then performs an MSM analysis on the reduced subspace to parse the slowest dynamics and corresponding timescales of the system.22,24,25,62,84–86 Usually, one selects the two slowest modes, but in some cases, one considers instead more than two tICA modes: the latter procedure may lead to even slower measured kinetic timescales.25 This is because the transitions among the selected modes can become even less probable, while statistical insufficiencies in the necessarily finite simulation data can also play a role. Note that if the tICA modes were fully uncoupled, as we would like them to be, there would not be transitions between the modes (or the time for the transition would be infinite). Transitions between tICA slow modes are rare and represent the slowest dynamics in the MD trajectory. See also Sec. S1 in the supplementary material for an analysis of the independence of the tICA modes.
Here, we report the results of this type of “traditional” tICA-MSM approach for the 1-µs ubiquitin simulation. We build the MSMs on the space spanned by the first two tICs calculated from the ubiquitin simulation at a tICA lag time of τtICA = 2 ns, the same as that used in the surfaces presented in Sec. III C. Then, we compare the results of this model to the analysis performed on the single-mode projections, namely, the FES of a tIC, and the LE4PD-XYZ mode-dependent FES, presented earlier in this paper.
Figure 13 shows that the free-energy surface spanned by the first two tICs has two “lobes” having two minima each. When a MSM is constructed on this surface, it predicts that the slowest motion spanned in this two-dimensional space is a transition between the two lobes, i.e., the transition between the two tICA modes, as can be seen by an examination of the spectrum of ψ2 projected onto the free-energy surface (Fig. 13, top left panel). The MSM predicts that the transition between the two lobes occurs over a timescale of ∼200 ns. Tracing a pathway between the two deepest minima in each lobe, using the same method we utilized for the surfaces shows that inter-lobe transitions correspond to dynamics in the 50 s loop of ubiquitin. The second-slowest relaxation processes on the surface spanned by the first two tICs correspond to movement between the intra-lobe minima on the right-hand lobe (Fig. 14). The MSM predicts that this transition occurs over a timescale of ∼70 ns and that the transition causes motions in the Lys11 of ubiquitin. Note that the same slow fluctuations are identified by the single mode analysis of tICA and by the LE4PD models.
Thus, although the timescales predicted using the space spanned by the first two tICs are slightly slower than using the surfaces for the first 2 tICs individually (∼40 vs 24.0 and ∼20 vs 10.7, respectively), the timescales are still within a factor of 2 in both cases, and the qualitative dynamics are predicted to be similar from both methods (comparing Figs. 2 and 3 with Figs. 13 and 14). Thus, the single-mode analysis of tICA, the two-mode analysis of tICA, and the single-mode analysis of LE4PD and LE4PD-XYZ give consistent results when identifying, in one simulation trajectory, the local fluctuations for the slowest processes.
We report in Sec. S7 of the supplementary material the analysis of the two-dimensional map of the two slowest LE4PD-XYZ modes. Analysis of this map confirms the previous calculations for the single-mode LE4PD-XYZ analysis. Because the LE4PD modes are slightly more coupled than the tICA modes, the energy barriers between the two slowest modes are smaller than in the tICA case.
IX. DISCUSSION AND CONCLUSIONS
Atomistic MD simulations of proteins have been shown to describe with accuracy relevant biological processes. However, the leading behavior that guides the properties in biological systems, including the slow cooperative dynamics involved in protein binding and function, is often hidden by the broad spectrum of phenomena and the multitude of atomistic details displayed in simulations. Many studies have begun to rely on machine learning techniques to distill the essential leading kinetic information from MD trajectories. The most straightforward and widely used analytical tool in machine learning is the time-lagged independent analysis or tICA, where the covariance matrix samples fluctuations at a given lag time, τtICA. One can obtain a careful analysis of the kinetics of slow fluctuations by combining the tICA with the MSM analysis.24,25 Similarly to tICA, the LE4PD and LE4PD-XYZ approaches accurately model nonlinear protein motions by partitioning the MD dynamics into a quasi-independent diffusive mode while including the mode-dependent free-energy surfaces.34–36,44,52 The LE4PD is an isotropic representation of protein dynamics in a lab reference frame, analogous to the celebrated Rouse–Zimm equation for synthetic polymer dynamics,41,42 extended to consider the protein’s hydrophobic core and the sequence-dependent hydrodynamic interaction. The LE4PD-XYZ is the anisotropic version of the previous equation of motion in the protein’s center-of-mass reference system, where rotation and translation are removed. The LE4PD-XYZ represents the anisotropic fluctuating dynamics of the proteins around its average structure.
Among deep learning methods, more sophisticated approaches than tICA have been proposed to model nonlinear effects in protein motions, such as kernel tICA,87 state-free reversible VAMPnets,88 and time-lagged autoencoders (TAEs).81,89 However, the use of such deep learning approaches to modeling nonlinearities in dynamics often comes with an increased computational cost, paired with a loss of physical intuition for the system under study. Thus, the tICA coordinates are considered, in general, the optimal linear approximation to the order parameters for relevant slow processes in proteins’ dynamics.21,24,60–62 However, the tICA modes and other slow processes identified by machine learning lack information on their physical origin, having no associated equation of motion.
In this study, we compare tICA predictions with both the isotropic LE4PD and with the anisotropic LE4PD or LE4PD-XYZ. To do so, we associate with each tICA mode a free-energy landscape obtained by the eigenvector projection of the simulation trajectory onto the tIC modes. This representation is convenient because it allows one to analyze the tICA’s predictions based on the time evolution of fluctuations onto the mode-dependent free-energy landscape. Because this representation depends on a single mode, it is free from the need to decide the number of modes to consider when building a free-energy map, as one usually does.
LE4PD and tICA agree in identifying the regions in the protein’s primary sequence that undergo slow dynamics. There are three regions of slow dynamics for ubiquitin, namely, the 50 s loop, the Lys 11 loop, and the C-terminal tail. All of them are known to be involved in ubiquitin’s multiple biological functions.45–47,64 Because the primary sequence of ubiquitin is highly conserved in the family of proteins with a similar function, we expect the processes identified in our study to be kinetically and thermodynamically robust and that similar mechanisms are likely to guide the binding of other proteins that perform the same function.
Both tICA and LE4PD consistently identify the leading slow modes. However, while tICA tends to collect all the slow processes in the first or a few modes, the LE4PD provides a more detailed picture of the time- and length-dependence of the slow dynamics, which are partitioned into a larger number of modes. Thus, if one aims at identifying the slowest fluctuations in one mode, tICA may be more efficient than the LE4PD if τtICA is opportunely selected. A detail to note is that when the tICs are sorted in descending order of decorrelation time (i.e., in the order of the tICA eigenvalues), the relative timescale may change after the free-energy barriers are accounted for through the Markov state modeling so that the slowest tICA mode could be different from the first tICA mode.32,34,44
In general, the tICA captures the slow fluctuations that occur at a timescale longer than the given tICA lag time, while faster dynamics are averaged out. The LE4PD method, instead, which is based on the solution of a “bead-and-spring” model of macromolecular dynamics, provides detailed information on the dynamics at the different length scales. It follows that the LE4PD is accurate in reproducing the time decay of amino acid fluctuations at all timescales when the dynamics is represented by the time correlation functions calculated from the simulation trajectory (Figs. 7 and 8). The similar calculation performed using tICA modes is, with a few exceptions, much less accurate (Figs. 7 and 9).
The tICA’s lack of accuracy in the description of the time dependence of the fluctuation decorrelation as described by the simulated tcfs is not surprising because the tICA averages out the information at times shorter than the lag time. Setting a lag time for tICA affects the modality of sampling the dynamics in the free-energy landscape. These considerations lead us to perform a study on how the tICA lag time affects the properties analyzed by tICA, namely, the slow fluctuations and the calculation of the time correlation functions. For example, if the lag time is too short or too long, the tICA cannot properly sample the free-energy barriers. Thus, we propose and test several methods to evaluate a priori an optimal lag time. The optimal τtICA calculated with the different methods is fairly consistent.
We also observed an almost quantitative agreement between the time correlation functions directly calculated from the simulation and the ones obtained by solving the LE4PD-XYZ equation when hydrodynamics is included. This result confirms the importance of hydrodynamics in the Langevin dynamics of proteins in solution, which is not surprising given that the Langevin is an equation of motion in the protein’s reduced coordinates, where the effect of the solvent enters through friction, random forces, and hydrodynamic interactions. Thus, the hydrodynamic forces that enter the LE4PD equations result from the projection of the forces due to the solvent and the protein’s atomistic fast degrees of freedom onto the reduced coordinates of the alpha carbons. Finally, hydrodynamics is more important for modes that are local, while large-scale fluctuations and slow modes are less affected.
In conclusion, if a rapid identification of the leading slow dynamics is required, the tICA analysis is a practical and valuable strategy to collect that information. However, suppose the time propagation of the slow leading dynamics is of interest. In that case, the LE4PD-XYZ with hydrodynamics provides a more accurate representation of the slow processes based on its superior ability to reproduce the protein’s dynamics at all times.
Note that different tICA modes are likely to have different optimal lag times. Thus, in principle, one cannot find one lag time that is optimal for the whole multiscale dynamics of the protein. In fact, in general, one observes that the barrier height tends to decrease with the increasing mode number, as the mode-dependent dynamics become increasingly more local and less cooperative.44 Because defining the lag time in tICA implies that dynamics at shorter lag time are not detected, the dynamics on the more local modes may not be correctly represented.
SUPPLEMENTARY MATERIAL
The supplementary material presents in Sec. S1 a study of the independence of the tICA and LE4PD modes, while Sec. S2 shows the stability of the tICA modes as a function of the selected lag time. To most efficiently identify the LE4PD-XYZ modes that best overlap to tICA modes, Sec. S3 in the supplementary material presents some additional methods to the ones described in Sec. V of the text, including the projection of the slowest tICA mode onto the LE4PD-XYZ modes with and without hydrodynamics interactions. Section S4 presents the quantitative comparison of the mode-dependent fluctuations for the different LE4PD methods and tICA. A selection of alternative methods to calculate the optimal tICA lag time is summarized in Sec. S5. Section S6 presents a brief overview of the isotropic LE4PD: the calculations for this model compare in several sections of the text together with the anisotropic LE4PD-XYZ’s data. Section S7 shows the two-dimensional free-energy maps for the two slowest modes of the anisotropic LE4PD-XYZ. Finally, the supplementary material document concludes in Sec. S8 with a presentation of the molecular dynamics methodology we used to simulate ubiquitin and the post-processing of the trajectory, followed by Sec. S9 with a brief overview of the Markov state model method.
ACKNOWLEDGMENTS
E.R.B. was supported by the John Keana Graduate Student Fellowship from the University of Oregon and by the National Science Foundation through Grant Nos. CHE-1665466 and CHE-2133464. The computational work was performed on the supercomputer Comet at the San Diego Supercomputer Center, with the support of XSEDE90 (Allocation Grant No. TG-CHE100082) (XSEDE is a program supported by the National Science Foundation under Grant No. ACI-1548562).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts of interest to disclose.
DATA AVAILABILITY
The codes used to perform the isotropic LE4PD and the anisotropic LE4PD-XYZ analyses described here are available on GitHub (https://github.com/GuenzaLab). The processed MD trajectory and trajectories for the first ten LE4PD-XYZ modes both with and without hydrodynamics included are available on Zenodo (https://doi.org/10.5281/zenodo.4312224).