Local fluctuations are important for protein binding and molecular recognition because they provide conformational states that can be trapped through a selection mechanism of binding. Thus, an accurate characterization of local fluctuations may be important for modeling the kinetic mechanism that leads to the biological activity of a protein. In this paper, we study the fluctuation dynamics of the regulatory protein ubiquitin and propose a novel theoretical approach to model its fluctuations. A coarse-grained, diffusive, mode-dependent description of fluctuations is accomplished using the Langevin Equation for Protein Dynamics (LE4PD). This equation decomposes the dynamics of a protein, simulated by molecular dynamics, into dynamical pathways that explore mode-dependent free energy surfaces. We calculate the time scales of the slow, high-amplitude fluctuations by modeling the kinetics of barrier crossing in the two-dimensional free energy surfaces using Markov state modeling. We find that the LE4PD predicts slow fluctuations in three important binding regions in ubiquitin: the C-terminal tail, the Lys11 loop, and the 50 s loop. These results suggest that the LE4PD can provide useful information on the role of fluctuations in the process of molecular recognition regulating the biological activity of ubiquitin.
Fluctuation dynamics allow proteins to modify their shape and efficiently sample conformational states that are potentially useful to perform their biological function.1 Large scale fluctuations can be precursors to unfolding2 because these fluctuations involve slow cooperative rearrangements of large portions of the protein; these cooperative motions are thought to guide and define the most relevant kinetic pathways of a protein, identifying reaction coordinates that can be important, for example, in substrate binding,1,3,4 product release,5 regulation,1 and allostery.6–8
Local fluctuations, occurring at a precise length scale, are supposed to be relevant in identifying regions of the protein likely involved in molecular recognition.3,9 Following the hypothesis of the conformation-selection model by Monod-Wyman-Changeux (or the MWC model),10 local fluctuations along the primary sequence of a protein provide information on the propensity to bind other molecules at the given segment of the protein’s primary sequence.
This hypothesized correlation between local fluctuations and binding lies at the foundation of the MWC model, where a substrate selects among a large ensemble of conformations the one that is geometrically and energetically most favorable to binding.11 Thus, spontaneous fluctuations are expected to occur even in the absence of a binding partner3,12 so that the modeling of spontaneous local fluctuations of an isolated protein may provide essential information on the kinetic mechanisms of protein binding.
These local fluctuations involve internal deformations of the protein, which require surmounting a free-energy barrier. A dynamical study of spontaneous fluctuations will likely uncover the length- and time scales over which these fluctuations occur, thus potentially highlighting the characteristic spatial and temporal parameters that set the limits for the binding process. Sometimes, internal fluctuations occur on a time scale that is comparable in magnitude with the slowest time scale of protein relaxation (for example, rotational diffusion),13 indicating that those internal motions can be important participants in the mechanisms of molecular recognition.
We study here the emergence of local fluctuations along a protein’s primary sequence and the length- and time scales associated with them, starting from the coarse-grained (CG) Langevin Equation for Protein Dynamics (LE4PD). The protein investigated is the regulatory protein in eukaryotic cells ubiquitin, which is most notable for its ability of post-translationally modifying other proteins through the process of mono- or polyubiquitination,14,15 being a key player in a number of important biological functions.14,16–19
In the process of ubiquitination, two ubiquitins bind by forming an amide bound between the carboxyl group at the C-terminus and the ε-amino group of a lysine amino acid. The reaction is catalyzed by a number of enzymes. The protein has seven lysines, and the length and shape of the chain of ubiquitins depend on which lysine in the protein participates to the binding of the C-terminus of another ubiquitin. Other parts of ubiquitin are also involved in other binding processes.14,16,20 Following the hypothesis of the MWC model of binding, local fluctuations along the primary sequence of ubiquitin provide information on the propensity of the protein to form bonds at a specific amino acid site; the height of the barriers and the kinetics of crossing these barriers will provide information on the time scale of binding. The different time- and length scales related to fluctuations in different parts of the protein may be useful in the reaction mechanism to discriminate among the different binding sites. Because ubiquitin has a highly conserved primary sequence in the family of proteins that have similar functions (for example, the primary sequence of ubiquitin has only a few residues that are different in animals, yeasts, and plants), we expect the mechanisms that guide these processes to be kinetically and thermodynamically robust.
The LE4PD approach effectively projects the dynamics of a protein onto a coarse-grained (CG) description where the protein is represented by a collection of vectors connecting pairs of α-carbons (Cα). The method starts with a molecular dynamics (MD) simulation of a protein in physiological conditions in the canonical (NVT) ensemble, where data were collected from a 1-μs equilibrium simulation. Then, it decomposes the MD dynamics by projecting the trajectory, which represents the complex dynamics of a protein with motion coupled across multiple length- and time scales, onto quasilinearly independent LE4PD normal mode coordinates derived from the CG description. While here we use MD simulations in the canonical ensemble, the LE4PD equation is equally useful when starting from other ensembles, statistical averages derived experimentally (e.g., NMR conformational ensembles21,22 or a set of X-ray crystal structures), or by Monte Carlo simulations. The conversion into normal mode coordinates yields a description of the local fluctuations that is largely uncoupled and can be analyzed independently: each normal-mode trajectory encapsulates the dynamics occurring at a selected length- and time scale in the simulation. The real-space dynamics can be reconstructed a posteriori from a linear combination of the normal modes.
When the simulated dynamics is projected onto the LE4PD normal modes, a free energy map of the conformational space is generated for each mode. The mode-dependent free energy maps display complex landscapes with energy barriers and unique pathways between minima on the surface. To calculate the time scale of transition between minima in the Free Energy Surfaces (FES), we combine, here for the first time, the LE4PD normal mode description with a mode-dependent Markov State Model (MSM) analysis of the dynamics. MSMs have been applied to the study of the kinetics of a wide range of biologically relevant systems, providing a reliable analysis of the dynamical pathways.23–28 Here, we propose a refined MSM method for the determination of the slow kinetic transitions between minima in each free energy map. Using this approach, we evaluate the characteristic time of transition between two well-defined energy minima in LE4PD mode-dependent FES of ubiquitin.
Decoupling the real dynamics by decomposition into independent normal modes is similar in purpose to the Principal Component Analysis (PCA) and the time-lagged Independent Component Analysis (tICA) approaches, which have been previously used in conjunction with MSM.29–31 Interestingly, because of their direct connection with the physical picture of the system, the LE4PD modes identify the contributions that arise from the type of amino acids and their local flexibility, hydrodynamics, and friction within the protein.2,13,21,22 Thus, the projection onto the LE4PD diffusive normal modes provides a detailed physical interpretation of the dynamics of the protein: in a given time window and at a given space scale, fluctuations occur on well defined fragments of the protein primary sequence, or, equivalently, one can see how different parts of the protein become dynamically active (fluctuate) on different time scales. This study shows that the newly proposed LE4PD-MSM method allows for the careful and accurate evaluation of mode-dependent local fluctuations and kinetic pathways.
II. FROM MOLECULAR DYNAMICS SIMULATIONS TO A LE4PD NORMAL MODE DESCRIPTION
The LE4PD is a linear Langevin equation of motion for a set of coarse-grained units (or beads) located at the position of the alpha-carbon along the primary sequence of a protein.2,13,21,22,32 The LE4PD is expressed as a function of the time-dependent Cα–Cα bond coordinates, , which allows one to discard the center-of-mass diffusion, irrelevant in the study of the internal dynamics of a protein. The LE4PD equation of motion for the bond vector i is
where kB is the Boltzmann constant, T is the temperature in Kelvin, and l2 is the mean-square bond length, which in this model is the mean-square peptide bond length.
is the stochastic force acting on bond i at time t, which is governed by a white-noise fluctuation-dissipation relation,
with and where ⟨⟩ defines the statistical average of a quantity. Here, α, β denote the Cartesian indices; δij and δ(t − t′) are the Kronecker delta and Dirac delta function, respectively. The matrix a is the matrix that transforms from bead coordinates to bond coordinates, while L is the matrix that contains the hydrodynamic interaction and U is the inverse of the bond correlation matrix , which provides information on the protein local flexibility along its primary sequence. For convenience, we define .
The statistical averages that enter the LE4PD hydrodynamic and structural matrices, as well as the amino acid friction coefficient, are calculated from trajectories of atomistic MD simulations of the protein in aqueous solvent at physiological conditions. More details on the LE4PD and on the MD simulations of ubiquitin, studied in this paper, are reported in the supplementary material.
The LE4PD equation of motion is solved by matrix diagonalization to recover the linearly independent mode representation of the dynamics. The eigenvalues of the diffusive matrix for mode a are defined as , while the normal modes are , with Q being the matrix of the eigenvectors that diagonalize the LE4PD equation [Eq. (1)]. The modes span the same space as the bond vectors with near linearity as with . Starting from the normal mode solution of Eq. (1), one can calculate any structural and dynamical property of interest. A comparison of real-space structural properties predicted by the LE4PD to those calculated directly from the simulation trajectory demonstrates the accuracy of the Langevin mode description.13
The LE4PD normal modes, , describe the dynamics of a protein over a given length scale. For a protein possessing N CG sites, the LE4PD gives N − 1 coupled equation of motion in the bond coordinates and N − 1 uncoupled equations of motion in normal modes. The modes are ordered by descending time scale, with the lowest index mode describing the slowest, highest-amplitude motion and the highest index modes describing the fastest, lowest-amplitude motions. For a well-folded protein, the first three modes describe the rotational diffusion tensor of the protein;13 these are the global modes of motion. The remaining N − 4 modes describe internal deformations of the protein and will be referred to as internal modes.
We then project the MD trajectory for each bond vector onto the LE4PD mode coordinates as , thus yielding a new trajectory in the mode coordinates. Then, an energy map for each mode is built by calculating the histogram of the trajectory projected onto the mode coordinates expressed in real space.13 The map of the free energy is reported as a function of the two spherical angles in polar coordinates: θ (describing the inclination relative to the z-axis of the simulation box) and ϕ (describing the azimuthal angle in the xy-plane of the simulation box). Note that the contour plot of the energy as a function of the modulus of the polar vector does not show relevant features and is not reported even if its contribution enters the normalization of the probability and thus in the evaluation of the mode-dependent free energy. The latter is given by the logarithm of the normalized probability of each molecular configuration in mode coordinates. This procedure yields N − 1 mode-dependent FES.
As an example, Fig. 1 displays the free energy surface of mode seven for ubiquitin. The contour map of the free energy surface displays interesting features with localized, deep minima and distinct reaction pathways between them. By applying a version of the string method,33,34 two possible pathways emerge between the two minima, given the circular symmetry of the angle ϕ. The transitions between two minima in the free energy landscape represent mode dependent local fluctuations, associated with a time scale for crossing the related energy barrier, or transition time. A more in-depth description of these local fluctuations is presented in Sec. III.
III. LOCALIZED FLUCTUATIONS IN UBIQUITIN LE4PD MODES
Ubiquitin is a regulatory protein present in eukaryotic cells, whose post-translational modification is involved in multiple biological functions.14,35 An important function of ubiquitin is to tag misfolded proteins and to signal them for degradation via the proteasome. For this reason, ubiquitin is called the “molecular kiss of death.” Degradation happens through a number of steps where first ubiquitin binds to the misfolded protein by a reaction that is catalyzed by ubiquitin ligases, and once the first ubiquitin binds, these signals the ligase for further binding of the ubiquitin proteins to form a polyubiquitin chain. The polyubiquitin chain finally binds to the proteasome, which ultimately degrades the misfolded protein: thus, ubiquitin molecules form a chain that connects the misfolded protein to the proteasome.36 The length and shape of polyubiquitin are important for the successful protein degradation.20
Binding of ubiquitin to a second protein by mono- or polyubiquitination occurs by formation of an amide bond between the carboxyl group of the last amino acid in the C-terminal tail and either the ε amino group in the side chain of a lysine residue or, alternatively, the amino group in the N-terminus. Ubiquitin itself has seven different Lys groups that can bind to the C-terminus of another ubiquitin: Lys6, Lys11, Lys27, Lys29, Lys33, Lys48, and Lys63 (see Fig. 2). The selection of the different Lys groups for binding yields different three-dimensional structures for the resultant polyubiquitin and supports different biological processes.20,37 For example, the formation of polyubiquitin that leads to protein degradation is initiated by the binding of the C-terminus of the second ubiquitin to either Lys48 or Lys29 in the first ubiquitin, i.e., the one directly bound to the protein that is being degraded. Binding of ubiquitin to Lys63, instead, leads to polyubiquitin chains that are important for other functions generally related to crossing of a membrane, including endocytosis, membrane trafficking, and signal transduction.14,19
Furthermore, it has been shown that ubiquitin may interact with other proteins through noncovalent binding and that this noncovalent binding involves conserved regions of the protein: the hydrophobic Ile44 patch on the surface of ubiquitin binds noncovalently to a number of ubiquitin binding domains, for example the ubiquitin interacting motif (UIM) and motif interacting with ubiquitin (MIU).37
Thus, given the complexity of the possible kinetic pathways involved in its binding, an accurate evaluation of the dynamics of ubiquitin is potentially enlightening in determining how this protein carries out its biological function: fluctuations may be important in characterizing the binding propensity of different regions of ubiquitin’s primary sequence in agreement with the theory of conformational selection, which postulates that a protein will sample all its relevant binding conformations even in the absence of any binding partners.1,38 As demonstrated below, the LE4PD coupled with the MSM is effective at describing dynamics at localized sites of ubiquitin involved in both covalent and noncovalent binding.
For a given LE4PD mode, the transition between two minima in the FES corresponds to well-defined local fluctuations. For each bond and a specified mode, a, the amplitude of the bond fluctuations is calculated as the Local Mode Length (LML) scale,
is the mean-squared projection of mode a onto the ith Cα–Cα bond.
Examples of Li,a are reported in Figs. 3 and 4 for a LE4PD analysis applied to a 1-μs equilibrium simulation of ubiquitin. Figure 3 displays the LML for the first ten LE4PD internal modes: those are the slowest modes that contain crossing of high energy barriers and possibly rare events. For the high-amplitude, slow global modes in Fig. 3 fluctuations are located mostly in the C-terminal tail (residues 71–76) of ubiquitin, which is involved in polyubiquitination.14 In the slower modes, smaller amplitude fluctuations are also located in the Lys11 loop (residues 6–11), while those fluctuations become dominant only for the faster and more local LE4PD modes 7, 11, and 13.
An interesting exception to the observed trend is LE4PD mode 9, which shows a large amplitude fluctuation in the stretch between residues 51 and 63; this segment is known as the 50 s loop and has been shown to be a binding site for the A20 zinc-finger binding motif.18,39,40 The specific behavior of this LE4PD mode will be elaborated more below because it is an example of the LE4PD’s ability to predict dynamics in binding regions of ubiquitin and the MSM’s ability to describe accurately the slow kinetics along the LE4PD mode’s FES.
Figure 4 shows the LML for a number of high index modes. The LML for these internal modes displays delocalized low-amplitude fluctuations across ubiquitin’s primary sequence. Dynamics on these local energy maps does not involve transition between deep wells on the free-energy landscape but rather diffusion over a rough landscape. Interestingly, this is not observed for the very last modes (modes 73–75), which describe highly localized fluctuations along the backbone that are sensitive to the chemical specificity of ubiquitin’s primary sequence (see more details in the supplementary material).
The analysis of the LML suggests that fluctuations do not appear uniformly in all the modes but that their localization along the primary sequence of the protein is specific to the mode number. Low index modes show fluctuations that are local in space and that occur by transition along a pathway between well-defined energy wells. Intermediate-index modes, instead, show an almost stochastic spreading of the fluctuations and delocalization along the primary sequence of the protein. The FES of high-index modes show a well-defined, highly conserved, localization corresponding to crossing of local energy barriers of order kBT.
Each mode-dependent FES is associated with a time scale characterizing the fluctuations described by that mode. Thus, it appears that the different binding regions, including different lysine residues, on the surface of ubiquitin are kinetically nonequivalent: this suggests preferential binding depending on the time scale of the kinetic processes involved. To calculate the time scale associated with the crossing of the free-energy barrier for a given fluctuation, we apply an MSM analysis as illustrated in Secs. IV–VI.
IV. MARKOVIAN AND NON-MARKOVIAN KINETICS OF THE MODE-DEPENDENT LE4PD FLUCTUATIONS
The trend observed in the mode-dependent fluctuations of Figs. 3 and 4 is indicative of the structure of the related FES. Thus, one observes that low-index LE4PD modes correspond to large wavelength processes and identify slow, cooperative dynamics of the amino acids along the primary sequence. For those slow, cooperative motions, free energy barriers are large enough and crossing is often a rare event. Thus, for the low-index modes, kinetic transitions are uncorrelated, a Markovian statistic applies, and the mode-dependent dynamics can be analyzed by MSMs.
As one moves to consider higher-index LE4PD modes, the dynamics becomes faster and the height of the free-energy barrier decreases.2,21 It was previously shown that the overall scaling behavior of the height of the average energy barrier scales with the mode number as , and with the characteristic mode length, , as .2 These characteristic scaling exponents are consistent with a dynamical process where the constant fluctuations in hydrogen bonding are a source of energetic disorder in the Hamiltonian of a protein, thus supporting the mapping of protein dynamics onto the Kardar-Parisi-Zhang model.2,21
Thus, the Markovian nature of the kinetics for each LE4PD mode depends on the mode that is under study. The decrease in the height of the energy barriers with increasing locality of the dynamics renders progressively more problematic the MSM analysis of the FES. For the high index modes, the free-energy barriers are not large enough and the trajectory cannot sample low free-energy regions for a sufficient length of time to completely lose memory of the previous transitions. For those modes, the dynamics is not Markovian independent of the lag time that is selected, and the MSM approach does not apply. For these high-index modes where the roughness of the landscape dominates the dynamics, it is appropriate to adopt Kramers’ diffusive renormalization of the friction coefficient, where the energy barrier is calculated through the Median Absolute Deviation (MAD) measure of the average energy roughness.21,41
In the MAD measure, the free-energy barrier is calculated as
with being the free-energy barrier predicted for mode a using the MAD and Emin,a being the global free-energy minimum of mode a.
Using the measured MAD barrier, we defined an effective friction coefficient that yielded a slowing down of the dynamics calculated with Kramers’ theory of diffusive barrier crossing.2,13,22 By assuming that the modes diffuse along their respective FES and that barrier crossing is a thermally activated process, the friction coefficient can be rescaled as , leading to a slowed down decay time for each mode: , with .
The use of this rescaling procedure for the mode-dependent time improves the agreement between the bond time autocorrelation function, calculated from the LE4PD theory, and the same function directly calculated from the MD simulations.2,13,22 The criterion that we adopt to establish the range of LE4PD modes where the MSM applies, i.e., the Markovian nature of the kinetic transition, is a standard procedure based on the Chapman-Kolmogorov condition, as illustrated in the supplementary material. For the protein ubiquitin studied here, we observe that for the tenth LE4PD internal mode and higher, a MSM analysis becomes impossible, and transition times are calculated by simple Kramers’ rescaling of the LE4PD mode-dependent time using the MAD determination of the average energy barrier. This procedure appears to be accurate also because the weight of each mode in the calculation of any time correlation function decreases with increasing LE4PD mode number: the possible error due to a less accurate evaluation of the high-index modes (by MAD instead of MSM) plays a less significant role in the calculation of the dynamical properties of the protein in the form of time correlation functions.
V. A MARKOV STATE MODEL FOR THE ANALYSIS OF THE MODE-DEPENDENT FREE ENERGY SURFACE
The MSM method models the kinetics of a molecular process as a Markov chain of uncorrelated jumps among conformational states (more details are available in the supplementary material). The mode-dependent trajectory is first partitioned into a finite number of discrete states, W, using the k-means++ clustering algorithm, as implemented in PyEMMA.30 Once a lag time τ is selected, the probability of transition between different microstates is calculated and stored in the transition matrix, T(τ). Thus, the evolution of the probability for the system to occupy a discrete state at a given time t, p(t), follows23
where the matrix T(τ) is calculated from the simulation trajectory using the reversible maximum-likelihood estimate,42
with cij = cij(τ) being the ijth element of the count matrix, which keeps track of all the transitions from state i to j in the trajectory at a lag time τ. We define ci = ∑jcij the ith row sum of the count matrix, giving the total number of observed transitions from i, while πi is the stationary (equilibrium) probability of state i. This definition of T(τ) satisfies detailed balance and implies reversibility of the kinetic process.
The right eigenvectors of T(τ) are solutions to the eigenvalue equation
Since T(τ) is a regular, stochastic matrix, the Perron-Frobenius theorem guarantees that λ1(τ)[MSM] = 1 is the maximum eigenvalue of T(τ), and its corresponding eigenvector, ψ1, has only positive entries.43 The other eigenvalues obey the condition that for i > 1, one has 0 < λi(τ)[MSM] < 1. Given the definition of the implied time scales, ti,
one observes that all the processes decay in time, except the one corresponding to the first eigenvalue (t1 = ∞). The first left eigenvector describes the stationary distribution of the configurational states, while the eigenvectors with an index higher than one describe kinetic transitions occurring at increasingly smaller time scales. The second eigenvalue, λ2(τ)[MSM], gives the time scale, t2, associated with the slowest internal motion of the protein from the given LE4PD dynamical mode. It is precisely this time scale t2 that is of interest here because it describes the slowest kinetic process occurring on a FES. Following the procedure presented in Sec. VI, time t2 is selected to correspond to the kinetic transition between the two minima in the FES of the slow LE4PD modes.
VI. A KINETICALLY INFORMED DETERMINATION OF THE MODE-DEPENDENT LE4PD TRANSITION TIME
We calculated the mode-dependent transition time for crossing a free energy barrier, τ, starting from the spectrum of the second right eigenvector, ψ2, of T(τ).44 This eigenvector has been called an “ideal reaction coordinate” because (i) it corresponds to the slowest, nonstationary process of the Markov state model, where the slowest dynamics is supposed to be representative of rare events, and (ii) it has been shown to be a good approximation to the discrete committor function45 [see Eq. (9)], which gives the probability of the trajectory to visit the product state before the reactant state, when initiated at any of the discrete states outside of the reactant and product regions44,46 (as an example of ψ2 acting as a committor function, compare Figs. 6 and 8).
To identify the relevant well-to-well transition time, we performed a series of calculations at increasing lag time, τ, and inspected the structure of the second eigenvector (ψ2) projected onto the free-energy surface. The transition time was selected such that the discrete states with the most positive projection along ψ2 were located in the deepest well in the free-energy surface (the “product” state), while the discrete states with the most negative projection were located in a second, well-defined free-energy well (the “reactant” state). This selection criterion was adopted because the goal is to construct kinetic models for each LE4PD mode describing the barrier-crossing events between the two most highly populated regions of the free-energy surface for each LE4PD mode. An example of the resulting ψ2 spectra is reported in Figs. 5 and 6, where the free energy maps for modes 4, 5, 7, and 9 are presented on the left panels and the discrete states for the second eigenvectors are superimposed on the FES in the right panels.
Figure 7 reports the kinetic parameters measured using the MSM method combined with the “kinetically informed” procedure and a comparison of these times with the results of the Kramers’ rescaling procedure with the MAD determination of the energy barrier, henceforth referred to as the “LE4PD-MAD” approach. Note that while the transition time of the slowest modes, which have pronounced and localized energy barriers, is calculated with the kinetically informed MSM procedure, for modes larger than 13 (the eleventh internal mode and higher), MSM cannot be applied and the LE4PD-MAD procedure is used. Using the proposed method to construct the MSM for each LE4PD mode, the kinetics predicted by the discrete model, which corresponds to the transition time between wells on the surface, can be seen to have a similar trend in the time scales to the LE4PD-MAD approach, while the LE4PD with the kinetically informed MSM generally predicts slightly larger time scales. For a given normal mode, the kinetically informed “LE4PD-MSM” procedure appears to provide a more accurate determination of the implied time scale.
It is worth noticing that if a too long lag time is selected in the kinetically informed MSM method, the discrete states corresponding to the minimum and the maximum projections along the second eigenvector are empirically found to no longer lie within wells in the FES (see the supplementary material). Thus, although long-time processes become naturally uncorrelated at large lag time, and thus Markovian, the characteristic time scale of a given fluctuation has to correspond to the transition between two well-defined energetic minima. Finally, we note that the eigenvalues of the transition matrix [Eq. (6)] are identical to those of the corresponding, symmetrized matrix.44 Then, the transition time for a dynamical fluctuation is uniquely defined because its value is independent of which well is assumed to be the initial state and which is assumed to be the final state in the free-energy barrier crossing.
VII. FLUCTUATION DYNAMICS AND BINDING: THE CASE OF MODE 9
Interestingly, the time scale of LE4PD mode 9 predicted from the MSM analysis has a transition time that is comparable to the rotational dynamics of the protein and the fluctuation time of the first internal mode. This indicates that important slow dynamics occurs for this internal mode. This long transition time is due to the structure of the free-energy surface for mode 9, which is shown in the left panel of Fig. 8. Visible in Fig. 8 is a prominent, third minimum in the FES. Using a committor analysis,44 this minimum is shown to lie near a committor value of 0.5 and serves as a “trap state” during transitions between the reactant and product states on the FES of LE4PD mode 9. The aforementioned committor analysis is performed using the standard approach of transition path theory.47 The committor function for a discrete state i, qi, is defined as47
with I being the set of intermediate states (all discrete states not belonging to the reactant or product states) and P being the product set. Conceptually, the committor function describes the probability that a trajectory initiated from state i will visit (or commit) next to the product state,44,45 that is, qi = 0 in the set of discrete states defined as the “reactant” states and qi = 1 in the set of discrete states defined as the “product states.” Discrete states with qi ≈ 0.5 are transitions states, with approximately equal probability of visiting either the reactant state or the product state next.
The right panel of Fig. 8 shows the committor function between the discrete states with the most positive and most negative projections along ψ2; discrete states with committor values near 0 are colored dark red, and discrete states with committor values near 1 are colored dark green. Since the trap state is located at an intermediate committor value (∼0.5–0.6), transitions between reactant and product states get trapped there. This trapping effect extends the transition time between regions of high and low projections along ψ2, leading to longer predicted time scales relative to the LE4PD-MAD procedure; the latter averages out these types of effects because it accounts only for a single characteristic barrier height. Thus, this mode provides a clear example of the extra information that may be afforded from adopting a more precise method of analysis of the dynamics, by using MSMs to model the kinetics of the slow LE4PD modes.
The fluctuations observed in the LML for LE4PD mode 9 are reproduced qualitatively when the transition between the reactant and product wells on the FES is modeled using a transition path found by a modified version of the zero-temperature string method.33,34 The minimum free-energy path is shown in the top panel of Fig. 9, and the corresponding structural deformations of ubiquitin as the trajectory moves along this pathway are shown in the bottom panel (details of how the string method is implemented and how representative structures are generated are given in the supplementary material). Figure 9 demonstrates that the minimum free-energy path between wells passes through the “trap state” mentioned above, which supports the slow time scale for this mode predicted by the MSM. The fluctuations along the primary sequence of ubiquitin, given in the bottom panel, support the localization of the fluctuations predicted by the LE4PD’s LML.
A. Biological interpretation
As previously observed, the conformational selection model of protein binding postulates that in the absence of its binding partner(s), a protein will still sample all energetically available states, including those states responsible for binding to the ligand(s) of interest.1,38 The conformational selection model implies that the Cα–Cα bond fluctuations described by the dynamics along the LE4PD FES may identify the time scales and length scales of relevant binding modes of, in the analysis shown here, ubiquitin. Thus, it is useful to summarize the results that we obtained in this study while trying to connect those data to a possible interpretation of the propensity of ubiquitin to bind at specific time scales in well-defined regions of the protein three-dimensional structure.
Re-examining the LML for LE4PD in Fig. 3, we identify the slowest and more relevant fluctuations of ubiquitin. First, we observe that LE4PD predicts large-amplitude, slow dynamics in two prominent regions: the C-terminal tail (modes 4, 5, 6, 8, and 10) and the Lys11 loop, which is the flexible loop containing Lys11 (modes 7, 11, and 13).17 The C-terminal tail and the Lys11 loop regions are implicated in covalent associations with other proteins, including other ubiquitin molecules; both regions are involved in polyubiquitination events.14,17,20 Due to their ability to bind covalently to numerous other proteins, it is perhaps not surprising that the LE4PD predicts motions in these two regions, which involve a relatively wide window of length- and time scales (Figs. 3 and 7). We speculate that perhaps the large variance of the size and kinetics of the fluctuations involving the C-terminal tail and the Lys11 loop is due to the need for the protein to interact with a variety of other proteins of different sizes and flexibilities.
Different is the result for the fluctuations involving the 50 s loop.39 In this case, fluctuations appear in a specific LE4PD mode (mode 9). X-ray crystallography, 2D-NMR experiments, and immunoblot assays have shown that this region of ubiquitin is recognized by, and is bound noncovalently to, the A20 zinc-finger (ZnF) motif of Ras guanine exchange factor Rabex-5.39,40 Perhaps more interestingly, the X-ray structure of Rabex-5 bound to ubiquitin (PDB ID: 2C7N)39 shows that Y25 on Rabex-5 forms a hydrogen bond with residue E51 of ubiquitin,39 while 2D-NMR studies show that E51 on ubiquitin undergoes a large chemical shift perturbation when Rabex-5 is added to a solution of ubiquitin.40 Interestingly, LE4PD mode 9 shows a large, local fluctuation localized on at E51 residue of ubiquitin (see Fig. 3).
Protein dynamics is hypothesized to play a central role in protein function. Molecular recognition and substrate binding often occur by a conformational selection mechanism, where protein conformations that are apt to bind a specific substrate are already populated in the isolated protein.1 Then, the binding of the substrate occurs by a simple selection of the proper protein conformation. Thus, spontaneous fluctuations in the isolated protein may provide useful information on the mechanism of protein substrate interaction and binding, where both structural information and kinetic information are important to characterize the mechanisms of binding.
In this paper, we present a detailed study of the fluctuation dynamics of the protein ubiquitin and propose a kinetically informed method to analyze the protein motion. The protein is simulated using an MD trajectory and is analyzed following a two-step procedure. In the first step, the trajectory is projected onto diffusive normal-modes of a coarse-grained Langevin representation, the LE4PD model. This step separates the dynamics into quasilinearly independent coordinates and allows for the identification of local fluctuations as a function of their length scale: large, cooperative fluctuations manifest themselves mostly in low-index LE4PD modes, but important exceptions naturally emerge even in this study. Normal-mode LE4PD free energy surfaces can be constructed, which describe the energy landscape connected with the mode-dependent fluctuations. The kinetic pathways of the mode-dependent fluctuations are analyzed by Markov state modeling, which provides the transition time for the crossing of the energy barrier for the LE4PD mode-dependent fluctuations.
To analyze the kinetics of transitions between energy wells in the free-energy surfaces of the slow, high-amplitude LE4PD modes, we identify the proper time scale of transition between two energy minima using the committor function. Since we seek to analyze the slow kinetic processes predicted by the MSM, we focus mostly on the dynamic process modeled by the first nontrivial eigenmode of the MSM, which is described by the spectrum of the second right eigenvector of the transition matrix, ψ2, and its corresponding time scale, . For low-amplitude, local modes, where the energy barriers are not high and the dynamics is not Markovian, we adopt instead a rescaling of the local friction in agreement with Kramers’ theory.
This method of combining the LE4PD normal modes to predict the dynamics over a specific length scale, measured using the LML scale, and the MSM to predict the kinetics (time scales) of each LE4PD mode is applied to a 1-μs, equilibrium simulation of the protein ubiquitin, a small, globular protein involved in the post-translational modification of many eukaryotic proteins.14,19,20 The LE4PD-MSM analysis reveals slow dynamics in three flexible regions of ubiquitin: the C-terminal tail, the Lys11 loop, and the 50 s loop (residues 51–63); these three regions all play prominent roles in ubiquitination pathways, especially in binding by ubiquitin-binding proteins.19,20,39,40 We find that a single, slow mode is dedicated to describing fluctuations in the 50 s loop, while multiple slow LE4PD modes describe motions in the C-terminal tail and the Lys11 loop. This disparity could potentially be due to the small number of proteins that bind to the 50 s loop (which all contain the same ubiquitin-recognition motif, the A20 zinc finger domain19), while many proteins bind to the Lys11 loop and C-terminal tail since these two regions are involved in both ubiquitin recognition and covalent binding to proteins targeted for degradation.14,20 Finally, the minimum free-energy pathways along the FES for the slow LE4PD modes reproduce the fluctuations predicted by the barrier-free LE4PD equation of motion, which indicates that the motion between wells of the FES of the LE4PD modes represents well the equilibrium fluctuations of ubiquitin, with the MSM analysis giving an accurate estimate of the time scales of those fluctuations.
Because MD atomistic simulations of proteins display dynamics that is coupled on multiple length scales, their interpretation is not easy. The LE4PD method of characterizing mode-dependent fluctuations, starting from atomistic MD simulations, may have some advantages with respect to other approaches commonly used, such as Principal Component Analysis (PCA) or time Independent Component Analysis (tICA). The LE4PD conveniently separates the complex dynamics of a protein into linearly independent normal modes, which can be analyzed individually. With respect to PCA and tICA, LE4PD brings a straightforward physical interpretation to the dynamics measured and directly connects the results of the analysis to the primary and secondary structures of the protein and to the free energy barriers and hydrodynamics, which affect protein fluctuations.
Furthermore, we have shown here how the LE4PD can be conveniently used as a first step in a MSM analysis. Although the MSM analysis can be mathematically performed in multiple dimensions, it is useful to identify a set of coordinates allowing for a reduction in the multidimensional space into a projected, low-dimensional space, where the free-energy landscape is represented as a function of two coordinates. The coordinates may be selected to be the first two principal components or time-lagged independent components, which are the two collective coordinates that maximize the variance and maximize the autocorrelation time, respectively. However, selecting the first two coordinates involves tracing the dynamics along a pair of collective coordinates that have different time scales. In the LE4PD case, one can perform an analysis of the dynamics along a single coordinate at a time, characterized by one time- and one length-scale, which is one of the main advantages of decomposing the protein’s dynamics into a set of normal modes.
In conclusion, the mode-dependent LE4PD description presented here appears to be an ideal framework for the analysis of protein dynamics through MSM because it decouples the dynamics into linearly independent modes, thus representing these modes in a low dimensional space that can be easily visualized and conveniently analyzed by MSM. The decomposition in LE4PD normal modes naturally separates the dynamics in independent contributions, where fluctuations occur at a given length scale, while the MSM analysis provides a precise evaluation of the time scale and the kinetic pathway associated with the local fluctuations. Finally, this mode-dependent MSM analysis can reconstruct the dynamics of the specific protein by an inverse transformation, , to reveal the real-space dynamics by LE4PD modes, as shown in Fig. 9.
The supplementary material includes an overview of the LE4PD approach, the details of the MD simulations, and the study of the convergence of the LE4PD modes and time scales and an overview of the MSM applied to LE4PD with the Variational Approach for Markov Processes (VAMP) analysis of the Markovian nature of the LE4PD modes. The supplementary material concludes with a brief description of the effect of changing the lag time on the FES analysis, a description of the behavior of the high-index LE4PD modes not reported in the main document, and a short description of the string method used to select the most probable pathway.
This work was supported by the National Science Foundation Grant No. CHE-1665466. The computational resources were provided by the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant No. ACI-1548562.