To understand protein folding mechanisms from molecular dynamics (MD) simulations, it is important to explore not only folded/unfolded states but also representative intermediate structures on the conformational landscape. Here, we propose a novel approach to construct the landscape using the uniform manifold approximation and projection (UMAP) method, which reduces the dimensionality without losing data-point proximity. In the approach, native contact likelihood is used as feature variables rather than the conventional Cartesian coordinates or dihedral angles of protein structures. We tested the performance of UMAP for coarse-grained MD simulation trajectories of B1 domain in protein G and observed on-pathway transient structures and other metastable states on the UMAP conformational landscape. In contrast, these structures were not clearly distinguished on the dimensionality reduced landscape using principal component analysis or time-lagged independent component analysis. This approach is also useful to obtain dynamical information through Markov state modeling and would be applicable to large-scale conformational changes in many other biomacromolecules.

In molecular dynamics (MD) simulations of proteins, it is essential to investigate the conformational landscapes with a high dimensionality, which include all information on protein stability, fluctuation, and transitions between different states.1 We usually focus on thermodynamics and/or kinetics of proteins on the landscapes spanned with a few collective variables (CVs).2 Since a protein consisting of N atoms has 3N − 6 internal degrees of freedom, a large reduction in the landscape dimensionality is required. Principal component analysis (PCA)3–5 or time-lagged independent component analysis (tICA)6 have been often used to reduce the dimensionality of conformational landscapes and describe low-frequency modes of globular protein dynamics.4,5,7,8

Due to an increase of computational power and advances of simulation algorithms, hundreds of μs or longer simulations have become more feasible.9 Then, MD simulations have allowed us to investigate not only small fluctuations around experimental structures but also much larger transitions.10 Folding/unfolding dynamics of small proteins have also been examined using atomistic or coarse-grained (CG) MD simulations.11–13 For these long MD trajectories, PCA or tICA, which are categorized as “linear” dimensionality reduction methods, sometimes failed to describe essential dynamics of proteins.14,15 In our recent MD simulations of spike protein on the surface of SARS-CoV-2, PCA with Cartesian coordinates could not visualize up-down motions of the receptor binding domain clearly.16 It was mainly because the low-frequency motions from PCA emphasized functionally less important loop fluctuations. Therefore, we used the CG beads, each of which represents a domain motion in spike protein as feature variable of PCA17 and distinguished up, down, and other intermediate structures on the dimensionality reduced landscape.

One of the simplest approach to overcome the problems in conventional PCA or tICA is applying “non-linear” dimensionality reduction methods, such as ISOMAP,18,19 diffusion map,20,21 sketch-map,22,23 t-distributed stochastic neighbor embedding (t-SNE),15,24 or uniform manifold approximation and projection (UMAP).25,26 Based on various criteria for data point proximity, these methods map objects from high-dimensional space to lower dimensional one, without drastically changing distances between data points.24 ISOMAP and diffusion map define a distance of two data points by computing the shortest path length that connects in-between data points,18 whereas diffusion map focuses on distance-dependent transition probability rather than the distance itself.20 Sketch-map aims to reproduce the dissimilarity between any two data points in the high-dimensional space.22 t-SNE computes a weight for two data points based on the “direct” distance independent of in-between data points.24 The strategy of UMAP is similar to t-SNE, but UMAP only focuses on the pairs of data points that are defined as “neighbors” in a high-dimensional space.25 While UMAP shows similar or better performances compared to t-SNE in several applications,27 the computational cost of UMAP is much smaller than t-SNE. UMAP were already applied to protein MD trajectories by Peng Tao’s group.15,26 They performed atomistic MD simulations of several globular proteins for about 1 µs and examined thermodynamics on the conformational landscapes and kinetics with Markov State Modeling (MSM) after dimensionality reductions by PCA, tICA, t-SNE, and UMAP. They concluded that both t-SNE and UMAP outperform PCA and tICA in the structural classification on the landscapes and MSM for better description of thermodynamics and kinetics of proteins.

In both linear (PCA or tICA) and non-linear (t-SNE or UMAP) dimensionality reduction methods, the selection of feature variables to describe a macromolecular structure is essential.2 Cartesian coordinates or dihedral angles of a target system are conventionally used in PCA or tICA. However, there are two well-known problems in analyses of protein folding simulations, when we use Cartesian coordinates in PCA. The first one is that the average conformation may lose its proper chemical structures and then low-frequency motions from PCA can be meaningless. The other one is that various disordered structures away from the native one are difficult to be distinguished in the Cartesian PCA. Gerhard Stock’s group used inter-residue contact distances as feature variables for the folding of HP35 to overcome the second problem.28,29 They could not clearly distinguish the native and intermediate states using the feature variables, suggesting to use backbone dihedral angles.

In the current study, we investigate the applicability of UMAP for describing folding/unfolding dynamics of a small globular protein, B1 domain of protein G (hereafter, we call it protein G) (Fig. 1). Due to slow folding/unfolding dynamics, a CG model of the protein is used in our MD simulation. One of the key differences between Peng Tao’s works and ours is the feature variables in UMAP to describe dimensionality reduced landscapes. Here, we propose native contact likelihood as feature variables and examine the performances of PCA, tICA, and UMAP. This work demonstrates usefulness of UMAP together with proper feature variables in thermodynamic analysis on the dimensionality reduced landscape and kinetic analysis with Markov state modeling (MSM).30–34 This approach would be applicable to large-scale conformational transitions of multi-domain proteins and other biomolecular complexes.

FIG. 1.

All-atom model of protein G. Although the actual MD simulation was performed with the Cα coarse-grained model, all-atom model is displayed to clearly show the secondary structure. Red and blue colored regions are used for definition of β12 and β34 q-values, respectively. Black line is a scale bar of 10 Å. This images is visualized by PyMOL.59 

FIG. 1.

All-atom model of protein G. Although the actual MD simulation was performed with the Cα coarse-grained model, all-atom model is displayed to clearly show the secondary structure. Red and blue colored regions are used for definition of β12 and β34 q-values, respectively. Black line is a scale bar of 10 Å. This images is visualized by PyMOL.59 

Close modal

Here, we briefly describe theoretical framework of the non-linear dimensionality reduction by UMAP.25 UMAP consists of the following three steps: (i) definition of the “neighborhoods” of each data point, (ii) construction of a fuzzy topological structure, and (iii) a nonlinear projection from original space to dimensionality reduced one, preserving the data-point proximity. In linear methods, such as PCA, we can evaluate the reduced dimensionality in terms of contributions of selected components. In contrast, we cannot use such a criterion in UMAP, and the dimension of the reduced space in UMAP is arbitrary. In this study, we set it to 2, in order to understand the reduced conformational landscape intuitively.

In Fig. 2, we assume N data points (xi,i=1,2,,N in original space. Each data point corresponds to a snapshot structure in MD simulations. In UMAP, the number of neighborhood points (NN) from each point is fixed as a hyperparameter. Neighborhood points of xi for NN = 2 are xj and xk, whose distances from xi are the two smallest ones. A topological network in original space is then determined by connecting all the neighborhood points (Fig. 2, left top). To preserve relative data positions in the topological network before and after nonlinear projection, we utilize a fuzzy topological structure characterized with a set of weights pij|i,j=1,2, of the connection between xi and xj. This value, pij, is the largest (=1) if the distance between xi and xj is the smallest in all the neighborhood points. The weight becomes smaller if the distance is greater. The fuzzy topological structure in dimensionality reduced space is optimized to mimic that in original space as much as possible (Fig. 2, right top). Mathematically, UMAP minimizes a cross entropy H defined as

H=i,jpijlogpijqij+1pijlog1pij1qij,
(1)

by changing the coordinates in the dimensionality reduced space yi|i=1,2,.

FIG. 2.

Schematic illustration of UMAP framework. Left top represents the data point distribution in original three-dimensional space, and right top is the dimensionality reduced space.

FIG. 2.

Schematic illustration of UMAP framework. Left top represents the data point distribution in original three-dimensional space, and right top is the dimensionality reduced space.

Close modal

The default value of NN in UMAP is 15. If we select a larger value of NN, the global manifold in the original space is well preserved in the dimensionality reduced one, while the computational costs increase. In contrast, a smaller value of NN gives us a better description of local structures in the topological network, while leading to a failure to preserve the global manifold. Practically, we have to test several values of NN in each application.

In this study, UMAP was applied to reduce the dimensionality of conformational landscape of protein G. UMAP calculations were performed by using a Python library implemented by the original developers.35 For comparison, PCA and tICA were carried out using python libraries in PyEMMA 2, a useful library for the analysis of MD trajectories.36 In addition, we prepared and used wrapper scripts for UMAP and PyEMMA 2 to facilitate UMAP analysis of MD trajectories. We set the lag time in tICA to 2.5 ns.

In the current study, we use quantities derived from native contact pairs in the folded structure. It is defined as

expdijdij,022σij2
(2)

for one native contact pair by atoms i, j. dij,0 is the native contact distance of this pair, and dij is a distance of the same atom pair for a given structure. σij is the variance of the native contact distance, and ideally, this parameter should reflect an interaction strength of corresponding contact pair. The quantity defined in Eq. (2) is able to be interpreted as a likelihood of the contact formation between atoms i and j, given a pairwise distance dij. Therefore, we call this quantity by “contact likelihood.” The average contact likelihood for protein-G is designated to be equivalent to q-value.37 This contact likelihood becomes 1 when the native contact is formed and becomes 0 when atoms i, j are far away. The unfolded structures distribute around the origin 0,0,0,,0T in contact likelihood space, while the folded and native-like structures distribute around 1,1,1,,1T. Intermediate structures are expected to distribute between the unfolded and folded structures.

In this study, σij in Eq. (2) is determined using the crystallographic B-factors of protein G. There is a relationship between the B-factor and the mean square displacement ⟨u2⟩ of the atom that B = 8π2u2⟩. We can calculate σij for the Cα-based CG model using the reproductive property of Gaussian

σij21Nini=1NiBni8π2+1Njnj=1NjBnj8π2,
(3)

where Bni represents the B-factor of nith atom in the ith residue and Ni is the number of atoms in the residue. Details of the determination of σij values from the B-factor of protein G is explained in the supplementary material, Fig. S1.

We performed CG MD simulation of protein G, which contains 56 amino-acid residues. It consists of N-terminal β-hairpin by its two β-strands (β12), one α-helix, and C-terminal β-hairpin by the other two strands (β34) (Fig. 1). β12 and β34 contact with β1 and β4 strands, and consequently these four strands form an anti-parallel β-sheet. The KB-Go model provided to MMTSB web service was applied to protein G.38,39 This model takes into account of only Cα atoms in a protein. The crystal structure [Protein Data Bank (PDB) ID: 1PGB40] was used to set up the KB-Go model and to define σij in the native contact likelihood as defined in Eq. (2). The number of native contact pairs in protein G (1PGB) is 84.

Both the native and non-native contact interactions were calculated without any truncation. 5.4 µs MD simulation with the KB-Go model was performed using the GENESIS program.41,42 In the simulation, system temperature was kept constant at 330 K in NVT ensemble with Langevin thermostat43 of which friction coefficient was 0.01 ps−1. The time step was set to 20 fs with SHAKE constraints44 for all bonds, and the simulation snapshot was saved in a trajectory at every 5 ps.

To examine dynamical information from MD simulations, MSM has become one of the most popular methods.45,46 Suppose that structures in given simulation trajectories are discretized into n states with a clustering method. The construction of MSM is equivalent to calculate a n × n transition probability matrix Pτ. The element, Pijτ (i, j = 1, 2, …, n), is a transition probability defined as

Pijτ=pxi,t=t0+τ|xj,t=t0
(4)

with a certain lag time τ. x denotes a conformation of the target system, and n is the number of states. i and j are the state indices. The time evolution of a probability vector, pt=p1t,p2t,,pntT, each of which gives a probability that the system visits a state at time t, can be written using the matrix, such as

pt+kτ=Pτkpt,k=0,1,2,.
(5)

If we assume thermodynamic equilibrium, which imposes a symmetrical transition matrix, we obtain more detailed dynamical information by solving the eigenvalue problem of Pτ. The first eigenvalue should be 1, and its eigenvector represents the equilibrium distribution. All the other eigenvectors correspond to certain transition modes. The magnitude of each eigenvalue indicates the contribution of each mode to the conformational transition. Then, plotting eigenvectors into space on which MSM is constructed gives us an intuitive view for the conformational transitions.

Practically, the state discretization or clustering is carried out using the essential feature variables or slow coordinates instead of Cartesian coordinates of biomolecules. Linear (PCA or tICA) or non-linear (UMAP and others) dimensionality reduction methods are applied to get the slow coordinates. Frequently, tICA is used to obtain a set of slow coordinates.47–49 In this study, we compare the performance of MSM using UMAP with those using PCA and tICA.

To determine the number of discretized states n in MSM, we compute a generalized matrix Rayleigh quotient (GMRQ),50,51 defined as

GMRQ=i=1nviTPτvii=1nviTvi.
(6)

Here, vi is the ith eigenvector of Pτ, and T denotes the transposition of a vector or a matrix. The computation of GMRQ requires two independent datasets: one for training data and another for validation. For this purpose, we randomly split MD trajectories into the two sets. If we set the number of states too low, the GMRQ value becomes incorrectly small due to discretization errors. Increasing the number of states can improve the GMRQ value, while discretization with too many states declines the GMRQ value, resulting from the overfitting to the training MSM. In this study, state discretization was done by k-means clustering method,52 and all processes accompanying MSM analysis, including k-means clustering, were carried out using PyEMMA 2.36 

We performed 5.4 µs CG MD simulation of protein G. Total number of structures sampled in the MD simulation reached more than one million. In the trajectories, protein G frequently stayed at the folded or unfolded states for several hundred ns. In the current analysis, we reduced the number of snapshots in MD simulation trajectories. First, we computed root mean square deviations (RMSD) of all snapshots against the native structure defined as

1N1i=1Nriri02.
(7)

Here, N is the number of Cα atoms. ri and ri0 are coordinates of the ith Cα atom in MD snapshot and the native structure, respectively. Based on the RMSD values, we define the folded and unfolded states of protein G if the RMSD values are <5 and >10 Å, respectively. Then, we divided whole trajectories into 27 subsets, each of which includes 40 000 structures. We extracted only 10 subsets that involve the folding/unfolding transitions (totally 400 000 structures) in terms of the RMSD changes. By discarding the other 17 non-transition trajectories, the landscape in the current analysis does not show correct free-energy changes while it keeps the global shape of the folding/unfolding transition landscape. Hereafter, we call it as just a “conformational landscape.” Figure 3 shows a time course of RMSD value in the concatenation of extracted trajectories, and folding/unfolding transitions occurred more than twenty times. The RMSD values of protein G are ranging from 0 to 30 Å, indicating that various structures are sampled in our CG MD simulation.

FIG. 3.

Time course of RMSD value in the concatenated CG MD trajectory of protein G. This figure and Figs. 4, 5(a), 5(c), 5(d), 6, and 7 and Figs. S1–S8 of the supplementary material were prepared by Matplotlib.60 

FIG. 3.

Time course of RMSD value in the concatenated CG MD trajectory of protein G. This figure and Figs. 4, 5(a), 5(c), 5(d), 6, and 7 and Figs. S1–S8 of the supplementary material were prepared by Matplotlib.60 

Close modal

We compared several feature variables for the dimensionality reduction of conformational landscape. In Fig. 4, we plot histograms of the RMSD [Fig. 4(a)] and averaged contact likelihood [Fig. 4(b)]. The folded and unfolded states are well separated in both cases, but the unfolded state shows different distributions in these two histograms. In the RMSD histogram, the unfolded state is widely distributed compared to the folded one, suggesting that conformational heterogeneity of the unfolded state observed in the CG MD simulations. In the averaged contact likelihood histogram, the folded and unfolded states show similar deviations from their respective averages, reflecting that this feature variable categorizes some of different unfolded structures into one class. As a result, the unfolded state has a compact distribution in the contact likelihood space. To examine the relationship between this feature variable and conformational classifications, we calculate the silhouette score for the folded and unfolded states.53 As mentioned, the folded and unfolded states are defined in terms of RMSD values. Silhouette score si for the ith MD structure xi is calculated with these formulas,

si=biaimaxai,bi,
(8)
ai=1N11jC1,jixixj,
(9)
bi=1N2kC2xixk.
(10)

C1 is a state to which xi belongs, and C2 is the other. N1 and N2 are numbers of data points in C1 and C2, respectively. ‖⋯‖ denotes a norm. ai is a measure of the compactness of C1 in data space, and bi indicates how well xi is separated from C2. si can take on any value between −1 < si < 1, but should be larger for better separation. If si takes a negative value, the classification may be inappropriate. We average silhouette scores for the folded and unfolded states to obtain sfold and sunfold, respectively.

FIG. 4.

Histograms of RMSD (a) and averaged contact likelihood (b). Peaks corresponding to the folded and unfolded states are labeled in the both histograms.

FIG. 4.

Histograms of RMSD (a) and averaged contact likelihood (b). Peaks corresponding to the folded and unfolded states are labeled in the both histograms.

Close modal

The silhouette scores using Cartesian coordinates, torsion angles, and contact likelihood as feature variables are compared in Table I. sfold with Cartesian coordinates has a high value close to 1, whereas sunfold is almost zero. This means that unfolded structures distribute broadly around one peak for the folded state, as clearly visualized in the supplementary material, Fig. S2a. sfold and sunfold with torsion angles as feature variables show a similar tendency, while sfold is much smaller. On torsion angle space, structures in the folded state are scattered more than those with Cartesian coordinates (supplementary material, Fig. S2b). In contrast to these feature variables, both sfold and sunfold are positive when we use the contact likelihood as defined in Eq. (2), suggesting that the contact likelihood can describe conformational heterogeneity in the folded and unfolded states almost equally. Hereafter, we use the contact likelihood as feature variables in the dimensionality reduction of conformational landscapes.

TABLE I.

Silhouette scores for the folded and unfolded states of protein G, in each feature variable space.

Cartesian coordinatesTorsion anglesContact likelihood
sfold 0.85 0.37 0.66 
sunfold −0.03 −0.04 0.51 
Cartesian coordinatesTorsion anglesContact likelihood
sfold 0.85 0.37 0.66 
sunfold −0.03 −0.04 0.51 

Next, we examined two-dimensional conformational landscapes of protein G obtained from PCA, tICA, and UMAP. Regarding UMAP, NN is a user-dependent hyperparameter and the users should select a value of NN that produce a reasonable landscape. In this study, we tested five different NN values around its default value of 15 and, consequently, selected the default value, considering continuity of a dimensionality reduced landscape. In Fig. 5(a), the UMAP landscape consists of two major basins (Major1 and Major2), which are connected with two local basins, Bridge1 and Bridge2. In addition, there are two bulges, which are named as Sub1 and Sub2. In Fig. 5(b), six snapshots are taken from Major1, Major2, Bridge1, Bridge2, Sub1, and Sub2, representing key structural features of individual states. As all the structures in Major1 do not form β12 and β34 at the same time (Fig. 6), Major1 is regarded as the unfolded state. In contrast, structures in Major2 keep all the secondary structures observed in the folded state. This is consistent with the scatter plot colored with the RMSD values against the native structure (supplementary material, Fig. S3a). The structures in bridge regions represent unfolded or partial unfolded ones. In Bridge1, neither β12 nor β34 is formed, while the N- and C-terminal regions are close to each other. In Bridge2, only β34 is formed, keeping the interaction with the N-terminal region. These structures appear to be transient structures in the folding/unfolding transitions. The RMSD values from the native structure are less than 5 Å in Sub1 and Sub2, while they are slightly different from the native one. β34 in structures belongs to Sub1 is dissociated from α-helix and there is an insertion of the C-terminal region between β12 and α-helix in Sub2.

FIG. 5.

Comparison of dimensionality reduced landscapes of protein G. (a) Dimensionality reduced landscape by UMAP. Characteristic regions (Major1, Major2, Bridge1, Bridge2, Sub1, and Sub2) are labeled in the landscape. (b) Snapshots with key structural features in these regions. N- and C-terminal sides are colored red and blue, respectively. These structures are visualized by VMD.61 Dimensionality reduced landscape by (c) PCA and (d) tICA.

FIG. 5.

Comparison of dimensionality reduced landscapes of protein G. (a) Dimensionality reduced landscape by UMAP. Characteristic regions (Major1, Major2, Bridge1, Bridge2, Sub1, and Sub2) are labeled in the landscape. (b) Snapshots with key structural features in these regions. N- and C-terminal sides are colored red and blue, respectively. These structures are visualized by VMD.61 Dimensionality reduced landscape by (c) PCA and (d) tICA.

Close modal
FIG. 6.

Scattered plots of dimensionality reduced landscapes in Fig. 5, colored by q-values of β12 (top row) and β34 (bottom row), respectively.

FIG. 6.

Scattered plots of dimensionality reduced landscapes in Fig. 5, colored by q-values of β12 (top row) and β34 (bottom row), respectively.

Close modal

The landscapes obtained from PCA and tICA are shown in Figs. 5(c) and 5(d), respectively. Total contributions of two dominant components in PCA and tICA are 72.6% and 84.5%, respectively. We obtained an optimal lag time of 2.5 ns in tICA. Both landscapes involve Major1 and Major2 as two most stable basins (supplementary material, Figs. S3b and S3c). “Bridge” structures are distributed between the two major basins as we see in the UMAP landscape. Major1 and Major2 are mainly distinguished on the first independent component (IC) of tICA, indicating that the folding/unfolding transition is the slowest mode. Furthermore, the second IC resolves another metastable state, Major3 [Fig. 5(d)]. According to the supplementary material, Fig. S3c, Major3 includes almost folded structures considering the RMSD values in that state, but IC2 obviously indicates that there are transitions between Major3 and the other states including folded Major2 with a certain relaxation time. Therefore, Major3 should be distinguished from Major2, even though they are not distinguished in the PCA landscape or in terms of RMSD. In both PCA and tICA, we can interpret the obtained axes (components) as modes, which can be interpreted as structural features. For instance, PC1 and IC1 obviously correspond to the formation of native contacts. On the other hand, we cannot simply interpret Dimension1 in the UMAP landscape as native contact formation, because there are little changes of contact formation in Major1, even though the distribution of Major1 occupies almost half the size of the whole landscape along with Dimension1. This is one of the biggest differences between linear and nonlinear dimensionality reduction methods. Folding transition mechanisms are understood in UMAP landscape using representative structures (see Fig. 5) and projected structural features (see Fig. 6).

Folding/unfolding trajectories in MD simulations are reasonably mapped along the landscapes in all three cases (supplementary material, Figs. S4 and S5). Then, how the dimensionality reduced landscapes from UMAP, PCA, and tICA provide molecular mechanisms for folding dynamics of protein G? According to the previous protein G simulations based on the same CG model,38 the β34 formation occurred in advance of β12. Figure 6 clearly shows that the UMAP landscape describes that the transition from Bridge1 to Bridge2 corresponds to the β34 formation, and subsequently β12 is formed toward the native structure. The β34 formation seems to happen earlier than the β12 formation in the PCA landscape. In contrast. The tICA landscape does not distinct the order of β12 and β34 formations (Fig. 6 right), suggesting that the first IC might not describe the folding process as good as UMAP and PCA. This would be reasonable if we consider rapid folding events of protein G in CG MD simulations (typically happened within 0.05–0.25 ns, supplementary material, Figs. S4 and S5). Each folding event occurs too rapid compared to the current lag time (2.5 ns) used in the current tICA. A proper balance between the lag time and transition times is required for obtaining meaningful conformational landscape reduced with tICA.

In the UMAP landscape, Sub1 and Sub2 are separated from the folded state, even though the RMSD values of the structures are less than 5 Å. In Fig. 7, we project snapshots belong to Sub1 and Sub2 in the PCA or tICA landscape and found that Major 3 in the tICA landscape is almost equivalent to Sub1 in the UMAP landscape, suggesting that UMAP is able to distinct metastable states that are kinetically trapped. Instead, Sub1 is merged into the folded state in the PCA landscape. Sub2 is not resolved as an isolated cluster just using the first and second components in PCA and tICA. With the third component of PCA and tICA, Sub2 is successfully isolated as shown in the supplementary material, Fig. S6. These results suggest that the UMAP landscape successfully isolates the folded, unfolded, and essential intermediate states more clearly on and off the folding pathways.

FIG. 7.

Projection of Sub1 (magenta) and Sub2 (cyan) into the dimensionality reduced landscapes by UMAP (top), PCA (middle), and tICA (bottom).

FIG. 7.

Projection of Sub1 (magenta) and Sub2 (cyan) into the dimensionality reduced landscapes by UMAP (top), PCA (middle), and tICA (bottom).

Close modal

Next, we carried out MSM after reducing the dimensionality of conformational space of protein G with PCA, tICA, and UMAP. First, we compute GMRQ as a function of the number of clusters to determine the common number of clusters in MSM with different dimensionality reduction methods. A value of lag time in GMRQ analysis was fixed to 2.5 ns according to the result of tICA. In the supplementary material, Fig. S7a, we can see that the optimum number of states is 80 in case of UMAP. In contrast, there are not clear tendency in the GMRQ plots of PCA and tICA clustering. We, therefore, decided to use the same number of states 80 for the three MSMs. Note that this number of clusters is much larger than that in conventional MSM analyses. This large number of clusters is just a sufficient condition because this value was determined according to only the GMRQ convergence in the microstate MSM. An essential number of clusters for this system should be investigated through lumping microstates into a small number of macrostates.46 We also analyze characteristics of clustering with different methods, by comparing the probability distributions of intra- and inter-cluster deviations of contact likelihood defined as

1Ni,jNmMCLimCLjm2
(11)

(supplementary material, Fig. S7b). Here, CLim is contact likelihood for the mth contact in the ith snapshot. M is the number of contacts in the native structure, and N is the number of possible pairs for i and j. Snapshots i and j are in the same cluster for intra-cluster deviation, and not for inter-cluster deviation. PCA and tICA show similar probability distributions of intra-cluster deviations with a single peak around 0.2 although the distribution of PCA slightly shifts smaller than that of tICA. Unlike these two methods, probability distribution in UMAP clustering has two peaks with a split around 0.2. The supplementary material, Fig. S7c, shows PCA/tICA/UMAP landscapes with the cluster centers colored by average intra-cluster deviation values. The UMAP landscape has so many clusters, each of which intra-cluster deviation is around 0.1, resulting in the bimodal distribution. Regarding the inter-cluster deviation, there is no difference among three methods, and note that these tendencies of intra- and inter-cluster deviations also appear in the case of different number of states (supplementary material, Fig. S8).

Based on the state discretization in Fig. S7c, we construct MSMs and validate them using implied time scale (ITS) defined as

tiτ=τlogλiτ.
(12)

Here, λiτ is the ith eigenvalue of Pτ. If the constructed MSM is Markovian, ITS would be a constant value with the lag time longer than a certain value τT (τT is called Markovian lag time). In other words, by using ITS analysis, we can identify significant transition modes of which ITS value converged with a lag time longer than τT. Figure 8(a) shows the dependency of ITS on a lag time, and all three MSMs are converged with τT = 2.5 ns. Using this τT, we construct MSMs with UMAP, PCA, and tICA. In each MSM, the stationary distribution (first eigenvector) is similar to the original conformational landscape (supplementary material, Fig. S9). Other eigenvectors representing slow dynamics of the system can resolve three transition modes in the UMAP MSM [left in Fig. 8(a)], while there are only one or two transition modes in the PCA/tICA MSM [middle and right of Fig. 8(a)].

FIG. 8.

Results of MSM analysis. (a) ITS plots of the UMAP (left), PCA (middle), and tICA (right) MSMs. From second to sixth eigenmodes are colored by blue, red, green, cyan, and purple, respectively. Black curves indicate the threshold of time scale, and time scale of eigenmodes inside the shadowed region is smaller than the given lag time. (b) Projection of eigenvectors into the landscape used in MSM analysis. This figure and Fig. S9 of the supplementary material were prepared by PyEMMA 2.36 

FIG. 8.

Results of MSM analysis. (a) ITS plots of the UMAP (left), PCA (middle), and tICA (right) MSMs. From second to sixth eigenmodes are colored by blue, red, green, cyan, and purple, respectively. Black curves indicate the threshold of time scale, and time scale of eigenmodes inside the shadowed region is smaller than the given lag time. (b) Projection of eigenvectors into the landscape used in MSM analysis. This figure and Fig. S9 of the supplementary material were prepared by PyEMMA 2.36 

Close modal

We also plot the landscapes colored by eigenvector values [Fig. 8(b)]. The second eigenvector modes in these MSMs are in common: they describe folding/unfolding transition of protein G. The third eigenvector mode in the UMAP and tICA MSMs represents a transition to the metastable states, Sub1 or Major3. The fourth eigenvector mode, which is meaningful only in the UMAP MSM, describes the transition to Sub2. These transition modes are consistent with the results discussed in Sec. III C. UMAP is, thus, applicable to the preprocessing of MSM and can resolve slow dynamics of a target biomolecule with fewer dimensions than tICA.

In this study, we proposed a new approach to reduce the dimensionality of conformational landscape with UMAP and the contact likelihood. The method was applied to CG MD simulation trajectories of folding dynamics of protein G. By the comparison between UMAP and PCA/tICA, we have shown that (i) UMAP can distinct essential intermediate structures as well as the folded/unfolded states, (ii) folding mechanisms of protein G are well explained in the dimensionality reduced space, and (iii) MSM based on UMAP is better than MSM with tICA or PCA. Interestingly, UMAP is able to resolve the metastable states, which might be distinguished only by tICA with dynamical information. This suggests potential applications of this method for investigating kinetics of protein motions. Regardless of dimensionality reduction methods, the selection of feature variables is found to be essential. Here, we used native contact likelihood as feature variables of folding/unfolding transition of protein G. Considering that the native contact is a key quantity to describe the nature of protein folding funnels,11,54,55 the feature variables would be applicable to folding/unfolding dynamics of other globular proteins. Another advantage of UMAP for the dimensionality reduction of conformational landscape is that it does not require dynamical information. This opens a new possibility to apply it to MD simulation trajectories with biased samplings or enhanced conformational sampling algorithms.56–58 Since these methods allow us to explore a wider conformational space of a protein or other biomolecules, dimensionality reduced landscape via UMAP and appropriate feature variables would provide essential information on thermodynamics and kinetics of the molecules more easily than the conventional methods.

See the supplementary material for the methodological explanations on contact likelihood, Figs. S1–S9 for the UMAP, PCA, and tICA landscapes.

We used the computational resources provided by the HPCI System Research Project (Project Nos. hp210107 and hp210177) and those in RIKEN Advanced Center for Computing and Communication (Hokusai BigWaterfall). This work was supported, in part, by MEXT/JSPS KAKENHI [Grant Nos. 19H05645 and 21H05249 (to Y.S.)], RIKEN pioneering projects in “Biology of Intracellular Environments,” “Glycolipidlogue” (to Y.S.), and MEXT “Program for Promoting Research on the Supercomputer Fugaku (Biomolecular dynamics in a living cell/MD-driven Precision Medicine).” M.O. received financial support from RIKEN special post-doctoral fellowship.

The authors have no conflicts to disclose.

Mao Oide: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Funding acquisition (supporting); Investigation (lead); Methodology (lead); Software (lead); Visualization (lead); Writing – original draft (equal); Writing – review & editing (supporting). Yuji Sugita: Conceptualization (supporting); Funding acquisition (lead); Supervision (lead); Validation (lead); Writing – original draft (equal); Writing – review & editing (lead).

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
M.
Karplus
and
J. A.
McCammon
,
Nat. Struct. Biol.
9
,
646
652
(
2002
).
2.
F.
Sittel
and
G.
Stock
,
J. Chem. Phys.
149
,
150901
(
2018
).
3.
S.
Wold
,
K.
Esbensen
, and
P.
Geladi
,
Chemom. Intell. Lab. Syst.
2
,
37
52
(
1987
).
4.
A.
Kitao
,
F.
Hirata
, and
N.
,
Chem. Phys.
158
,
447
472
(
1991
).
5.
A.
Kitao
and
N.
,
Curr. Opin. Struct. Biol.
9
,
164
169
(
1999
).
6.
Y.
Naritomi
and
S.
Fuchigami
,
J. Chem. Phys.
134
,
065101
(
2011
).
8.
Y.
Naritomi
and
S.
Fuchigami
,
J. Chem. Phys.
139
,
215102
(
2013
).
9.
D. E.
Shaw
 et al,
Science
330
,
341
346
(
2010
).
10.
R. O.
Dror
 et al,
Proc. Natl. Acad. Sci. U. S. A.
108
,
18684
18689
(
2011
).
11.
N.
,
Annu. Rev. Biophys. Bioeng.
12
,
183
210
(
1983
).
12.
C.
Clementi
,
H.
Nymeyer
, and
J. N.
Onuchic
,
J. Mol. Biol.
298
,
937
953
(
2000
).
13.
K.
Lindorff-Larsen
 et al,
Science
334
,
517
520
(
2011
).
14.
Y.
Matsunaga
 et al,
J. Phys. Chem. Lett.
7
,
1446
1451
(
2016
).
15.
H.
Zhou
,
F.
Wang
, and
P.
Tao
,
J. Chem. Theory Comput.
14
,
5499
5510
(
2018
).
16.
H. M.
Dokainish
 et al,
Elife
11
,
e75720
(
2022
).
17.
R.
Henderson
 et al,
Nat. Struct. Mol. Biol.
27
,
925
933
(
2020
).
18.
J. B.
Tenenbaum
,
V.
de Silva
, and
J. C.
Langford
,
Science
290
,
2319
2323
(
2000
).
19.
P.
Das
 et al,
Proc. Natl. Acad. Sci. U. S. A.
103
,
9885
9890
(
2006
).
20.
R. R.
Coifman
 et al,
Proc. Natl. Acad. Sci. U. S. A.
102
,
7426
7431
(
2005
).
21.
A. L.
Ferguson
 et al,
Proc. Natl. Acad. Sci. U. S. A.
107
,
13597
13602
(
2010
).
22.
M.
Ceriotti
,
G. A.
Tribello
, and
M.
Parrinello
,
Proc. Natl. Acad. Sci. U. S. A.
108
,
13023
13028
(
2011
).
23.
M.
Ceriotti
,
G. A.
Tribello
, and
M.
Parrinello
,
J. Chem. Theory Comput.
9
,
1521
1532
(
2013
).
24.
L.
van der Maaten
and
G.
Hinton
,
J. Mach. Learn. Res.
9
,
2579
2605
(
2008
).
25.
L.
McInnes
,
J.
Healy
, and
J.
Melville
, arXiv:1802.03426 (
2018
).
26.
F.
Trozzi
,
X.
Wang
, and
P.
Tao
,
J. Phys. Chem. B
125
,
5022
5034
(
2021
).
27.
E.
Becht
 et al,
Nat. Biotechnol.
37
,
38
44
(
2019
).
28.
M.
Ernst
,
F.
Sittel
, and
G.
Stock
,
J. Chem. Phys.
143
,
244114
(
2015
).
29.
A.
Jain
and
G.
Stock
,
J. Phys. Chem. B
118
,
7750
7760
(
2014
).
30.
C.
Schutte
 et al,
J. Comput. Phys.
151
,
146
168
(
1999
).
31.
N.
Singhal
,
C. D.
Snow
, and
V. S.
Pande
,
J. Chem. Phys.
121
,
415
425
(
2004
).
32.
F.
Noe
 et al,
J. Chem. Phys.
126
,
155102
(
2007
).
33.
J. D.
Chodera
 et al,
J. Chem. Phys.
126
,
155101
(
2007
).
34.
N.-V.
Buchete
and
G.
Hummer
,
J. Phys. Chem. B
112
,
6057
6069
(
2008
).
35.
See https://github.com/lmcinnes/umap for Github repository of UMAP.
36.
M. K.
Scherer
 et al,
J. Chem. Theory Comput.
11
,
5525
5542
(
2015
).
37.
R. B.
Best
,
G.
Hummer
, and
W. A.
Eaton
,
Proc. Natl. Acad. Sci. U. S. A.
110
,
17874
17879
(
2013
).
38.
J.
Karanicolas
and
C. L.
Brooks
 III
,
Protein Sci.
11
,
2351
2361
(
2002
).
39.
J.
Karanicolas
and
C. L.
Brooks
 III
,
J. Mol. Biol.
334
,
309
325
(
2003
).
40.
T.
Gallagher
 et al,
Biochemistry
33
,
4721
4729
(
1994
).
41.
J.
Jung
,
T.
Mori
 et al,
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
5
,
310
323
(
2015
).
42.
C.
Kobayashi
,
J.
Jung
 et al,
J. Comput. Chem.
38
,
2193
2206
(
2017
).
43.
D.
Quigley
and
M. I. J.
Probert
,
J. Chem. Phys.
120
,
11432
11441
(
2004
).
44.
J.-P.
Ryckaert
,
G.
Ciccotti
, and
H. J. C.
Berendsen
,
J. Comput. Phys.
23
,
327
341
(
1977
).
45.
B. E.
Husic
and
V. S.
Pande
,
J. Am. Chem. Soc.
140
,
2386
2396
(
2018
).
46.
W.
Wang
 et al,
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
8
,
e1343
(
2017
).
47.
G.
Perez-Hernandez
 et al,
J. Chem. Phys.
139
,
015102
(
2013
).
48.
C. R.
Schwantes
and
V. S.
Pande
,
J. Chem. Theory Comput.
9
,
2000
2009
(
2013
).
49.
F.
Noé
and
C.
Clementi
,
J. Chem. Theory Comput.
11
,
5002
5011
(
2015
).
50.
R. T.
McGibbon
and
V. S.
Pande
,
J. Chem. Phys.
142
,
124105
(
2015
).
51.
B. E.
Husic
 et al,
J. Chem. Phys.
145
,
194103
(
2016
).
52.
E. W.
Forgy
,
Biometrics
21
,
768
780
(
1965
).
53.
P. J.
Rousseeuw
,
J. Comput. Appl. Math.
20
,
53
65
(
1987
).
54.
J. N.
Onuchic
,
Z.
Luthey-Schulten
, and
P. G.
Wolynes
,
Annu. Rev. Phys. Chem.
48
,
545
600
(
1997
).
55.
K. A.
Dill
and
H. S.
Chan
,
Nat. Struct. Biol.
4
,
10
19
(
1997
).
56.
Y.
Sugita
and
Y.
Okamoto
,
Chem. Phys. Lett.
314
,
141
151
(
1999
).
57.
M.
Kamiya
and
Y.
Sugita
,
J. Chem. Phys.
149
,
072304
(
2018
).
58.
Y.
Miao
,
V. A.
Feher
, and
J. A.
McCammon
,
J. Chem. Theory Comput.
11
,
3584
3595
(
2015
).
59.
The PyMOL Molecular Graphics System, Version 2.4.0, Schrödinger, LLC.
60.
J. D.
Hunter
,
Comput. Sci. Eng.
9
,
90
95
(
2007
).
61.
W.
Humphrey
,
A.
Dalke
, and
K.
Schulten
,
J. Mol. Graphics
14
,
33
38
(
1996
).

Supplementary Material