The statistical analysis of molecular dynamics simulations requires dimensionality reduction techniques, which yield a low-dimensional set of collective variables (CVs) {*x*_{i}} = ** x** that in some sense describe the essential dynamics of the system. Considering the distribution

*P*(

**) of the CVs, the primal goal of a statistical analysis is to detect the characteristic features of**

*x**P*(

**), in particular, its maxima and their connection paths. This is because these features characterize the low-energy regions and the energy barriers of the corresponding free energy landscape Δ**

*x**G*(

**) = −**

*x**k*

_{B}

*T*ln

*P*(

**), and therefore amount to the metastable states and transition regions of the system. In this perspective, we outline a systematic strategy to identify CVs and metastable states, which subsequently can be employed to construct a Langevin or a Markov state model of the dynamics. In particular, we account for the still limited sampling typically achieved by molecular dynamics simulations, which in practice seriously limits the applicability of theories (e.g., assuming ergodicity) and black-box software tools (e.g., using redundant input coordinates). We show that it is essential to use internal (rather than Cartesian) input coordinates, employ dimensionality reduction methods that avoid rescaling errors (such as principal component analysis), and perform density based (rather than**

*x**k*-means-type) clustering. Finally, we briefly discuss a machine learning approach to dimensionality reduction, which highlights the essential internal coordinates of a system and may reveal hidden reaction mechanisms.

## I. INTRODUCTION

Classical molecular dynamics (MD) simulation of proteins has emerged as complementary tool to experiment. Its appeal to fully describe biomolecular structure and dynamics at an atomistic level combined with advancements in computer hardware and algorithms have lead to an ever-growing interest in simulations of increasing size and length.^{1} As a consequence, the interpretation of the resulting “big data” describing complicated multiscale molecular motion presents new challenges. While nanosecond trajectories typically remain close to the experimental starting structure and therefore are traditionally characterized by some representative MD snapshots, nowadays achievable microsecond simulations may give rise to numerous complex conformational transitions that require careful statistical analysis.

To this end, it is common practice to choose some (in general multidimensional) molecular observables ** x** describing the process of interest and consider their mean ⟨

**⟩ or distribution**

*x**P*(

**) (structural analysis) as well as their time evolution or autocorrelation function (dynamical analysis). In particular, biomolecular processes are often described in terms of the free energy landscape**

*x*^{2–4}

with *k*_{B} being Boltzmann’s constant and *T* the temperature. Given a suitable choice of ** x**, the free energy landscape reveals the relevant regions of low energy (corresponding to metastable states) as well as the barriers (accounting for transition states) between these regions, and may therefore visualize the pathways of a biomolecular process.

As an example, let us consider the villin headpiece protein (HP35), which from experiments is known to fold on a microsecond time scale via several intermediate states.^{5–8} Adopting a 300 *µ*s MD trajectory at 360 K by Piana *et al.*^{9} (for details see the supplementary material), Fig. 1(a) shows representative molecular structures of HP35. Also shown is the time evolution of the radius of gyration *R*_{G}, which represents a commonly used one-dimensional (1D) observable of the folding process. That is, small values of *R*_{G} with little fluctuations indicate the folded state of the protein, while heavily fluctuating large values of *R*_{G} are indicative for unfolded conformations of HP35. However, the 1D observable and its free energy profile Δ*G*(*R*_{G}) does not provide any detailed information on the folding process such as intermediate states. As a second example, we consider T4 lysozyme (T4L), a 164-residue enzyme whose interaction with the substrate involves a prominent hinge-bending motion of its two domains [Fig. 1(b)]. Adopting a 50 *µ*s MD trajectory at 300 K by Ernst *et al.*^{10} (for details see the supplementary material), we find that this motion is well captured by the radius of gyration *R*_{G}. However, using *R*_{G} as a reaction coordinate to estimate the energy barrier, it becomes obvious that the resulting small value of ∼2*k*_{B}*T* cannot account for the long (∼10 *µ*s) observed transition time of T4L. In fact, it has recently been shown that the origin of this long time scale is a “locking mechanism” that involves hidden intermediate states that are not detected by standard 1D observables.^{10}

From the examples given above, it is obvious that a first and crucial step of MD analysis is the choice of suitable observables ** x**. To be specific, we make the following definitions. The starting point is some high-dimensional set of

*input coordinates*

**= (**

*r**r*

_{1}, …,

*r*

_{n}) from MD simulation, such as Cartesian atom coordinates, interatom distances, or dihedral angles. Given as a function of

**, we wish to construct a low-dimensional observable**

*r***= (**

*x**x*

_{1}, …,

*x*

_{d}), that in some sense describes the system’s essential dynamics. The

*x*

_{i}, henceforth referred to as

*collective variables*(CVs) are meant to resolve the underlying structural and dynamical processes of the MD data, e.g., they should discriminate between different metastable states.

^{11–13}Furthermore, CVs are crucial for various enhanced sampling techniques

^{14–16}such as umbrella sampling,

^{17}targeted MD,

^{18}and metadynamics.

^{19}

A critical aspect is the dimensionality *d* of the CVs. While the above examples already indicated that low-dimensional projections of the energy landscape may not yield the correct number of conformational states and their connectivity,^{20–22} *d* needs to be sufficiently low (say, *d* ≲ 10) to allow for a statistical analysis of the data, e.g., for clustering into metastable states.^{23–35} This is because when we distribute, e.g., 10^{6} data points on a 10D grid with 10 bins in each dimension, the vast majority of bins is empty or very sparsely populated. On this account, a number of efficient and systematic strategies of dimensionality reduction have been developed,^{36–53} which aim to identify the optimal CVs for a given purpose. Popular methods in the field of MD simulation include principal component analysis (PCA)^{39,40} that represents a linear transformation to coordinates that maximize the variance of the first components and time-lagged independent component analysis (TICA)^{41–44} that aims to maximize the time scales of the first components. Moreover, a variety of nonlinear techniques^{45–54} such as local linear embedding,^{50} isomap,^{51} diffusion maps,^{52} and sketch-map^{53} as well as various kinds of machine learning approaches^{55–64} have been proposed for this purpose.

Assuming a time scale separation between the slow motion of CVs (representing the “system”) and the fast fluctuations of the remaining degrees of freedom (representing the “bath”), CVs may be employed as a multidimensional *reaction coordinate* that can be used to calculate transition rates.^{65–70} In this way, they may serve as a basis to construct a data-driven Langevin equation^{71–73} that accounts for the continuous time evolution of the system on the low-dimensional free energy landscape Δ*G*(** x**). Alternatively, we may employ some clustering of the data to reveal the metastable states of the system.

^{23–35}By calculating the transition probabilities between these states, we can construct a Markov state model that describes protein dynamics in terms of memory-less jumps.

^{74–78}Holding the promise to predict long-time dynamics from many short trajectories, Markov state models have been employed in a variety of applications; see Refs. 79 and 80 for recent reviews. In this way, free energy landscapes and associated Langevin equations as well as metastable states and Markov transition networks have emerged as central theoretical concepts to analyze MD data.

The statistical analysis of MD simulations has been significantly facilitated by freely available software packages such as PyEmma^{81} and MSMBuilder,^{82} which aim to semi-automatically identify CVs and metastable states and provide means to construct Markov state models. While these programs have greatly promoted the field, they nonetheless need to be used with caution, as their black box-like application may lead to suboptimal or even nonsensical results. On this account, in this perspective, we want to discuss several issues regarding the identification of CVs and metastable states.

Dimensionality reduction. Focusing mainly on linear formulations such as PCA-based approaches and TICA, we discuss the effective dimension of the dynamics, the effects of projection and rescaling errors, and selection criteria for the CVs. Since biomolecular processes may involve small structural changes (thus defying variance-optimizing methods like PCA) and exhibit hierarchically coupled processes on various time scales

^{83–86}(that hamper methods maximizing time scales like TICA), it is not necessarily obvious which property of the CVs is to be optimized.Input coordinates. Due to inevitable mixing of overall rotation and internal motion, Cartesian coordinates are in general not directly suited for dimensionality reduction.

^{87}Careful inspection of the raw data shows what kind of internal coordinates (e.g., interatomic distances or dihedral angles) are of importance for the description of the process under consideration.^{10,88}To improve the signal-to-noise ratio, irrelevant or redundant coordinates should be omitted from the outset. As shown below, the choice of input coordinates dramatically affects all subsequent analysis.Clustering. While theoretical formulations of Markov state models often assume ergodicity of the dynamics, MD data of proteins are notoriously undersampled and by no means ergodic. To minimize statistical errors, it is essential to use robust geometrical clustering methods that correctly cut conformational states at their energy barriers, and therefore appropriately describe the dynamics in terms of relatively few well sampled metastable states.

^{23–28}We also briefly discuss dynamic clustering approaches^{31–35}that aim to provide a dynamical coarse graining and correct spurious barrier transitions via coring.Essential coordinates. Given as linear combinations of high-dimensional input coordinates, standard CVs do not necessarily point to the important specific interatomic distances or dihedral angles. Employing a recently suggested machine learning algorithm,

^{64}we identify these essential internal coordinates that are shown to represent versatile reaction coordinates and may reveal previously hidden intermediate states of the system.

Adopting the folding of HP35 as a representative example, we discuss virtues and shortcomings as well as common pitfalls of the aforementioned approaches, and stress the importance of testing model results by comparing to raw data. For the sake of brevity, we focus on few examples (mostly HP35 and T4L). Hence we cannot address the many facets of biomolecular systems but rather aim to outline issues of general importance. To broaden the discussion, we nonetheless frequently refer to other molecular systems, including proteins such as PDZ2 domain^{86} and BPTI,^{87,88} peptides like Ala_{n}^{21,27} and Aib_{9},^{84} and RNA loops.^{89} Similarly, we focus on linear dimensionality reduction approaches and do not attempt to discuss the large variety of intriguing nonlinear techniques.^{45–54} We also do not address subsequent issues related to the set-up of Markov state models (e.g., test of lag times, reversibility, and Markovianity), which are well covered by the mentioned software packages.^{81,82}

## II. DIMENSIONALITY REDUCTION

### A. General considerations

Biomolecules represent dynamical systems, the complexity of which represents a well-established concept in the theory of nonlinear dynamics. It is often associated with the fact that the “effective dimension” *d*_{eff} of the system,^{90} that is, the dimension of the subspace an MD trajectory $r(t)\u2009\u2208\u2009Rn$ occupies in the course of its time evolution, can be much smaller than the dimension the problem is formulated in. The origin of this dimensionality reduction is nonlinear couplings, which give rise to cooperative effects that reduce the effective number of degrees of freedom. Employing techniques from nonlinear time series analysis, the effective dimension of various small peptides and proteins was found to be *d*_{eff} ≲ 5.^{91–93} This value seems relatively small, considering that it accounts for the motion of thousands of atoms. Obtained from a nonlinear description of the dynamics, however, this small dimension applies only if a suitable nonlinear dimensionality reduction method is used.^{45–53} For example, the first eigenfunctions of the diffusion operators as well as the committor function provide suitable (but abstract) coordinates for such a description in reduced dimensionality.^{11–13} In particular, the committor (i.e., the probability of a trajectory at some position ** r** to reach the product state before visiting the reactant state) represents an important theoretical concept of reaction rate theory, from which many properties of the dynamics in principle can be calculated exactly.

^{12,69}

Commonly employed nonlinear dimensionality reduction methods are typically based on a distance metric (e.g., the root mean square distance, RMSD) between all data points, the quadratic scaling of which may becomes prohibitive for nowadays used data sizes of ≳10^{6} points. Moreover the estimation of distances seriously suffers from the nonuniform (i.e., Boltzmann-type) distribution of data and and the ubiquitous noise encountered in MD simulations.^{11} In practice, one therefore often invokes a simple linear ansatz,

(with *A* being a symmetric transformation matrix), and subsequently uses only the first *d* components of ** x** as CVs. While in general

*d*≥

*d*

_{eff}, a linear formulation provides well-established formulations to construct

*A*and allows us to invert the transformation, which provides a direct interpretation of the CVs

*x*

_{i}in terms of the input coordinates

**. In variance to low-dimensional model examples such as eight-membered rings**

*r*^{47}and coarse-grained model systems,

^{45}a recent MD-based study of high-dimensional conformational dynamics has shown that linear approaches are not necessarily inferior to nonlinear methods.

^{49}

### B. Projection and rescaling errors

The main goal of dimensionality reduction in the analysis of an MD trajectory is to detect nonrandom structures of the conformational distribution, that is, its maxima (representing metastable states) and the connection paths between them (yielding the energy barriers between the states). Approximating a high-dimensional probability distribution *P*(** r**) ($r\u2009\u2208\u2009Rn$) by a low-dimensional function

*P*(

**) ($x\u2009\u2208\u2009Rd$), information may get lost in two ways: via projection errors and via rescaling errors. First, in any dimensionality reduction method—linear or nonlinear—high-dimensional data were projected onto a lower dimensional manifold. This may lead to artifacts, e.g., by artificially combining clusters which are well separated in a dimension which has been integrated out, or by changing the connectivity of clusters due to the projection [Fig. 2(a)]. Hence we need to choose dimension**

*x**d*(i.e., the number of CVs) large enough to reproduce the correct number and connectivity of the metastable states. This is a problem common to all of the discussed methods and can be handled by carefully reviewing the resulting CVs in terms of non-random structure (see below).

Second, while unitary transformations^{94} (such as standard PCA) conserve the metric of the underlying space, various dimensionality reduction methods do not. Examples include dihedral angle PCA (dPCA)^{95} (while the improved version dPCA+ does preserve the metric^{96}), TICA,^{42} and various versions of kernel PCA.^{97–99} The associated change of the metric causes a distortion of the probability distribution, which may lead to rescaling errors. As an example, Fig. 2(b) shows the distribution *P*(*ψ*) of backbone dihedral angle *ψ* (here *ψ*_{2} of the above introduced HP35 trajectory), which reveals three peaks corresponding to the *α*_{R}, *α*_{L}, and *β* conformations of the residue. Also shown are corresponding sine and cosine transformations *P*(sin *ψ*) and *P*(cos *ψ*) used in dPCA to deal with the periodicity of the data (see below). Due to the nonlinearity of the trigonometric functions, it is evidently not possible to separate all three peaks of the original distribution in the transformed space anymore.^{96} On the one hand, the cosine overemphasizes the regions around 0 and *π* to such a degree that the separation of the two closer peaks gets lost. On the other hand, the sine falsely combines the peaks due to its symmetry. As the example shows, transformations which distort the metric of the underlying space may lead to spurious results when subsequently clustering methods to separate metastable states are applied. We note in passing, though, that a nonlinearly transformed free energy landscape may produce correct dynamics, if the position-dependent diffusion coefficient is transformed in the same way.^{100}

To summarize, we want a dimensionality reduction method that avoids projection and distortion errors, is linear (and thus readily constructed and inverted), and allows us to systematically construct typically 5-10 CVs. Here the lower bound is given by *d*_{eff} discussed above, while the upper bound ensures sufficient sampling of the *d*-dimensional space. [Recall that even for a large number of input data (say, 10^{7}), the vast majority of bins of a 10D grid would be empty or very sparsely populated.] While standard PCA fulfills all three criteria, there are various approaches such as TICA and kernel PCA, which put up with projection errors but hold the promise to provide CVs that approximate the essential motions in a more appropriate and thus more efficient way.

### C. PCA

Principal component analysis (PCA) is a well-established approach to systematically construct a low-dimensional set of CVs, ** x** = (

*x*

_{1}, …,

*x*

_{d}).

^{37}Given some high-dimensional MD input data

**= (**

*r**r*

_{1}, …,

*r*

_{n}), the basic idea of PCA is to describe the correlated motion of the system via the the covariance matrix

where ⟨…⟩ represents the average over all sampled conformations. Diagonalization of the covariance matrix results in *n* eigenvalues *λ*_{i} and eigenvectors $v(i)$, which describe variances and direction of the principal motion, respectively. Projection of the input data ** r** onto the eigenvectors,

yields the principal components *x*_{i} (*i* = 1, …, *n*), that are linearly uncorrelated, $\u27e8xixj\u27e9=\delta ij\u27e8xi2\u27e9$, and account for the dynamics along the directions of maximum variance. Ordering components *x*_{i} by decreasing eigenvalues, we can truncate the vector ** x** at a low dimension

*d*according to some convergence criterion and use only the first

*d*components as CVs.

^{39,40}As an alternative to the covariance, we may also consider the correlation (i.e., the normalized covariance), which emphasizes correlated motion and therefore can be advantageous to identify small amplitude motion.

^{10}As a measure of the underlying dynamics, we consider the correlation functions

where *δx*_{i} = *x*_{i} −⟨*x*_{i}⟩. In particular, the decay time of the autocorrelation function *C*_{ii}(*t*) reports on the time scale of the *i*th component.

### D. TICA

Since Langevin and Markov models are based on a time scale separation, a linear transformation that maximizes the time scales of the components rather than their variance (as in PCA) appears desirable. To some extent, this goal is achieved by time-lagged independent component analysis (TICA), which originally was suggested in the field of signal processing^{41} and adopted later for the analysis of MD data.^{42–44} While TICA can be formulated as a generalized eigenvalue problem

with *C*(*τ*) = {*C*_{ij}} being the time-lagged covariance matrix [Eq. (5)], the method is best understood in terms of a three-step process. Starting with a regular PCA (step 1), the principal components are normalized (step 2) such that the covariance matrix of the normalized components is given by the unit matrix. As a consequence, it will remain diagonal during the second unitary transformation (step 3), which may diagonalize another quantity of choice, in case of TICA the (symmetrized) time-lagged covariance matrix *C*_{ij}(*τ*). Hence TICA results in components that are linearly uncorrelated (as in PCA) and at the same time show maximal autocovariances at a fixed lag time *τ*. Using a variational principle, Noé and co-workers showed that (at least) the first TICA component provides the optimal linear approximation of the first eigenfunction of the transfer operator associated with the slowest implied time scales *t*_{i} = −*τ*/ ln *λ*_{i}.^{42}

Diagonalizing two observables, though, comes at a price. Due to the rescaling in step 2, the overall transformation is not unitary, and consequently its eigenvectors are not orthogonal. As illustrated in Fig. 2(b), this may lead to a distortion of the free energy landscape. Moreover, the lag time *τ* represents an additional parameter, whose choice may be ambiguous.^{44}

### E. Kernel PCA

The idea of kernel PCA is to construct a matrix *C*_{ij} = ⟨*K*(*r*_{i}, *r*_{j})⟩, where *K*(*r*_{i}, *r*_{j}) describes a mapping of input coordinates *r*_{i}, *r*_{j} to some feature space. In the case of standard PCA, the kernel reduces to a bilinear form, *K*(*r*_{i}, *r*_{j}) = (*r*_{i} − ⟨*r*_{i}⟩)(*r*_{j} − ⟨*r*_{j}⟩). In the field of support vector machine learning, in particular, sigmoidal, exponential and polynomial kernels have been employed.^{98} As in standard PCA, one may solve the associated eigenvalue problem and project the resulting eigenvectors on the input data ** r** [cf. Eq. (4)]. Unlike to the case of standard PCA, however, the resulting first components of the transformation are not simply given as linear combinations of the input coordinates, but are represented in the chosen feature space. Due to the rescaling problem discussed above, this may hamper a simple interpretation of the components and seriously limit their use for subsequent clustering.

## III. CHOICE OF INPUT COORDINATES

### A. Cartesian coordinates

The typical output of MD simulations is a large file with 3D Cartesian coordinates of all atoms of the system (consisting, e.g., of a protein and its surrounding solvent molecules) per simulation step. While the system is described in phase space, it is generally accepted to omit the velocities because velocity autocorrelation functions decay on a rather short time scale.^{101} For simplicity, moreover, it is common practice to discard the solvent coordinates and focus on the protein motion (but see Refs. 65 and 102). We are then left with the 3*N* Cartesian coordinates of the protein, which are convenient to represent the 3D structure of the system and also used to calculate the kinetic energy in MD simulations.

Employing Cartesian coordinates, first the translation and overall rotation need to be removed from the trajectory. The latter is commonly achieved via a “rotational fit” to a reference structure,^{103} which determines an overall 3 × 3 rotation matrix *R* that minimizes the least-square distance between the mass-weighted instantaneous atomic positions $ri\u2032=RTri$ and reference positions $r\xafi$. Since the rotation depends via the moment of inertia on the molecule’s structure, however, this separation of overall and internal motion is only straightforward for relatively rigid molecules. For a flexible system such as a folding protein, on the other hand, we obtain significant mixing of overall and internal motion.

To demonstrate this effect, we adopt the above introduced example of HP35 and consider the free energy landscape Δ*G*(*x*_{1}, *x*_{2}) along the first two principal components. Using Cartesian input coordinates, Fig. 3(a) shows that the mixing of overall and internal motion results in a free energy landscape that is rather diffuse and structureless. To provide a simple structural analysis of Δ*G*(*x*_{1}, *x*_{2}), we cluster the density in 12 structurally well-defined metastable conformational states (see Sec. V B). Depicted in the lower panel of Fig. 3(a), this clustering reveals that the eight highest-populated metastable states are completely mixed by the Cartesian PCA and lie on top of each other. This is caused by the fact that the Cartesian principal components rather reflect the dominant overall motion than the much smaller internal motion of the protein.^{87}

It should be stressed that—given sufficient sampling—even the small-amplitude functional motion of quite rigid proteins such as BPTI was found to result in significant mixing of overall and internal motion.^{87} Although improved separation methods have been suggested,^{104} Cartesian PCA therefore may be only useful for trajectories showing few events with small structural changes. This widely underestimated problem affects many other applications of covariance matrices, e.g., in linear response theory^{105} and allosteric networks.^{106,107} For example, a recent study of the allosteric communication in PDZ2 domain^{86} revealed that the resulting correlations are in fact mostly effected by the above described fitting problem.^{108} We note that nonlinear approaches such as scale-invariant mutual information^{109} or machine learning techniques that eliminate invariances^{59} may elegantly circumvent this problem.

### B. Distances and contacts

Internal coordinates such as intramolecular distances and angles, on the other hand, are by definition not affected by overall motion. Moreover they represent a natural choice in the sense that the molecular force field used in MD simulations is given in terms of internal coordinates. Hence several authors have considered PCA based on distances between closest lying atoms of each residue, hydrogen bonds, or C_{α}-atoms.^{81,110–114} Since the number of distances scales quadratically with the number of considered atoms, however, the approach is numerically expensive and thus seems prohibitive for larger systems. Moreover, the inclusion of a large number of distances may result in highly correlated coordinates, although it is advantageous for a PCA, if relatively few and only weakly correlated input coordinates are used.^{115}

Hence we have recently suggested focusing on distances between interresidue contacts of the protein that are present in the native state of a protein.^{88} While native contacts are obviously important to describe small-amplitude motions of a folded protein, they have been recently shown to also largely determine the folding pathways.^{116} We consider a contact as formed if the distance between the closest lying heavy atoms of each residue is less than 4.5 Å. Moreover we discard contacts between residues that are less than four residues apart, thereby omitting short-range contacts. Figure S1 shows the resulting contact map of HP35, which clearly reveals diagonal secondary structure contacts as well as tertiary contacts that are either contacts of the hydrophobic core or hydrogen bonds. Also shown is a contact map based onC_{α}-distances that are shorter than 8 Å, which turns out to be quite similar.

Using these definitions, we calculated the associated covariance matrix in order to perform a PCA on contact distances. The resulting free energy landscape shown in Fig. 3(b) reveals the three overall energy basins of HP35,^{117} *N*, *I*, and *U*, where *N* denotes the native state, *I* contains (mostly folded) intermediate conformations, and *U* includes unfolded conformations. While the folded/unfolded structures are well discriminated by the first two principal components, native and intermediate conformations are found to partially overlap. This is because the intermediate state *I* differs from the native state *N* mainly in residue 3, which hardly changes the distances of HP35 but results in a somewhat larger flexibility of this residue.^{117} We also considered various other contact definitions (e.g., using selected C_{α}-distances), which overall were found to yield similar results, with the exception of the case where *all* (not only the preselected) C_{α}-distances were taken into account. The latter procedure yielded clearly minor resolution of the energy landscape, thus emphasizing the necessity to perform a preselection of the degrees of freedom, in order to reduce the noise of the data.^{88}

### C. Dihedral angles

Reflecting the secondary structure, backbone dihedral angles (*ϕ*_{i}, *ψ*_{i}) of residues *i* are valuable conformational descriptors that directly indicate whether the protein forms helices, sheets, or loops. Side chain dihedral angles, on the other hand, are expected to report on interresidue contacts. This analysis is complicated by the fact that longer side chains exhibit several dihedral angles *χ*_{n} that often show frequent changes between several rotameric states. To describe contacts, distance-based measures therefore appear to be better suited than side chain dihedral angles.^{10}

When angles are employed as input coordinates for a PCA, their periodic nature needs to be taken into account. From the periodicity, two problems may arise. First, the arithmetic mean is ill-defined for circular sampling, e.g., the arithmetic mean of two points at 170° and −170° is 0 instead of 180°, as one would intuitively assume. Second, projections in circular spaces are ambiguous because there are always two directions in which a circular coordinate can be projected (corresponding to positive and negative rotation). If we decide for a specific direction, we introduce a borderline at which points will be projected to one side and neighbors to the other. Thus, local neighborhoods get ripped apart, which may effect large projection artifacts in the resulting free energy landscape.

While one can employ suitable definitions of circular means to solve the first problem,^{118} the second one is more fundamental and does not have a general solution.^{96} A remedy to circumvent the projection problem is to convert all angles *φ* to sine/cosine-transformed coordinates (*r*_{1} = cos *φ*, *r*_{2} = sin *φ*), in order to obtain a linear coordinate space with the usual Euclidean distance as induced metric.^{40,95} This approach—known as dPCA—has been applied successfully for various systems including peptides, proteins, and RNA.^{21,89,119–124} On the downside, the inherent duplication of coordinates and the nonlinearity of the sine and cosine transformations render it difficult to interpret the results in terms of the underlying observables [cf. Fig. 2(b)].

To avoid these issues, we recently suggested an improved approach named dPCA+, which performs the analysis directly on the dihedral angles.^{96} It exploits the well-known fact that protein backbone dihedral angles do not cover the full angular space [−*π*, *π*] but are limited to specific regions due to steric hindrance (as shown by the Ramachandran plot). Thus, natural cuts between sampled regions can be defined, which may be used to decide on where to project the given data. By shifting the original data to align the periodic border to this “maximal gap” in sampling, PCA can be performed in a standard manner, without distortion errors, artificial doubling of coordinates, and minimal projection artifacts. For example, the distribution *P*(*ψ*) in Fig. 2(b) would be shifted by *δψ* = −0.82 rad, in order to move the maximal gap of the sampling to the borders

Applying dPCA+ to the MD data of HP35, we obtain a well-resolved free energy landscape [Fig. 3(c)], which—apart from the overall basins—also exhibits numerous smaller minima. As shown in the lower panel, in particular, dPCA+ succeeds in clearly separating the metastable conformational states of HP35.

### D. Discussion

Figure 3 shows that dihedral angle PCA clearly yields the best structural resolution for HP35. This finding is by no means general, however, but depends specifically on the considered system and process. For example, the second model introduced in Fig. 1, the open-close transition of T4L, is best described by changing interresidue contacts, while only a few dihedral angles change upon this transition.^{10} Generally speaking, experience shows that backbone dihedral angles are the input coordinates of choice if we consider flexible secondary structure elements (like in short peptides or in disordered loops of a protein), while distances and interresidue contacts are advantageous to report on changes in tertiary structure when a mostly rigid protein is considered. This is not meant to be mutually exclusive. For example, the conformational reorganization due to allosteric communication in PDZ2 domain was shown to result in changes of distances (reporting on rearrangements of contacts) as well as of backbone dihedral angles (reflecting conformational changes of flexible loops).^{86}

The above discussion shows that the first step of a MD analysis—ahead of any dimensionality reduction or clustering—should be a check of the “raw data.” For example, assuming that we are interested in the conformational transition of a protein between two end states, we first want to explore which variables change along these transitions. In particular, this includes a quick visual inspection of the (*ϕ*_{i}, *ψ*_{i}) Ramachandran plots of all residues and of the distributions of selected distances (e.g., of the contacts specified above). All variables with a narrow distribution (reporting, e.g., on rigid *α*-helices or *β*-sheets) that hardly change upon the transition can (and should) be excluded from the analysis. Furthermore, some of the motion shown by the MD trajectory may be irrelevant for the question under consideration (e.g., dangling terminal ends) and thus should be excluded too. We emphasize that any irrelevant coordinate left in the data set adds variance to the data which may overshadow the essential motion.

## IV. CLUSTERING

Having identified a suitable set of CVs, we are next faced with the challenge of finding high-density clusters in this low-dimensional space. Corresponding to regions of low free energy [Eq. (1)], these clusters represent conformational states, i.e., protein structures that occur predominantly. If the conformational states are metastable, in the sense that they establish a time scale separation between fast intrastate fluctuations and rarely occurring interstate transitions, they can be used in a Markov state model which describes the dynamics in terms of memory-less jumps between these states.^{74–78}

In simple cases where the dynamics is well described in only one or two dimensions, direct visual inspection of the free energy landscape is sufficient to locate the regions of low energy. As the dimension of CV space is usually higher (see Sec. II A), however, we need to resort to numerical algorithms. In practice, one typically first employs geometrical clustering techniques (that use only structural information) to define some structurally well-defined microstates. To combine several rapidly interchanging microstates in a metastable state, in a second step dynamical clustering algorithms (that also use dynamic information) may be applied. Alternatively, spectral clustering^{29} has been employed in the space of interatomic inverse residue distances.^{30} The discussion below shows that the quality and interpretative value of a Markov state model depends almost completely on the appropriate definition of the metastable states. In particular, the first step, the geometrical identification of well-defined microstates, turns out to be crucial.

### A. *k*-means vs. density-based clustering

Among geometric clustering approaches, the *k*-means algorithm (and variations thereof) represents the most widely used method.^{23} *k*-means works by distributing a user-defined number *k* of points representing cluster centers randomly in the given space. Iteratively, all sampling points in the data set are appointed to the cluster whose center point is closest, and new center points are calculated from the average position of all points in the respective state. After a variable number of iteration cycles, less points switch state allegiance than those defined by some convergence threshold and the algorithm stops. In effect, *k*-means thus performs a Voronoi partitioning of a given data set into *k* subsets that minimizes the sum of squares of distances between the objects and their corresponding cluster centroids.

While *k*-means has been a valuable clustering tool in a variety of fields,^{23} it suffers from several drawbacks when being used for locating metastable states of MD trajectories. First, although the number of sampled metastable conformational states of a protein is essentially a property of the considered MD trajectory, *k* is a required input parameter of the algorithm. While various runs with different *k* can be performed, it may be difficult to assess which of the resulting partitionings is appropriate. Moreover, the stochastic nature of the algorithm (the initial random distribution of cluster centers) leads to different results for differently converged clustering runs, without a proper means of controlling the results. Most seriously, though, *k*-means separates clusters at the geometric middle between cluster centers, which may lead to microstates which are not cut at, but rather include energy barriers. As illustrated in Fig. 4(a), this inaccurate definition of the separating barrier may cause that *intrastate* fluctuations are mistaken as *interstate* transitions. Using ill-defined states to calculate transition matrices, subsequent dynamic clustering cannot produce appropriate metastable states and often yields results that are very sensitive to details of the parameter choice, in particular, in the case of low statistics.

A possible remedy to this problem is to choose a large number *k* of states, in the hope that the finer resolution of state space will also resolve the free energy barriers properly. In fact, Prinz *et al.*^{77} showed that the associated discretization error of the resulting Markov state model can be made arbitrarily small by making the partition finer. However, given a finite number of data points, a higher number of clusters directly leads to a lower sampling of these states, which eventually hampers the accurate estimation of cluster populations and transition probabilities. Even in the case of sufficient sampling, in practice it is difficult to estimate how many states are necessary to properly discretize the given space.

*Density-based* clustering methods,^{24–28} on the other hand, do not suffer from these problems. The approach does not separate the underlying space into a Voronoi partitioning, but rather assigns densities to local points and separate clusters with respect to high- and low-density regions. To this end, the algorithm first computes a local free energy estimate for every structure in the trajectory by counting all other structures inside a *d*-dimensional hypersphere of fixed radius *R*. Normalization of these population counts yields densities or sampling probabilities *P*, which give the free energy estimate Δ*G* = −*k*_{B}*T* ln *P*. Thus, the more structures are close to the given one, the lower the free energy estimate. By reordering all structures from low to high free energy, finally the minima of the free energy landscape can be identified.

In its original form, density-based clustering is not necessarily well suited to analyze MD trajectories. First, the algorithm scales quadratically (in effort and memory) with the number of input points *N*, which becomes prohibitive given the typical size of MD data (say, 10^{5}–10^{7} points). Another issue is the problem of “bandwidth selection,” with the hypersphere radius *R* being the bandwidth of the employed density kernel that has to be provided as an input parameter. (In fact, it has been argued that choosing the right bandwidth is of higher importance than the kernel choice itself.^{125}) Finally, existing methods do not necessarily assure that the resulting clusters are cut at the energy barrier.

On this account, we have recently proposed a new density-based clustering algorithm that meets these requirements.^{27} Due to the use of a box-assisted algorithm for neighbor search,^{126} the run-time behavior reduces asymptotically to *N* log *N*. Furthermore, by using an implementation that employs computational acceleration via graphical processing units (GPU), it has become possible to cluster >10^{7} points in six dimensions in a couple of hours on a standard desktop computer. Concerning the bandwidth selection problem, we exploit the fact that an equilibrium MD simulation with proper thermostat samples a Boltzmann distribution. Hence the bandwidth parameter *R* can be adjusted such that the resulting conformational distribution shows the correct resolution of the input MD data.^{127} This renders the approach quasi parameter-free, in the sense that everything can be deduced from the given data set. Finally, as demonstrated in Fig. 5 for the example of a simple three-state system, we perform a lumping procedure that by design cuts the resulting clusters at the energy barrier. As a consequence, we obtain a minimum number of states (and thus maximally available sampling) that accurately represent all local free energy minima. Hence the partitioning is optimal in the sense that a system with *n* metastable sets is best approximated by the most metastable partition into *n* states.^{128}

A major advantage of the new approach is the high quality of the clustering results. Previously it was necessary to rely on a large number (typically thousands) of microstates to properly discretize barriers and subsequently use dynamic clustering techniques to reduce the large set of microstates to a manageable number of macrostates. Employing density-based clustering, it is often not even necessary to apply dynamic coarse-graining because geometric clustering already produces a quite limited number (<100) of microstates that are well separated by free energy barriers.

### B. Dynamical coarse graining

Notwithstanding, there are several reasons why one might want to introduce macrostates that comprise several microstates. First, rare transitions between states may lead to severe undersampling of the barriers, such that geometrically distinct but dynamically close microstates are artificially separated. On the other hand, states with low sampling may be geometrically close to a state that is not dynamically closest. Finally, for easy interpretation of the considered biomolecular process, one often prefers a coarse grained rather than the most detailed description of the free energy landscape.

On this account, a number of dynamical clustering methods have been proposed that use dynamical information obtained from the MD trajectory (such as transition probabilities between the microstates), in order to combine MD frames which are close in time evolution (rather than close in geometry).^{31–35} They include, for example, robust Perron cluster analysis^{32} (PCCA+), the most probable path algorithm^{31} (MPP), the Bayesian agglomorative clustering engine (BACE),^{33} and the reduced dynamical model of Hummer and Szabo.^{34} See Ref. 33 for a comparison of various methods.

Here we adopt MPP analysis^{31} as a simple, robust, and intuitively appealing method to construct metastable states. Starting with a given set of microstates, MPP first calculates the transition matrix of these states. If the self-transition probability of a given state is lower than a certain metastability criterion *Q*_{min} ∈ (0, 1], the state will be lumped with the state to which the transition probability is the highest. This procedure is reiterated, until—for a given *Q*_{min}—there are no more transitions. Repeating the procedure for various *Q*_{min}, we can construct a dendrogram that demonstrates how various metastable states merge into basins with increasing minimum metastability *Q*_{min}, which illustrates the topology and the hierarchical structure of the free energy landscape.^{31} While this diagram is similar to the lumping procedure used in density-based clustering (Fig. 5), the MPP dendrogram describes the dynamical instead of the geometrical relationship between microstates.

To construct a Markov state model that yields accurate estimates of life times and transition times, it is often necessary to perform a final refining step. That is, even when using state-of-the-art clustering methods, sampled points in transition regions may be easily misclassified due to low sampling and large errors in defining the barriers. As a consequence, intrastate conformational fluctuations may be misinterpreted as interstate transitions [see Fig. 5(b)], leading to false transition rates in the resulting model. To correct for these errors, we employ the concept of coring^{75} or milestoning.^{129} The basic idea is to identify core regions of the metastable states and count transitions only, if the core region of the other state is reached [Fig. 5(b)].^{31} Effectively, this procedure generates a new macrostate trajectory with clear-cut state boundaries.

As in high dimensional space clusters, core regions may be hard to identify, we rather suggest using dynamic coring^{117} which defines core regions by requesting that after a transition the trajectory spends some minimum time *t*_{min} in the new state. If this condition is not met, the trajectory points are reassigned to the last visited state. Plotting the probability distribution *P*_{n}(*t*) to stay in state *n* for duration *t*, we choose the smallest coring window *t*_{min} for which the fast initial decay of *P*_{n}(*t*) due to spurious interstate transitions vanishes.

## V. A CASE STUDY ON HP35

To illustrate the work flow introduced above, we again adopt the example of HP35. As discussed in Sec. III, for this system it is advantageous to use backbone dihedral angles as input variables. For dimensionality reduction, we first employ dPCA+ and then compare to the results obtained for TICA.

### A. Selection of CVs

Following dimensionality reduction, we need to decide which and how many components of a transformation should be included. Although PCA selects for directions of maximum variance, this is not necessarily the main criterion when we want to analyze conformational space. Rather the discussion of Fig. 3 revealed that it is most important to test if a component shows a multipeak distribution *P*(*x*_{i}) that indicates nonrandom structures in the conformational space. In practice, this test is readily performed for the marginal distributions *P*(*x*_{i}) and a choice of 2D representations *P*(*x*_{i}, *x*_{j}). Considering dPCA+ on HP35, this analysis shows that the free energy profiles along the first five and the seventh component exhibit several minima (Fig. S2). We note that this information is neither obtained from the values of the variances of these components, nor from the negentropy^{97} which measures the deviation of their distribution from a Gaussian of same mean and variance (see Fig. S3).

As a second test, we consider the autocorrelation function *C*_{ii}(*t*) [Eq. (5)] of the first principal components, whose decay report on the time scales exhibited by the components [Fig. 6(a)]. In line with the finding of multipeak components, the similar decay times (apart from component 1) suggest keeping the first five and the seventh components for further analysis. The resulting number of six CVs appears to be a reasonable compromise between a high dimension (to resolve the conformational distribution) and a low dimension (to achieve sufficient statistics for clustering).

In a last step, we want to validate the resulting CVs *x*_{i} by characterizing their motion. To this end, we analyze the eigenvectors $v(i)$ [Eq. (4)] of the chosen CVs. Plotting the coefficients $vk(1)$ and $vk(2)$ as a function of the residue number *k*, Fig. 7 shows that the first dPCA+ eigenvector of HP35 is dominated by a broad distribution of *ψ* angle changes, reflecting the fact that *ϕ* angles are of minor importance for the folding of protein helices. In particular, the peak around the turn between helix 1 and helix 2 indicates that this region is essential for the rate-limiting transition between the unfolded and intermediate energy basin.^{117} The second eigenvector, on the other hand, is clearly peaked at residue 2, which is known to be essential for discriminating intermediate and native states.^{117} While the structure of the higher eigenvectors is less obvious (Fig. S2), they are necessary to reproduce the connectivity of the metastable states. In this way, we can quickly test if the selected CVs are relevant for the considered process. In particular, we may exclude irrelevant large-amplitude motion (e.g., due to dangling ends) or single rare transitions (which are statistically irrelevant).

### B. Characterization of states

Employing the 6D space defined above, we next perform density-based clustering using a hypersphere radius *R* = 0.3, which yields 76 microstates that are structurally well defined, see Fig. S4. Applying the most probable path algorithm with minimum metastability *Q*_{min} = 0.76 and lag time *τ* = 1 ns, we obtain twelve macrostates. Figure S4 also shows the associated MPP dendrogram which nicely reveals the hierarchical structure of the free energy landscape. To remove spurious transitions, dynamical coring with a minimum residence time *t*_{min} = 2 ns is performed. To illustrate the quality of this methodology, we note that an earlier analysis of HP35 using (the old version of) dPCA and *k*-means required clustering in a 10D space, which produced *k* ≈ 12 000 microstates.^{117} Employing dPCA with density-based clustering (instead of *k*-means) reduced the number of microstates to 543 microstates.^{27} Only when combined with projection error-free dPCA+, this reduces to the abovementioned 6 components and 76 microstates.^{96}

To provide a concise structural characterization of the resulting metastable states, we use the “Ramacolor” method.^{27} A Ramacolor plot assigns to each residue of a protein a unique color that reflects its conformational distribution in a (*ϕ*, *ψ*) Ramachandran plot; see Fig. 8(b). To define a color for a residue in some structural ensemble (e.g., a metastable state), we average over the colors pertaining to all (*ϕ*, *ψ*) frames of the ensemble. Figure 8(a) shows the resulting Ramacolor plot characterizing the 12 metastable states of HP35. With green, red, and blue indicating *α*-helical, *β*-extended, and left-handed conformations, respectively, we readily identify the three *α*-helices with residues 3-10, 14-19, and 22-32. Ordered by decreasing population probability, of particular interest are the first 8 states (which comprise over 95% of the total population) that have been used to characterize the free energy landscape in Fig. 3. The Ramacolor plot reconfirms various facts anticipated in the discussion above. The unfolded basin *U* comprises states 3-5 which clearly reveal a heterogeneous conformational distribution. The transition from *U* to the intermediate basin *I* involves residues 11-13, while the transition from *I* to the native basin *N* is mediated via residues 2 and 3.

As a final result, Fig. 8(c) shows a network representation of the resulting twelve-state Markov state model of HP35, which was calculated for a lag time of 2 ns.^{96} Here states are annotated by their lifetime, their size indicates their population, and the thickness of the arrows indicates the number of transitions. Quite remarkably, we find that the connectivity of these basins and the underlying metastable states of the Markov state model is directly reflected in the dPCA+ free energy landscape in Fig. 3.

### C. Comparison to TICA

It is instructive to highlight virtues and shortcomings of TICA as compared to the above results for PCA. To facilitate a direct comparison, we performed TICA on the maximal-gap shifted dihedral angles in the same way as done for dPCA+. As an additional parameter, in TICA we first need to choose the lag time *τ*, at which the autocovariances should be maximized [cf. Eq. (6)]. Considering various observables such as correlation functions and energy landscapes for *τ* = 1 and 100 ns (Fig. S5), we opted for the longer lag time which overall performs somewhat better. Since TICA is expected to achieve an improved time scale separation, we first consider the autocorrelation function for the first few components of HP35. Figure 6 shows that for both, PCA and TICA, the first component clearly accounts for the the slowest time scale (∼1 *µ*s), reflecting the rate-limiting step from the unfolded to the intermediate energy basin.^{117} This is followed by two intermediate time scales (∼100 ns) exhibited by components 2 and 3, as well as several shorter time scales shown by the higher components. While the decay times of the first and the higher components are rather similar for PCA and TICA, the decays of the two intermediate components are indeed somewhat slower for TICA, which is in line with the findings of previous studies.^{42–44,81}

We next consider the free energy landscape obtained from TICA; see Fig. 3(d). In variance with all PCA landscapes shown in Fig. 3, the TICA free energy is found to split up in two well-separated basins that do not correlate with the metastable conformational states found by dPCA+. Rather, the lower panel of Fig. 3(d) shows that the structurally well-defined dPCA+ states are mixed up in the TICA energy landscape and lie on top of each other. As it may be inconsistent to consider PCA states using TICA variables, we also determined the TICA conformational states by using the first six TICA components to perform density-based and MPP clustering in the same way as done in PCA (Fig. S6). The Ramacolor plot of the resulting 45 microstates and 10 metastable states in Fig. S7 shows that in both cases only the first two states are structurally well defined, while the remaining states are very heterogeneous and sparsely populated. This is again in variance with the result of 12 structurally well defined metastable states obtained by dPCA+.

To explain these findings, we analyzed the Ramacolor plot of the TICA metastable states (Fig. S7). Interestingly, it reveals that all seven lowly populated states exhibit left-handed residues in the first or the third helix of HP35. Since *α*_{R}, *β* ↔ *α*_{L} transitions occur quite infrequently and TICA by design focuses on the slowest time scales of the system, Fig. S7 suggests that TICA yields first components that discriminate these rare events. This finding is supported by an analysis of the first TICA eigenvectors (Fig. S8). Unlike to the case of dPCA+ shown in Fig. 7, the TICA eigenvectors contain contributions of *ϕ* and *ψ* angles with similar weights (although the *ψ* angles are much more important for the folding of *α* helices) and do not emphasize the functionally important regions such as residue 2 and the turn between helix 1 and helix 2.^{117} Rather the TICA eigenvectors contain *ϕ* dihedral angles with high weights because these angles discriminate between left and right handed conformations.

As discussed in Fig. 8, the structurally well defined metastable states identified by dPCA+ allow us to characterize the pathways of the folding of HP35 and thus describe the mechanism of the reaction. On the other hand, the hardly populated left-handed states identified by TICA are of little relevance for the folding process, although they account for the slowest time scales of the system. We note that the magnitude and convergence of the implied time scales of the resulting Markov model, which are commonly adopted as quality criterion for a dimensionality reduction method,^{77,81} naturally do not reveal this problem. In line with the autocorrelation functions shown in Fig. 6, we rather find that TICA achieves somewhat longer implied time scales than PCA (see Fig. S9) which could misleadingly be interpreted as better representation.

While further work is required to study the generality of this issue, first results on small peptides and PDZ2 domain indicate that the enforced focus on the slowest time scales combined with rescaling induced distortion of the conformational density generally leads to metastable states of larger structural heterogeneity and lower interpretative value.^{130} Another example is given by a TICA study^{81} of the 1 ms long trajectory of BPTI by Shaw *et al.*,^{1} which highlights a single rare event (of no statistical and minor functional importance) and therefore yields longer implied time scales than PCA. Employing suitable input coordinates (e.g., contacts or dihedral angles), the latter, though, focuses on the statistically relevant functional dynamics.^{88}

We conclude that, although TICA indeed may provide an improved time scale separation, it is not necessarily better suited to detect the functionally relevant metastable conformational states in subsequent clustering. The definition of physically meaningful states, though, is arguably the main cornerstone for the construction of a Markov state model accounting for conformational dynamics. If TICA is to be used, we therefore highly recommend validating the resulting CVs by characterizing their motion.

## VI. ESSENTIAL COORDINATES

This section extends the above discussion in two ways. For one, we briefly introduce a recently proposed machine learning approach to dimensionality reduction, which naturally leads to the “essential coordinates” of the system.^{64} Second, we reconsider the functional dynamics of T4L, which represents a counterexample to the above introduced methodology in the sense that it defies standard approaches to identify CVs.^{10} Machine learning and nonequilibrium techniques are proposed as an alternative way to cope with this problem.

### A. Machine learning of dimensionality reduction

In Secs. II–V A, we have elaborated on a general and systematic strategy to identify CVs. Described as linear combinations of high-dimensional input coordinates, however, CVs tend to be hard to interpret as they do not necessarily point to the *essential internal coordinates*, i.e., specific interatomic distances or dihedral angles that are important for the considered process. Information on these coordinates, on the other hand, may be provided by the metastable conformational states of the system, which provide a well-defined representation of the regions of low free energy. Given such states, for example, we might ask questions like: “Which coordinates classify a state?,” “Which coordinates distinguish between states?,” or “How can we infer the physical mechanism of a transition between states?”

With this idea in mind, we have recently proposed a supervised machine learning method that constitutes an alternative approach to reduce the dimensionality of a complex molecular system.^{64} That is, given a trajectory of MD coordinates and a set of metastable states defined by these coordinates, a machine learning model is trained that assigns new MD data to the state they most likely belong to. To this end, the model learns classification rules based on certain features of the MD coordinates, e.g., a certain distance should be bigger than a trained cut-off value, or some angles should lie in a certain region. Rather than using popular deep learning networks, we have employed a boosted decision tree algorithm called XGBoost^{131} that allows us to directly determine the importance of input features. Extending previous studies,^{55,56} we have devised a new algorithm that exploits this “feature importance” via an iterative exclusion principle [Fig. 9(a)]. That is, given a trained model, we sort all coordinates by their importance, remove the coordinate with highest (or lowest) importance from the training set and reiterate the procedure by retraining the model based on all remaining coordinates. When discarding the coordinate of least importance first, for example, we can easily filter out all nonessential coordinates (that do not change the accuracy of the model when discarded) and thus obtain the desired essential coordinates.

As an example, Fig. 9 shows the application of the machine learning method to identify the essential coordinates of the folding of HP35. Based on the state definitions obtained from the clustering described in Sec. V B, we trained XGBoost on the full MD data set of dihedral angles. Figure 9(b) shows the resulting feature importance plot that quantifies how much a single dihedral angle affects the classification of a given metastable state. Remarkably, the most important dihedral angles identified by XGBoost are clearly the ones also highlighted as state-defining by the Ramacolor plot in Fig. 8(a). Employing the iterative scheme to discard the coordinate of least importance first, Fig. 9(c) shows the accuracy loss of XGBoost plotted as a function of the number of discarded coordinates. While the accuracy is found to be remarkably stable for all states when discarding up to 60 coordinates, it decreases sharply for most states when the remaining six coordinates are removed. These six variables, given by the backbone dihedral angles *ϕ*_{3}, *ϕ*_{2}, *ψ*_{13}, *ψ*_{2}, *ψ*_{10}, and *ϕ*_{33} (ordered by decreasing importance), represent the desired essential coordinates.

Discriminating the metastable states of the system, essential coordinates are expected to elucidate the mechanism of the considered process. In fact, we find that *ϕ*_{3}, *ϕ*_{2}, and *ψ*_{2} account for the structure of the N-terminus that is known to be crucial to discriminate the folded and the intermediate basins,^{27} while *ψ*_{13} and *ψ*_{10} describe the structure of the loop between helices 1 and 2 that is important to discriminate the intermediate and the unfolded basin [cf. Fig. 1(a)]. To reveal the time scales described by the essential coordinates, Fig. 6 compares their autocorrelation functions to the results obtained for the first components of dPCA+ and TICA. Remarkably, we notice that the decay times of the essential coordinates are in part significantly longer than the results for the principal components, and are overall also longer than the TICA decay times. That is, even without enforcing long time scales, essential coordinates account for rare events and are therefore promising candidates for reaction coordinates. Most interestingly, we find that the time scales of the essential coordinates can be directly related to a corresponding physical process. For example, *ψ*_{13} is associated with the relative positions of helices 1 and 2 and as such describes the overall folding-unfolding transition, which corresponds to the slowest process of the system.^{27} On the other hand, *ϕ*_{2} and *ψ*_{2} account for relatively fast fluctuations of the N-terminal which affect a twisting motion described by *ϕ*_{3} that leads to the destabilization of the protein.^{117} The latter is associated with relatively slow rearrangements of *ψ*_{10} and *ϕ*_{33} and finally leads to unfolding involving again *ψ*_{13}. Hence essential coordinates may provide a direct view of hierarchically coupled fast and slow motions.^{83–85}

### B. Identification of hidden coordinates

So far we have mostly focused on the folding of HP35 as a representative model example. By contrast, functional dynamics of biomolecules is not necessarily mediated by large conformational changes or slow time scales. To demonstrate the virtues and limits of the methods discussed above, it is instructive to reconsider our second example introduced in Fig. 1, the open-close transition of T4L. While this prominent hinge-bending motion is well described by the radius of gyration, we have already pointed out that the energy barrier of the corresponding free energy landscape is significantly too small to account for the microsecond transition time of T4L. In fact, Ernst *et al.*^{10} tested prospective candidates for a reaction coordinate by studying the molecule’s response to external pulling along the coordinate, using targeted MD simulations.^{18} While trying to directly enforce the open-closed transition did not recover the two-state behavior of T4L, this transition was found to be triggered by a “hidden” locking mechanism, by which the side chain of Phe4 changes from a solvent-exposed to a hydrophobically-buried state.

It should be stressed that standard dimensionality reduction methods such as various types of PCA fell short to find CVs that clearly indicate the locking mechanism. The best hints came from contact PCA, whose leading eigenvectors reveal frequent occurrences of Phe4. In this respect, the hinge-bending motion of T4L represents a prime example of complex functional dynamics whose underlying mechanism is not readily explained by using straightforward dimensionality reduction. Apart from the strategy to identify and validate possible reaction coordinates via targeted MD simulations suggested in Ref. 10, the machine learning approach described above^{64} might help to identify hidden coordinates as in the locking mechanism of T4L.

The idea pursued in Ref. 64 is to employ machine learning using the known states only (here the open and closed states of T4L) and to see if the resulting essential coordinates help to clarify the reaction mechanism. To this end, we define open and closed states of T4L based on the opening distance *d*_{21,142} and train an XGBoost model. By removing the most important coordinate from the data set at every iteration, we force XGBoost to reclassify the importance of alternative descriptors. As may be expected, the first three most important distances connect the two sides of the binding pocket [see Fig. 1(b)] and therefore directly monitor the open-closed transition of T4L. Surprisingly, though, the next three distances all involve the residue Phe4, which is the key residue of the locking mechanism. Although these distances are not evidently associated with the open and closed states used to train the model, they are apparently almost as descriptive as the state defining opening distance.

Representing this mechanism by the distance *d*_{4,60} between residues Phe4 and Lys60 and the hinge-bending motion by the distance *d*_{21,142}, the resulting free energy landscape shown in Fig. 10 reveals a four-state model of the functional dynamics of T4L that describes a hierarchical coupling^{83–86} of the fast nanosecond opening-closing motion and the slow microsecond locking transition. That is, besides the main open and closed states, we find two intermediate states associated with the locking transition, whose structures agree perfectly with the structures of the four metastable states found in the targeted MD study by Ernst *et al.*^{10} However, while the latter authors spend considerable effort, the XGBoost algorithm found these hidden conformational states semiautomatically, starting from the naive assumption of a two-state system.

## VII. CONCLUDING REMARKS

Dimensionality reduction and the related concepts of collective variables (CVs) and clustering occur in numerous and partly diverse fields, ranging from physical sciences to financial markets. By contrast, this perspective has focused on a very specific topic, that is, the statistical analysis of biomolecular MD trajectories that attempt to model structural dynamics on multiple time scales. Given the large dimension of biomolecular systems combined with hierarchically coupled fast and slow motions, the analysis of MD data provides specific challenges. While some methods (such as *k*-means) may be favorable in many fields, they are not necessarily the best choice for MD analysis. In practice, a successful post-simulation modeling of MD data is often hampered by the blackbox-like use of analysis software and a missing awareness of the (typically very) limited sampling of standard MD simulations. Unfortunately, the combination of inappropriate methods with poor sampling may nonetheless yield seemingly plausible results, whose relevance is often difficult to assess.

In this perspective, we have elaborated on a general and systematic strategy to identify CVs, which involves the choices of input coordinates, dimensionality reduction methods, and clustering techniques. Adopting the folding of HP35 as a model example, we have discussed various issues, such as projection errors and rescaling errors in dimensionality reduction and various misconceptions concerning clustering. In particular, we have stressed the significance of (i) using non-redundant internal input coordinates, (ii) avoiding dimensionality reduction methods that suffer from rescaling errors, (iii) validating the resulting CVs by characterizing their motion, and (iv) using deterministic barrier-preserving clustering methods such as density based clustering. Furthermore, we need to decide whether we seek for a mathematically optimal Markov model (as aimed for by TICA), or if we rather opt to faithfully characterize the conformational distribution (as aimed for by PCA). It is the latter that has been shown advantageous for answering biophysical questions such as folding pathways or reaction mechanisms. While we have only considered a few molecular examples and mainly focused on our own methods in this perspective, these general and rather obvious principles motivated by basic physical and statistical considerations should be valid independent of the particular system and method employed.

Although we have mainly considered the application of CVs to analyze MD simulations, many of our conclusions also apply to their use in enhanced sampling methods.^{14–19} As a main difference, biased MD techniques are typically restricted to one or two CVs. In fact, we have found for the two examples considered here, HP35 and T4L, that already two CVs may be sufficient to resolve the main conformational states. The additional degrees of freedom required for clustering purposes are mainly needed to establish the correct connectivity between these states. The latter should be less of a problem for biased MD simulations that evolve freely in the remaining degrees of freedom. We note that various machine learning approaches have been suggested that aim for an adaptive generation of CVs.^{57–59,62,63}

At present, the development of post-simulation models is at a transition point from gathering concepts and ideas to a systematic, consistent, and well-understood methodology. Similar as in more mature research fields such as electron structure theory and force field development, this requires an agreement of the community on well-defined benchmark problems, key observables of interest, and clear quality assessment criteria. Recalling the huge diversity of intriguing biophysical problems these methods could be applied to, it will take the rigor of mathematical definitions, the insight of physical laws and the intuition of chemists to master the arising ever-new challenges.

## SUPPLEMENTARY MATERIAL

See supplementary material for 1D and 2D energy landscapes, variances, and negentropy of the first principal components of HP35, as well as contact maps and a Ramacolorplot of the microstates. For the TICA study, autocorrelations and energy landscapes for different lag times, eigenvector content of the first two components, Ramacolorplot of the micro- and macrostates, and implied time scales are reported.

## ACKNOWLEDGMENTS

We thank Björn Bastian, Simon Brandt, Sebastian Buchenberg, Matthias Ernst, Thomas Filk, Abhinav Jain, Benjamin Lickert, Daniel Nagel, Sophia Ohnemus, Matthias Post, Anna Weber, and Steffen Wolf for providing computational data and for numerous instructive and helpful discussions, Lucie Delemotte, Peter Hamm, Jerome Henin, and Frank Noe for helpful comments on the manuscript, and D. E. Shaw Research for sharing their trajectories of HP35. This work has been supported by the Deutsche Forschungsgemeinschaft (Grant No. Sto 247/11).

The dPCA+ method^{96} and the density-based clustering algorithm^{27} were implemented in the open source software *FastPCA* and *Clustering*, respectively, which have also been embedded in the *prodyna* R-library, a toolkit for dimensionality reduction, clustering, and visualization of protein dynamics data. All programs are freely available at https://github.com/lettis.

## REFERENCES

^{2+}on the promiscuous target-protein binding of calmodulin

Since PCA is applied to real-valued variables here, we can also speak of an *orthogonal* transformation instead of a *unitary* transformation (i.e., its complex generalization).