Many processes of scientific importance are characterized by time scales that extend far beyond the reach of standard simulation techniques. To circumvent this impediment, a plethora of enhanced sampling methods has been developed. One important class of such methods relies on the application of a bias that is a function of a set of collective variables specially designed for the problem under consideration. The design of good collective variables can be challenging and thereby constitutes the main bottle neck in the application of these methods. To address this problem, recently we have introduced Harmonic Linear Discriminant Analysis, a method to systematically construct collective variables as linear combinations of a set of descriptors. The method uses input information that can be gathered in short unbiased molecular dynamics simulations in which the system is trapped in the metastable states. Here, to scale up our examination of the method’s efficiency, we applied it to the folding of chignolin in water. Interestingly, already before any biased simulations were run, the constructed one-dimensional collective variable revealed much of the physics that underlies the folding process. In addition, using it in metadynamics, we were able to run simulations in which the system goes from the folded state to the unfolded one and back, where to get fully converged results, we combined metadynamics with parallel tempering. Finally, we examined how the collective variable performs when different sets of descriptors are used in its construction.

## INTRODUCTION

Simulations of complex processes such as drug binding, protein association, protein folding, phase transitions, etc., have proven to be of great value and are a pillar of contemporary scientific investigation. However, many such processes are characterized by very long time scales which prohibit their simulation using conventional simulation techniques. Hence, to circumvent this limitation, a variety of enhanced sampling methods have been developed over the years, including replica exchange based methods such as parallel tempering^{1} and bias based techniques such as umbrella sampling,^{2} metadynamics,^{3} and variationally enhanced sampling.^{4} The latter category relies on the use of collective variables (CVs) which describe the most essential degrees of freedom of the processes being considered. Constructing appropriate CVs, however, can be challenging and time consuming. Thus, and in light of the expected continuing increase in the complexity and size of the systems being studied, devising techniques for the systematic construction of efficient CVs are regarded as an important objective of the enhanced sampling community. Also, finding good CVs is not only a technical issue, but is a way of encoding in a compact and transparent way the essence of the process being considered.

In the effort to address this challenge, in recent publications,^{5,6} we have proposed a new scheme for constructing systematically viable CVs through the utilization of the supervised learning class classification paradigm and, in particular, using a modification of Fisher’s Linear Discriminant Analysis (LDA), termed as Harmonic Linear Discriminant Analysis (HLDA) (see also Ref. 7). (The LDA assumes a multivariate normal distribution of the descriptors. For this reason, we also examine how deviation from multivariate normality affects the CVs’ efficiency.) The scheme of choice requires as input only short unbiased trajectories, for each metastable state. Using these data, HLDA can estimate the direction within an *N*_{d} dimensional space of selected system descriptors upon which the projections of these sets of data are best separated. The linear combination corresponding to this direction is then utilized as the CV.

To test its applicability, HLDA has been used in Ref. 5 in two examples taken from the realm of materials science and chemistry. In both cases, HLDA was found to be able to generate good CVs, leading to biased runs characterized by a high frequency of transitions between the metastable states of interest and to a rapidly converging sampling.

The application of HLDA in Ref. 5 was, however, still confined to a set of relatively simple problems. Hence, if its use (and the class classification paradigm underlying it) is to be adopted for scientific and technological problems of increasing complexity, it would need to prove effective in such systems. Here, we would like to investigate the performance of HLDA for a relatively more complex system. We consider the case of a small protein, chignolin, for which extensive simulations on purpose build machines are available.^{8} The procedure to determine the HLDA CVs requires the use of a convenient set of descriptors *d*_{i}(*R*) that are capable of describing the initial and final states. In this study the effect of the choice of descriptors on the quality of the CVs will be examined as well.

We show also how the HLDA CVs bring out much of the physics even before performing the simulations. Moreover, we find that employing the HLDA CVs within metadynamics simulations enables sampling numerous folding and unfolding path-ways and that through their incorporation in Parallel Tempered Metadynamics (PTMetaD) simulations,^{9} estimates for the system Free Energy Surface (FES) can be obtained.

## METHODS

### Constructing the collective variable

To construct CVs which describe the chignolin folding process, we utilize the paradigm introduced in Ref. 5 that estimates the direction **W** in an *N*_{d} dimensional descriptor space in which the projections of the unbiased distributions of the folded and unfolded states are best separated. As in Ref. 5, we do this using HLDA, a modification of Fisher’s LDA. Thus, as in Fisher’s LDA, the estimation of **W** is done through the maximization of the ratio between the system’s so-called between class **S**_{b} and within class $Sw$ scatter matrices. Like LDA, the former is measured by the square of the distances between the projected means and can be written as

with

where *μ*_{A,B} are the expectation values of the two metastable states. In contrast to LDA, in which the within class matrix is estimated using the average of the two metastable states’ multivariate variances **Σ**_{A,B}, here it is estimated from the spread harmonic average

with

The HLDA objective function which has the form of a Rayleigh ratio

is then maximized by

which in turn yields the HLDA CV

### Computational details

Simulations of chignolin (sequence TYR-TYR-ASP-PRO-GLU-THR-GLY-THR-TRP-TYR) were conducted using GROMACS 5.1.2^{10,11} and the PLUMED 2.4 plugin.^{12} The CHARMM22^{*} force field^{13} and the three-site transferable inter-molecular potential (TIP3P) water model^{14} were used to make direct comparisons with Ref. 8. ASP and GLU residues were simulated in their charged states, as were the N- and C-terminal amino acids. A time step of 2 fs was used for all systems, and a constant temperature of 340 K (in agreement with Ref. 8) was maintained by the velocity rescaling thermostat of Bussi *et al.*^{15} All bonds involving H atoms were constrained with the linear constraint solver (LINCS) algorithm.^{16} Electrostatic interactions were calculated with the particle mesh Ewald scheme,^{17} and a 1 nm cutoff was applied to all non-bonded interactions. Runs were all conducted with a box containing 1649 water molecules, along with 2 sodium ions to neutralize the system. Parallel tempering simulations were run with 40 replicas, each at a different temperature. The replica with the lowest temperature was set to be at T = 340 K while the rest of the temperatures were arranged in a geometrical series with a factor *a* = 1.011. The Well Tempered Metadynamics (WTMD) Gaussian heights and widths were 2.82 KJ/mol and 0.01, respectively, and a bias factor of 8 and a Gaussian deposition rate of 500 were used in all simulations. (Details regarding the utilized descriptor sets can be found in the supplementary material.)

## RESULTS

In accordance with the basic requirements of the HLDA approach, we began the study by acquiring two unbiased trajectories taken from the Shaw unbiased simulations spanning roughly 2 *μ*s each in the system’s folded and unfolded states (these time frames were chosen rather arbitrarily and as can be seen in Fig. 1 of the supplementary material, depending on the descriptor set used, trajectories as short as 100 ns could suffice). The first step in applying HLDA requires the selection of a set of system descriptors that can be instrumental in describing the folding and unfolding of the protein. Ideally, since this step should not require an expert’s understanding of the system, we proceeded by selecting fairly naively three different sets of descriptors to observe how in the present context this selection can influence the outcome of the method implementation. Here, we also chose to assess the HLDA ability to perform beyond its strict theoretical limitations, namely, that the descriptor unbiased fluctuations are normal in form.

Thus, the first set *D*_{1} consisted of 12 distances between different atomic sites on the protein. Six of these distances were selected between atoms situated on the backbone, while six more were taken between atoms situated on the protein’s side chains. The second set of descriptors *D*_{2} consisted of 6 contacts placed on the protein’s backbone while the third *D*_{3} consisted of *αβ* functions corresponding to the protein’s backbone dihedral angles, i.e., $\alpha \beta =12(1+cos(\varphi i))$, with *i* = 1…18. (For a detailed list of the atom pairs used for the construction of *D*_{1} and *D*_{2} and the contact parameters, see the supplementary material.) The two covariance matrices **Σ**_{f}, **Σ**_{u} and two mean vectors *μ*_{f}, *μ*_{u} corresponding to the folded and unfolded states, respectively, were thus constructed for each of the descriptor sets.

Using this information and applying HLDA, we could next obtain an estimation of the hyper planes that best separated the unbiased distributions corresponding to the folded and unfolded states within the space spanned by each set of descriptors. Concomitantly, the sought after CVs were obtained using Eq. (7). The weights of the HLDA CVs attained for each of the descriptor sets are plotted in Fig. 1 along with illustrations of the utilized descriptors.

Interestingly, analysis of the weight distributions reveals much valuable information about the system showing that the main features of the folding process are encoded in the CVs themselves. Thus, we find that for both the sets *D*_{1} and *D*_{2}, most of the weight is assigned to the descriptors *d*_{1}, *d*_{2}, and *d*_{3} which correspond to the distances and contacts between facing amino acids located away from the backbone beta-turn. Similarly, in both *D*_{1} and *D*_{2}, the distances/contacts located in the beta-turn are found to be comparably less important, alluding to the fact that in the unfolded state, the beta-turn is intermittently formed. In the case of *D*_{1}, the distances between the side chains are assigned lower weights as well. Nevertheless, for the distances *d*_{10}, *d*_{11}, and *d*_{12}, non-negligible weight is assigned, reflecting the associated side-chain’s contact formation in the folded state, due to their hydrophobic nature.

Inspecting the weight distribution obtained for *D*_{3} reveals interesting trends as well. Thus, one can observe that by and large the higher weights of the CV are assigned to *αβ*s of the backbone dihedral angles situated in and near the backbone beta-turn, reflecting these angles’ importance in the folding process. In addition, we find that in comparison, it is the *αβ*(Ψ) that attain higher weights. Inspection here shows that while the *αβ*(Φ) fluctuations do not change much between the folded and unfolded states, clearly configurational changes of *αβ*(Ψ) between the folded and unfolded states are present. Another interesting feature of the CV weight distribution is that with the exception of *d*_{6}, the weights of *αβ*(Ψ) and *αβ*(Φ) tend to be in anti-phase. Here, inspection shows that this results from the correlation between the *αβ*(Ψ) and *αβ*(Φ) fluctuations in the *α*-helical basin that is visited in the unfolded state. Moreover, we attribute the single positive value of *d*_{6} (which corresponds to the glycine Φ) to the fact that unlike the other Φ backbone dihedral angles, it can also assume a left helix conformation.^{18} Finally, examination of *d*_{13}, the descriptor assigned with the largest weight, shows that it corresponds to the *αβ*(Ψ) of the proline amino acid, coinciding with the fact that such angles are associated with a relatively high energetic rotational barrier.^{19}

### Biased simulations

With the CVs at hand, we could next launch metadynamics simulations with the objective of sampling the system phase space. Monitoring these simulations, we found that several folding events could be observed with *D*_{1} and *D*_{2} taking the lead in the transition frequencies. Thus, with the little initial information with which we commenced, an assortment of folding and unfolding pathways could be harvested, thereby shedding light on the mechanisms underlying these events. Figure 2 presents segments of three simulations, each run with a CV generated by a different descriptor set, showing the *Cα* RMSD of the protein with respect to its folded crystal structure as a function of metadynamics simulation time. All three segments exhibit both folding and unfolding events.

Despite observing even multiple transitions between the folded and unfolded states in some of the simulations, attaining estimates of converged FES using metadynamics alone was not possible. Observing the simulation dynamics, we found this to be caused by the system’s phase space intricate multidimensional nature with a profusion of kinetic bottlenecks and free energy barriers. Thus, to circumvent this impediment, we resorted to the utilization of parallel tempering metadynamics which has previously been shown to be very effective for such problems.^{9} By running such simulations, now with the HLDA generated CVs, we could observe that for all three sets of descriptors, effective sampling of the relevant system phase space was achieved (see Fig. 2). Moreover, estimates of the system FES could be obtained using the WTMD^{20} relation, Eq. (8),

where *γ* is the WTMD bias factor and *V*(*s*) is the simulation bias. Figure 3 presents the FES obtained from three different simulations using the CVs generated by the three different sets of descriptors. For the sake of comparison, the corresponding FES obtained using a histogram analysis on a 100 *μ*s unbiased simulation taken from the Shaw data bank^{8} is presented as well. As can be seen in all three cases, reasonable estimates of the FES could be obtained. However, the results obtained using the sets *D*_{1} and *D*_{2} are markedly more accurate. Additionally, differences in the convergence times between the different simulations were observed, namely, *t*_{conv}(*D*_{1}) ≈ 25 ns, *t*_{conv}(*D*_{2}) ≈ 35 ns, and *t*_{conv}(*D*_{3}) ≈ 45 ns, where we defined convergence when the calculated FES fluctuations as a function of simulation time reached their minimal amplitudes around their final average result.

One probable reason for the differences in performance between the different CVs is the extent to which their underlying descriptor unbiased fluctuations deviate from multivariate normality. To assess the possible influence of such deviations on the CVs quality, we thus computed the kurtosis and skewness^{21} of the unbiased fluctuations of the covariance matrices, **Σ**_{f,u}, eigenvectors. Figure 4 presents these data for each of the descriptor sets in both the folded and unfolded states. As can be seen, the values obtained for *D*_{1} largely match those that correspond to perfect normal distributions (indicated by dashed lines). In slight contrast, the set *D*_{2} can be seen to exhibit larger deviations from normality, while the set *D*_{3} seems to exhibit the most and largest instances of such deviations. Observation of these data thus indicates that while HLDA seems to be forgiving when applied to data which is not strictly multinormal, it is likely that a price is to be paid in their quality as deviations increase.

## CONCLUSIONS

As the size and complexity of systems simulated using molecular dynamics increase, the need for systematic ways of constructing viable CVs for these systems which do not require an expert’s knowledge is becoming more evident. In the present study, we have applied HLDA, a recently developed modification of LDA to develop one-dimensional CVs for the folding problem of chignolin. In doing so, we have found that given a naive selection of descriptors, the method is able to generate CVs that, when biased, are able to drive the system back and forth between the system’s folded and unfolded states. In addition, we have found that incorporating these CVs in PTMetaD can enable obtaining of good estimates of the system FES. In both cases, we found that some deviation from multivariate normality is tolerated by HLDA, yet increasing the amount of deviation may lead to a reduction in the constructed CVs’ quality. Finally, we found that within the weight distributions of the calculated HLDA CVs themselves reside abundant useful information and physical insight about the process being studied. We thus conclude that the present study suggests that HLDA can be applied to increasingly more complex systems for the systematic construction of CVs, a path which we wish to continue and explore in the near future.

## SUPPLEMENTARY MATERIAL

See supplementary material for additional information regarding the simulations’ implementation.

## ACKNOWLEDGMENTS

We acknowledge D. E. Shaw Research for sharing data from the simulations of chignolin. This research was supported by the European Union Grant No. ERC-2014-AdG-670227/VARMET. Calculations were carried out on the Mönch cluster at the Swiss National Supercomputing Center (CSCS).

## REFERENCES

^{2}test