In this paper, we present a fast and adaptive correlation guided enhanced sampling method (CORE-MD II). The CORE-MD II technique relies, in part, on partitioning of the entire pathway into short trajectories that we refer to as instances. The sampling within each instance is accelerated by adaptive path-dependent metadynamics simulations. The second part of this approach involves kinetic Monte Carlo (kMC) sampling between the different states that have been accessed during each instance. Through the combination of the partition of the total simulation into short non-equilibrium simulations and the kMC sampling, the CORE-MD II method is capable of sampling protein folding without any *a priori* definitions of reaction pathways and additional parameters. In the validation simulations, we applied the CORE-MD II on the dialanine peptide and the folding of two peptides: TrpCage and TrpZip2. In a comparison with long time equilibrium Molecular Dynamics (MD), 1 *µ*s replica exchange MD (REMD), and CORE-MD I simulations, we find that the level of convergence of the CORE-MD II method is improved by a factor of 8.8, while the CORE-MD II method reaches acceleration factors of ∼120. In the CORE-MD II simulation of TrpZip2, we observe the formation of the native state in contrast to the REMD and the CORE-MD I simulations. The method is broadly applicable for MD simulations and is not restricted to simulations of protein folding or even biomolecules but also applicable to simulations of protein aggregation, protein signaling, or even materials science simulations.

## I. INTRODUCTION

Molecular Dynamics (MD) and Monte Carlo (MC) simulations are important theoretical tools for the investigation of biological systems on a molecular level. For the last two decades, theoretical improvements, the improvement of forcefield parameter sets, and a substantial rise of the performance of hardware established MD and MC as methods that are complementary to experiments. MD and MC strongly contributed to a better understanding of the underlying molecular processes of protein folding,^{1–4} protein aggregation,^{5,6} and protein signaling.^{7} Both methods play an essential role in the modeling of drug–host interactions,^{8} protein self-assembly,^{9} and protein membrane interactions.^{10} Despite the importance of both techniques, the timescales of biological processes can exceed the accessible computable time-ranges by many orders of magnitude due to the complexity of the underlying energy landscapes. This problem has been tackled through the improvement of the computer hardware,^{11} a reduction in the complexity through the development of coarse-grained models,^{12–14} and algorithmic improvements that raised the performance of MD and MC.^{15–19}

Leaving aside advances based on coarse-grained modeling and the development of efficient hardware and software, the group of umbrella sampling methods solves the timescale problem through projections of the trajectory space to underlying energy landscapes that contain the dimensionality of specific collective variables. Several groups and sub-groups of methods have emerged that belong to the class of umbrella sampling methods.^{20,21} Prominent members of that class are the metadynamics method,^{22} *λ*-dynamics,^{23,24} adaptive bias MD,^{25} the hyperdynamics method,^{26} conformational flooding,^{27} and the local elevation technique.^{28} Regardless of the aforementioned classes of methods, techniques that accelerate the sampling in the trajectory space are adaptive and in most cases do not contain the need of *a priori* definitions of reaction coordinates. This group of methods relies on first-principles conjectures, such as approximations of the geometric degrees of freedom of a molecule or propagation methods that coarse-grain fast thermal fluctuations of a protein. Here, we refer to specific classes of constraint methods, while the use of constraints is not interpreted as an enhanced sampling in the community, although bonded, non-bonded, and angular constraints can contribute to 2–15-fold speed-ups of the MD and MC sampling.^{29,30} More broadly, constraining the internal degrees of motion will enable even much higher accelerations, such as in simulations of the Brownian motion of protein complexes or the extended capture of drug–host interactions.^{31} Langevin and Brownian dynamics in combination with constraints contain stochastic terms in the propagation scheme and can be applied for the simulation of protein-aggregation phenomena on timescales ranging from micro- to milli-seconds. Hybrid techniques use a combination of the stochastic nature of MC with the deterministic propagation in MD and have the potential of further improvements of the sampling efficiency. Hybrid MC approaches range from the hybrid MC methodologies^{32–35} to hybrid kinetic Monte Carlo/MD (kMC/MD)^{36–38} approaches that apply an event-driven acceleration. In particular, kMC/MD approaches allow for larger propagations along the time-axis dependent on the nature of the move, which has been shown in simulations of protein folding and protein signaling.^{39} Although the classes, groups, and sub-groups of methods we discuss here have proven their performance in simulations, a larger fraction of the methods applies biases or modifications in the energy space of the system, which leads to the occurrence of non-equilibrium states and the need to define appropriate reaction coordinates.^{40,41} Considerable efforts have been made in recent work to address the problem of adaptive definitions of collective variables, most notably using artificial intelligence (AI)-based techniques.^{42–46} The new approaches use projections to the principle components of the system or generalize the trajectory-dependent variables to a dimensionless space, which is suitable for supervised machine learning methods. Other approaches employ the Riemannian geometry for a suitable transformation of the trajectory space and the identification of collective variables. Although many of these approaches have been deemed successful for the systems to which they have been applied, the new techniques are mathematically complex, difficult to implement, or require large amounts of suitable training data. In contrast to these very complex methods, trajectory space enhanced sampling methods are based on simple conjectures, which resemble an alternative to very complex AI-driven approaches, while they remain computationally easy to handle and lead to reproducible results. In a recent work, we developed a correlation-dependent enhanced sampling method (CORE-MD I)^{47} that only depends on one single energy parameter and does not need *a priori* definitions of reaction coordinates. The CORE-MD I method relies on a projection of the complex energy landscape on the dimensionless space of the path-dependent correlation functions. The CORE-MD I method accelerates the sampling through the formulation of history-dependent correlation-dependent bias potentials and an additional statistical bias, which also depends on the autocorrelation function of the adaptive paths. We successfully validated the method on the folding of TrpCage and the conformational landscape of dialanine.

In this work, we developed and implemented a novel correlation-dependent MD method (CORE-MD II), which is parameter-free and does not require an *a priori* definition of reaction coordinates. Using a kinetic Monte Carlo (kMC) formalism, the method performs a statistical sampling between configurations that have been accessed previously. While the CORE-MD I methodology is based on a correlation-dependent probability, the CORE-MD II method partitions the simulation into sub-trajectories that we refer to as instances. In an adaptive way, a correlation-dependent rate is used as a sampling parameter for the selection of the instances *on the fly*. In contrast to CORE-MD I, the CORE-MD II method uses a path-dependent metadynamics formalism and a statistical bias to accelerate the MD sampling within each instance. The set of techniques that we apply in the CORE-MD II method allow for a faster and more accurate sampling of protein folding than in the CORE-MD I simulations. We validate the CORE-MD II method on the conformational landscape of dialanine and the folding of two peptides: TrpZip2 and TrpCage.^{48,49} In the simulations, we observe a good agreement with the experiment and long time equilibrium MD simulations. We find that the novel algorithm is capable of sampling the systems with a high level of convergence compared with equilibrium MD data, while the acceleration factors range from 20 to 120.

## II. METHODS

The CORE-MD II method uses a central path-definition and calculates a path-dependent autocorrelation function of the increments along the pathway. Using a correlation-dependent path-definition, a metadynamics algorithm samples the system along adaptive pathways. The CORE-MD II method partitions the total trajectory into instances with independent path-definitions and correlations. The states that are obtained after each instance are sampled using a hybrid kinetic Monte Carlo (kMC) algorithm. The CORE-MD II method can be understood as a non-equilibrium sampling strategy relying on an adaptive rate-dependent selection of short enhanced MD trajectories. The total trajectory consists then of a large number of adaptive and path-dependent non-equilibrium simulations. In these terms, the CORE-MD II method can be interpreted as an adaptive rate-dependent state-to-state dynamics sampling method between Markovian states.^{50–52} The hybrid kinetic Monte Carlo/MD (kMC/MD) method does not require any *a priori* information on reaction coordinates or collective variables and does not need additional input parameters.

### A. Theory

We start with the definition of the global probability *P*(*x*_{i}(*t*)) that can be defined over *N* time-slices or sub-trajectories *k* with length *τ*_{k}, which are described by local probability densities *ρ*_{k}(*x*_{i}(*t*)),

where *x*_{i}(*t*) stands for the coordinate of an atom with the index *i*. As a result, we divide the total trajectory into slices with periods of *τ*_{k}, where we observe local pathways $Lik(t)$ and local correlation functions $Cik(t)$ (see Fig. 1). In analogy, we express the partition of the total pathway into instances with the index *k*,

If we then consider the averaging process of a trajectory-dependent quantity *X*(*t*), the partition into small trajectories allows for a faster formation of time-averages than the determination of the expectation value of the complete trajectory, which is linked to the timescale problem of MD simulations. Therefore, the expectation value of the complete trajectory can be approximated as

which states that the partition of the complete trajectory into a finite number of *K* sub-trajectories is approximately sufficient for the sampling of the expectation value ⟨*X*(*t*)⟩. We define the number of configurations *K* by a minimal set of the number of atoms *N*_{a} in the system, which guarantees a fast forward propagation. In the following, we introduce the expressions for the fragmented pathway and the local correlation function. Within each sub-trajectory *k*, the local pathway is described by the reduced action $Lik(t)$,

where Δ*x*_{i}(*t*) = *x*_{i}(*t*) − ⟨*x*_{i}(*t*)⟩, *p*_{i}(*t*) is the momentum, and *t* is the time. The local path $Lik(t)$ is used to define the local autocorrelation function $Cik$,^{47}

where $Lik(t\u2032)$ is determined with a frequency equal to 1 ps^{−1} and ⟨⋯⟩ denotes the time-average. In our implementation, we define a period *τ*_{k}(*t*) that separates each individual instance *k* from the preceding instance [see Fig. 2, where we show an example of the index *k* and the correlation function $Cik(t)$ as a function of time in a CORE-MD II simulation of dialanine]. The CORE-MD II algorithm samples the system along a correlation-dependent probability between states with an index *k* using a kinetic Monte Carlo (kMC) algorithm. We limit the number of kMC configurations by a minimal set of the number of atoms *N*_{a} in the system, which guarantees a fast forward propagation within a small window of possible selections in each kMC-step. With a frequency of $\tau k\u22121$, we perform a kMC-step and express a rate *r*_{k} for each instance *k* as

where *ν* is a frequency factor (we apply *ν* = *N*_{a}*τ*_{0}, where *N*_{a} is the number of atoms and *τ*_{0} is the minimal period equal to 10 ps^{−1}, which is a relation that connects to the friction terms in the prefactors of Kramer’s rate theory^{53,54}), Δ*E*_{k}(*t*) = *E*_{k}(*t*) − *E*_{k−1}(*t*), and *E*_{k}(*t*) is the energy for the instance *k*,

where $Epotk(t)$ stands for the potential energy in the instance *k*, *V*_{k}(*t*) is the bias potential, and $\u03f5=1RT$, where $R=8.314JKmol$ and *T* stands for the temperature. We then define the period *τ*_{k} as

which is the timescale for the instance *k*. We then calculate the cumulative rates $Rk(t)=\u2211j=1krj(t)$ and $RN(t)=\u2211j=1Nrj(t)$ and apply the kinetic Monte Carlo algorithm^{36,55,56} for the selection of a configuration *k* with which a configuration is used for the subsequent trajectory instance,

where *ξ* stands for a random number ranging from 0 to 1. The kMC sampling guides the trajectory between equilibrium configurations of the system, where each instance *k* resembles a state that resides close to the equilibrium state. We continue with the description of the second component of the CORE-MD II algorithm that applies the local biases. (1) At each initialization of a new trajectory-fragment, the velocities are selected from a random distribution. (2) In order to accelerate the sampling within each instance, we apply a history-dependent bias potential $Vik(t)$ that is related to metadynamics,^{22} while the history dependency is limited by the timescale of each instance. We consider the fragmented pathway that depends on the correlation function and express that the expectation value of a reaction coordinate $\u27e8\sigma ik(t)\u27e9$ equals the sum over all local path-increments of individual instances *k*,

where *β*_{k} is a normalization factor ranging from 0 to 1. Therefore, we define a bias potential consisting of an accumulation of Gaussian functions along a collective variable $\sigma ik(t)$ that we define through the correlated path,^{57}

which we normalize by the maximal correlated path occurring within each instance. The history-dependent potential $Vik(t)$ is accumulated with a frequency of 1 ps^{−1},

where W is the height of the Gaussian, which we determine using $W=10kBT1ps\tau k(t)$, and *δσ* is the width of the Gaussians, which we set equal to the width of 2 bins in the histogram (consisting of 100 bins in our implementation). We add the Gaussians to the history-dependent potential using the well-tempered metadynamics technique through a normalization of the added Gaussians by the factor $exp(\u2212Vik(t)/\Delta T)$, where we apply Δ*T* = 1000 kJ/mol.^{58} The corresponding bias $\u2202\u2202\sigma ik(t)Vik(t)$ is added throughout the simulation. Finally, we accelerate the sampling within each instance and apply the statistical bias as described in our recent work on the CORE-MD algorithm.^{47} We implemented the correlation-dependent bias by a factorization with the variable $\lambda ik(t)$ with which we scale the gradient of all atoms in the system. The factor $\lambda ik(t)$ is described by

This statistical bias enhances the decay of the correlation function and accelerates the access of new states by the system.^{47}

### B. Algorithm

Calculate the path using expression (4) for each time step and evaluate the correlation function given in (5).

Accumulate the history-dependent potential $Vik(t)$ as described in Eq. (12) and apply the statistical bias from (13).

With a frequency of $\tau k\u22121$, apply the kinetic Monte Carlo formalism and re-initialize the correlation function:

Using the kMC formalism described in the expression (9), select a new configuration and re-initialize a new sub-trajectory, while the path-dependent quantities and the bias potential $Vik(t)$ are set to values equal zero. Assign the present configuration to the array of configurations and determine the rate

*r*_{k}.

### C. Simulation parameters and system setup

For the simulations and parts of the trajectory analysis, we used the GROMACS-4.5.5 simulation package.^{59} We implemented the CORE-MD II method into the same package. In all simulations, we used the AMBER99SB forcefield^{60,61} and the generalized Born implicit solvent model using the Still algorithm with a continuum dielectric constant equal to 80.^{62} The electrostatics and van der Waals interactions were treated using the twin-range cutoff equal to 1.0/1.2 nm with a neighborlist cutoff equal to 1.0 nm. The neighborlist was updated every integration step. We applied the Nosé–Hoover thermostat with a coupling time of *τ*_{T} = 1.0 ps and a target temperature of 300 K. We modeled the starting structures in an extended conformation using the ribosome code that we downloaded from *http://folding.chemistry.msstate.edu/raj/Manuals/ribosome.html*. In validation simulations, we modeled dialanine (Ace-Ala-NMe), tryptophan cage minipeptide (TrpCage) (NLYIQWLKDGGPSSGRPPPS),^{49} and tryptophan zypper peptide 2 (TrpZip2) (SWTWENGKWTWKX).^{48} We capped both peptides N- and C-terminal with an acetyl- and a methyl-group. For the comparison of the CORE-MD II results, we applied the CORE-MD I algorithm in simulations of TrpCage, TrpZip2, and dialanine using a parameter *α* = 1.0 kJ/mol.^{47} For the generation of equilibrium MD data, we ran an equilibrium MD simulation over 2 *µ*s of dialanine and parallel tempering replica exchange MD (REMD) simulation of TrpCage and TrpZip2 over 1 *µ*s using 24 replicas in the NVT ensemble within a temperature range from 300 to 396 K.^{63} An exchange of configurations was attempted every 1000 integration steps in the conventional REMD simulation. We ran a total simulation time of 200 ns for dialanine using CORE-MD I/II and simulated the TrpCage and TrpZip2 minipeptides over 200 ns using the same algorithms. For the calculation of the root-mean square deviation of the backbone atoms C*α* to the native structure (*RMSD*_{Cα−Cα}), we used the NMR-model No. 1 from the protein data bank (PDB) structure: 1l2y (TrpCage).^{49} In parts, the trajectory analysis was performed using in-house programs. For the determination of the free energies Δ*F*, we used

where *k*_{B} is Boltzmann’s constant, *T* is the temperature, *P* is the probability, and *P*_{min} is the minimal non-zero reference value of the same function. We determined the level of convergence ⟨ΔΔ*G*⟩ by the average difference value of the free energies from equilibrium MD and the CORE-MD I/II simulations. We analyzed the frequency *ν* of transitions of the Φ-angle of dialanine from values lower than zero to positive values through counting the numbers of transitions *N*_{Φ} from negative to positive values and normalizing by the total number of frames *N*_{t},

We clustered the structures using *RMSD*_{Cα−Cα} to the native structure. We applied an RMSD threshold of ∼0.1 nm for the clustering of the structures. For TrpCage, we coarse-grained the total configuration space into eight different clusters, while we divided the conformation space of TrpZip2 into seven clusters. We define the acceleration time by the approximate central processing unit (CPU)-time that is required to sample the identical free energy partition in relation to the CPU-time in conventional MD and REMD simulations.

### D. Program

The CORE-MD II simulation code is implemented into the GROMACS-4.5.5 simulation package. The code is available at www.github.com/epeter455/.

## III. RESULTS AND DISCUSSION

### A. Simulations of dialanine

We validated the CORE-MD II algorithm on the dialanine system and compared our results with long time equilibrium MD and with results from a CORE-MD I simulation. In Fig. 3, we show the free energy landscapes as a function of dihedral angles Φ and Ψ (FEL_{Φ−Ψ}) and our results related to the dynamical behavior of dialanine. In the following, we define the regions along FEL_{Φ−Ψ} as follows: C7_{eq} (−180^{°} < Φ < −160^{°}, 135^{°} < Ψ < 170^{°}) [panels (1) and (2)], *α* (−110^{°} < Φ < −60^{°}, −50^{°} < Ψ < 5^{°}) [panel (5)], and C7_{ax} (40^{°} < Φ < 70^{°}, −5^{°} < Ψ < 40^{°}) [panel (4)]. We define the interfacial regions *C*7_{eq} ≬ *C*7_{ax} [panel (3)] (−20^{°} < Φ < 30^{°}, Ψ ≈ 90^{°}), *α* ≬ *C*7_{ax} [panel (6)] (−30^{°} < Φ < 30^{°}, Ψ ≈ −80^{°}), and *α* ≬ *C*7_{eq} [panel (7)] (−60^{°} < Φ < −90^{°}, Ψ ≈ −80^{°}). In the FEL_{Φ−Ψ} averaged over 2 *µ*s equilibrium MD, we find major minima at the C7_{eq} position [(1) and (2)] (−11 to −12 *k*_{B}*T*) at the region *α* (5) (−11 to −12 *k*_{B}*T*) and the C7_{ax} position (4) (−6 to −8 *k*_{B}*T*) [see Fig. 3(a)]. The histogram of the CORE-MD I simulation is widened at the interfacial regions [(6), (7), and (3)] by ∼10° to 20°, while *C*7_{eq} and the *α*-positions are approximately identical to the equilibrium MD result at −11 to −12 *k*_{B}*T* [see Figs. 3(e) and 3(g)]. We find a larger deviation from the equilibrium MD result at the *C*7_{ax} position (4), where the minimum is widened and the regions for values of Ψ above and below position (4) are populated with approximately −8 *k*_{B}*T*. In the CORE-MD II result, we find populations at (6), (7), and (3), with energy values ranging from −11 to −12 *k*_{B}*T*. At the C7_{ax} position (4), we observe energy values of −6 *k*_{B}*T*, which are approximately equivalent to the equilibrium MD result [see Figs. 3(b) and 3(g)]. We then measured the free energy differences ΔΔ*G*_{Φ−Ψ} between the simulations using CORE-MD I/II and the equilibrium MD result. In a comparison between CORE-MD I and II, we find that the minima at the *C*7_{eq} position (1) and (2) are shifted only in the CORE-MD I result, where we find a positive shift of ∼0.2 *k*_{B}*T* [see Figs. 3(c), 3(f), and 3(g)]. At that position, the CORE-MD II result agrees very well with equilibrium MD. We observe an identical behavior for the *α*-position (5), while we find the largest differences at the interfaces at (3), (6), and (7), where we observe deviations of up to −3.5 *k*_{B}*T* in the CORE-MD I simulation. In contrast, the CORE-MD II result shows almost no deviation. At the *C*7_{ax} position (4), we observe that the CORE-MD I result is up to 0.5–1 *k*_{B}*T* lower compared with equilibrium MD. The CORE-MD II simulation result shows approximately identical energy values in the sampling of the *C*7_{ax} position (4). We then looked at the average deviation in energy ⟨ΔΔ*G*_{Φ−Ψ}⟩, where we find a value of −0.67 *k*_{B}*T* for CORE-MD I and −0.16 *k*_{B}*T* for the CORE-MD II simulation. We further investigated the comparatively strong improvement in the CORE-MD II simulation and measured the autocorrelation function of the Φ-angle *C*_{Φ}(*t*) for all simulations [see Figs. 4(a), 4(b), and 4(g)]. In the equilibrium MD simulation, the value of *C*_{Φ}(*t*) decays from 1 to a value of 0.7 and resides at this value up to a lag time of 1 *µ*s. The correlation function *C*_{Φ}(*t*) of the CORE-MD I simulation indicates a less correlated behavior with average values of *C*_{Φ}(*t*) in the range from 0.6 to 0.65. In contrast, the CORE-MD II simulation shows a higher Φ-correlation, where *C*_{Φ}(*t*) remains at values ranging from 0.72 to 0.73. This indicates that the CORE-MD II approach enhances correlation effects within dialanine, in contrast to the CORE-MD I technique. If we consider the average correlation ⟨*C*_{Φ}(*t*)⟩ and compare the different approaches, the correlation behavior of the CORE-MD II simulation agrees better with equilibrium MD than the CORE-MD I simulation. CORE-MD I yields the highest transition rate of the Φ-angle with *ν* = 4 × 10^{−4} ps^{−1}, while the transition rates for CORE-MD II reside at ∼2 × 10^{−4} ps^{−1} [see Fig. 3(d)]. In a comparison with the equilibrium MD result, we observe an acceleration factor of ∼20 for the CORE-MD II algorithm, while the level of convergence of the free energy landscapes is 4.2 times higher than the CORE-MD I algorithm.

In the validation simulations on dialanine, we compared 2 *µ*s equilibrium MD with CORE-MD I/II simulations. The CORE-MD II REMD simulation shows an optimal sampling behavior with a level of convergence that is 4.2 higher than in the CORE-MD I simulation, while we observe an acceleration factor of ∼20. The differences between the CORE-MD II and CORE-MD I sampling are given by the separation into instances *k*, the kMC sampling between the instances, a re-evaluation of the correlation function $Cik(t)$ at each instance, and the flexible biasing expression (see Fig. 2). In general, the CORE-MD II method samples the free energy landscapes with higher internal correlations in the system than CORE-MD I, where the system is strongly decorrelated. From that correlation behavior, we deduce that the separation of the system into sub-instances *k* in CORE-MD II yields a more realistic description of the underlying collective variables that guide the system along its reaction pathways.

### B. TrpZip2 folding

As a first validation example, we performed 1 *µ*s REMD simulations with 24 replicas and enhanced MD simulations of TrpZip2 using CORE-MD I and II (see Fig. 5). We first analyzed *RMSD*_{Cα−Cα} to the native structure as a function of simulation time. In the REMD simulation, we observe a fast drop of the RMSD-value from 1.2 nm to *RMSD*_{Cα−Cα} of 0.14 nm within the first 1 ps. In this simulation, *RMSD*_{Cα−Cα} migrates between extremum values of 0.8 and 0.14 nm with population maxima within 0.3–0.42 and 0.42–0.6 nm [see Fig. 5(a)]. The REMD sampling does not form conformations with *RMSD*_{Cα−Cα} values below 0.22 nm, which are only accessed within rare-event fluctuations. The CORE-MD II result on TrpZip2 shows a different behavior [see Fig. 5(b)]. The CORE-MD II simulation forms a hairpin structure within the first 10 ps that remains stable for 13 ns (*RMSD*_{Cα−Cα} ≈ 0.22 nm). The event of the first collapse is followed by a re-opening of the hairpin and a re-orientation of the Trp-sidechains as we find through a visual analysis of the conformers. This reopening is followed by a subsequent collapse at 30 ns, where the hairpin remains stable over the next 5 ns. The remaining CORE-MD II simulation follows the order of hairpin opening and a subsequent closure within a hydrophobic collapse, where the Trp-sidechains reorient. The CORE-MD II simulation result on TrpZip2 is the only trajectory in which we observe the formation of a stable hydrophobic core consisting of four stacked Trp-sidechains^{48} [conformation (1), see Fig. 5(j)]. *RMSD*_{Cα−Cα} in the CORE-MD I simulation shows a similar fluctuation behavior as the REMD simulation on TrpZip2. Although the fluctuations in the CORE-MD I simulation are the strongest of all three different simulations, the peptide accesses RMSD-values below 0.22 nm only in rare fluctuations, while TrpZip2 mainly resides around 0.3–0.6 nm [see Fig. 5(c)]. Next, we analyzed the free energy landscapes (FEL) as a function of *RMSD*_{Cα−Cα} and the radius of gyration *R*_{g} for each of the three applied techniques [see Figs. 5(d)–5(f)]. The FEL averaged over 1 *µ*s REMD results in a population ranging from 0.13 < *RMSD*_{Cα−Cα} < 0.8 nm and 0.56 < *R*_{g} < 0.92 nm. We observe minor populations ranging from 0 to −2 *k*_{B}*T* in the range 0.13 < *RMSD*_{Cα−Cα} < 0.24 nm, 0.56 < *R*_{g} < 0.8 nm and 0.7 < *RMSD*_{Cα−Cα} < 0.8 nm, 0.65 < *R*_{g} < 0.8 nm. We find higher populations with energies from −2 to −6 *k*_{B}*T* in the range 0.24 < *RMSD*_{Cα−Cα} < 0.4 nm [conformations (2) and (3)], 0.56 < *R*_{g} < 0.8 nm and 0.6 < *RMSD*_{Cα−Cα} < 0.7 nm, 0.65 < *R*_{g} < 0.8 nm [conformations (4) and (5)] [see Fig. 5(j)]. We locate the maximal population within the range 0.4 < *RMSD*_{Cα−Cα} < 0.6 nm, 0.56 < *R*_{g} < 0.8 nm, where the energy ranges from −8 to −9 *k*_{B}*T*. In contrast to CORE-MD I and the 1 *µ*s REMD simulation, the FEL in the CORE-MD II simulation contains the only minimum in the collapsed state with a stable hydrophobic core [see Fig. 5(e), conformation (1)]. In the CORE-MD II FEL, we observe minor populations ranging from 0 to −4 *k*_{B}*T* in the range 0.11 < *RMSD*_{Cα−Cα} < 0.2 nm, 0.56 < *R*_{g} < 0.8 nm and 0.7 < *RMSD*_{Cα−Cα} < 0.9 nm, 0.9 < *R*_{g} < 1.05 nm. We find higher populations with energies from −2 to −6 *k*_{B}*T* in the range 0.6 < *RMSD*_{Cα−Cα} < 0.7 nm and 0.65< *R*_{g} < 0.8 nm. We locate two maxima in the population within the range 0.2 < *RMSD*_{Cα−Cα} < 0.25 nm, 0.56 < *R*_{g} < 0.8 nm corresponding to the collapsed native state, and a near native state at 0.4 < *RMSD*_{Cα−Cα} < 0.6 nm, 0.56 < *R*_{g} < 0.8 nm, where the energy ranges from −8 to −9 *k*_{B}*T*. The CORE-MD I result shows the widest population range and no energy minimum corresponding to the native state [see Fig. 5(f)]. In that FEL averaged over the CORE-MD I simulation, we find minor populations with energies ranging from 0 to −2 *k*_{B}*T* within the range 0.13 < *RMSD*_{Cα−Cα} < 0.22 nm, 0.6 < *R*_{g} < 0.8 nm and 0.8 < *RMSD*_{Cα−Cα} < 1 nm, 0.9 < *R*_{g} < 1.1 nm. In that validation example, the population rises toward a main maximum in the range 0.4 < *RMSD*_{Cα−Cα} < 0.6 nm and 0.65 < *R*_{g} < 0.8 nm. We then calculated the relative differences in the free energies between the CORE-MD I/II results and the 1 *µ*s REMD simulation of TrpZip2 [see Figs. 5(g)–5(i)]. For the CORE-MD II result, the differences ΔΔ*G* in the populations range from ∼5 to −2.5 *k*_{B}*T*, where the largest differences can be found for radii of gyration *R*_{g} below 0.65 nm and above 0.87 nm. In the center of the FEL, we find a good agreement of the CORE-MD II simulation and the REMD result. The CORE-MD I result differs only slightly from the CORE-MD II difference plot [see Fig. 5(i)], where we find approximately the same difference pattern. With −0.038 *k*_{B}*T*, the average deviation ⟨ΔΔ*G*⟩ of the CORE-MD II result shows that the CORE-MD II method is 1.4 times more precise than the CORE-MD I sampling with an average deviation of −0.052 *k*_{B}*T* [see Fig. 5(g)]. In a final analysis, we performed a *RMSD*_{Cα−Cα}-dependent clustering of the conformations in the CORE-MD II simulation [see Fig. 5(j)]. In contrast to the REMD and the CORE-MD I simulation, CORE-MD II samples the formation of the native state with a correct arrangement of the Trp-residues. As a general observation, we find that the folding pathways in the CORE-MD II simulation follow a first hydrophobic collapse, after which a re-ordering of the Trp-sidechain occurs leading to the final formation of the native fold. The CORE-MD II simulation shows that only the near native states can access the native state of TrpZip2, while coiled conformers have to pass through the collapsed state of that peptide toward folding events with a correct hydrophobic core. These results agree with prior simulation studies and the observation of a hydrophobic collapse mechanism of folding of TrpZip2.^{64–68} The approximate acceleration factor of the CORE-MD II method compared with the 1 *µ*s REMD simulation is ∼120, while the REMD simulation did not sample the formation of the native fold correctly.

### C. TrpCage folding

As a second validation example, we performed 1 *µ*s REMD simulations with 24 replicas and enhanced MD simulations of TrpCage using CORE-MD I and II (see Fig. 6). In a first analysis, we measured *RMSD*_{Cα−Cα} to the native structure as a function of simulation time. In the REMD simulation, the RMSD-value decreases from 1.6 nm to *RMSD*_{Cα−Cα} of 0.2 nm within the first 50 ns. *RMSD*_{Cα−Cα} fluctuates between extremum values of 0.7 and 0.13 nm with a population maximum at 0.2–0.32 nm [see Fig. 5(a)]. The CORE-MD II result on TrpCage shows an almost identical behavior [see Fig. 6(b)]. The CORE-MD II simulation forms a helical structure within the first 1–2 ns that remains stable for 9.5 ns (*RMSD*_{Cα−Cα} ≈ 0.22 nm). The event of the first collapse is followed by a re-opening of the native fold and a re-organization of the sidechain involving Trp6 and Tyr3 as we find through a visual analysis of the conformers. This reopening is followed by a subsequent collapse at 25 ns, where the peptide remains stable over the next 56 ns. The remaining CORE-MD II simulation follows the order of opening and a subsequent closure of the PPII helix and the N-terminal *α*-helix within a hydrophobic collapse. The CORE-MD II simulation result on TrpCage differs from the CORE-MD I result, where we do not observe the formation of a stable hydrophobic core [see Fig. 6(c)]. *RMSD*_{Cα−Cα} in the CORE-MD I simulation shows the strongest fluctuations in contrast to the CORE-MD II simulation. As the fluctuations in the CORE-MD I simulation are the strongest of all three different simulations, the peptide accesses RMSD-values below 0.22 nm only through rare-event fluctuations, while TrpCage mainly resides around 0.3–0.6 nm [see Fig. 6(c)]. Next, we analyzed the free energy landscapes (FEL) as a function of *RMSD*_{Cα−Cα} and the radius of gyration *R*_{g} for each of the three applied techniques [see Figs. 6(d)–6(f)]. The FEL averaged over 1 *µ*s REMD results in a population ranging from 0.13 < *RMSD*_{Cα−Cα} < 0.8 nm and 0.63 < *R*_{g} < 0.92 nm. We observe minor populations ranging from 0 to −3 *k*_{B}*T* in the range 0.13 < *RMSD*_{Cα−Cα} < 0.17 nm, 0.63 < *R*_{g} < 0.85 nm and 0.7 < *RMSD*_{Cα−Cα} < 0.93 nm, 0.63 < *R*_{g} < 0.8 nm. We find higher populations with energies from −3 to −8 *k*_{B}*T* in the range 0.17 < *RMSD*_{Cα−Cα} < 0.2 nm, 0.63 < *R*_{g} < 0.85 nm and 0.4 < *RMSD*_{Cα−Cα} < 0.93 nm, 0.65 < *R*_{g} < 0.8 nm. We locate the maximal population within the range 0.2 < *RMSD*_{Cα−Cα} < 0.3 nm, 0.63 < *R*_{g} < 0.8 nm, where the energy ranges from −9 to −10.5 *k*_{B}*T*. In agreement with the 1 *µ*s REMD simulation, the FEL in the CORE-MD II simulation contains a single minimum in the collapsed state with a stable hydrophobic core [see Fig. 6(e)]. In the CORE-MD II FEL, we observe minor populations ranging from 0 to −4 *k*_{B}*T* in the range 0.11 < *RMSD*_{Cα−Cα} < 0.2 nm, 0.63 < *R*_{g} < 0.9 nm and 0.6 < *RMSD*_{Cα−Cα} < 0.9 nm, 0.9 < *R*_{g} < 1.1 nm. We find higher populations with energies from −2 to −6 *k*_{B}*T* in the range 0.5 < *RMSD*_{Cα−Cα} < 0.8 nm and 0.65< *R*_{g} < 0.9 nm. We locate the maximum in the population within the range 0.19 < *RMSD*_{Cα−Cα} < 0.42 nm and 0.63 < *R*_{g} < 0.7 nm corresponding to the collapsed native state and a near native state, where the energy ranges from −8 to −8.5 *k*_{B}*T*. The CORE-MD I result shows the widest population range and an energy minimum corresponding to the native state [see Fig. 6(f)]. In that FEL averaged over the CORE-MD I simulation, we find minor populations with energies ranging from 0 to −2 *k*_{B}*T* within the range 0.11 < *RMSD*_{Cα−Cα} < 0.24 nm, 0.63 < *R*_{g} < 0.9 nm and 0.6 < *RMSD*_{Cα−Cα} < 1 nm, 0.9 < *R*_{g} < 1.1 nm. In that validation example, the population rises toward a main maximum in the range 0.22 < *RMSD*_{Cα−Cα} < 0.4 nm and 0.65 < *R*_{g} < 0.8 nm. We then calculated the relative differences in the free energies between the CORE-MD I/II results and the 1 *µ*s REMD simulation of TrpCage [see Figs. 6(g)–6(i)]. For the CORE-MD II result, the differences ΔΔ*G* in the populations range from ∼5 to −5 *k*_{B}*T*, where the largest differences can be found for radii of gyration *R*_{g} below 0.8 nm and above 0.9 nm. In the center of the FEL, we find a good agreement of the CORE-MD II simulation and the REMD result. The CORE-MD I result differs strongly from the CORE-MD II difference plot [see Fig. 6(i)], where we observe that the CORE-MD I result shows ΔΔ*G* values of up to −10 *k*_{B}*T* at *R*_{g} values of 0.83 nm. With −0.063 *k*_{B}*T*, the average deviation ⟨ΔΔ*G*⟩ of the CORE-MD II result shows that the CORE-MD II method is 8.8 times more precise than the CORE-MD I sampling with an average deviation of −0.53 *k*_{B}*T* [see Fig. 6(g)]. In a final analysis, we performed a *RMSD*_{Cα−Cα}-dependent clustering of the conformations in the CORE-MD II simulation [see Fig. 6(j)]. As a general observation, we find that the folding pathways in the CORE-MD II simulation follow a first hydrophobic collapse, after which a re-ordering of the N-terminal and the 3_{10}-helical segment occurs leading to the final formation of the native fold [conformations (4) and (5)], while the PPII helical element does not perform a strong internal restructuring. The CORE-MD II simulation shows that only the near native states can access the native state of TrpCage [conformations (2) and (3)], while opened conformers have to pass through the collapsed state of that peptide toward folding events with a correct hydrophobic core [(conformation (1)]. The folding pathways observed in the CORE-MD II validation simulation is dominated by the formation of the secondary structure and an internal reorganization toward the formation of the native fold.^{36,69} The CORE-MD I technique samples the formation of the native state correctly, while the fluctuations in the CORE-MD I simulation lead to a stronger relative deviation from the native state as in the CORE-MD II simulation. Related to the total convergence to the folded state, the CORE-MD II algorithm shows an 8.8 times higher convergence than the CORE-MD I technique. The approximate acceleration factor of the CORE-MD II method compared with the 1 *µ*s REMD simulation is ∼120. Our results are in agreement with our previous findings and other theoretical studies.^{36,69–78}

## IV. CONCLUSIONS

In this paper, we presented a fast and adaptive correlation guided enhanced sampling MD method (CORE-MD II) that raises the performance of the CORE-MD I methodology. The CORE-MD II technique applies a partition of the total pathway into short trajectories that we refer to as instances. Within each instance, the CORE-MD II technique samples independent states using adaptive path-dependent metadynamics. Using a detailed balance criterion, the technique applies a kinetic Monte Carlo (kMC) sampling between the different states that have been accessed in the individual instances. Through the combination of the partition of the total simulation into short non-equilibrium simulations and the kMC sampling, the CORE-MD II method is capable of sampling protein folding in an adaptive and non-parameter-dependent way. In contrast to the CORE-MD I method, the CORE-MD II technique considers the local heterogeneity of correlation patterns and reaction pathways, while the CORE-MD I method applies the global correlation function and the associated probability function. Compared to the CORE-MD I technique, the combination of short path-dependent metadynamics simulations and the kMC sampling of the instances leads to an improvement in the accuracy and the performance of the CORE-MD II algorithm. We applied the CORE-MD II state-to-state dynamics on the dialanine peptide and the folding of two peptides: TrpCage and TrpZip2. In a comparison with long time equilibrium MD and 1 *µ*s REMD simulations, we find that the level of convergence of the CORE-MD II method is up to 8.8 times higher than that of the CORE-MD I method, while the CORE-MD II method reaches acceleration factors of ∼120.

## ACKNOWLEDGMENTS

This work was supported by the Deutsche Forschungsgemeinschaft (Grant No. MA1081/23-1) (to D.J.M.). D.J.M. is a member of the Cluster of Excellence RESIST (Grant No. EXC 2155) with support from the DFG (Project No. 39 087 428-B11) and of the ERA-Net Joint Project on Rare Diseases Consortium “PredACTINg” with support from the German Federal Ministry of Education and Research under Grant Agreement No. 01 GM1922B. J.E.S. acknowledges support through Grant No. NSF MCB-1716956. Calculations have been carried out at the JUWELS Cluster at the Forschungszentrum Juelich, the Lise Cluster of the Norddeutscher Verbund für Hoch-und Höchstleistungsrechnen (HLRN-IV) (Project No. nib00016), and Nodes of the European Life-science Infrastructure for Biological Information.

The authors have no conflicts to disclose.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.