Several enhanced sampling techniques rely on the definition of collective variables to effectively explore free energy landscapes. The existing variables that describe the progression along a reactive pathway offer an elegant solution but face a number of limitations. In this paper, we address these challenges by introducing a new path-like collective variable called the “deep-locally non-linear-embedding,” which is inspired by principles of the locally linear embedding technique and is trained on a reactive trajectory. The variable mimics the ideal reaction coordinate by automatically generating a non-linear combination of features through a differentiable generalized autoencoder that combines a neural network with a continuous k-nearest neighbor selection. Among the key advantages of this method is its capability to automatically choose the metric for searching neighbors and to learn the path from state A to state B without the need to handpick landmarks *a priori*. We demonstrate the effectiveness of DeepLNE by showing that the progression along the path variable closely approximates the ideal reaction coordinate in toy models, such as the Müller-Brown potential and alanine dipeptide. Then, we use it in the molecular dynamics simulations of an RNA tetraloop, where we highlight its capability to accelerate transitions and estimate the free energy of folding.

## I. INTRODUCTION

Atomistic molecular dynamics (MD) simulations have proven to be a powerful tool for investigating intricate aspects of physical, chemical, and biological systems.^{1} The power of MD investigations has increased with advances in computational resources and refinement of force field accuracy. Still, many phenomena of importance exhibit timescales well beyond the reach of conventional unbiased MD, even when used on state-of-the-art supercomputing platforms. In response to this temporal constraint, a variety of *enhanced sampling* algorithms have been developed over time to augment the sampling methods and reconstruct the corresponding free energy landscapes.^{2–6} Two of the most widely used families of enhanced sampling algorithms are based on the definition of a set of *collective variables* (CVs) that approximate the reaction coordinate^{7} or on *paths* that connect two end states of the process of interest.^{8} Each of these families of methods has their strengths and limitations.

The CV-based algorithms crucially depend on the identification of slow degrees of freedom that capture the transition under investigation and are typically limited to biasing a maximum of three variables at the same time.^{9,10} Path-based methods can collect a rather large number of degrees of freedom into a feature space to be used in the path-building process. However, the effectiveness of this process generally depends on a proper definition of the end states and the effectiveness of the path search algorithm. The original *path-collective variables* (PATHCVs) combine the various aspects of the two families of approaches in an effective manner.^{11} They are typically used in combination with an enhanced sampling algorithm, and thus, they can be used to explore paths that are separated by a high free energy barrier from the initial (guess) path. Because of their efficiency in exploring path space, PATHCVs have been successfully used to reconstruct the free energy profiles associated with many complex systems.^{12–16} In addition, their combination with partial path transition interface sampling was shown to produce accurate numerical prediction of rate constants.^{17} Recent developments include approaches where the path is updated adaptively^{18} or where multiple paths are taken into account simultaneously.^{19} Such approaches make the construction of the path more robust to user input, but at the same time, more costly because of the path optimization.

Still, the effectiveness of PATHCVs depends on the choice of the metric used to define the path, as well as on milestones number, position, and other parameters. We have recently explored a machine learning approach based on a spectral gap optimization procedure to find optimal linear combinations of different features for the metric of PATHCVs.^{20} However, the approach is limited by the fact that a fixed linear combination of different features does not always provide an optimal metric for all the milestones defining complex paths. A typical example of this is provided, for instance, by the binding of a ligand to a protein. In such a case, the role of ligand hydration and dehydration is crucial at a given milestone (when the ligand enters or exits the cavity), while it is less relevant in the remaining path (corresponding to the solvated ligand being free in solution or finding its way to the cavity). This was clearly shown by quantifying the relevance of various descriptors, including ligand hydration, at different stages of ligand binding.^{21} In that context, it was shown that a CV that combines a number of descriptors in a non-linear way^{22} is able to capture the relevant slow degrees of freedom (including hydration) at the right time. Thus, using a fixed metric to define the distance from the milestones defining the path can result in a sub-optimal performance. In recent years, several methods have been proposed to learn CVs directly from data, using either supervised or unsupervised methods.^{22–36} Some of the learned variables were based on atomic representation vectors, e.g., using variational autoencoders and applied to complex reactions.^{37} Others were shown to be close to paths connecting metastable states,^{38,39} without this condition being explicitly imposed. In addition, a data-driven variant of PATHCV based on kernel ridge regression has been proposed, in which the committor probability is approximated.^{40}

In this paper, we use machine learning (ML) techniques to overhaul the path CV approach and overcome the limitations associated with previous implementations. The resulting variable, called deep-locally non-linear-embedding (DeepLNE), offers a path-like-behavior when trained on a time series or synthetic data describing a path connecting A to B in phase space. It offers an automatic procedure of building an optimal 1D description of the training data, inspired by the formalism of locally linear embedding^{41} (LLE) and the PATHCV. Essential elements of the algorithm are a dimensionality reduction operated by an artificial neural network (ANN), a differentiable k-nearest neighbor selection (k-NN), and an encoding of the neighborhood into a 1D latent space via an ANN. Starting from a predefined feature set describing the system dynamics of interest, the DeepLNE algorithm creates a representation of each datapoint based exclusively on its neighbors in the training dataset. This metric is then transformed into a path-like CV locally anchored by the data provided during the training phase.

We test the DeepLNE CV on the simulations of a particle in the Müller-Brown potential, alanine dipeptide, and an RNA 8-mer using different variants of the on-the-fly probability enhanced sampling^{42–45} (OPES) method. We show that the application of DeepLNE is successful not only in the simplest examples but also in the cases of high-dimensional inputs or the combination of a set of heterogeneous features into an optimal path-like CV.

## II. METHODS

Prior to introducing the DeepLNE method, we provide a brief overview of the PATHCVs and LLE frameworks, highlighting the areas of inspiration for our work.

### A. The path collective variables

^{11}one encodes a path connecting two end state A and B into a progress along the path CV

*s*and a distance from the path

*z*. Given a

*D*-dimensional feature space

**, one can express it parametrically**

*X***(**

*X**θ*) along the path so that

**(0) =**

*X*

*X*_{A}and

**(1) =**

*X*

*X*_{B}. Formally, the PATHCVs

*s*and

*z*are two functions of

**(**

*X**θ*), which are defined as

In practice, this PATHCV formulation cannot be employed directly to deposit bias during simulations because of the crucial requirement of a CV to be smooth and differentiable. One has to resort to a number of assumptions and approximations. First, the parametric path is replaced by a discretized version *X*_{i} constructed on a series of *m* path milestones, such that the integrals transform into finite sums. Choosing a number of high-quality milestones that are equidistant in ** X** is a non-trivial task that requires empirical optimization.

Another challenge is the choice of *λ*. Here, *λ* → ∞ leads to a step-wise path dominated by contributions for which $(X\u2212Xi)2$ is minimal and whose derivative is ill-behaved. A finite optimal *λ* strikes a balance between high values corresponding to sharp PATHCVs and low values that sacrifice resolution in favor of differentiability.

An often overlooked challenge is the choice of ** X** itself. For a PATHCV to be effective in enhanced sampling simulations, the feature space must contain as many as possible degrees of freedom so that the relevant ones pertaining to the transition between states A and B are captured. At the same time, the number of dimensions of

**should be reasonably small, as the Euclidean norm used in Eqs. (1) and (2) quickly loses resolution for an increasing dimensionality of**

*X***. Here, the conundrum lies in navigating these two conflicting requirements, which represents a complex optimization problem. Attempting to solve it requires finding a difficult compromise that is often problem dependent and cannot be easily generalized.**

*X*In order to construct the PATHCVs, one typically follows a multi-step procedure. The starting point is a trajectory of the system of interest in which at least one transition from A to B is observed. Next, a feature vector ** X**(

*θ*) is determined, which should discriminate, as uniquely as possible, the state of the system at any given time

*θ*. Then, one selects snapshots along the transition that define the path

**and picks a finite**

*X**λ*. For reasons of computational efficiency, the number of snapshots

*m*must be much smaller than the number of sampled datapoints

*θ*. The milestone selection can be performed in a number of ways but is typically empirically crafted, requiring expert knowledge of the system and CV-building experience. We will see how some of these steps can be automated with the help of machine learning.

### B. The locally linear embedding method

Locally linear embedding (LLE)^{41} is a non-linear dimensionality reduction technique that can be used to transform high-dimensional data into a lower-dimensional space with the constraint of preserving the local structure (i.e., the neighborhood) of the data. This is achieved in two steps.

*X*_{i}possesses

*k*neighbors that lie on a locally linear patch of its manifold, the LLE method reconstructs

*X*_{i}from its neighbors via a linear regression. At first, one minimizes the cost function,

*∑*

_{j}

*W*

_{ij}= 1 and that each datapoint

*X*_{i}is reconstructed exclusively from its neighbors. By doing so, one determines the optimal weight matrix

*W*, where

*W*

_{ij}represents the contribution of the

*j*th datapoint to the

*i*th reconstruction. For our purposes, this corresponds to assuming that each point in a trajectory used to generate a PATHCV can be represented locally by

*k*other points.

*W*

_{ij}that reconstruct the

*i*th datapoint in

*D*dimensions should also be able to reconstruct its embedded manifold coordinates

*x*_{i}(of dimension

*d*). This is accomplished by minimizing the embedding cost function,

*W*

_{ij}is now fixed.

Constructing a PATHCV by applying the standard LLE method still involves the application of the Euclidean metric on the feature space [compare Eqs. (1) and (4)]. The higher dimension of the space, the more the computed Euclidean distances and the low-dimensional mapping suffer from degeneracy, i.e., a number of different feature space vectors produce analogous compressed values. Therefore in both contexts, one faces the same challenge: the choice of an optimal metric. Furthermore, while one has to define the total number of *m* milestones for PATHCVs, in LLE, one has also to make an *a priori* choice of the number of neighbors *k* that locally represent each datapoint.

### C. The deep-locally non-linear-embedding method

The DeepLNE CV that we develop here (see Fig. 1) aims at representing a high dimensional dataset with a single dimension, by incorporating concepts from PATHCV and LLE into neural network architecture. Its starting point is the trajectory of a system going from state A to state B and a set of physical features ** X** evaluated along it. Its objective is to build a directional CV,

*s*, that can describe the progress along the transition through a non-linear combination of such a feature vector. The architecture of DeepLNE can be seen as a generalized autoencoder, with the first part being composed by an artificial neural network (ANN) to perform a preliminary dimensionality reduction, followed by a continuous k-nearest neighbor (k-NN) step on each datapoint, which is then passed to an encoder that compresses the neighborhood into a one-dimensional representation. This 1D CV,

*s*, is then used to reconstruct the original input via a decoder, which is also used for computing an accessory perpendicular distance

*z*CV.

This strategy is built upon the satisfaction of a number of constraints. The first one requires that the initial *m* datapoints ** X** can be optimally reconstructed from their one-dimensional representation

*s*. We set this restriction in place with an asymmetric autoencoder architecture, where the reconstruction loss between the original data

**and the decoded data $X\u0302$ is minimized.**

*X*Second, in analogy with the PATHCVs and LLE, each datapoint *i* must be exclusively represented by its neighbors. In order to satisfy this constraint, we implement a continuous and differentiable relaxation of the k-NN selection rule (detailed in Subsection I D). Notably, the compressed latent space representation *s*_{i} is not directly obtained from a transformation of the input features of the datapoint *X*_{i} but from those of its *k* neighbors. Thus, the resulting CV is more robust to extrapolation as it is anchored to the local description of the training datapoints at all times. As a third constraint, the neighbors of each datapoint are found in a different manifold of arbitrary dimension, in our case in a lower dimension *d* compared to the original feature space dimension *D*. This is achieved by transforming the initial inputs using an ANN. The PATHCV and LLE methods discriminate between datapoints by applying the Euclidean distance directly on the original feature space; therefore, their compressed representation may suffer from a degeneracy issue for large *D*. Instead, DeepLNE computes Euclidean distances exclusively in the reduced dimensional space *d* during the k-NN step, thus alleviating the degeneracy issue.

*s*) and the distance from it (

*z*), which can be used for driving the system along the process and controlling it. While

*s*is computed as the output of the encoder part of the DeepLNE architecture described above, the

*z*variable is defined from the decoded

*s*variable. For a new datapoint

**, the perpendicular distance to the path**

*Y**z*is computed as

**is replaced by the decoded training data $X\u0302$ (shown in the scheme of Fig. 1). By not comparing the new data directly with the training datapoints, but with their decoded versions (i.e., passed through neighbor selection and then reconstructed), we can more effectively measure the distance from the average learned path. Note that in this definition, the**

*X*(*θ*)*z*variable might still suffer from the high dimensionality problem of $X\u0302$. In this regard, we checked whether the variable

*z*can be calculated with respect to the reduced dimension

**instead of $X\u0302$. However, in the numerical examples, we found that the**

*x**z*computed in this alternative way did not improve upon our recommended definition in Eq. (5), and in some cases, it even decreased the sampling efficiency along the transition of interest.

In summary, the DeepLNE algorithm makes it possible to construct a variable that approximates a transition path between states, while overcoming multiple issues of PATHCVs, as follows:

A lower dimensional metric

*d*is learned automatically and reduces the degeneracy when computing the Euclidean norm to identify the local neighborhood of a datapoint.Local descriptions of the high dimensional dataset do not have to be empirically chosen but are found in an automatic way by constructing the neighborhood of each datapoint using a differentiable k-NN step.

Heterogeneous input features are permitted, e.g., we can combine distances, angles, and contact maps into

.*X*

### D. Continuous and differentiable nearest neighbors selection

*r*between query point

*x*_{i}and the transformed training datapoints

*x*_{j}, to be

*a*= −

*r*(

*x*_{i},

*x*_{j}). The expectation of $w\u03041$ of the first index vector taking a value

*i*∈

*I*is given by

*t*is the temperature scaling factor and

*P*is the nearest neighbor probability for point

*x*_{i}.

*a*are updated, such that

*x*_{i}using the expectations $w\u0304ij$ follow as

^{47}selecting a subset of the training datapoints

**only for the k-NN selection and the construction of $xik\u2212NN$.**

*x*### E. Interpretation of hyperparameters for nearest neighbors

An important ingredient for the deployment of the DeepLNE CV in the enhanced sampling context is the adoption of a continuous and differentiable approximation of the k-NN selection rule. Here, we focus on the choice and the interpretation of two hyperparameters: the number of neighbors *k* and the sparsity of the selection matrix *t*. Concerning the number of neighbors *k*, we have the following limiting cases:

*k*→ 1: leads to sharp local description of training data and high variability of the DeepLNE CV when extrapolating out of distribution. This is due to the fact that each datapoint is exclusively described by a single neighbor.*k*→ ∞: in this limit, we have a loss of locality and, therefore, loss of path-like behavior, especially in the regions that are sparsely described or under-sampled, such as the transition state region, which is crucial for capturing the dynamics of the process. Furthermore, the interpolation and extrapolation of variables*s*and*z*become more degenerate as a given datapoint is described by many training samples.

The second hyperparameter *t* determines the smoothness of the k-NN step by adjusting whether only the first *k* neighbors or a larger set of neighbors contribute (see Fig. SI 1). Its role is comparable to the one of *λ* in the PATHCV framework. We single out the following limiting cases:

*t*→ 0 imposes exact nearest neighbor selection, in which a datapoint is exclusively described by the features of the selected neighbors. The effect is analogous to the case where*λ*→ ∞ for the PATHCV. In this limit, vanishing gradients occur as a result of the applied softmax function in the k-NN (compare Fig. SI 2).*t*→ ∞ implies that a datapoint is described by a function of all the training datapoints, strongly lowering the resolving power of*s*, in analogy with*λ*→ 0 for the PATHCV. As a consequence, each new point gets transformed into a point at the center of the training set.

Based on these limiting cases, we suggest that *k* should be rather small (*k* < 10) to retain the path-like behavior of the CV, while *t* should be small but finite (*t* ≈ 0.1) to ensure numerical stability during enhanced sampling simulations.

### F. Training the DeepLNE CV

For all the systems studied in this paper, we followed the same protocol: (1) sampling of the transition of interest, (2) training the DeepLNE CV on a set of input features, and (3) performing enhanced sampling MD simulation using the DeepLNE CV for bias deposition. In the following, we outline all the key steps of our recommended strategy for training the DeepLNE CV.

#### 1. Training data generation

Starting trajectories must describe the transition from state A to B fully and as densely as possible. They can be unbiased in the simplest of cases where the energetic barrier between states is small or they can be biased data. Among the biasing procedures to generate input trajectories, we would single out the ratchet-and-pawl restraint,^{48,49} steered MD,^{50} and OPES in exploration mode.^{44} Since the computational cost of DeepLNE increases with an increasing number of datapoints, one can rely on FPS to select a maximally diverse subset of the original data, such that only the *m* frames are retained.

#### 2. Neural network architecture

In the examples reported in this study, we use 2 feed-forward ANN, both containing at most two hidden layers and the hyperbolic tangent activation function. The ANN architecture can be adjusted as desired by the user. We recommend choosing *d* such that the dimensional reduction per hidden layer and the total number of fitting parameters strike a balance. While the expected application scenario for complex biological systems of many degrees of freedom is *d* ≪ *D*, cases in which *d* > *D* can also be envisioned. Furthermore, in order for the *s* variable to describe the progress along the path, we scale it using a sigmoid activation in the final encoder layer, restricting its values to the range of 0–1.

#### 3. Neural network optimization

^{−3}and 5000 epochs, using the machine learning library PyTorch,

^{51}with the help of the mlcolvar package.

^{38}

#### 4. Inspecting the results

Furthermore, it can be instructive to plot the DeepLNE CV along a few important physical descriptors or against the first principal components of the input features ** X**. A more detailed analysis can be obtained via a sensitivity analysis to identify the features that contribute the most

^{22}or by using linear models to interpret the neural network-based CV.

^{52}

#### 5. Exporting the model into PLUMED

After the training is finalized, the model is compiled using TorchScript. The DeepLNE CV is evaluated on the fly in MD simulations by loading the model into the PLUMED plugin^{53} as a PYTORCH CV.^{38} An implementation of the method and usage tutorials are available at https://github.com/ThorbenF/DeepLNE.

### G. Simulation details

To test the introduced algorithm, we report simulation results for a particle in the Müller-Brown potential, alanine dipeptide, and an RNA tetraloop (Fig. 2).

#### 1. Müller-Brown

The simulations of a particle subjected to the Müller-Brown potential are performed using the simple Langevin dynamics code contained in the VES module of PLUMED,^{54} and the biased simulations are performed using the OPES^{42} method with a pace of 200 steps, the automatic bandwidth selection, and a barrier parameter equal to 15 *k*_{B}*T*. The particle is moving in two dimensions under the action of the three-state potential shown in Fig. 2, built out of a sum of Gaussians. During step (1), we choose to employ the ratchet and pawl ABMD method^{48} biasing the y-coordinate to sample the transition between the two states A and B. Then, in step (2), the DeepLNE CVs (*s* and *z*) are trained using the *x* and *y* positions of the particle as input features. For step (3), a new simulation is run where *s* is used as a CV to deposit bias via OPES. In addition, a half parabolic harmonic constraint with a cutoff value of 0.1 (UWALL) and *κ* = 1000 *k*_{B}*T* is applied on the *z* variable. We estimate statistical errors via block analysis using ten blocks.

#### 2. Alanine dipeptide

For the extensively studied FES of alanine dipeptide, three transition paths are investigated considering the dihedral angles *ϕ* and *ψ*. During step (1), we apply ABMD biasing both dihedrals so that five independent transitions across the FES barrier of interest are observed. To test the dimensionality reduction capability of DeepLNE, we also employ all inter-atomic distances in the molecule as initial features, which corresponds to *D* = 190. To save computational costs, we use the FPS tool of Ref. 47 to reduce the number of training datapoints (*m* = 3000). After training the DeepLNE CV, we use the *s* variable as a CV to bias in combination with OneOPES^{45} and eight replicas. We choose a pace of 500 steps, the automatic bandwidth selection, and a barrier parameter equal to 60 and 2.5 kJ/mol for OPES-METAD and OPES-EXPLORE, respectively. In addition, a harmonic constraint with cutoff value *κ* = 20 000 kJ/mol (UWALL) is applied on the *z* variable. Simulations are carried out with the DES-Amber ff.^{55} We analyze replica 0 and estimate the statistical errors using ten blocks of length 2 ns.

#### 3. RNA tetraloop

We select an RNA 8-mer as a model for a system with biological relevance and configurational complexity in order to perform a prediction task using DeepLNE. The system consists of eight nucleotides with the sequence “CCGAGAGG” containing the GAGA tetraloop motif (Fig. 2) and has previously been studied using extensive enhanced sampling MD^{56} using 24 replicas with a simulation length of 1 *µ*s per replica. As proposed by the authors in the previous study, we included recent corrections to the van der Waals parameters and a more accurate water model. Consequently, we used the standard OL3 RNA ff^{57–60} for our simulation with the van der Waals modification of phosphate oxygens developed in Ref. 61 without adjustment of dihedral parameters. As a water model, we chose OPC.^{62} This combination has been originally proposed in Ref. 63 and already tested on RNA systems.^{64,65} The simulations were run at salt concentrations corresponding to a system with neutralized total charge using potassium with the Joung–Cheatham ion parameters^{66} optimized for TIP4PEwald. To sample the transition from natively folded state A to unfolded state B, we use the ABMD method biasing 3 different CVs: (1) the eRMSD metric^{67} with respect to state A, (2) a contact map constructed using the heavy atom distances between the 4 nucleotides in stems 1 and 2 as well as 7 and 8 with cutoff value 5 Å (CMAP) inspired by Ref. 68, and (3) a contact map constructed using the heavy atom distances between the 4 nucleotides in the loop 3 and 4 as well as 5 and 6 with cutoff value 3.5 Å (CMAP). We start ABMD runs from the unfolded state at an eRMSD $>2.1$ and stop the simulation upon reaching the natively folded state at an eRMSD $<0.2$. We reduce the number of training datapoints (*m* = 2250) via FPS and choose the same distance metrics used for the ABMD simulations as input features in the DeepLNE training. In the enhanced sampling simulation using DeepLNE, we use the *s* variable to deposit bias with OneOPES. For OPES-EXPLORE, we choose a PACE of 20 000 steps, automatic bandwidth selection, and a barrier parameter equal to 50 kJ/mol. The temperatures for the 8 replicas are selected from a distribution ranging from 300 to 400 K, and each replica is simulated for 500 ns. As done in the previous study,^{56} the free energy differences were calculated by integrating the probability of being in the folded/unfolded states by Δ*F* = −*k*_{B}*T*[log(*∑*_{i,folded}*p*(*x*_{i})) − log(*∑*_{i,unfolded}*p*(*x*_{i}))]. Structures with an eRMSD $>0.7$ from the natively folded structure are considered unfolded. Replica 0 is analyzed excluding the first 200 ns. The statistical errors were estimated using 10 blocks of 30 ns.

Simulations of alanine dipeptide and GAGA tetraloop are performed using the GROMACS 2022.5 engine^{69} patched with the PLUMED 2.9 plugin^{53} with the Hamiltonian replica exchange algorithm.^{70}

After simulations of the toy models Müller-Brown and the alanine dipeptide were completed, we performed committor analysis to investigate whether the identified DeepLNE variable *s* approximates the real reaction coordinate well. This analysis protocol is meant to estimate the committor values near the maximum of the FES. For the Müller-Brown potential, the transition region is split into 24 bins along the *s* CV in the region of the maximum in the FES, and for each bin, 10 times 50 particle positions are extracted via bootstrapping and independent simulations are performed. In the case of alanine dipeptide, we split the FES of replica 0 in the vicinity of the transition state region into 11 bins along the *s* variable and extract 100 structures with the lowest values of the *z* variable. Then, we start new independent simulations from these structures, book-keeping which state they visit first (A or B). We repeat this 5 times with randomized starting velocities to obtain averages and error estimates.

## III. RESULTS

### A. Particle in Müller-Brown potential

We demonstrate how to train and apply the DeepLNE CV to enhanced sampling MD simulations. In Fig. 3, we show the collected results of the three steps along the proposed DeepLNE algorithm. We start by performing two simulations biasing the *y*-coordinate via ABMD using different spring constants *κ* = 0.5 *k*_{B}*T* and *κ* = 1.25 *k*_{B}*T*. As shown in Fig. 3(a), both runs start from the same state A and are stopped as soon as they reach state B. We show the portion of trajectory that describes the transition from state A to state B. We use this reduced set of datapoints to compute the input features for the DeepLNE CV training step. In Fig. 3(b), we show the results of biasing the particle simulation using the *s* variable of the DeepLNE model and constraining perpendicular movement via the *z* variable. One can appreciate that the desired exploration moving from state A to state B is achieved and that *s* describes the intended progression along the path.

In Fig. SI 3(a), we show the decoded training datapoints $X\u0302$ of DeepLNE. Since all datapoints used for training the DeepLNE CV describe a transition across the same barrier, $X\u0302$ well approximates the path along the minimum in the Müller-Brown FES. Consequently, applying a harmonic constraint on *z* during enhanced sampling MD allows for frequent transitions while sampling the movement between the state A and B as shown in Fig. SI 3(b). We can appreciate that the entire spectrum of the *s* variable is explored. Since we exclusively sample the transition and not the entire free energy landscape [Fig. 3(b)], we can confirm that the applied harmonic constraint on *z* is effective, limiting the exploration perpendicular to the path-like variable *s*. In Fig. 3(c), we show the results of free energy estimation as well as committor analysis based on the estimated FES in more detail in panel (d). We compute the committor probability for state A in 24 different bins along *s* in the region of the highest free energy. Along the increasing values of the *s* variable, one can appreciate that the committor value changes from 0 to 1 and takes a value of 0.47 ± 0.04 in proximity of the maximum barrier value in the FES, indicating that the DeepLNE CV approximates well the ideal reaction coordinate. In Fig. SI 3(c), we show the individual committor values for each of the 10 estimations, resulting in the average committors reported herein.

For the trained DeepLNE model, we perform a benchmark of the computation time for evaluating the variables *s* and *z* (single forward pass) for increasing the number of training datapoints kept in memory and used during the k-NN. The computation timescales sub-linearly for less than 400 training datapoints and linearly beyond (Fig. SI 4).

### B. Alanine dipeptide

Figure 4 shows the results of constructing the DeepLNE CVs for three different transitions of alanine dipeptide and estimating the FES via OneOPES simulations using the *s* variable to deposit the bias potential. We note that the DeepLNE CVs are a non-linear function of the 190 inter-atomic distances in alanine dipeptide. Figures 4(a)–4(c) show the datapoints initially sampled from short ABMD runs across three different barriers in the reference FES obtained using OPES-METAD, biasing *ϕ* and *ψ* over 200 ns. We color the FPS-reduced training datapoints based on the DeepLNE variable *s*. The variable describes the intended progression from one state A to another state B across the different barriers in all three cases.

Having constructed three path-like variables, we estimate the FES performing OneOPES simulations along them and also perform committor analysis for each case [Figs. 4(d)–4(f)]. Before going into more detail, we note that in Fig. SI 5, we collected the expectation value of the *z* variable and the time evolution of the *s* variable. The analysis of ⟨*z*⟩_{ϕ,ψ} [see Figs. SI 5(a)–5(c)] confirms that path-like CVs are learned, which exhibit low *z* values along the transition of interest. Importantly, frequent transitions occur, and the entire spectrum of the *s* variables is explored during the sampling of the three different transition paths [Figs. SI 5(d)–5(f)]. Figure 4(a) superimposes 5 ABMD simulations starting from the same configuration in the state *C*_{7eq} and then progressing to state *C*_{7ax} over a barrier at *ϕ* = 2, *ψ* = −2.5. From the reference FES, it can be seen that there exists a single transition state region, which is quantified, as shown in Fig. 4(d), by performing OneOPES simulation. The transition region is located between *s* values 0.13 to 0.25, and accordingly, configurations from 10 bins in this range are extracted to perform committor analysis. We find that the *s* values in range 0.17–0.19 of the DeepLNE CV correspond to the transition state with the committor value 0.49 ± 0.05. In Fig. 4(b), we show another transition between the state *C*_{7eq} and state *C*_{7ax}, this time across a barrier located close to *ϕ* = 0, *ψ* = 0. The superimposition of the training data obtained from 5 ABMD runs and coloring them with the learned *s* variable shows that the transition should be expected for *s* values in the range 0.53–0.65, and accordingly, we analyze the corresponding OneOPES simulation shown in Fig. 4(e). It turns out that in this case, *s* values in the range between 0.59 and 0.61 lead to committor probabilities of 0.52 ± 0.01, also allowing us here to identify the configurations corresponding to the transition state. Finally, in Fig. 4(c) we train the DeepLNE model on a third possible transition path in the alanine dipeptide FES. We can see for this case that the system starts from configurations of the state *C*_{7eq} with *ϕ* = −1.5, *ψ* = 0.5. The transition along a transition region close to *ϕ* = −2, *ψ* = −1 is considered completed when *ϕ* = 3, *ψ* = −3 is approached. In this case, OneOPES simulations are also performed using the DeepLNE *s* variable as the biased CV that leads to the FES shown in Fig. 4(f). The maximum of the free energy occurs for *s* values close to 0.65, and through committor analysis in the range 0.57–0.75, we find committor probabilities of 0.49 ± 0.04 for *s* values 0.64–0.67. Furthermore in Figs. SI 6(a)–6(c), we show a closer examination of the subsections of variable *s* analyzed by committor analysis showing the individual committor estimations for all three transition paths.

In addition, the relevance of the input features ** X** can be estimated by accumulating the gradients of the DeepLNE

*s*variable with respect to

**. This is especially insightful in a scenario where the input feature dimension is very high. We report this analysis in Fig. SI 7 showing that for all three path-like DeepLNE CVs, there exist differences in relevance between input features [Figs. SI 7(a)–7(c)]. Importantly, the most relevant inputs identified are sensitive to the relevant dihedral angle changes required to perform the respective transitions [Figs. SI 7 (d)–7(f)].**

*X*Finally, since the size of the features ** X** is very high for this system, we analyzed the effect of the dimensionality reduction operated by the first ANN (

*D*→

*d*) for all three transition paths. To this end, we inspected the distribution of Euclidean distances between the training data in both the high and low dimensional representations

**and**

*X***(Fig. SI 8). The latter one leads to broader distributions and, consequently, reduced degeneracy when defining the local neighborhood, thus resulting in a better input for the nearest-neighbor selection algorithm.**

*x*In conclusion, these results show that despite the challenge of being trained on high-dimensional input features, the DeepLNE algorithm can successfully guide a simulation from state A to B in free energy space and help identify the configurations belonging to the transition state.

### C. RNA GAGA-tetraloop

In Fig. 5(a), we compare the FES estimation of the RNA GAGA tetraloop after 500 ns of simulation using OneOPES in combination with the eRMSD or DeepLNE variable *s* as the CV for bias deposition. As in Ref. 56, the eRMSD is computed with respect to the natively folded conformation. The DeepLNE CV is trained on the eRMSD and two contact maps providing additional information about the contacts formed between the four nucleotides in the stem and the loop, respectively.

Our OneOPES simulation biasing the eRMSD for 500 ns results in an FES estimate that is in disagreements with previous studies where extensive enhanced sampling simulations were carried out. The FES in regions of conformations corresponding to high eRMSD values exhibit high statistical errors and are strongly underestimating the propensity of these conformations with respect to what has been previously observed (compare FES in Ref. 56). Instead the OneOPES simulation using the trained DeepLNE variable *s* to deposit a bias potential over time results in an FES which shows the expected relative population between natively folded and misfolded states. The statistical errors over the entire FES are consistently low. The source of high statistical uncertainty in the case of the simulations biased via eRMSD are shown in Fig. SI 9. We compare the exploration of the biased CV (eRMSD, DeepLNE) over the course of the simulation. The analysis confirms that the eRMSD is not reliably exploring its entire range of values and instead for simulations with the DeepLNE *s* variable, the entire spectrum of *s* values is explored and frequent transitions occur.

In Fig. 5(b), we show the estimated free energy of folding over the last 300 ns of simulation time. In agreement with what has been discussed above, the OneOPES simulation biasing the eRMSD is starting with a Δ*F* overestimating the free energy of folding and shows a drift over the course of the simulation. Instead, the folding free energy estimate for the OneOPES run, employing the DeepLNE CV *s* for bias deposition, approaches the value obtained after 500 ns of enhanced sampling within the first 200 ns. The Δ*F* estimate plateaus at 11.5 kJ/mol. Since our simulations are performed using a force-field, including van der Waals modification of phosphate oxygens and a more accurate OPC water model, both of which were not used in Ref. 56, our Δ*F* estimation is expected to be similar, but not identical to the previous findings. We specifically note for the GAGA tetraloop system that even though experimentally the folded state is expected to be energetically favored, the opposite is observed in simulations because of force field inaccuracies. Interestingly, our simulation performed with the corrections envisioned in Ref. 56 indeed lowers the free energy of folding. Nevertheless, the misfolded state still represents the minimum in the FES.

## IV. DISCUSSION

In this work, we propose a strategy to construct a path-like CV using a data-driven approach. The DeepLNE method builds a 1D representation of reactive trajectories that can be used to capture and especially accelerate rare events, with a focus on biophysical systems. Through multiple examples, we have shown how this CV well approximates the ideal reaction coordinate and can be effectively used to improve sampling using enhanced sampling methods, such as OPES or its recently developed OneOPES variant.

Our methodology begins with obtaining a first reactive trajectory that captures the transition from state A to state B. Herein, these starting trajectories have been obtained with a ratchet-and-pawl restraint,^{48} which provided training data of sufficient quality for all the systems investigated. For more complex systems, an input reactive trajectory that samples the transition state regions more extensively might result in a DeepLNE CV that captures the region of the free energy barrier more precisely. To this end, an iterative biasing procedure, such as the one introduced in Ref. 71 or the generation of synthetic training data as in Ref. 72, can be considered in future applications.

The input trajectory is used to define a feature vector to distinguish not only state A from state B but also the intermediate states encountered. With this featurized trajectory, one can automatically derive path-like variables applying the DeepLNE CV method introduced in this study. Input features are initially subjected to dimensionality reduction. In this lower dimensional manifold, we compute Euclidean distances to search for a small number of k-nearest neighbors that describe the local neighborhood of each datapoint. Compressing this neighborhood representation even further finally yields the DeepLNE variable *s*, which, upon decoding, can be used to generate in turn the DeepLNE CV *z*. The interpretation of *s* and *z* is analogous to the PATHCVs as they describe the progress along the path and its perpendicular distance, respectively.

The proposed method alleviates the issue of degeneracy that occurs when comparing datapoints in a high dimensional space via the Euclidean metric. It does so by identifying the local neighborhood of a datapoint in a lower dimensional representation instead. At the same time, it automates the process of constructing the CVs since the *s* and *z* CVs are generated without making the empirical choices of selecting a set of milestones and a metric that have a strong impact both on the quality of the CV and on the time required to fine-tune it. We show that non-linear combinations of hundreds of descriptors can easily be handled with DeepLNE, providing high flexibility and expressiveness. In addition, the input features can be selected by using data-driven methods.^{32} Unlike standard neural network-based CVs, the DeepLNE CV exclusively represents new datapoints through their neighbors in the training set. This feature is fundamental in preventing erroneous extrapolations when unknown data regions are visited.

Our method is not limited to two state systems and can be applied to more complex systems that are characterized by multiple metastable states. A DeepLNE CV can be successfully trained as long as the training trajectories visit all the states of interest and their transition regions. In the systems studied here, we did not impose directionality on the CV in order to show that path-like CVs can be successfully derived even in a completely unsupervised manner. If needed, this can be ensured by adding a supervised term to the optimization (such as cross entropy or a discriminative term), which can be done straightforwardly in a multi-task framework.^{38} We believe that DeepLNE provides a powerful and flexible method to automatically create efficient path-like CVs to accelerate sampling of complex phenomena.

## SUPPLEMENTARY MATERIAL

The supplementary material contains additional information on the influence of hyperparameters on the neighborhood representation and the phenomenon of vanishing gradients in continuous k-nearest neighbors; further analysis of the particle in the Müller-Brown potential and alanine dipeptide, incorporating committor analysis; and a computational benchmark, importance ranking analysis, and a distribution comparison of the Euclidean norm between input dimensions X and reduced dimension x are included. In addition, the exploration of biased CVs in enhanced sampling simulations of the RNA GAGA-tetraloop is provided.

## ACKNOWLEDGMENTS

We acknowledge Simone Aureli for providing several useful suggestions. F.L.G., V.R., and T.F. acknowledge the Swiss National Supercomputing Center (CSCS) for large supercomputer time allocations Project ID: s1228. They also acknowledge the Swiss National Science Foundation and BRIDGE for financial support (Project Nos. 200021_204795, CRSII5_216587, and 40B20_203628). L.B. acknowledges funding from the Federal Ministry of Education and Research, Germany, under the TransHyDE research network AmmoRef (Support Code: 03HY203A).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Thorben Fröhlking**: Conceptualization (equal); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (equal); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (equal); Writing – review & editing (equal). **Luigi Bonati**: Conceptualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **Valerio Rizzi**: Methodology (equal); Writing – original draft (equal); Writing – review & editing (equal). **Francesco Luigi Gervasio**: Funding acquisition (lead); Project administration (lead); Resources (lead); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

At https://github.com/ThorbenF/DeepLNE we provided the DeepLNE class with a corresponding tutorial together with scripts that allow to reproduce the figures and the results of this study.

## REFERENCES

*Understanding Molecular Simulation*

*Advances in Neural Information Processing Systems*

*α/γ*conformers

*et al.*nucleic acids force field based on reference quantum chemical calculations of glycosidic torsion profiles