Single-molecule experimental techniques track the real-time dynamics of molecules by recording a small number of experimental observables. Following these observables provides a coarse-grained, low-dimensional representation of the conformational dynamics but does not furnish an atomistic representation of the instantaneous molecular structure. Takens’s delay embedding theorem asserts that, under quite general conditions, these low-dimensional time series can contain sufficient information to reconstruct the full molecular configuration of the system up to an a priori unknown transformation. By combining Takens’s theorem with tools from statistical thermodynamics, manifold learning, artificial neural networks, and rigid graph theory, we establish an approach, Single-molecule TAkens Reconstruction, to learn this transformation and reconstruct molecular configurations from time series in experimentally measurable observables such as intramolecular distances accessible to single molecule Förster resonance energy transfer. We demonstrate the approach in applications to molecular dynamics simulations of a C24H50 polymer chain and the artificial mini-protein chignolin. The trained models reconstruct molecular configurations from synthetic time series data in the head-to-tail molecular distances with atomistic root mean squared deviation accuracies better than 0.2 nm. This work demonstrates that it is possible to accurately reconstruct protein structures from time series in experimentally measurable observables and establishes the theoretical and algorithmic foundations to do so in applications to real experimental data.
I. INTRODUCTION
The molecular structure of a protein is defined by a vector in specifying the Cartesian coordinates of the N constituent atoms. Molecular dynamics (MD) simulation is a computational algorithm that propagates the 3N-dimensional position vectors through time under the action of a force field specifying the interatomic interaction potential.1 At each step of the MD simulation, the full 3N-dimensional configurational state of the system is available. The sophisticated experimental techniques such as x-ray crystallography and cryo-electron microscopy can solve protein structures to near atomic resolution in crystalline or vitrified samples.2–4 Proteins, however, are typically not functional under these conditions, and these structures cannot capture transitions between metastable states. For example, the Mycobacterium tuberculosis protein tyrosine phosphatase PtpB exhibits dynamical transitions between “closed” and “open” states in which a pair of α-helices transiently cover the active site and dynamically protect the active site from oxidative inactivation.5 Single-molecule experimental techniques such as single molecule Förster resonance energy transfer (smFRET) can follow protein dynamics by tracking small numbers of experimentally observable intramolecular distances between fluorescent probes conjugated to the target molecule.3,4,6 There are currently no experimental techniques available to follow the real-time dynamical evolution in atomistic detail.
Takens’s delay embedding theorem is a result from dynamical systems’ theory that asserts a sufficiently long and frequently sampled time series of a single system observable can contain sufficient information to reconstruct the full-dimensional state of the system up to an a priori unknown diffeomorphism (i.e., smooth and invertible bijective transformation).7–16 This theoretical result opens the door to reconstructing the molecular coordinates of a protein from single-molecule experimental measurements such as smFRET. We have previously applied Takens’s theorem to synthetic single-molecule time series extracted from molecular dynamics simulations of polymers and proteins to estimate single molecule free energy surfaces (smFESs).17,18 Since the simulations also furnish the full molecular configurations, we also estimated the “true” smFES from the atomistic molecular simulation trajectory to numerically verify the existence of the a priori unknown diffeomorphism and place empirical bounds on the degree of perturbation to the true smFES induced by this transformation. In this work, we build on these foundations using tools from rigid graph theory and artificial neural networks to learn this transformation from the data and also approximate the inverse transformation from the low-dimensional smFES backup to the molecular configuration space. This calibrates a functional approximation mapping a time series in a single system observable to the atomistic molecular configuration, thereby enabling the reconstruction of atomistic molecular structures from experimentally measurable observables. We term this approach Single-molecule TAkens Reconstruction (STAR).
The structure of this paper is as follows: In Sec. II, we describe the methodological details of STAR. We detail the mathematical formulation and numerical solution of each step of the learning problem combining principles and tools from statistical thermodynamics, manifold learning, and rigid graph theory. In Sec. III, we present the applications of STAR to MD simulations of a C24H50 polymer chain and the 10-residue artificial mini-protein chignolin. We train STAR to reconstruct molecular configurations from univariate time series in the head-to-tail distance as a synthetic and idealized smFRET time trace. In Sec. IV, we present our conclusions and opportunities for future work.
II. METHODS
A. Principles of STAR
A cartoon schematic of STAR is presented in Fig. 1. The objective of this work is to train STAR as an accurate and generalizable model to predict molecular configurations from time series data in experimental observables through the four-step pathway b → d → c → e → f. Each panel in this figure corresponds to a different representation of the molecular system, and the arrows between them correspond to mathematical operations to convert one representation to another. Red arrows correspond to unsupervised learning problems, typically nonlinear dimensionality reduction; blue arrows correspond to supervised learning problems requiring learning of a nonlinear function linking the inputs and outputs, and a gray arrow corresponds to deterministic operations. In order to learn and approximate each of these functions, we require training data for which the molecular configurations [Fig. 1(a)] corresponding to each point in the time series in the experimental observable [Fig. 1(b)] are known. We obtain the former from atomistic MD simulations and the latter by computing the experimentally measurable observable corresponding to each frame in the trajectory. In this work, we adopt the head-to-tail (h2t) distance as an experimental observable that could, in principle, be measured by smFRET.4,6 We emphasize that the MD simulation data are only required to train the model, and once the model is trained, there is no requirement for any additional molecular calculations. The trained model predicts molecular configurations [Fig. 1(f)] from novel time series data [Fig. 1(b)] that were not present in the training data via the pathway b → d → c → e → f. In this work, we generate the novel time series data from additional MD simulations, but, in principle, the STAR model trained and calibrated over MD data could be applied to experimental measurements. We detail the additional sophistications that would be necessary to achieve that goal in Sec. IV. A separate STAR model must be trained for each molecular system.
Cartoon schematic of Single-molecule TAkens Reconstruction (STAR). The trained STAR model predicts molecular configurations of a protein from univariate time series in an experimentally measurable observable of the system through the four-step pathway b → d → c → e → f. Each panel corresponds to a different representation of the molecular system and arrows between them to mathematical operations. Red arrows represent unsupervised nonlinear manifold learning problems (i.e., nonlinear dimensionality reduction); blue arrows represent supervised learning problems (i.e., function approximation); and a gray arrow represents deterministic operations. [(a) and (b)] Molecular dynamics simulations that are sufficiently long to sample the thermally relevant configurational space provide training trajectories of molecular configurations [panel (a)] and a scalar time series in an experimentally measurable system observable [panel (b)]. In this work, we set v(t) to be the head-to-tail (h2t) intramolecular distance that is, in principle, measurable by smFRET. (c) Interactions between atoms constrain the molecular trajectory to a k-dimensional manifold with effective dimensionality k ≪ 3N. The collective variables {ψ1, ψ2, …, ψk} parameterizing the manifold are extracted from the simulation trajectory using diffusion maps. The manifold supports the smFES βF(ψ1, ψ2, …, ψk). (d) An image of and the smFES is obtained from the time series data by forming a d-dimensional delay embedding and then applying diffusion maps to learn a parameterization {, , …, } of a manifold . Under technical conditions on v, d, and τ discussed in the main text, Takens’s delay embedding theorem asserts that the dynamical evolution of y(t) is C1-equivalent to that of r(t) and that the manifold is related to via diffeomorphic (i.e., smooth, invertible, and bijective) transformation . We learn the transformation Θ from the training data using a simple artificial neural network. [(e) and (f)] The manifold contains a low-dimensional projection of into from which molecular configurations may be approximately reconstructed. We perform the reconstruction by first learning the pairwise distances between all atoms using an artificial neural network [panel (e)] and then using classical multidimensional scaling to deterministically transform this into the reconstructed atomic coordinates [panel (f)]. All molecular renderings in this work were constructed using Visual Molecular Dynamics (VMD).19
Cartoon schematic of Single-molecule TAkens Reconstruction (STAR). The trained STAR model predicts molecular configurations of a protein from univariate time series in an experimentally measurable observable of the system through the four-step pathway b → d → c → e → f. Each panel corresponds to a different representation of the molecular system and arrows between them to mathematical operations. Red arrows represent unsupervised nonlinear manifold learning problems (i.e., nonlinear dimensionality reduction); blue arrows represent supervised learning problems (i.e., function approximation); and a gray arrow represents deterministic operations. [(a) and (b)] Molecular dynamics simulations that are sufficiently long to sample the thermally relevant configurational space provide training trajectories of molecular configurations [panel (a)] and a scalar time series in an experimentally measurable system observable [panel (b)]. In this work, we set v(t) to be the head-to-tail (h2t) intramolecular distance that is, in principle, measurable by smFRET. (c) Interactions between atoms constrain the molecular trajectory to a k-dimensional manifold with effective dimensionality k ≪ 3N. The collective variables {ψ1, ψ2, …, ψk} parameterizing the manifold are extracted from the simulation trajectory using diffusion maps. The manifold supports the smFES βF(ψ1, ψ2, …, ψk). (d) An image of and the smFES is obtained from the time series data by forming a d-dimensional delay embedding and then applying diffusion maps to learn a parameterization {, , …, } of a manifold . Under technical conditions on v, d, and τ discussed in the main text, Takens’s delay embedding theorem asserts that the dynamical evolution of y(t) is C1-equivalent to that of r(t) and that the manifold is related to via diffeomorphic (i.e., smooth, invertible, and bijective) transformation . We learn the transformation Θ from the training data using a simple artificial neural network. [(e) and (f)] The manifold contains a low-dimensional projection of into from which molecular configurations may be approximately reconstructed. We perform the reconstruction by first learning the pairwise distances between all atoms using an artificial neural network [panel (e)] and then using classical multidimensional scaling to deterministically transform this into the reconstructed atomic coordinates [panel (f)]. All molecular renderings in this work were constructed using Visual Molecular Dynamics (VMD).19
What is the origin of the multi-step pathway in STAR illustrated in Fig. 1? Why not predict molecular configurations directly from the h2t time series in a single step b → f? In this work, we exploit the generic low effective dimensionality of molecular systems along with theoretical guarantees offered by Takens’s delay embedding theorem to formulate a succession of simpler and typically lower-dimensional learning problems with firm theoretical underpinnings that we solve using appropriate numerical tools. An additional benefit of this perspective is that the approach also explicitly learns the smFES. The trained STAR model therefore furnishes both a prediction of the molecular structure and its location and thermodynamic stability on the smFES. We now detail each step of the STAR approach presented in Fig. 1.
1. Molecular dynamics training data [r(t), v(t)]
MD simulations of sufficient duration to comprehensively sample the thermally relevant configurational space of the molecular system are required to provide the initial training data for STAR. Data that do not sample all relevant conformational states and transitions will result in models overfitted to the training data and a poorly calibrated STAR model that is unable to generalize to new data. For the small macro and biomolecular systems considered in this work, simulations of a few hundred ns to a few μs were sufficient to provide good sampling and generalizability. The simulation trajectory provides a temporally ordered sequence of snapshots [Fig. 1(a)]. For each snapshot, we compute the instantaneous value of an experimentally accessible observable to define a 1D time series [Fig. 1(b)]. In the present work, we set v(t) to be the head-to-tail molecular distance h2t(t) that is, in principle, measurable by smFRET by conjugating the termini of the molecules with fluorescent probes. These time series can therefore be regarded as idealized synthetic smFRET time traces with no measurement noise and arbitrarily high time resolution. These training data are used to train all steps of the STAR pipeline.
2. Learning the atomistic manifold
Interactions between the constituent atoms of the molecular system generically constrain molecular trajectories within the 3N-dimensional Cartesian coordinate space to a so-called intrinsic manifold of effective dimensionality k ≪ 3N20–26 [Fig. 1(c)]. We learn this manifold from the simulation trajectory training data using the diffusion maps (dMaps) nonlinear manifold learning technique.20,25,27–32 Applying dMaps identifies both the dimensionality k of the latent manifold and the CVs {ψ1, ψ2, …, ψk} parameterizing it as nonlinear functions of the atomic coordinates.20,28,33,34 The dMap CVs are the leading eigenvectors of a discrete random walk constructed over the 3N-dimensional snapshots comprising the simulation trajectory. The bandwidth ε of the Gaussian kernel used to construct the random walk can be tuned automatically based on the structure of the data20,35 and the dimensionality of the embedding k defined by a gap in the eigenvalue spectrum.20,28–30,36
Assuming that the distance metric used to measure pairwise similarities between molecular configurations is a good proxy for the kinetic proximity of the configurations over short time scales, these CVs furnish a dynamically meaningful low-dimensional embedding into the slowest relaxing modes of a diffusion process over the data. In particular, Euclidean distances in the embedding correspond to diffusion distances in configurational space that measure the kinetic proximity of any pair of configurations under the action of the random walk.29,32 This dynamic interpretability of dMap CVs makes them an excellent choice for parameterization of the intrinsic manifold , but other nonlinear dimensionality reduction or manifold learning techniques such as Isomap,37–40 locally linear embeddings (LLE),41,42 and local tangent space alignment40 may also be employed.
The only input to the diffusion map algorithm is the definition of a distance metric k(ri, rj) with which to measure the dissimilarity of pairs of configurations in the trajectory. In order to guarantee the existence of the diffeomorphism, it is critical that this metric be appropriately symmetrized to eliminate any spatial symmetries that cannot be distinguished by the choice of experimentally measurable observable v(t). Our choice of observable v(t) = h2t(t) is invariant under translation, rotation, mirror inversion, and—for chemically symmetric molecules (e.g., simple polymer chains)—head–tail inversion of the molecular configuration . Any representation of the system based on v(t) sacrifices the ability to distinguish changes in the system state under any of these transformations. It is possible to recover some or all of these symmetries under different choices for v(t) or by multiplexed simultaneous measurements v(t) = {v1(t), v2(t), …}. As is standard practice in measuring molecular similarity, we select k(ri, rj) to be the rotationally and translationally aligned root mean squared deviation (RMSD), which naturally mods out the rototranslational symmetry using the Kabsh algorithm,43 and we eliminate the discrete symmetries by minimizing under mirror inversion,
or both mirror and head–tail inversion for chemically symmetric molecules,18
Failure to eliminate all such spatial symmetries induced by the choice of observable v(t) violates Takens’s theorem and can cause the manifolds and to no longer be diffeomorphisms related by a well-defined and learnable transformation.
The dMap CVs define a data-driven parameterization of the k-dimensional manifold . Projection of the full simulation trajectory into these CVs defines an empirical probability distribution P(ψ1, ψ2, …, ψk). An estimate of the smFES mapping out the free energy F over is obtained via the statistical mechanical relationship βF(ψ1, ψ2, …, ψk) = −ln P(ψ1, ψ2, …, ψk) + C, where C is an arbitrary additive constant and the free energy is de-dimensionalized by the reciprocal temperature β = 1/kBT. New molecular configurations rnew not contained within the data used to construct may be projected onto the manifold using the Nyström extension.44–46
3. Learning the Takens’s manifold
Section II A 2 detailed a procedure to estimate the k-dimensional intrinsic manifold and the smFES it supports by applying manifold learning to a simulation trajectory of the atomistic molecular configuration . Takens’s delay embedding theorem7–11,13–16 provides a means to reconstruct an image of the intrinsic manifold [Fig. 1(d)] by applying similar operations to delay embeddings of a time series in a single coarse-grained observable [Fig. 1(b)]. Takens’s theorem asserts that (i) a [d ≥ (2k + 1)]-dimensional delay embedding in a generic observable v(t) constructed at a delay time τ uniquely specifies the instantaneous state of the system, (ii) the dynamical evolution of y(t) is C1-equivalent (i.e., identical under a smooth and continuous mapping) to that of r(t), and (iii) the evolution of y(t) lies on a manifold that is related to by a diffeomorphism (i.e., smooth, invertible, and bijective transformation) with inverse .6–11,13–15,18 The diffeomorphism is a priori unknown but is guaranteed only to stretch and squash the manifold and not tear it or stitch it together in new ways. As such, is a topologically identical image of that preserves its continuity and connectivity and possesses the same dimensionality.7–10,14,15
Takens’s theorem holds for any generic observable of the system v that does not contain any spurious symmetries that are not present in the system itself (i.e., the system is invariant in the observable under particular symmetries), for any delay time τ that does not introduce temporal aliasing (i.e., is a multiple of a period of the dynamical motion), and for any delay embedding dimensionality d greater than twice the intrinsic dimension of the system k.7,8,15,17,18,47–49 In practice, better results are obtained for observables v that are strong functions of all system degrees of freedom and which respond sensitively to the important dynamical motions of the system,18 for delay times τ estimated as the first minimum of the autocorrelation or mutual information of v(t),50 and for delay dimensionalities d estimated using the E1 method of Cao.51–53 Takens’s theorem still holds when applied to observables v that do contain symmetries not present in the system, but the system may only be reconstructed up to those symmetric operations (vide supra), and also to observables of subsystems and under stochastic or deterministic forcing.12,13 The latter two generalizations are relevant to the present work because we adopt the head-to-tail molecular distance as our observable v(t) = h2t(t), which is an observation of the solute subsystem within the full solute–solvent system and is subject to dynamical coupling with solvent and deterministic or stochastic forcing by any attached thermostats, barostats, or other external constraints. Finally, Takens’s theorem may also be applied to multiplexed measurements of several simultaneous observables v(t) = {v1(t), v2(t), …} such as multi-channel smFRET measuring multiple intramolecular distances.4,11
The manifold is estimated by applying dMaps to the delay embedding y(t) = [v(t), v(t − τ), v(t − 2τ), …, v(t − (d − 1)τ)] parameterized by the k CVs {, , …}. We apply dMaps under a distance metric k′(yi, yj) measuring the dissimilarity of pairs of delay vectors that we select to be a simple Euclidean distance metric. This metric requires modification due to the introduction of a spurious symmetry into the system by the temporal ordering of v(t) within the y vectors induced by Takens’s delay embedding. This spurious symmetry can be understood by considering a hypothetical microstate transition of the system rA(t − τ) → rB(t) → rC(t + τ) and its reverse rC(t − τ) → rB(t) → rA(t + τ), with associated d = 3-dimensional delay vectors yfwd = [vA, vB, vC] and ybkwd = [vC, vB, vA]. Adopting a convention that defines a mapping between system microstates and delay vectors based on the central microstate, both of these delay vectors are associated with microstate rB, but the delay vectors themselves are not identical. By observing the past and future of rB, we can distinguish whether it was occupied as part of a forward or reverse transition, and the “forward” rB and “backward” rB appear in Takens’s delay embedding as distinct identifiable states. For equilibrium systems obeying detailed balance, the forward and backward transitions between any pair of microstates have equal fluxes, and we should not be able to identify whether occupancy of a particular microstate rB resulted from the forward or reverse transition in any such pair. To prevent the occurrence of this symmetry breaking in that remains unbroken in , we must symmetrize k′(yi, yj) to eliminate this temporal symmetry. This assures that Takens’s theorem is not violated and the manifolds and remain diffeomorphic. We have previously demonstrated the importance of eliminating this symmetry in applications of Takens’s theorem to equilibrium molecular systems.17,18 As such, we minimize the distance metric under time reversal, which is equivalent to minimizing under inversion of one of the delay vectors,
The dMap CVs define a data-driven parameterization of the k-dimensional manifold over which we construct the smFES βF′(, , …, ) = −ln P′(, , …, ) + C′ by compiling the empirical probability distribution P′(, , …, ) of the delay vectors projected onto the manifold. New delay vectors ynew not contained within the data used in the construction of may be projected onto the manifold using the Nyström extension.44–46 We exploit this out-of-sample extension when applying the trained STAR model to new data that were not present during training.
4. Learning the diffeomorphism from to
Takens’s theorem guarantees that the smooth manifolds and are related by a diffeomorphism .6–11,13–15,18 The theoretical guarantees on the existence of this mapping and its low dimensional nature are a valuable advantage of the multi-step STAR pathway that would be lost by formulating a direct reconstruction of the molecular configurations from the univariate time series. We adopt a convention associating each delay vector y(t) = [v(t), v(t − τ), v(t − 2τ), …, v(t − (d − 1)τ)] with the configurational microstate corresponding to its central element r(t − ((d − 1)/2)τ). We assert that d be odd in order to make this association unambiguous. This mapping means that the configurations in the leading t < (d − 1)τ/2 and trailing t > (d − 1)τ/2 periods of the molecular simulation trajectory are not associated with any delay vector and are eliminated from all analyses.
Having defined the associations between the projections of the delay vectors y(t) on represented as and the projections of the molecular configurations r(t) on represented as , we define a supervised learning problem between pairs of data points to perform data-driven estimation of the k-dimensional to k-dimensional diffeomorphism [Fig. 1(d) → Fig. 1(c)]. There are many ways to learn and approximate this function, including k-nearest neighbors,54 kernel methods,55 or local Jacobians.17,18,56 In this work, we employ simple fully connected feedforward artificial neural networks (ANN) as an easy to train and flexible function approximator.57 We typically find that networks comprising four to eight hidden layers each containing ∼10 × k neurons are adequate to furnish high accuracy mappings.
5. Learning the reconstruction
The final step is to learn approximate reconstructions of the molecular configurations from their projections onto the intrinsic manifold . If the effective dimensionality of the simulation trajectory is less than or equal to k and the dMap CVs have been properly learned from the simulation trajectory, then the k-dimensional subspace spanned by is expected to preserve the important configurational variance in r(t).20,28 As such, the location on the intrinsic manifold should contain sufficient information to approximately reconstruct the configurational state of the system up to any symmetries that have been modded out in its construction (vide supra). In this sense, the process may be viewed as the concatenation of a low-dimensional encoder furnished by diffusion maps and a decoder to be learned from the data. Again, we have the one-to-one mapping of data pairs and can formulate and solve this as a supervised learning problem.
We split this learning problem into two steps and instead of learning to predict the Cartesian coordinates of the atoms r(t) directly from , we first learn the ordered vector of pairwise distances between all atoms . Rigid graph theory asserts that specification of all pairwise distances defines the absolute coordinates of the N points up to translation, rotation, and mirror inversion.58,59 In practice, calculation of the coordinates from the pairwise distances is easily accomplished using classical multidimensional scaling (cMDS) to form the Gram matrix and compute its eigendecomposition.58,60 In the present application, the translational, rotational, and mirror inversions (and head–tail inversion, where applicable) have all been modded out of the construction of , and so there is no (additional) information loss in formulating the supervised learning problem as first approximating the functional mapping [Fig. 1(c) → Fig. 1(e)] and then making the deterministic transformation using cMDS [Fig. 1(e) → Fig. 1(f)]. Unlike the one-step formulation of the reconstruction , the two-step formulation eliminates the need to perform any alignment of the molecular configurations with respect to rotation, translation, and mirror inversion since these symmetries are naturally modded out within the pairwise distance matrix. The omission of these alignment operations, either mutually or to some fixed reference structure, carries advantages in avoiding the introduction of noise and approximations into the fitting problem.
We use simple ANNs to learn from the training data. Typically, we find that networks comprising four to eight hidden layers each containing on the order of up to ∼1000 × k neurons are adequate to furnish high accuracy mappings. The molecular configurations are reconstructed only up to translational, rotational, and mirror inversion symmetries (and head–tail inversion for chemically symmetric molecules), and so comparisons of the reconstruction accuracy between pairs must be performed by mutual alignment under these transformations. The alignment problem corresponds to an orthogonal Procrustes problem61 that we efficiently solve using the Kabsch algorithm.43
6. Deploying the trained STAR model
The STAR pipeline trained for a particular molecular system may then be used to predict molecular reconstructions from new time series data through the four-step pathway b → d → c → e → f [Figs. 1(b)–1(f)]: Takens’s delay vectors y are constructed from the time series v(t), projected onto Takens’s manifold using the Nyström extension,44–46 mapped onto the atomistic manifold using the trained ANN, converted into the atomistic pairwise distances matrix d using another trained ANN, and then finally transformed into the reconstructed molecular configurations using cMDS. Provided that the training data were sufficiently rich to span the thermally relevant metastable states and transitions in the system and the supervised learning problems are not overfitted, then the trained STAR model should be able to generalize to new time series data from the system that were not included in model training. Importantly, deployment of the trained model does not require any additional molecular simulation data during the deployment phase [Fig. 1(a)].
We validate the trained model by extracting time traces of the molecular heat-to-tail distance v(t) = h2t(t) from new MD simulation trajectories that were not present in the training data and testing the capability of the trained STAR model to reconstruct the molecular configurations r(t) in the trajectory. A well-trained model should be able to predict molecular configurations from novel test data with similar accuracies to that in the training data. In applications to real experimental smFRET data, the true molecular configurations would not be available to make this comparison and the model predictions would have to be validated by indirect means. In all cases, we seek only to reconstruct the molecular configuration of the solute and do not attempt to predict the coordinates of the water solvent or any counter ions. In principle, this information is—again subject to the relevant symmetries—present within Takens’s delay embedding but is much more challenging to recover due to the permutational fungibility of the water molecules and the relatively weaker influence of solvent coordinates on our choice of solute-centric observable.
B. Molecular dynamics simulations
1. C24H50
MD simulations of C24H50 were conducted using the Gromacs 4.6 simulation suite62 using the TraPPE potential63 that models each CH3 and CH2 group as a united atom and the all-atom SPC model of water.64 Chain topologies were constructed using the PRODRG2 server.2 Lennard-Jones interactions were smoothly set to zero at a cutoff of 1.4 nm and Lorentz–Berthelot combining rules used to determine dispersion interactions between unlike atoms.65 Electrostatic interactions were treated using particle mesh Ewald66 with a real-space cutoff of 1.4 nm and a 0.12 nm reciprocal-space grid spacing that were optimized during runtime. Simulations were conducted in a 5 × 5 × 5 nm3 cubic box with periodic boundary conditions that were sufficiently large to prevent self-interactions of the chain through the periodic walls even in its fully extended configuration. Initial system configurations were generated by placing an initially elongated chain into an empty box and solvating with 4117 water molecules to a density of 1.0 g/cm3. High energy overlaps were eliminated by steepest descent energy minimization to a force threshold of 2000 kJ/mol nm. The initial particle velocities were sampled from a Maxwell–Boltzmann distribution at 298 K. Simulations were performed in the NPT ensemble at 298 K and 1 bar using a Nosé–Hoover thermostat67 and an isotropic Parrinello–Rahman barostat.68 Equations of motion were integrated using the leap-frog algorithm69 with a 2 fs time step. LINCS constraints were used to fix bond lengths to their equilibrium distances for computational efficiency and as required by the TraPPE and SPC potentials.32 Systems were equilibrated for 1 ns before conducting a 100 ns production run during which system configurations were saved every 0.2 ps. Head-to-tail distances were computed for each frame of the 500 000 frame simulation trajectory to furnish the univariate time series v(t) = h2t(t) in addition to the atomistic simulation trajectory r(t). The first 20 ns (100 000 frames) of these trajectories provided the training data used to train the STAR model, and the remaining 80 ns (400 000 frames) reserved for testing. Input files for these simulations are provided in the supplementary material.
2. Chignolin
MD simulations of the 10-residue (166 atom) engineered mini-protein chignolin (GYDPETGTWG, PDB ID: 1UAO)70 were performed by D. E. Shaw Research using the Desmond simulation suite71 on the Anton supercomputer72 and reported in Ref. 73. The peptide was modeled using the CHARMM22 force field74 and solvated by ∼1900 water molecules75 in a 4 × 4 × 4 nm3 cubic box with periodic boundary conditions. Two Na+ ions were added to maintain charge neutrality. Lennard-Jones interactions were treated with a 0.95 nm cutoff and electrostatic interactions treated by Gaussian Split Ewald76 employing a 0.95 nm real-space cutoff and a 32 × 32 × 32 cubic grid. Equations of motion were integrated with a 2.5 fs time step. Equilibration runs were conducted in the NPT ensemble at 340 K and maintained by a Nosé–Hoover thermostat.67,77 The 106 µs production run was conducted in the NVT ensemble employing a Nosé–Hoover thermostat67,77 and system configurations harvested every 200 ps. The first 5 µs (25 000 frames) of these trajectories were used to train the STAR model, and the next 15 µs (75 000 frames) employed for testing.
III. RESULTS AND DISCUSSION
We demonstrate and validate STAR in applications to molecular dynamics simulations of a C24H50 polyethylene chain and the β-hairpin engineered mini-protein chignolin. Simulation data constituting the training and testing data are collected, as described in Sec. II B. The STAR pipeline was trained and deployed as described in Sec. II A with details specific to each system provided below. In each case, we achieve RMSD reconstruction accuracies on the hold-out test data of better than 0.2 nm.
A. C24H50
1. STAR training
The training portion of the C24H50 simulation trajectory comprised 100 000 frames saved at 0.2 ps intervals recording the Cartesian coordinates of the N = 24 united atoms of the polymer chain [cf. Fig. 1(a)]. In order to run dMaps into local memory, the simulation trajectory was subsampled with a stride of 20 and the excluded frames projected into the manifold post hoc using the Nyström extension.44–46 The stride length was empirically tuned to balance computational efficiency and comprehensive sampling of the configurational phase space to achieve robust embeddings and reconstruction accuracies. Application of spatially symmetrized dMaps to these data employing a kernel bandwidth of ε = exp(−3) nm exposed a k = 2-dimensional intrinsic manifold spanned by nonlinear collective variables of the united atom coordinates {ψ1, ψ2} and supporting the smFES βF(ψ1, ψ2) [cf. Fig. 1(c)]. The eigenvalue spectrum exhibiting a spectral gap separating the second non-trivial eigenvalue from the continuum of higher-order eigenvalues is presented in Fig. S1a of the supplementary material.
A scalar time series in the head-to-tail distance between the two terminal united atoms was computed from the MD training trajectory as a synthetic and idealized smFRET time series over the 100 000 frames separated by intervals of 0.2 ps [cf. Fig. 1(b)]. Takens’s delay vectors y(t) = [h2t(t), h2t(t − τ), h2t(t − 2τ), …, h2t(t − (d − 1)τ)] were constructed at a delay time of τ = 30 ps (150 time steps) computed as the first minimum in the mutual information of h2t(t)50 and delay dimensionality d = 5 computed using the E1 method of Cao.51–53 This resulted in the construction of 99 400 delay vectors, each associated with a frame in the MD trajectory containing the molecular structure from which the central element in the delay vector was computed. The initial and terminal (d − 1)τ/2 = 60 ps (300 frames) of the MD trajectory that were unassociated with any delay vector were dropped from further analyses. Application of temporally symmetrized dMaps to the delay embedding trajectory, subsampled with a stride of 20 in order to fit into local memory and employing a kernel bandwidth of ε = 1 nm, defined a k = 2-dimensional manifold spanned by {, }. The eigenvalue spectrum exhibiting a spectral gap after the second non-trivial eigenvalue is presented in Fig. S1b. The frames strided over in the application of dMaps were projected into using the Nyström extension and used to estimate the smFES βF′(, ) [cf. Fig. 1(d)].
The diffeomorphism linking the two manifolds was learned using a simple 2-25-25-25-25-2 fully connected feedforward ANN comprising four hidden layers of 25 neurons [cf. Fig. 1(d) → Fig. 1(c)]. The ANN employed tanh activations and was trained using Adam78 with a batch size of 500 and a learning rate of 1 × 10−4. Training was terminated after 250 epochs upon plateauing of the validation error.
The function linking the projection of each frame in the MD trajectory into the intrinsic manifold {ψ1(t), ψ2(t)} to the N(N − 1)/2 = 276-dimensional united atom pairwise distance vectors corresponding to that configuration was learned using a 2-4-187-370-552-276 fully connected feedforward ANN employing tanh activations and trained using Adam78 with a batch size of 500 and a learning rate of 1 × 10−3 over 150 epochs [cf. Fig. 1(c) → Fig. 1(e)]. Molecular reconstructions of the united atom coordinates were computed deterministically from each pairwise distance vector using cMDS [cf. Fig. 1(e) → Fig. 1(f)]. The reconstruction accuracy of the trained STAR pipeline [cf. Fig. 1(b) → Fig. 1(d) → Fig. 1(c) → Fig. 1(e) → Fig. 1(f)] was assessed by computing the RMSD of the predicted and true configurations under translational, rotational, mirror, and head–tail alignment by the Kabsch algorithm.43
2. STAR deployment
We illustrate the application of the trained C24H50 STAR model in Fig. 2. The trained STAR pipeline enables us to associate each value of h2t(t) in the synthetic smFRET time trace with a molecular reconstruction and also a location and stability on the smFES supported by the low-dimensional intrinsic manifold.17,20 The mean RMSD reconstruction accuracy of the trained pipeline applied to the 20 ns training trajectory is RMSDtrain = 8.4 × 10−2 nm. In an application to the remaining 80 ns test trajectory that was not part of the training ensemble, the mean reconstruction accuracy degrades only slightly to RMSDtest = 8.6 × 10−2 nm and is at worst 0.19 nm, indicating that the STAR model is well trained and has good generalizability to novel data. Histograms illustrating the complete RMSD distribution over the test trajectory are presented in Fig. S2. We illustrate for five selected points A–E in the h2t(t) time trace [Fig. 2(a)], their projection onto the smFES βF(ψ1, ψ2) [Fig. 2(b)], RMSD reconstruction accuracy as a function of position on the intrinsic manifold [Fig. 2(c)], and reconstructed and true molecular structures [Fig. 2(d)]. Movie S1 of the supplementary material presents an animation of molecular reconstructions and smFES projections for all test data in the time series in Fig. 2(a).
Application of STAR to the C24H50 polymer chain. (a) Synthetic idealized smFRET time trace of the head-to-tail distance h2t(t) between the terminal united atoms computed over a 100 ns MD trajectory with frames saved every 0.2 ps. The first 20 ns are used for training (orange) and the remaining 80 ns for testing (blue). The molecule undergoes hundreds of folding and unfolding events over the course of the simulation. (b) The intrinsic manifold spanned by the dMap CVs {ψ1(t), ψ2(t)} and supporting the smFES βF(ψ1, ψ2). The Gibbs free energy F is dedimensionalized by the inverse temperature β = 1/kBT. The arbitrary zero of F is specified to lie at the global free energy minimum. (c) RMSD reconstruction accuracy as a function of location on the intrinsic manifold . (d) Molecular reconstructions using the trained STAR pipeline of five representative points A–E in the h2t(t) time series spanning the test set. Reconstructions (red) are superposed on the corresponding true configurations r (blue) extracted directly from the MD simulation. The head-to-tail distance of the true configuration and the RMSD under translational, rotational, mirror, and head–tail alignment between the true and reconstructed configurations are reported under each image. The STAR prediction of the location of each point on the smFES is illustrated in panel (b) and the corresponding dimensionless free energy βF reported under each image.
Application of STAR to the C24H50 polymer chain. (a) Synthetic idealized smFRET time trace of the head-to-tail distance h2t(t) between the terminal united atoms computed over a 100 ns MD trajectory with frames saved every 0.2 ps. The first 20 ns are used for training (orange) and the remaining 80 ns for testing (blue). The molecule undergoes hundreds of folding and unfolding events over the course of the simulation. (b) The intrinsic manifold spanned by the dMap CVs {ψ1(t), ψ2(t)} and supporting the smFES βF(ψ1, ψ2). The Gibbs free energy F is dedimensionalized by the inverse temperature β = 1/kBT. The arbitrary zero of F is specified to lie at the global free energy minimum. (c) RMSD reconstruction accuracy as a function of location on the intrinsic manifold . (d) Molecular reconstructions using the trained STAR pipeline of five representative points A–E in the h2t(t) time series spanning the test set. Reconstructions (red) are superposed on the corresponding true configurations r (blue) extracted directly from the MD simulation. The head-to-tail distance of the true configuration and the RMSD under translational, rotational, mirror, and head–tail alignment between the true and reconstructed configurations are reported under each image. The STAR prediction of the location of each point on the smFES is illustrated in panel (b) and the corresponding dimensionless free energy βF reported under each image.
Configuration A corresponds to elongated chain configurations with the preponderance of the backbone dihedrals in the trans state. Configuration C corresponds to a similarly elongated conformation but with a small number of gauche defects that lead to a small degree of curvature in the contour of the chain. Configuration C resides within 0.3 kBT of the global free energy minimum at βF = 0, while configuration A lies slightly higher at βF = 2.7. Configurations B and D correspond to partially collapsed twisted structures ∼3kBT less stable than the global free energy minimum that lies along the transition pathway between the global minimum at (ψ1 ∼ 1.2, ψ2 ∼ 2.0) containing elongated chains and the weak local minimum at (ψ1 ∼ 3.2, ψ2 ∼ −6.0) containing hydrophobically collapsed coiled chains. Configuration E is a metastable hydrophobically collapsed coil that lies slightly outside the local minimum at βF = 4.8. We note that configurations B (h2t = 1.8 nm) and D (h2t = 2.0 nm) possess similar values of the h2t distance but correspond to very distinct molecular configurations residing at different locations on the smFES that are both accurately reconstructed by STAR. This illustrates the value of using Takens’s theorem to reconstruct molecular configurations that cannot be distinguished from the instantaneous value of the observable alone.
B. Chignolin
1. STAR training
The training portion of the chignolin trajectory comprised 25 000 frames saved at 200 ps intervals recording the Cartesian coordinates of the N = 93 heavy atoms of the protein. A k = 2-dimensional intrinsic manifold spanned by {ψ1, ψ2} was constructed by applying dMaps with a kernel bandwidth of ε = exp(−3) nm to a subsampling of this trajectory with a stride of 2. The eigenvalue spectrum exhibiting a spectral gap separating the second non-trivial eigenvalue from the continuum of higher-order eigenvalues is presented in Fig. S3a. Again, the stride length was tuned to balance computational cost and robust training of the STAR pipeline. The excluded frames were projected into the manifold using the Nyström extension. Takens’s delay embeddings were constructed from the synthetic smFRET time series recording the distance between the terminal heavy atoms at a delay time τ = 200 ps (1 time step)50 and delay dimensionality d = 11.51–53 This resulted in the construction of 24 990 delay vectors. We applied temporally symmetrized dMaps to the delay embedding trajectory, sub-sampled with a stride of 2 in order to fit into local memory, , with a kernel bandwidth of ε = 1 nm to define a k = 2-dimensional manifold . The eigenvalue spectrum exhibiting a spectral gap after the second non-trivial eigenvalue is presented in Fig. S3b. Again, the excluded frames were projected into the manifold using the Nyström extension. The diffeomorphism Θ mapping to was approximated by a 2-25-25-25-25-2 ANN trained using Adam78 with a batch size of 500 and a learning rate of 1 × 10−4 over 250 epochs. The function mapping locations on to the N(N − 1)/2 = 4278-dimensional heavy atom pairwise distance vectors were learned and approximated by a 2-4-2855-5706-8556-4278 ANN using Adam78 with a batch size of 500 and a learning rate of 1 × 10−5 over 150 epochs. Predictions of heavy atom molecular configurations were computed deterministically from the pairwise distance vectors using cMDS.
2. STAR deployment
Application of the trained chignolin STAR model is illustrated in Fig. 3. The mean heavy atom reconstruction accuracy over the 5 µs training trajectory is RMSDtrain = 0.12 nm. The mean accuracy is only slightly diminished to RMSDtest = 0.14 nm over the 15 µs test trajectory reaching a worst case maximum of 0.29 nm, again illustrating good generalizability of the trained model. Histograms illustrating the complete RMSD distribution over the test trajectory are presented in Fig. S4. We illustrate for five selected points A–E in the h2t(t) time trace [Fig. 3(a)], their projection onto the smFES βF(ψ1, ψ2) [Fig. 3(b)], RMSD reconstruction accuracy as a function of position on the intrinsic manifold [Fig. 3(c)], and reconstructed and true molecular structures [Figs. 3(d) and 3(e)]. Movie S2 presents an animation of molecular reconstructions and smFES projections for all test data in the time series in Fig. 3(a).
Application of STAR to the 10-residue engineered mini-protein chignolin. (a) Synthetic idealized smFRET time trace of the head-to-tail distance h2t(t) between the terminal heavy atoms computed over a 20 µs MD trajectory with frames saved every 200 ps. The first 5 µs are used for training (orange) and the remaining 15 µs for testing (blue). The molecule undergoes several of folding and unfolding events over the course of the trajectory. (b) The smFES βF(ψ1, ψ2) supported by the intrinsic manifold has its arbitrary zero of F specified to lie at the global free energy minimum. (c) RMSD reconstruction accuracy as a function of location on the intrinsic manifold . Molecular reconstructions using the trained STAR pipeline of five representative points A–E selected from the test h2t(t) time series rendered in (d) ribbon and (e) licorice representations. Reconstructions (red) are superposed on the corresponding true configurations r (blue) extracted directly from the MD simulation. The head-to-tail distance of the true configuration and the RMSD under translational, rotational, and mirror alignment between the true and reconstructed configurations are reported under each pair of images. The STAR prediction of the location of each point on the smFES is illustrated in panel (b) and the corresponding dimensionless free energy βF reported under each image.
Application of STAR to the 10-residue engineered mini-protein chignolin. (a) Synthetic idealized smFRET time trace of the head-to-tail distance h2t(t) between the terminal heavy atoms computed over a 20 µs MD trajectory with frames saved every 200 ps. The first 5 µs are used for training (orange) and the remaining 15 µs for testing (blue). The molecule undergoes several of folding and unfolding events over the course of the trajectory. (b) The smFES βF(ψ1, ψ2) supported by the intrinsic manifold has its arbitrary zero of F specified to lie at the global free energy minimum. (c) RMSD reconstruction accuracy as a function of location on the intrinsic manifold . Molecular reconstructions using the trained STAR pipeline of five representative points A–E selected from the test h2t(t) time series rendered in (d) ribbon and (e) licorice representations. Reconstructions (red) are superposed on the corresponding true configurations r (blue) extracted directly from the MD simulation. The head-to-tail distance of the true configuration and the RMSD under translational, rotational, and mirror alignment between the true and reconstructed configurations are reported under each pair of images. The STAR prediction of the location of each point on the smFES is illustrated in panel (b) and the corresponding dimensionless free energy βF reported under each image.
Configurations A and D correspond to the native hairpin state of the protein in which all native hydrogen bonds are intact and lie within the deep global free energy minimum. The reconstruction accuracy of the densely sampled native fold is extremely good as indicated by RMSDs of ∼0.05 nm. Configurations B, C, and E represent a sampling of the unfolded ensemble containing a diversity of random coiled states with some or all of the native hydrogen bonds broken. Despite the relatively sparser sampling and larger configurational diversity of the unfolded ensemble, the RMSD reconstruction accuracy is still better than 0.2 nm. This indicates that the training data, despite containing only four folding/unfolding transitions, provide a sufficiently dense and representative sampling of configurational space to enable accurate reconstruction of even transiently visited molecular conformations. It is unsurprising that configurations lying at high free energies (e.g., configuration E, βF = 10.5, and RMSD = 0.15 nm) that are very sparsely sampled in the training data have poorer reconstruction accuracies than the densely sampled native configurations (e.g., configuration A, βF = 0.0, and RMSD = 0.06 nm). We propose that adaptive sampling techniques to perform targeted sampling of infrequently visited regions of the manifold may be beneficial in providing more training data to the STAR pipeline and improving the reconstruction accuracy of higher free energy configurations. Configurations B (h2t = 2.1 nm) and C (h2t = 2.4 nm) possess similar values of the h2t distance but constitute two configurationally and thermodynamically distinct members of the chignolin unfolded ensemble. Again, we observe that STAR accurately reconstructs configurations with similar instantaneous values of the scalar observable but correspond to structurally different configurations that lie in very different regions of the smFES.
IV. CONCLUSIONS
This work presents the theoretical underpinnings and numerical implementation of an approach, Single-molecule TAkens Reconstruction (STAR), to reconstruct molecular configurations from time series in a single experimentally measurable observable. The basis for the approach rests upon the integration of Takens’s delay embedding theorem with tools from manifold learning, statistical thermodynamics, artificial neural networks, and rigid graph theory to extract a representation of the system state from the scalar time series and learn the a priori unknown mapping to the molecular configuration from molecular dynamics simulation training data. The trained STAR model can then be applied to novel time series data to predict both the corresponding molecular configurations and their location and stability on the single molecule free energy surface. We have demonstrated and validated the approach in applications to a C24H50 polyethylene chain and the 10-residue engineered β-hairpin mini-protein chignolin. In both cases, we demonstrate that trained STAR models can robustly reconstruct the molecular configurations from time series data in the head-to-tail distance with mean RMSD reconstruction accuracies better than 0.2 nm. This is comparable to the approximately 0.4 nm resolution of cryo-electron microscopy79,80 or 0.1 nm resolution of x-ray crystallography.81
In this work, we adopt the head-to-tail distance as an experimental observable that can, in principle, be measured by a single molecule experimental technique such as smFRET. The head-to-tail time traces in this work are extracted from MD simulation trajectories in order to test our approach in applications where the ground truth molecular configurations are explicitly available from the simulations. These time traces can therefore be considered to represent synthetic and idealized smFRET data at arbitrarily high temporal resolution and subject to no measurement error or noise. The present work reports a computational proof-of-principle demonstration of STAR in this idealized limit. Applying STAR to real experimental time series must engage a number of concerns surrounding the experimental realities of smFRET measurements including millisecond limits in sampling frequency; shot noise, uncertainties, and unpredictable trajectory lengths due to photobleaching; degraded measurement reliabilities for fluorophores outside of the 2 nm–8 nm range; and conformational perturbations induced by conjugation of the fluorescent probes.3,4 Furthermore, the STAR model must be trained and calibrated on MD training data, and so the accuracy of the molecular potential functions and degree of sampling of the thermally relevant configurational space will limit the quality to the trained model. A STAR model calibrated on MD simulation data employing a poor force field or which does not sample all of the experimentally accessible states and transitions will not perform well when deployed on real experimental time series. In future work, we propose to engage these practical issues empirically by adding noise to the observables extracted from our simulation trajectories, limiting the accessible time resolution, limiting the length of the time series, and exploring the transferability of the trained models between molecular force fields.
We would also like to explore technical innovations to determine optimal experimental observables (e.g., optimal FRET fluorophore placement) for high-accuracy molecular reconstruction,82 the extension of STAR to multiplexed measurements,11 and the potential to reconstruct not just the molecular configuration but also the location and orientation of proximate solvent molecules using permutationally invariant representations of the solvent coordinates.83,84 It would also be of interest to explore the transferability of reconstructions learned under one set of conditions (e.g., temperature, pressure, salt concentration, and mutations from wild type) to reconstruct those at another. Takens’s theorem does require time-ordered data, but this may come from many short trajectories rather than a single long one. As such, we would also like to develop adaptive sampling capabilities wherein the deployed model can identify regions of configurational space where it does not have sufficient training data to make accurate predictions and performs automated targeting of additional on-the-fly molecular simulations to supplement training in under-sampled regions. This is anticipated to be particularly important for applications of STAR to large proteins where it is challenging to comprehensively sample the important configurational space. MD simulations are limited by the short time step required for the stability of the numerical integration to millisecond time scales on even the fastest computational hardware. This adaptive parameterization protocol would enable efficient training of robust STAR models using a discontinuous ensemble of short MD trajectories on the microsecond time scale and then deployment of the trained model to reconstruct atomistic trajectories from smFRET data on the time scale of seconds or longer. We would also like to experiment by replacing the ANN and cMDS approach to molecular reconstruction with a conditional Wasserstein generative adversarial network (cWGAN)85,86 as a more powerful reconstruction tool that, due to its intrinsic driving by noise, has the potential to reintroduce the richness and diversity of molecular structures in the degrees of freedom that are not explicitly captured within the low-dimensional intrinsic manifold. Finally, we also see potential applications of STAR in other applications where it is of interest to reconstruct the state of a dynamical system and where it is challenging or impossible to obtain complete information on its state. As such, we foresee potential applications of the approach in diverse fields and areas of dynamical system modeling such as fluid mechanics, meteorology, climatology, epidemiology, and ecology.
SUPPLEMENTARY MATERIAL
See the supplementary material for simulation input files for the C24H50 molecular dynamics simulations, Jupyter notebooks implementing each stage of the STAR pipeline, Fig. S1 showing the C24H50 diffusion map eigenvalue spectra, Fig. S2 showing the distribution of C24H50 RMSD reconstruction accuracies, Fig. S3 showing the chignolin diffusion map eigenvalue spectra, Fig. S4 showing the distribution of chignolin RMSD reconstruction accuracies, Movie S1 showing an animation of molecular reconstructions and smFES projections for C24H50, and Movie S2 showing an animation of molecular reconstructions and smFES projections for chignolin.
AUTHORS’ CONTRIBUTIONS
A.L.F. conceived and designed the research and provided funding acquisition; A.L.F. and M.T. performed the calculations, analyzed data, and wrote and edited this paper.
ACKNOWLEDGMENTS
We thank Dr. Gül H. Zerze for fruitful discussions. This material is based on work supported by the National Science Foundation under Grant No. DMS-1841810. We are grateful to D. E. Shaw Research for sharing the chignolin simulation trajectories. A.L.F. is a consultant of Evozyne and a co-author of U.S. Provisional Patent Nos. 62/853,919 and 62/900,420 and International Patent Application Nos. PCT/US2020/035206 and PCT/US20/50466.
DATA AVAILABILITY
Input files for the C24H50 molecular dynamics simulations and Jupyter notebooks implementing each stage of the STAR pipeline are provided in the supplementary material. The chignolin simulation trajectories reported in Ref. 73 were obtained by request from D. E. Shaw Research.