The 2024 Nobel Prize in Chemistry was awarded in part for de novo protein structure prediction using AlphaFold2, an artificial intelligence/machine learning (AI/ML) model trained on vast amounts of sequence and three-dimensional structure data. AlphaFold2 and related models, including RoseTTAFold and ESMFold, employ specialized neural network architectures driven by attention mechanisms to infer relationships between sequence and structure. At a fundamental level, these AI/ML models operate on the long-standing hypothesis that the structure of a protein is determined by its amino acid sequence. More recently, AlphaFold2 has been adapted for the prediction of multiple protein conformations by subsampling multiple sequence alignments. Herein, we provide an overview of the deterministic relationship between sequence and structure, which was hypothesized over half a century ago with profound implications for the biological sciences ever since. We postulate that protein conformational dynamics are also determined, at least in part, by amino acid sequence and that this relationship may be leveraged for construction of AI/ML models dedicated to predicting protein conformational ensembles. Accordingly, we describe a conceptual model architecture, which may be trained on sequence data in combination with conformationally sensitive structural information, coming primarily from nuclear magnetic resonance (NMR) spectroscopy. Notwithstanding certain limitations in this context, NMR offers abundant structural heterogeneity conducive to conformational ensemble prediction. As NMR and other data continue to accumulate, sequence-informed prediction of protein structural dynamics with AI/ML has the potential to emerge as a transformative capability across the biological sciences.
BIOLOGICAL SEQUENCE INFORMATION AND ITS RELATIONSHIP WITH PROTEIN STRUCTURE AND DYNAMICS
The exchange of sequence information between biological macromolecules is a fundamental process of life. The pathway commonly summarized as DNA → RNA → protein was put forward as the “Sequence Hypothesis” by Francis Crick.1 The discovery of the double helix structure of DNA served as the initial inspiration, as asserted in one of scientific literature's most famous understatements: “It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.”2 On the relationship between sequence and structure, Crick further speculated that “folding is simply a function of the order of the amino acids,”1 and a more comprehensive articulation was later provided by Christian Anfinsen through the “Thermodynamic Hypothesis”—which states that “the native conformation is determined by the totality of interatomic interactions and hence by the amino acid sequence.”3 Protein structure is not static, however, and function may depend on conformational dynamics. For example, early studies on the structure of myoglobin revealed that structural re-arrangement is required for molecular oxygen binding.4–6 Building on these observations and hypotheses, we posit that amino acid sequence determines both three-dimensional (3D) structure and conformational dynamics of protein [Fig. 1(a)]. Together, the above insights span from information storage to biological function.
From sequence to structure and conformational dynamics. (a) Biological sequence information (red) encodes biophysical properties (blue) across a spectrum that spans from information storage to biological function. For simplicity, not all possible directionalities of biological information transfer are shown, as reviewed in detail elsewhere.7–9 (b) The postulated deterministic relationship between sequence and conformational dynamics may be leveraged for development of AI/ML models, trained on metagenomic sequence data in combination with 3D conformational ensemble data, for the prediction of protein conformational ensembles from amino acid sequence alone.
From sequence to structure and conformational dynamics. (a) Biological sequence information (red) encodes biophysical properties (blue) across a spectrum that spans from information storage to biological function. For simplicity, not all possible directionalities of biological information transfer are shown, as reviewed in detail elsewhere.7–9 (b) The postulated deterministic relationship between sequence and conformational dynamics may be leveraged for development of AI/ML models, trained on metagenomic sequence data in combination with 3D conformational ensemble data, for the prediction of protein conformational ensembles from amino acid sequence alone.
Looking back, it is quite remarkable that, despite the paucity of experimental evidence available at the time,1,3 early hypotheses concerning the relationship between biological sequence and structure remain not only valid but central to major research advances.8,10 Notably, the flow of biological information is distinct from the flow of energy and matter.1 While obeying the laws of chemistry and physics, biological information transfer and its deterministic role in structure may be treated as distinct and self-contained. Considering biological sequence information as such enabled significant progress in protein structure prediction. Jumper et al. emphasized the limitation of conventional physics-based approaches for this purpose and relied instead on a sequence/structure-centric AI/ML approach for de novo protein structure prediction with AlphaFold2.11 This is not to say that the laws of chemistry and physics are not important when considering the relationship between sequence and structure, but explicit implementation of chemistry/physics-based algorithms alone for protein structure prediction proved difficult, to say the least.12 In contrast, utilization of AI/ML methods that combine Protein Data Bank (PDB) holdings with genome sequence information to infer structure proved both feasible and remarkably effective.11–14
Despite major progress with AI/ML approaches, application of sequence–structure relationships in this context is arguably in its infancy. AlphaFold2 and similar approaches make predictions of static protein structure, presumably an “idealized” structure of a protein in a low energy state. However, proteins are not frozen in space and time—they are dynamic and adopt various structural conformations across complex energy landscapes.6,15,16 Protein structural heterogeneity has been explored both experimentally and computationally. Nuclear magnetic resonance (NMR) spectroscopy has allowed for experimental determination of ensembles of conformational states,17 while computational molecular dynamics (MD) simulations have aimed to provide conventional physics-based insights.18–20 Both NMR and MD simulations depend on physical properties of individual atoms and each approach offers a methodologically independent basis for characterizing relationships between amino acid sequence and conformational variability. Furthermore, combined incorporation of sequence and structural conformation data may be used for training AI/ML models dedicated to sequence/structure-centric prediction of protein conformational ensembles [Fig. 1(b)].
PREDICTION OF PROTEIN STRUCTURE AND DYNAMICS WITH AI/ML
Following the success of DeepMind in the Critical Assessment of Structure Prediction (CASP) competition,21 the AI/ML model AlphaFold2 was recognized with the 2024 Nobel Prize in Chemistry for bridging the predictive gap between sequence and structure with unprecedented accuracy.11,22 As previously mentioned, AlphaFold2 and similar AI/ML approaches, including RoseTTAFold13 and ESMFold,14 rely on biological sequence data for prediction of protein structure via attention-based23 neural network architectures. 3D atomic coordinate information from the PDB24 was central for training these models. In addition, metagenomic sequence data obtained from UniProt25 and other sources were also instrumental. AlphaFold2 and RoseTTAFold rely on multiple sequence alignments (MSAs), which carry information about co-evolution of pairs of amino acid residues, to inform 3D structure prediction. Remarkably, structure prediction accuracy arises in the early sequence-based stages of the AlphaFold2 model architecture.11 Similarly, ESMFold, trained initially on sequence data alone using an attention-based language model without MSAs, exhibited substantial degradation of prediction accuracy when the sequence-based language model was dispensed with.14 The sequence/structure-centric frameworks of these AI/ML approaches rely heavily on the long-standing hypothesis that sequence determines 3D structure.1,3 In the same vein, the predictive accuracy of these approaches bolsters support for this hypothesis—if sequence did not determine structure, AlphaFold2, and related approaches would not have been so successful.
More recently, AI/ML approaches have been introduced for prediction of multiple conformational states of proteins, rather than singular structures. AlphaFold2 was adapted to predict multiple protein conformations without retraining the model. More specifically, subsampling of sequences for assembly of MSAs has been demonstrated to result in predictions of multiple protein conformations that resemble conformations determined by experimental methods.26–28 Of particular note, AlphaFold2 was trained on structures determined by x-ray crystallography and cryogenic electron microscopy (cryoEM), but not more conformationally sensitive spectroscopic methods. An example of a protein conformational ensemble predicted using this approach is shown in Fig. 2(a). Another AI/ML model developed by Mansoor et al., though not trained on biological sequence information, was able to predict multiple protein conformations with a variational autoencoder (VAE) architecture.29,30 In this case, protein structures determined by x-ray crystallography accompanied by MD simulation snapshots were used to train the model to infer structural variation. 3D atomic coordinate-derived data are “encoded” into a latent space of reduced complexity from which novel conformations are “decoded” back into 3D structure data. While this model incorporates RoseTTAFold13 to process the decoded structural information into 3D coordinate structures, sequence data were not used for training the underlying model.29 Nevertheless, the approach of Mansoor et al. does provide a promising methodology for generating informative conformational ensembles from structural input.
Toward sequence/structure-centric prediction of protein conformational dynamics with AI/ML. (a) NMR-determined (beige) (PDB ID 1GA3) and AlphaFold-predicted (teal) conformational ensembles of interleukin 13 along with a comparison of the root mean square fluctuation (RMSF) between them. The AlphaFold prediction was performed with stochastic MSA subsampling.26–28,31 (b) Distribution of NMR-determined single-chain protein entries currently deposited in the protein data bank (PDB) concerning number of conformational structures per ensemble (per entry) vs protein sequence length. The scale bar represents the number of PDB entries. (c) Conceptual AI/ML model architecture for end-to-end prediction of protein conformational ensembles from amino acid sequence input. The model comprises attention-based and variational mechanisms, representing a refined integration of existing models.11,29 An NMR-determined conformational ensemble of the globular domain of human histone H1x (PBD ID 2LSO) is used for illustrative purposes in the predicted ensemble panel.
Toward sequence/structure-centric prediction of protein conformational dynamics with AI/ML. (a) NMR-determined (beige) (PDB ID 1GA3) and AlphaFold-predicted (teal) conformational ensembles of interleukin 13 along with a comparison of the root mean square fluctuation (RMSF) between them. The AlphaFold prediction was performed with stochastic MSA subsampling.26–28,31 (b) Distribution of NMR-determined single-chain protein entries currently deposited in the protein data bank (PDB) concerning number of conformational structures per ensemble (per entry) vs protein sequence length. The scale bar represents the number of PDB entries. (c) Conceptual AI/ML model architecture for end-to-end prediction of protein conformational ensembles from amino acid sequence input. The model comprises attention-based and variational mechanisms, representing a refined integration of existing models.11,29 An NMR-determined conformational ensemble of the globular domain of human histone H1x (PBD ID 2LSO) is used for illustrative purposes in the predicted ensemble panel.
Collectively, these approaches offer encouragement for development of AI/ML models dedicated to prediction of protein conformational dynamics in a sequence/structure-centric manner. Moreover, the combined incorporation of biological sequence data with conformationally sensitive, experimentally determined NMR data for model training is yet to be explored. There are over 10 000 single-chain protein 3D structure ensembles freely available from the PDB, which may be leveraged as training data [Fig. 2(b)]. NMR data may be further enriched with carefully selected information from MD simulations. With the use of such data, a sequence/structure-centric model may be developed without entirely “re-inventing the wheel” by incorporating attributes from existing approaches including attention-based11 and VAE29 mechanisms [Fig. 2(c)]. The goal of such a model would be to input an amino acid sequence and perform end-to-end prediction of protein conformational ensembles, similar to prediction of static protein structures with AlphaFold2. Predicted 3D structure ensembles may be compared to NMR-determined ensembles for overall benchmarking of the method and individual accuracy assessments.
The proposed approach represents a simplified conceptualization, and alternate approaches with differing architectures may be pursued. Furthermore, it is important to consider that the spatial distribution of conformational states does not provide a complete characterization of protein dynamics, which also depend on associated transition rates. More sophisticated AI/ML approaches should therefore aim to incorporate transition rate information, which may also shed light on how sequence relates to rate. Such data may be derived from NMR relaxation measurements, for example from the Biological Magnetic Resonance Data Bank (BMRB),32 as well as from MD simulations. It should also be noted that NMR data bear intrinsic uncertainties, particularly from incomplete or ambiguous restraint assignment, making it difficult to differentiate between true and methodologically artifactual structural heterogeneity.33,34 Therefore, the ground truth predictive accuracy of AI/ML models trained on NMR data is limited to the accuracy of the methodology itself. Nevertheless, NMR data offer an abundance of structural heterogeneity conducive to model training for conformational ensemble prediction. Improvements in experimental methodology, continued data accumulation, and augmentation with MD simulation data are anticipated to facilitate implementation and reliability. The plausibility of accurately predicting protein dynamics from sequence is strengthened by the capabilities of current AI/ML-based sequence/structure-centric models and recent forays into prediction of multiple structural conformations, warranting further research and development.
CONCLUSIONS
Central to biology is the implicit relationship between amino acid sequence and 3D structure, as postulated nearly seven decades ago. This principle underpinned numerous discoveries and technical developments, including recent advances in AI/ML-based protein structure prediction. It appears highly likely that sequence encodes not just a single idealized 3D structure but also the conformational dynamics of a protein and, therefore, biochemical/biological function. If this hypothesis holds, amino acid sequence data—alongside conformationally sensitive structural data, i.e., NMR-determined structures—may be leveraged to train AI/ML models for prediction of conformational ensembles from amino acid sequence information alone. The functional implications of sequence–structure relationships across all of biology and biomedicine, literally spanning from agriculture to zoology,35 have been profound. This relationship is expected to continue to play a fundamental role at the nexus of AI/ML, data science, and biology.
ACKNOWLEDGMENTS
Molecular graphics were prepared using UCSF ChimeraX, developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, with support from the National Institutes of Health R01-GM129325 and the Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases.36 The conformational ensemble of interleukin 13 in Fig. 2(a) was predicted using the ColabFold implementation of AlphaFold2.31 RCSB PDB Core Operations are jointly funded by the U.S. National Science Foundation (NSF) (DBI-2321666, PI: S.K. Burley), the U.S. Department of Energy (DE-SC0019749, PI: S.K. Burley), and the National Cancer Institute, the National Institute of Allergy and Infectious Diseases, and the National Institute of General Medical Sciences of the National Institutes of Health (R01GM157729, PI: S.K. Burley).
AUTHOR DECLARATIONS
Conflict of Interest
Yes, Alexander M. Ille and Emily Anas are co-founders of North Horizon, which is engaged in the development of artificial intelligence-based software. Emily Anas is an employee of Microsoft. Michael B. Mathews and Stephen K. Burley have no conflicts to disclose.
Author Contributions
Alexander M. Ille: Conceptualization (equal); Data curation (lead); Software (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). Emily Anas: Conceptualization (equal); Software (supporting); Writing – original draft (supporting). Michael B. Mathews: Conceptualization (supporting); Fundint acquisition (supporting); Writing – review & editing (equal). Stephen K. Burley: Conceptualization (supporting); Funding acquisition (lead); Resources (lead); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.