Proteins, serving as the fundamental architects of biological processes, interact with ligands to perform a myriad of functions essential for life. Designing functional ligand-binding proteins is pivotal for advancing drug development and enhancing therapeutic efficacy. In this study, we introduce ProteinReDiff, an diffusion framework targeting the redesign of ligand-binding proteins. Using equivariant diffusion-based generative models, ProteinReDiff enables the creation of high-affinity ligand-binding proteins without the need for detailed structural information, leveraging instead the potential of initial protein sequences and ligand SMILES strings. Our evaluations across sequence diversity, structural preservation, and ligand binding affinity underscore ProteinReDiff's potential to advance computational drug discovery and protein engineering.
I. INTRODUCTION
Proteins, often referred to as the molecular architects of life, play a critical role in virtually all biological processes. A significant portion of these functions involves interactions between proteins and ligands, underpinning the complex network of cellular activities. These interactions are not only pivotal for basic physiological processes, such as signal transduction and enzymatic catalysis, but also have broad implications in the development of therapeutic agents, diagnostic tools, and various biotechnological applications.1–3 Despite the paramount importance of protein–ligand interactions, the majority of existing studies have primarily focused on protein-centric designs to optimize specific protein properties, such as stability, expression levels, and specificity.4–8 This prevalent approach, despite leading to numerous advancements, does not fully exploit the synergistic potential of optimizing both proteins and ligands for redesigning ligand-binding proteins. By embracing an integrated design approach, it becomes feasible to refine control over binding affinity and specificity, leading to applications such as tailored therapeutics with reduced side effects, highly sensitive diagnostic tools, efficient biocatalysis, targeted drug delivery systems, and sustainable bioremediation solutions,9–11 thus illustrating the transformative impact of redesigning ligand-binding proteins across various fields.
Traditional methods for designing ligand-binding proteins have relied heavily on experimental techniques, characterized by systematic but often inefficient trial-and-error processes.12–14 These methods, while foundational, are time-consuming, resource-intensive, and sometimes fall short in precision and efficiency. The emergence of computational design has marked a transformative shift, offering new pathways to accelerate the design process and gain deeper insights into the molecular basis of protein–ligand interactions. However, even with the advancements in computational approaches, significant challenges remain. Many existing models demand extensive structural information, such as protein crystal structures and specific binding pocket data, limiting their applicability, especially in urgent scenarios like the emergence of novel diseases.15–17 For instance, during the outbreak of a new disease like COVID-19, the spike proteins of the virus may not have well-characterized binding sites, delaying the development of effective drugs.18,19 Furthermore, the complexity of binding mechanisms, including allosteric effects and cryptic pockets, adds another layer of difficulty.20,21 Specifically, many proteins do not exhibit clear binding pockets until ligands are in close vicinity, necessitating extensive simulations to reveal potential binding interfaces.21,22 While molecular dynamics simulations offer detailed atomistic insights into binding mechanisms, they often prove inadequate for designing high-throughput sequences due to high computational cost.9,23 This complexity underscores the need for a drug design methodology that is agnostic to predefined binding pockets.
Our study addresses those identified challenges by introducing ProteinReDiff, a Protein Redesign framework based on Diffusion models. Originating from the foundational concepts of the Equivariant Diffusion-Based Generative Model for Protein–Ligand Complexes (DPL),24 ProteinReDiff incorporates key improvements inspired by the representation learning modules from the AlphaFold2 (AF2) architecture.25 Specifically, we integrate the Outer Product Update (adapted from outer product mean of AF2), single representation attention (SRA) [adapted from multiple sequence alignment (MSA) row attention module], and Triangle Multiplicative Update modules into our Residual Feature Update procedure. These modules collectively enhance the framework's ability to capture intricate protein–ligand interactions, improve the fidelity of binding affinity predictions, and enable more precise redesigns of ligand-binding proteins.
The framework integrates the generation of diverse protein sequences with blind docking capabilities. Starting with a selected protein–ligand pair, our approach stochastically masks amino acids and equivariantly denoises the diffusion model to capture the joint distribution of ligand and protein complex conformations (Fig. 1). Another key feature of our method is blind docking, which predicts how the redesigned protein interacts with its ligand without the need for predefined binding site information, while relying solely on initial protein sequences and ligand SMILES strings.26 This streamlined approach significantly reduces reliance on detailed structural data, thus expanding the scope for sequence-based exploration of protein–ligand interactions.
In summary, the contributions of our paper are outlined as follows:
-
We introduce ProteinReDiff, an efficient computational framework for ligand-binding protein redesign, rooted in equivariant diffusion-based generative models. Our innovation lies in integrating AF2's representational learning modules to enhance the framework's ability to capture intricate protein–ligand interactions.
-
Our framework enables the design of high-affinity ligand-binding proteins without reliance on detailed structural information, relying solely on initial protein sequences and ligand SMILES strings.
-
We comprehensively evaluate our model's outcomes across multiple design aspects, including sequence diversity, structure preservation, and ligand binding affinity, ensuring a holistic assessment of its effectiveness and applicability in various contexts.
II. RELATED WORK
A. Traditional approaches in protein design
Protein design has historically hinged on computational and experimental strategies that paved the way for modern advancements in the field. These foundational methodologies emphasized the balance between understanding protein structure and engineering novel functionalities, albeit with inherent limitations in scalability and precision. Key traditional approaches include the following:
-
Rational Design27–29 focused on introducing specific mutations into proteins based on known structural and functional insights. This method required an in-depth understanding of the target protein structures and how changes might impact its function.
-
Directed Evolution30–33 mimicked natural selection in the laboratory, evolving proteins toward desired traits through iterative rounds of mutation and selection. Despite its effectiveness in discovering functional proteins, the process was often labor-intensive and time-consuming.
These traditional methods have been instrumental in advancing our understanding and capability in protein design. However, their limitations in terms of efficiency, specificity, and the broad applicability of findings highlighted the need for more versatile and scalable approaches. As the field progressed, the integration of computational power and biological understanding opened new avenues for innovation in protein design, leading to the exploration and adoption of more advanced methodologies.
B. Deep generative models in protein design
Since their inception, deep generative models have significantly advanced fields like computer vision (CV)34 and natural language processing (NLP),35 sparking interest in their application to protein design. This enthusiasm has led to numerous studies that harness these models for innovating within the protein design area. Among these, certain types of deep generative models have distinguished themselves through their effectiveness and the promising results they have achieved, including:
-
Variational Autoencoders (VAEs) are utilized to explore diverse chemical spaces by learning rich latent representations of protein sequences, enabling the generation of novel sequences through latent space manipulation.36–38
-
Autoregressive models predict the probability of each amino acid in a sequential manner, facilitating the generation of coherent and functionally plausible protein sequences.39,40
-
Generative adversarial networks (GANs) employ two networks that work in tandem to produce protein sequences indistinguishable from real ones, enhancing the realism and diversity of generated designs.41,42
-
Diffusion models represent a step forward by gradually transforming noise into structured data, simulating the complex process of folding sequences into functional proteins.43–46
However, the majority of these studies have focused on protein-centric designs, with a noticeable gap in research that integrates both proteins and ligands for the purpose of redesigning ligand-binding proteins. Such integration is crucial for a holistic understanding of the intricate dynamics between protein structures and their ligands, a domain that remains underexplored.
C. Current approaches in ligand-binding protein redesign
1. Heavy reliance on detailed structural information
Contemporary computational methodologies for designing proteins that target specific surfaces predominantly rely on structural insights from native complexes, underscoring the critical role of fine-tuning side chain interactions and optimizing backbone configurations for optimal binding affinity.15–17,44,47,48 These strategies often initiate with the generation of protein backbones, employing inverse folding techniques to identify sequences capable of folding into these pre-designed structures.6,7,48,49 This approach signifies a paradigm shift by prioritizing structural prediction ahead of sequence identification, aiming to produce proteins that not only fit the desired conformations for potential ligand interactions but also navigate around the challenge of undefined binding sites. Despite the advantages, including the potential of computational docking to create binders via manipulation of antibody scaffolds and varied loop geometries,36,50,51 a notable challenge persists in validating these binding modes with high-resolution structural evidence. Additionally, the traditional focus on a limited array of hotspot residues for guiding protein scaffold placement often restricts the exploration of possible interaction modes, particularly in cases where target proteins lack clear pockets or clefts for ligand accommodation.22,52
2. Limited training data and lack of diversity
Existing approaches often rely on a limited set of training data, which can restrict the diversity and generalizability of the resulting models. For instance, datasets like PDBBind provide detailed ligand information, but their scope is limited.53 This limitation is further compounded when protein datasets lack corresponding ligand data, reducing the effectiveness of the training process. Traditional methodologies also tend to focus on a narrow range of protein–ligand interactions, potentially overlooking the broader spectrum of possible interactions.
3. Single-domain denoising focus
Previous methodologies typically concentrate on denoising either in sequence space or structural space, but not both. Approaches like ProteinMPNN,6 LigandMPNN,17 and MIF48 primarily operate in sequence space, while others like DPL function in structural space.24 This single-domain focus can limit the ability to capture the full complexity of protein–ligand interactions, which inherently involve both sequence and structural dimensions. Consequently, these methodologies may fall short of accurately predicting the functional capabilities of redesigned proteins.
4. Challenges in generating diverse sequences with structural integrity
While some approaches prioritize sequence similarity to generate functional proteins, they often do so at the expense of structural integrity. For example, ProteinMPNN and CARP focus heavily on sequence similarity, which can result in a lack of diversity and flexibility in the generated sequences.6,7 This limitation can hinder the ability to explore a wider range of functional conformations, reducing the effectiveness of the protein design process.
5. Key improvements of ProteinReDiff
We address the weaknesses of available methodologies by integrating diverse datasets, employing a dual-domain denoising strategy, and ensuring the generation of diverse sequences while maintaining structural integrity. Our approach utilizes only protein sequences and ligand SMILES strings, eliminating the need for detailed structural information. By combining PDBBind53 and CATH54 datasets, we effectively double our training data, enhancing protein representations. Our equivariant and KL-divergence loss functions enable denoising across both sequence and structural dimensions, capturing the full complexity of protein–ligand interactions. This approach maintains structural fidelity and promotes sequence diversity, overcoming the limitations of methodologies prioritizing sequence similarity at the expense of diversity.
III. BACKGROUND
A. Protein language models (PLMs)
Protein language models (PLMs) harness the power of natural language processing (NLP) to unravel the intricate latency embedded within protein sequences. By analogizing amino acid sequences to human language sentences, PLMs unlock profound insights into protein functions, interactions, and evolutionary trajectories.55 These models leverage advanced text processing techniques to predict structural, functional, and interactional properties of proteins based solely on their amino acid sequences.56–59 Their adoption in protein design has catalyzed significant progress, with studies leveraging PLMs to translate protein sequence data47,60–62 into actionable insights, thus guiding the precise engineering of proteins with targeted functional attributes.
We employ the ESM-2 model,59 a state-of-the-art protein language model with 650 × 106 parameters, pre-trained on nearly 65 × 106 unique protein sequences from the UniRef63 database, to feature initial masked protein sequences. ESM-2 enriches the latent representation of protein sequences, bypassing the need for conventional multiple sequence alignment (MSA) methods. By incorporating structural and evolutionary information from input sequences, ESM-2 enables us to unravel interaction patterns across protein families for effective ligand targeting. This understanding is crucial for designing and optimizing ligand-binding proteins.
B. Equivariant diffusion-based generative models
1. The diffusion procedure
2. The generative denoising process
IV. METHOD
In this section, we detail the methodology employed in our noise prediction model, which is depicted in Fig. 1 and consists of three main procedures: (1) input featurization, (2) residual feature update, and (3) equivariant denoising. Through these steps, we transform raw protein and ligand data into structured representations, iteratively refine their features, and leverage denoising techniques inherent in the diffusion model to improve sampling quality.
A. Input featurization
We develop both single and pair representations from protein sequences and ligand SMILES string (Fig. 2). For proteins, we initially applied stochastic masking to segments of the amino acid sequences. The protein representation is attained through the normalization and linear mapping of the output from the final layer of the ESM-2 model, which is subsequently combined with the amino acid and masked token embeddings. Additionally, for pair representations of proteins, we leveraged pairwise relative positional encoding techniques, drawing from established methodologies.25 For ligand representations, we employed a comprehensive feature embedding approach, capturing atomic and bond properties such as atomic number, chirality, connectivity, formal charge, hydrogen attachment count, radical electron count, hybridization status, aromaticity, and ring presence for atoms and bond type, stereochemistry, and conjugation status for bonds. These representations are subsequently merged, incorporating radial basis function (RBF) embeddings of atomic distances and sinusoidal embeddings of diffusion times. Together, these steps culminate in the formation of preliminary complex representations, laying the foundation for our computational analyses.
B. Residual feature update procedure
Our Residual Feature Update Procedure, as illustrated in Fig. 3, deviates significantly from the approach employed in the original DPL model.24 While the DPL model relied on Alphafold2's Triangular Multiplicative Update for updating single and pair representations, where these representations mutually influence each other, our objective is to optimize this procedure for greater efficiency. Specifically, we incorporate enhancements such as the Outer Product Update and single representation attention to formulate sequence representational hypotheses of protein structures and to model suitable motifs for binding target ligands specifically. These modules, integral to Evoformer, the sequence-based module of AF2, play a crucial role in extracting essential connections among internal motifs that serve structural functions (i.e., ligand binding) when structural information is not explicitly provided during training. Importantly, we adapt and tailor these modules to fit within our model architecture, ensuring their effectiveness in capturing the intricate interplay between proteins and ligands.
1. Single representation attention module
The single representation attention (SRA) module, derived from the Alphafold2 model's MSA row attention with pair bias, accounts for long-range interactions among residues and ligand atoms within a single protein–ligand embedding vector. In essence, the attention mechanism assigns importance to those involved in complex-based folding to denoise the equivariant loss (Sec. IV C) in a self-supervised manner. While the original Alphafold2 MSA row attention mechanism processes input for a single sequence, the SRA module is designed to incorporate representations from multiple protein–ligand complexes concurrently. Specifically, the pair bias component of the SRA attention module captures dependencies between proteins and ligands, which was shown to fit the attention score better than the regular self-attention model without bias terms.67 By considering both the single representation vector (which encodes the protein/ligand sequential representation) and the pairwise representation vector (which encodes protein-protein and protein–ligand interactions), this cross-attention mechanism exchanges information between pairwise and single representation to effectively preserves internal motifs, as evidenced by contact overlap metrics.55,68 As transformer architecture is widely used for predicting protein functions,69 we observed similar efficacy to our binding affinity prediction in Results V B 5 and Appendix B and C. For a detailed description of the computational steps implemented in this module, refer to Algorithm 1.
Input: Single representation vector msi, pair representation vector zsij of the i-th sequence in the set of sequences s, C = 65, Nhead = 4. |
Output: Updated single representation vector with the dimension of Cm. |
1: |
2: |
3: |
4: |
5: |
6: |
7: |
8: |
Input: Single representation vector msi, pair representation vector zsij of the i-th sequence in the set of sequences s, C = 65, Nhead = 4. |
Output: Updated single representation vector with the dimension of Cm. |
1: |
2: |
3: |
4: |
5: |
6: |
7: |
8: |
2. Outer product update
Since the SRA encodings have a shape (s, r, cm) and the pair representation has a shape (s, r, r, cz), the outer product (OPU) layer merges insights by reshaping SRA encodings into pair representations. This module leverages evolutionary cues from ESM to generate plausible structural hypotheses for pair representations.70 It first calculates the outer product of the SRA embeddings of protein–ligand pairs, then aggregates the outer products to yield a measure of co-evolution between every residue pair.55 Analogous to tensor product representations (TPR) in NLP, the outer product is akin to the filler-and-role binding relationship, where each entity (i.e., amino acid residue) on a sequence is attached to a rich functional embedding based on its relationship to one another.71–73
This process integrates correlated information of residues i and j of a sequence s, resulting in the intermediate Kronecker product tensors (.i.e., role embeddings in NLP).67,74,75 Subsequently, an affine transformation projects those representations to hypotheses concerning the relative positions of residues i and j under biophysical constraints. Our implementation adapts the outer product without computing the mean to maintain the pair representations of multiple protein–ligand complexes. For a detailed description of the computational steps implemented in this module, refer to Algorithm 2.
Input: Single representation vector msi of the i-th sequence in the set of sequences s, C = 32. |
Output: Pair representation vector zsij with the dimension of . |
1: |
2: |
3: |
4: |
5: |
Input: Single representation vector msi of the i-th sequence in the set of sequences s, C = 32. |
Output: Pair representation vector zsij with the dimension of . |
1: |
2: |
3: |
4: |
5: |
3. Triangle multiplicative updates
After refining the pair representation, our model interprets the primary protein–ligand structure using principles from graph theory, treating each residue as a distinct entity interconnected through the pairwise matrix. These connections are then refined through triangular multiplicative updates to account for physical and geometric constraints, such as triangular inequality. While the SRA weights the importance of residues, the triangular multiplicative update acts as another stack of transformer-based layers where any two edges affect the third one to enforce triangle equivariance.55,76 The starting and ending nodes propagate information in and out of neighbors in similar fashion as the message-passing framework.67 These mechanisms enable the model to generate more accurate representations of protein–ligand complexes, leading to improved predictive performance in predicting binding affinities and structural characteristics.
C. Equivariant denoising
V. EXPERIMENTS
A. Training process
1. Data curation
We curated a broad range of protein structures, including both ligand-bound (holo) and ligand-free (apo) forms, sourced from two key repositories: PDBBind v202053 and CATH 4.2.54 PDBBind v2020 offers a diverse collection of protein–ligand complexes, while CATH 4.2 provides a substantial repository of protein structures. This strategic selection of datasets ensures our model is exposed to a wide and varied spectrum of protein–ligand interactions and structural configurations, enabling comprehensive evaluation against diverse inverse folding benchmarks. By training on both holo and apo structures, our approach imbues the model with a robust understanding of protein–ligand dynamics to navigate the complexities of unseen protein–ligand interactions.
To ensure robust model training and evaluation, we partitioned the datasets by MMseqs2.77 The protein sets were clustered for training, validation, and testing to maintain sequence similarities between 40% and 50% and ensure unbiased training and predictions. Similar protocols were implemented in other protein models.25,48 For ligands, we cluster based on the Tanimoto similarity of Morgan fingerprints78 on ligand structures. Incorporating CATH 4.2 data into PDBBind not only preserves the objectivity of the train/test/validation partitions but also substantially decreases the similarities within ligand sets, as shown in Table I.
Protein . | Validation . | Test . |
---|---|---|
Train | 36.0% (36.2%) | 38.0% (42.2%) |
Validation | ⋯ | 39.08% (43.5%) |
Ligand | Validation | Test |
Train | 72.2% (36.1%) | 9.41% (3.11%) |
Validation | ⋯ | 9.37% (3.17%) |
Protein . | Validation . | Test . |
---|---|---|
Train | 36.0% (36.2%) | 38.0% (42.2%) |
Validation | ⋯ | 39.08% (43.5%) |
Ligand | Validation | Test |
Train | 72.2% (36.1%) | 9.41% (3.11%) |
Validation | ⋯ | 9.37% (3.17%) |
Table II provides an overview of the partitioning details, facilitating a clear understanding of the distribution of samples across different subsets of the dataset.
Dataset . | Train . | Validation . | Test . |
---|---|---|---|
PDBBind v2020 | 9430 | 552 | 207 |
CATH 4.2 | 15261 | 939 | ⋯ |
Dataset . | Train . | Validation . | Test . |
---|---|---|---|
PDBBind v2020 | 9430 | 552 | 207 |
CATH 4.2 | 15261 | 939 | ⋯ |
-
PDBBind v2020: For consistency and comparability with previous studies, we first adhered to the test/training/validation split settings outlined in the established literature,79 specifically following the configurations defined in the respective sources for the PDBBind v2020 datasets.80 Then, we filtered out those highly similar sequences (above 95%) to keep the average similarities between 40% and 50%.
-
CATH 4.2: In our approach, we deliberately focused on proteins with fewer than 400 amino acids and less similar (below 90%) sequences from the CATH 4.2 database. This selective criterion was chosen to prioritize smaller proteins, which often represent more druggable targets of interest in drug discovery and development endeavors. During both the training and validation phases, SMILES strings of CATH 4.2 proteins were represented as asterisks (masked tokens) to denote unspecified ligands. Notably, CATH 4.2 was excluded from the test set due to the absence of corresponding ligands required for evaluating protein–ligand interactions.
2. Loss functions
a. Weighted sum of relative differences ( )
b. Kullback–Leibler divergence ( )
This component quantifies the divergence between the model's predictions and actual sequence data at time step t − 1. Defined as , it contrasts the predicted distribution, , against the true sequence distribution, , leveraging the diffusion process's β parameter for temporal adjustment. This loss is also applied in the Protein Generator5 model to ensure the model's predictions progressively align with actual data distributions, enhancing the accuracy of sequence and structure generation by minimizing the expected divergence.81
c. Cross-entropy loss ( )
This loss function is crucial for the accurate prediction of protein sequences, aligning them with the ground truth through effective classification. It denoises each amino acid from masked latent embedding to a specific class, leveraging categorical cross-entropy to rigorously penalize discrepancies between the model's predicted probability distributions and the actual distributions for each amino acid type.
3. Training performance
Throughout the training phase, we observed the model's performance between training and validation losses, as demonstrated in Fig. 4. While the training loss consistently diminished, indicating effective learning, the validation loss exhibited more variability. Despite these fluctuations, the validation loss showed an overall downward trend, suggesting that the model is improving its generalization capabilities over time. The general alignment between the downward trends of training and validation losses indicates that the model is learning effectively without significant overfitting.
B. Evaluation process
1. Ligand binding affinity (LBA)
Ligand binding affinity is a fundamental measure that quantifies the strength of the interaction between a protein and a ligand. This metric is crucial as it directly influences the effectiveness and specificity of potential therapeutic agents; higher affinity often translates to increased drug efficacy and lower chances of side effects.82 Within this context, ProteinReDiff is evaluated on its ability to generate protein sequences for significantly improved binding affinity with specific ligands. We utilize a docking score-based approach for this assessment, where the docking score serves as a quantitative indicator of affinity. Expressed in kcal/mol, these scores inversely relate to binding strength—lower scores denote stronger, more desirable binding interactions.
2. Sequence diversity
3. Structure preservation
Structural preservation is paramount in the redesign of proteins, ensuring that essential functional and structural characteristics are maintained post-modification. To effectively measure structural preservation between the original and redesigned proteins, three key metrics are the template modeling score (TM Score),85 the root mean square deviation (RMSD),86 and the contact overlap (CO).87 These three metrics collectively provide a comprehensive assessment of structural integrity and similarity.
a. The root mean square deviation (RMSD)
b. TM score
c. Contact overlap (CO)
4. Experimental setup
To evaluate ProteinReDiff, we employed Omegafold90 to predict the three-dimensional structures of all designed protein sequences. The choice of Omegafold over AF2 was favorable because Omegafold can more accurately fold proteins with low similarity to existing proteomes, making it suitable for proteins lacking available ligand-binding conformations. Next, we utilized AutoDock Vina91 to conduct docking simulations and evaluate the binding affinity between the redesigned proteins and their respective ligands based on the predicted 3D structures. To ensure fair comparisons and mitigate potential biases introduced by pre-docked structures, we aligned our redesigned protein structures with reference structures before docking. This approach is crucial, particularly because the use of pre-docked structures may favor certain conformations, leading to inaccurate evaluations. Additionally, to provide context for our results, we compared the binding scores of our redesigned proteins not only with those of the original proteins but also with proteins generated by other protein design models. While these models may differ in sequence characteristics from those optimized for ligand binding, comparing their scores provides insights into the relationship between protein sequence, structure, and ligand interactions, deepening our understanding of protein–ligand dynamics.
a. Benchmark model selection
In selecting benchmark models for performance comparison, we focused on state-of-the-art approaches, particularly those relevant to protein design tasks. Traditionally, protein design has been primarily based on inverse folding, utilizing protein structure information. Our choices encompass a range of methodologies:
-
MIF,48 MIF-ST,48 and ProteinMPNN6 are notable for generating sequences with high identity and experimental significance, utilizing protein structure information.
-
The Protein Generator,5 a representative of RosettaFold models,44 employs diffusion-based methods, making it an intriguing comparative candidate. The model also shares a similar loss function, LKL, in sequence space with our model but diverges in modules and training procedures (i.e., stochastic masking).
-
ESMIF,49 belonging to the ESM model family,59 stands as another competitive benchmark, emphasizing the generation of high-quality sequences.
-
CARP, while lacking ligand information, shares similar protein input and output characteristics with our models, warranting inclusion for comparison.
-
DPL,24 originally geared toward protein–ligand complex generation, was adapted for our purposes by modifying loss functions and incorporating a sequence prediction module, given its alignment with our model architecture.
-
LigandMPNN,17 resembling the most to our task in designing ligand-binding proteins, necessitates binding pocket information, unlike our model, which emphasizes a simplified yet effective approach for ligand-binding protein tasks.
Our model's design prioritizes simplicity in input while achieving effectiveness in output for ligand-binding protein tasks. For a comprehensive comparison of input–output dynamics across each model, please consult Table III.
. | Input . | Output . | |||||
---|---|---|---|---|---|---|---|
. | Protein . | Protein . | Ligand . | Binding . | Protein . | Protein . | Ligand . |
Model . | Sequence . | Structure . | SMILES . | Pocket . | Sequence . | Structure . | Structure . |
CARP7 | ✓ | × | × | × | ✓ | × | × |
ESMIF49 | × | ✓ | × | × | ✓ | × | × |
MIF48 | ✓ | ✓ | × | × | ✓ | × | × |
MIF-ST48 | ✓ | ✓ | × | × | ✓ | × | × |
ProteinMPNN6 | × | ✓ | × | × | ✓ | × | × |
LigandMPNN17 | × | ✓ | ✓ | ✓ | ✓ | × | × |
Protein generator5 | × | ✓ | × | × | ✓ | × | × |
DPL24 | ✓ | × | ✓ | × | × | ✓ | ✓ |
ProteinReDiff (ours) | ✓ | × | ✓ | × | ✓ | ✓ | ✓ |
. | Input . | Output . | |||||
---|---|---|---|---|---|---|---|
. | Protein . | Protein . | Ligand . | Binding . | Protein . | Protein . | Ligand . |
Model . | Sequence . | Structure . | SMILES . | Pocket . | Sequence . | Structure . | Structure . |
CARP7 | ✓ | × | × | × | ✓ | × | × |
ESMIF49 | × | ✓ | × | × | ✓ | × | × |
MIF48 | ✓ | ✓ | × | × | ✓ | × | × |
MIF-ST48 | ✓ | ✓ | × | × | ✓ | × | × |
ProteinMPNN6 | × | ✓ | × | × | ✓ | × | × |
LigandMPNN17 | × | ✓ | ✓ | ✓ | ✓ | × | × |
Protein generator5 | × | ✓ | × | × | ✓ | × | × |
DPL24 | ✓ | × | ✓ | × | × | ✓ | ✓ |
ProteinReDiff (ours) | ✓ | × | ✓ | × | ✓ | ✓ | ✓ |
5. Results and discussion
We conducted comprehensive evaluation of ProteinReDiff, as detailed in Table IV and visually represented in Fig. 6, across the metrics of ligand binding affinity, sequence diversity, and structure preservation. These evaluations provide a clear depiction of the model's performance relative to established baselines and within its variations.
Category . | Method . | LBA (kcal/mol) ↓ . | Sequence diversity ↑ . | Structure preservation . | ||
---|---|---|---|---|---|---|
TM Score ↑ . | RMSD (Å) ↓ . | CO ↑ . | ||||
Baseline | CARP7 | −5.658 ± 0.301 | 185.532 | 0.850 ± 0.023 | 3.768 ± 0.553 | 0.922 ± 0.003 |
MIF48 | −5.518 ± 0.381 | 185.600 | 0.877 ± 0.020 | 2.986 ± 0.468 | 0.938 ± 0.002 | |
MIF-ST48 | −5.596 ± 0.330 | 185.584 | 0.872 ± 0.021 | 3.026 ± 0.451 | 0.937 ± 0.003 | |
ESMIF49 | −5.555 ± 0.326 | 187.512 | 0.837 ± 0.021 | 4.000 ± 0.501 | 0.915 ± 0.003 | |
ProteinMPNN6 | −5.423 ± 0.225 | 188.792 | 0.714 ± 0.026 | 6.806 ± 0.616 | 0.859 ± 0.004 | |
LigandMPNN17 | −5.717 ± 0.287 | 191.384 | 0.782 ± 0.024 | 4.512 ± 0.668 | 0.915 ± 0.008 | |
Protein generator5 | −5.674 ± 0.266 | 186.962 | 0.806 ± 0.022 | 4.431 ± 0.523 | 0.899 ± 0.003 | |
DPL24 | −5.551 ± 0.459 | 188.139 | 0.788 ± 0.024 | 5.094 ± 0.537 | 0.896 ± 0.009 | |
Reference cases | −5.847 ± 0.263 | ⋯ | ⋯ | ⋯ | ⋯ | |
ProteinReDiff (Ours) | 5% masking | −5.805 ± 0.252 | 185.935 | 0.864 ± 0.022 | 3.197 ± 0.470 | 0.942 ± 0.007 |
15% masking | −6.803 ± 0.329 | 186.627 | 0.845 ± 0.023 | 3.690 ± 0.508 | 0.935 ± 0.007 | |
30% masking | −5.769 ± 0.244 | 187.877 | 0.803 ± 0.024 | 4.467 ± 0.544 | 0.916 ± 0.008 | |
40% masking | −5.617 ± 0.366 | 188.600 | 0.756 ± 0.026 | 5.639 ± 0.625 | 0.896 ± 0.008 | |
60% masking | −5.467 ± 0.318 | 190.425 | 0.305 ± 0.024 | 18.056 ± 0.773 | 0.735 ± 0.010 | |
70% masking | −5.470 ± 0.199 | 187.291 | 0.147 ± 0.004 | 23.197 ± 0.497 | 0.689 ± 0.007 |
Category . | Method . | LBA (kcal/mol) ↓ . | Sequence diversity ↑ . | Structure preservation . | ||
---|---|---|---|---|---|---|
TM Score ↑ . | RMSD (Å) ↓ . | CO ↑ . | ||||
Baseline | CARP7 | −5.658 ± 0.301 | 185.532 | 0.850 ± 0.023 | 3.768 ± 0.553 | 0.922 ± 0.003 |
MIF48 | −5.518 ± 0.381 | 185.600 | 0.877 ± 0.020 | 2.986 ± 0.468 | 0.938 ± 0.002 | |
MIF-ST48 | −5.596 ± 0.330 | 185.584 | 0.872 ± 0.021 | 3.026 ± 0.451 | 0.937 ± 0.003 | |
ESMIF49 | −5.555 ± 0.326 | 187.512 | 0.837 ± 0.021 | 4.000 ± 0.501 | 0.915 ± 0.003 | |
ProteinMPNN6 | −5.423 ± 0.225 | 188.792 | 0.714 ± 0.026 | 6.806 ± 0.616 | 0.859 ± 0.004 | |
LigandMPNN17 | −5.717 ± 0.287 | 191.384 | 0.782 ± 0.024 | 4.512 ± 0.668 | 0.915 ± 0.008 | |
Protein generator5 | −5.674 ± 0.266 | 186.962 | 0.806 ± 0.022 | 4.431 ± 0.523 | 0.899 ± 0.003 | |
DPL24 | −5.551 ± 0.459 | 188.139 | 0.788 ± 0.024 | 5.094 ± 0.537 | 0.896 ± 0.009 | |
Reference cases | −5.847 ± 0.263 | ⋯ | ⋯ | ⋯ | ⋯ | |
ProteinReDiff (Ours) | 5% masking | −5.805 ± 0.252 | 185.935 | 0.864 ± 0.022 | 3.197 ± 0.470 | 0.942 ± 0.007 |
15% masking | −6.803 ± 0.329 | 186.627 | 0.845 ± 0.023 | 3.690 ± 0.508 | 0.935 ± 0.007 | |
30% masking | −5.769 ± 0.244 | 187.877 | 0.803 ± 0.024 | 4.467 ± 0.544 | 0.916 ± 0.008 | |
40% masking | −5.617 ± 0.366 | 188.600 | 0.756 ± 0.026 | 5.639 ± 0.625 | 0.896 ± 0.008 | |
60% masking | −5.467 ± 0.318 | 190.425 | 0.305 ± 0.024 | 18.056 ± 0.773 | 0.735 ± 0.010 | |
70% masking | −5.470 ± 0.199 | 187.291 | 0.147 ± 0.004 | 23.197 ± 0.497 | 0.689 ± 0.007 |
For ProteinReDiff, we aimed to capture the diverse conformations of ligand-binding proteins, recognizing that they can adopt multiple structural states. To assess these conformations, we employed alignment metrics such as TM score, RMSD, and contact overlap (CO). In Fig. 5, we presented several instances where the contact overlap appeared to be maintained, yet the RMSD is large and TM score is low. This discrepancy suggests that while global alignment metrics like TM score and RMSD may not adequately capture the domain shift within these complex ensembles, the preservation of local motifs, as indicated by contact overlap, remains crucial in our framework. This underscores the importance of capturing both global and local structural features for a comprehensive understanding of protein–ligand interactions.
A pivotal observation from our study is ProteinReDiff's unparalleled ability to enhance ligand binding affinity, particularly at a 15% masking ratio in Fig. 6. This configuration not only surpasses the performance of inverse folding (IF) models and the original DPL framework but also exceeds the binding efficiencies of the original protein designs. By incorporating attention modules from AlphaFold2, ProteinReDiff effectively captures the complex interplay between proteins and ligands, demonstrating its superiority over the original DPL model. While other masking ratios within ProteinReDiff show varying degrees of effectiveness, lower ratios, though at the same par as reference, do not achieve the peak LBA performance observed at 15%. For instance, the 5% masked model emphasizes structural consistency with a high TM score and low RMSD, but does not exhibit the same level of binding capability as the 15% masking. These findings are also consistent with ablation studies shown in Appendix C. Conversely, higher masking ratios fail to strike the necessary balance between introducing beneficial modifications and maintaining functional precision, underscoring the importance of optimizing the masking ratio.
Our analysis of sequence diversity and structure preservation metrics reveals a delicate balance essential in protein redesign. The 15% masking ratio, identified as optimal for enhancing ligand binding affinity in our model, also aligns closely with benchmark methods in both sequence diversity and structure preservation. For instance, LigandMPNN excels in sequence diversity but faces challenges in obtaining binding pocket inputs for various design tasks, unlike our approach. Moreover, our models (at 30% and 40% maskings) significantly outperform others in contact overlap, crucial for diversifying structures while preserving functional motifs in protein redesign tasks. This equilibrium underscores ProteinReDiff's ability to optimize ligand interactions without compromising the exploration of sequence diversity or the integrity of original protein structures alone.
In contrast, extreme values in either sequence diversity or structure preservation, which could be seen in other masking ratios, do not lead to optimal ligand binding affinities. This finding highlights an inverse relationship between pushing the limits of diversity and preservation and achieving the primary goal of binding enhancement. Thus, the 15% masking ratio not only stands out for its ability to significantly improve ligand binding affinity but also for maintaining a balanced approach, ensuring that enhancements in functionality do not detract from the protein's structural and functional viability.
In Fig. 7, we compare the ligand-binding affinity (LBA) of original and redesigned proteins by ProteinReDiff. The redesigned proteins maintain their original folds while significantly enhancing LBA. In ablation studies (Sec. V B 6), we can apply various masking strategies to adjust both sequence diversity and structural integrity. This approach has potential applications in different settings to control the affinity of ligand binders.
6. Ablation studies
Here, we conducted thorough ablation studies on ProteinReDiff's model architecture, featurization, and masking ratios. For complete ablation setup, please refer to Table VII ( Appendix C).
a. Interpreting model architecture
We trained ablated versions of ProteinReDiff without the SRA or OPU modules and compared them to the original DPL model. Initially designed for generating ensembles of complex structures, DPL was adapted for targeted protein redesign by adding sequence-based loss functions to generate new target sequences.
In Fig. 8, we computed the performance score by averaging the sum of five evaluation metrics introduced in Secs. V B 1–V B 3. Since the sequence diversity is not within the [0,1] range, we applied Min-Max normalization. For LBA and RMSD, we used inverse normalization to ensure that a score closer to 1.0 indicates better model performance. The average score is then compared with the score of baseline ProteinReDiff, which was trained without any ablations.
We observed that our model outperformed DPL by a large margin. Incorporating just the OPU module (without the SRA module) yields better performance than DPL, indicating OPU's ability to exchange insights between single and pair representations. First, the equivariant loss function is parameterized on the structural space, making the pairwise representations from the OPU critical to that loss. Second, without OPU, the model performs poorly on TM score (the bottom brown line in Fig. 11, Appendix C), which measures global structural preservation. Additionally, introducing SRA only without OPU hurts our model performance, suggesting the model would have been over-parameterized as the SRA updates primarily on the sequence representation. Therefore, combining both the OPU and SRA modules provides an effective approach for enhancing the representational learning of ProteinReDiff. A complete comparative assessment is presented in Table IV and Appendix C.
b. Ablations on input featurization methods
We conducted ablation studies to evaluate different input featurization methods, including manual feature engineering for ligands and the use of ESM-2 as a pre-trained LLM (Large Language Model) for protein featurization.
We gradually reduced ligand features, starting with ligand distance and bond information (e.g., types and ring), and even omitted the entire bond and ligand. In Fig. 8, omitting bond features and distance caused less reduction in model performance than omitting the entire ligand. Ligand bond information is crucial for the model to learn the relative positions of ligand atoms and adhere to geometric constraints within the triangular update module (Sec. IV B 3).
We observed a significant decrease in model performance when ESM embeddings were excluded (the red bar in Fig. 8). The ESM features alone (the brown bar) significantly boosted performance when training without ligand data, as these embeddings are enriched with protein evolutionary and biophysical information needed for both single and pair representations. Other protein features, such as position encodings and amino acid types, provided slight improvements, though they were minimal. However, excluding ligand information led to a reduction in model performance compared to the baseline, as the model relies on learning the overall structure of the complexes.
Therefore, using pre-trained featurization methods, such as ESM and other protein BERT-like models, in combination with ligand input, significantly enhances model training and performance.
c. Impact of masking ratios
We examined ProteinReDiff's performance with various percentages of masked amino acids, adjusting the masking ratio as a hyperparameter and retraining our model. In Fig. 9, we observed consistent top performance across the metrics with masking ratios between 5% and 15%. This range is crucial for the protein redesign strategy, enhancing binding affinity while preserving the structural and functional motifs of the target protein. The 15% masking ratio achieved the best ligand binding affinity, the most important metric for capturing protein function.
Interestingly, we noticed performance spikes for 50% masking in contact overlap and TM-score. This is because applying stochastic masks allows the model to learn representations with varied masking from 0 up to the set ratio. Although the 50% masking does not surpass the 15% masking's performance, the improvement in the high masking regime demonstrates the robustness of our training scheme.
Overall, this investigation highlights the optimal level of sequence masking needed to enhance ligand binding affinity, sequence diversity, and structural preservation. It also reinforces training strategies for protein redesign as shown on the Discussion Sec. V B 5.
VI. CONCLUSIONS
This study introduces ProteinReDiff, a computational framework developed to redesign ligand-binding proteins. By utilizing advanced techniques inspired by Equivariant Diffusion-Based Generative Models and the attention mechanism from AlphaFold2, ProteinReDiff demonstrates its ability to enhance complex protein–ligand interactions. Our model excels in optimizing ligand binding affinity based solely on initial protein sequences and ligand SMILES strings, bypassing the need for detailed structural data. Experimental validations highlight ProteinReDiff's capability to improve ligand binding affinity while preserving essential sequence diversity and structural integrity. These findings open new possibilities for protein–ligand complex modeling, indicating significant potential for ProteinReDiff in various biotechnological and pharmaceutical applications.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Viet Thanh Duy Nguyen: Data curation (equal); Formal analysis (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Nhan D. Nguyen: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Truong Son Hy: Conceptualization (lead); Funding acquisition (lead); Investigation (lead); Methodology (lead); Project administration (lead); Resources (lead); Software (equal); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request. The data that support the findings of this study are openly available in Ref. 106.
APPENDIX A: BENCHMARKING PROTEINREDIFF AGAINST RELATED MODELS
These plots, shown in Fig. 10, demonstrate the comparative performance of ProteinReDiff against other relevant models. The results indicate that our model consistently ranks among the high performers.
APPENDIX B: EVALUATING PROTEIN–LIGAND COMPLEX REPRESENTATION
In the continuation of our study's exploration of protein–ligand complex representations, we extended the use of the PDBBind v2020 dataset,53 previously detailed in our training process, to specifically evaluate the effectiveness of the Input Featurizer from ProteinReDiff. By using embeddings generated by the Input Featurizer as input features, we trained a Gaussian Process (GP) model to predict ligand binding affinity. The choice of a GP model, recognized for its probabilistic nature and adaptability to the nuanced, uncertain dynamics of biological interactions, was pivotal in assessing how well the embeddings capture predictive information about protein–ligand interactions. The GP model employed a Gaussian likelihood, suitable for regression tasks, along with a radial basis function (RBF) kernel, chosen for its effectiveness in modeling smooth, continuous variations characteristic of binding affinities. The GP model's parameters were optimized to ensure a robust fit to the training data.
The evaluation results in Table V demonstrate the performance of embeddings generated by the Input Featurizer on the PDBBind v2020 dataset compared to baseline methods. Notably, these embeddings achieved the highest Pearson correlation (0.721) for predicting ligand binding affinity, highlighting the Input Featurizer's effectiveness in capturing meaningful protein–ligand interactions. This strong performance is further supported by competitive RMSE, MAE, and Spearman correlation metrics.
Approach . | ( ) . | ( ) . | . | . |
---|---|---|---|---|
Pafnucy92 | 1.435 | 1.144 | 0.635 | 0.587 |
OnionNet93 | 1.403 | 1.103 | 0.648 | 0.602 |
IGN94 | 1.404 | 1.116 | 0.662 | 0.638 |
SIGN95 | 1.373 | 1.086 | 0.685 | 0.656 |
SMINA96 | 1.466 | 1.161 | 0.665 | 0.663 |
GNINA97 | 1.740 | 1.413 | 0.495 | 0.494 |
dMaSIF98 | 1.450 | 1.136 | 0.629 | 0.588 |
TankBind99 | 1.345 | 1.060 | 0.718 | |
GraphDTA100 | 1.564 | 1.223 | 0.612 | 0.570 |
TransCPI101 | 1.493 | 1.201 | 0.604 | 0.551 |
MolTrans102 | 1.599 | 1.271 | 0.539 | 0.474 |
DrugBAN103 | 1.480 | 1.159 | 0.657 | 0.612 |
DGraphDTA104 | 1.493 | 1.201 | 0.604 | 0.551 |
WGNN-DTA105 | 1.501 | 1.196 | 0.605 | 0.562 |
STAMP-DPI104 | 1.503 | 1.176 | 0.653 | 0.601 |
PSICHIC80 | 0.710 | 0.686 | ||
ProteinReDiff (our) | 1.443 | 1.168 | 0.721 | 0.639 |
Approach . | ( ) . | ( ) . | . | . |
---|---|---|---|---|
Pafnucy92 | 1.435 | 1.144 | 0.635 | 0.587 |
OnionNet93 | 1.403 | 1.103 | 0.648 | 0.602 |
IGN94 | 1.404 | 1.116 | 0.662 | 0.638 |
SIGN95 | 1.373 | 1.086 | 0.685 | 0.656 |
SMINA96 | 1.466 | 1.161 | 0.665 | 0.663 |
GNINA97 | 1.740 | 1.413 | 0.495 | 0.494 |
dMaSIF98 | 1.450 | 1.136 | 0.629 | 0.588 |
TankBind99 | 1.345 | 1.060 | 0.718 | |
GraphDTA100 | 1.564 | 1.223 | 0.612 | 0.570 |
TransCPI101 | 1.493 | 1.201 | 0.604 | 0.551 |
MolTrans102 | 1.599 | 1.271 | 0.539 | 0.474 |
DrugBAN103 | 1.480 | 1.159 | 0.657 | 0.612 |
DGraphDTA104 | 1.493 | 1.201 | 0.604 | 0.551 |
WGNN-DTA105 | 1.501 | 1.196 | 0.605 | 0.562 |
STAMP-DPI104 | 1.503 | 1.176 | 0.653 | 0.601 |
PSICHIC80 | 0.710 | 0.686 | ||
ProteinReDiff (our) | 1.443 | 1.168 | 0.721 | 0.639 |
APPENDIX C: ABLATION STUDIES
Here, we present additional results from the mask and feature ablation studies. Figure 11 illustrates the performance of ablated models across five key metrics. The impact of different mask ratios on validation and test set metrics is summarized in Table VI. For each model, Table VII specifies the features included or excluded, while Table VIII highlights the resulting effects of these feature ablations on performance.
. | Valid . | Test . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Mask ratio . | LBA ↓ . | Sequence diversity ↑ . | TM-score ↑ . | RMSD ↓ . | CO ↑ . | LBA ↓ . | Sequence diversity ↑ . | TM-score ↑ . | RMSD ↓ . | CO ↑ . |
5% | −4.602 ± 0.377 | 87.252 | 0.555 ± 0.023 | 8.225 ± 0.510 | 0.788 ± 0.008 | −6.058 ± 0.182 | 180.800 | 0.734 ± 0.025 | 6.685 ± 0.629 | 0.879 ± 0.010 |
10% | −4.410 ± 0.541 | 89.472 | 0.598 ± 0.022 | 7.808 ± 0.544 | 0.873 ± 0.008 | −6.101 ± 0.194 | 184.564 | 0.739 ± 0.027 | 7.108 ± 0.784 | 0.883 ± 0.010 |
15% | −4.890 ± 0.303 | 89.601 | 0.581 ± 0.022 | 8.252 ± 0.537 | 0.867 ± 0.008 | −6.202 ± 0.167 | 184.925 | 0.729 ± 0.025 | 7.257 ± 0.768 | 0.877 ± 0.010 |
30% | −4.596 ± 0.257 | 90.643 | 0.453 ± 0.022 | 10.707 ± 0.604 | 0.820 ± 0.008 | −5.553 ± 0.188 | 181.978 | 0.221 ± 0.015 | 21.166 ± 0.740 | 0.707 ± 0.009 |
40% | −4.668 ± 0.281 | 89.091 | 0.297 ± 0.016 | 14.309 ± 0.497 | 0.768 ± 0.008 | −5.794 ± 0.286 | 185.136 | 0.390 ± 0.024 | 15.014 ± 0.717 | 0.750 ± 0.011 |
50% | −4.052 ± 1.162 | 90.445 | 0.390 ± 0.020 | 10.886 ± 0.424 | 0.788 ± 0.009 | −6.034 ± 0.177 | 188.163 | 0.567 ± 0.029 | 10.239 ± 0.688 | 0.807 ± 0.012 |
60% | −4.678 ± 0.262 | 88.643 | 0.226 ± 0.011 | 14.142 ± 0.337 | 0.729 ± 0.007 | −5.981 ± 0.258 | 184.356 | 0.243 ± 0.017 | 18.092 ± 0.525 | 0.702 ± 0.009 |
70% | −4.214 ± 0.264 | 81.333 | 0.165 ± 0.004 | 18.226 ± 0.456 | 0.733 ± 0.007 | −5.360 ± 0.175 | 162.841 | 0.145 ± 0.004 | 24.944 ± 0.646 | 0.689 ± 0.008 |
. | Valid . | Test . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Mask ratio . | LBA ↓ . | Sequence diversity ↑ . | TM-score ↑ . | RMSD ↓ . | CO ↑ . | LBA ↓ . | Sequence diversity ↑ . | TM-score ↑ . | RMSD ↓ . | CO ↑ . |
5% | −4.602 ± 0.377 | 87.252 | 0.555 ± 0.023 | 8.225 ± 0.510 | 0.788 ± 0.008 | −6.058 ± 0.182 | 180.800 | 0.734 ± 0.025 | 6.685 ± 0.629 | 0.879 ± 0.010 |
10% | −4.410 ± 0.541 | 89.472 | 0.598 ± 0.022 | 7.808 ± 0.544 | 0.873 ± 0.008 | −6.101 ± 0.194 | 184.564 | 0.739 ± 0.027 | 7.108 ± 0.784 | 0.883 ± 0.010 |
15% | −4.890 ± 0.303 | 89.601 | 0.581 ± 0.022 | 8.252 ± 0.537 | 0.867 ± 0.008 | −6.202 ± 0.167 | 184.925 | 0.729 ± 0.025 | 7.257 ± 0.768 | 0.877 ± 0.010 |
30% | −4.596 ± 0.257 | 90.643 | 0.453 ± 0.022 | 10.707 ± 0.604 | 0.820 ± 0.008 | −5.553 ± 0.188 | 181.978 | 0.221 ± 0.015 | 21.166 ± 0.740 | 0.707 ± 0.009 |
40% | −4.668 ± 0.281 | 89.091 | 0.297 ± 0.016 | 14.309 ± 0.497 | 0.768 ± 0.008 | −5.794 ± 0.286 | 185.136 | 0.390 ± 0.024 | 15.014 ± 0.717 | 0.750 ± 0.011 |
50% | −4.052 ± 1.162 | 90.445 | 0.390 ± 0.020 | 10.886 ± 0.424 | 0.788 ± 0.009 | −6.034 ± 0.177 | 188.163 | 0.567 ± 0.029 | 10.239 ± 0.688 | 0.807 ± 0.012 |
60% | −4.678 ± 0.262 | 88.643 | 0.226 ± 0.011 | 14.142 ± 0.337 | 0.729 ± 0.007 | −5.981 ± 0.258 | 184.356 | 0.243 ± 0.017 | 18.092 ± 0.525 | 0.702 ± 0.009 |
70% | −4.214 ± 0.264 | 81.333 | 0.165 ± 0.004 | 18.226 ± 0.456 | 0.733 ± 0.007 | −5.360 ± 0.175 | 162.841 | 0.145 ± 0.004 | 24.944 ± 0.646 | 0.689 ± 0.008 |
. | . | Ablation studies . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | No bond distance . | No bond feats . | No bond . | No ligand . | No ligand, only ESM . | No ESM . | No SRA . | No OPU . | DPL (No SRA/OPU) . |
Bond distance | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
Ligand | Bond feats (type, ring, etc.) | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Ligand atom feats (chirality, charge, degree, etc.) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Protein | ESM embeddings | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Residue feats (pos. encodings, res. type) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Model architecture | Single representation attention (SRA) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Outer product update (OPU) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
. | . | Ablation studies . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | No bond distance . | No bond feats . | No bond . | No ligand . | No ligand, only ESM . | No ESM . | No SRA . | No OPU . | DPL (No SRA/OPU) . |
Bond distance | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
Ligand | Bond feats (type, ring, etc.) | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Ligand atom feats (chirality, charge, degree, etc.) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Protein | ESM embeddings | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Residue feats (pos. encodings, res. type) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Model architecture | Single representation attention (SRA) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Outer product update (OPU) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Features . | LBA ↓ . | Sequence diversity ↑ . | TM-score ↑ . | RMSD ↓ . | CO ↑ . |
---|---|---|---|---|---|
Reference | −4.890 ± 0.303 | 89.601 | 0.581 ± 0.022 | 8.252 ± 0.537 | 0.877 ± 0.008 |
No bond | −4.549 ± 0.272 | 84.837 | 0.287 ± 0.016 | 14.325 ± 0.491 | 0.761 ± 0.009 |
No bond distance | −4.869 ± 0.277 | 90.186 | 0.447 ± 0.021 | 11.068 ± 0.579 | 0.821 ± 0.008 |
No bond feats | −4.985 ± 0.289 | 85.974 | 0.475 ± 0.022 | 9.748 ± 0.476 | 0.811 ± 0.010 |
No ESM | −2.723 ± 0.176 | 32.222 | 0.136 ± 0.007 | 37.322 ± 1.032 | 0.748 ± 0.008 |
No ligand | −4.478 ± 0.252 | 87.723 | 0.324 ± 0.018 | 14.125 ± 0.571 | 0.780 ± 0.008 |
No OPU | −3.197 ± 0.304 | 71.669 | 0.102 ± 0.006 | 40.969 ± 1.246 | 0.723 ± 0.004 |
No SRA | −4.878 ± 0.282 | 87.054 | 0.424 ± 0.023 | 11.391 ± 0.556 | 0.810 ± 0.009 |
DPL | −4.153 ± 0.631 | 86.379 | 0.311 ± 0.019 | 13.931 ± 0.527 | 0.744 ± 0.009 |
No ligand, only ESM | −4.429 ± 0.270 | 88.481 | 0.390 ± 0.020 | 13.108 ± 0.702 | 0.813 ± 0.007 |
Features . | LBA ↓ . | Sequence diversity ↑ . | TM-score ↑ . | RMSD ↓ . | CO ↑ . |
---|---|---|---|---|---|
Reference | −4.890 ± 0.303 | 89.601 | 0.581 ± 0.022 | 8.252 ± 0.537 | 0.877 ± 0.008 |
No bond | −4.549 ± 0.272 | 84.837 | 0.287 ± 0.016 | 14.325 ± 0.491 | 0.761 ± 0.009 |
No bond distance | −4.869 ± 0.277 | 90.186 | 0.447 ± 0.021 | 11.068 ± 0.579 | 0.821 ± 0.008 |
No bond feats | −4.985 ± 0.289 | 85.974 | 0.475 ± 0.022 | 9.748 ± 0.476 | 0.811 ± 0.010 |
No ESM | −2.723 ± 0.176 | 32.222 | 0.136 ± 0.007 | 37.322 ± 1.032 | 0.748 ± 0.008 |
No ligand | −4.478 ± 0.252 | 87.723 | 0.324 ± 0.018 | 14.125 ± 0.571 | 0.780 ± 0.008 |
No OPU | −3.197 ± 0.304 | 71.669 | 0.102 ± 0.006 | 40.969 ± 1.246 | 0.723 ± 0.004 |
No SRA | −4.878 ± 0.282 | 87.054 | 0.424 ± 0.023 | 11.391 ± 0.556 | 0.810 ± 0.009 |
DPL | −4.153 ± 0.631 | 86.379 | 0.311 ± 0.019 | 13.931 ± 0.527 | 0.744 ± 0.009 |
No ligand, only ESM | −4.429 ± 0.270 | 88.481 | 0.390 ± 0.020 | 13.108 ± 0.702 | 0.813 ± 0.007 |