Molecular dynamics (MD) is an extremely powerful, highly effective, and widely used approach to understanding the nature of chemical processes in atomic details for proteins. The accuracy of results from MD simulations is highly dependent on force fields. Currently, molecular mechanical (MM) force fields are mainly utilized in MD simulations because of their low computational cost. Quantum mechanical (QM) calculation has high accuracy, but it is exceedingly time consuming for protein simulations. Machine learning (ML) provides the capability for generating accurate potential at the QM level without increasing much computational effort for specific systems that can be studied at the QM level. However, the construction of general machine learned force fields, needed for broad applications and large and complex systems, is still challenging. Here, general and transferable neural network (NN) force fields based on CHARMM force fields, named CHARMM-NN, are constructed for proteins by training NN models on 27 fragments partitioned from the residue-based systematic molecular fragmentation (rSMF) method. The NN for each fragment is based on atom types and uses new input features that are similar to MM inputs, including bonds, angles, dihedrals, and non-bonded terms, which enhance the compatibility of CHARMM-NN to MM MD and enable the implementation of CHARMM-NN force fields in different MD programs. While the main part of the energy of the protein is based on rSMF and NN, the nonbonded interactions between the fragments and with water are taken from the CHARMM force field through mechanical embedding. The validations of the method for dipeptides on geometric data, relative potential energies, and structural reorganization energies demonstrate that the CHARMM-NN local minima on the potential energy surface are very accurate approximations to QM, showing the success of CHARMM-NN for bonded interactions. However, the MD simulations on peptides and proteins indicate that more accurate methods to represent protein–water interactions in fragments and non-bonded interactions between fragments should be considered in the future improvement of CHARMM-NN, which can increase the accuracy of approximation beyond the current mechanical embedding QM/MM level.
I. INTRODUCTION
Molecular dynamics (MD) simulation is an effective and widely used technique to investigate proteins, such as protein folding and unfolding, protein–ligand binding, and protein conformational and structural analysis.1–5 One of the most important factors related to the accuracy of results from MD simulations is the force field.6–8 The traditional molecular mechanical (MM) force fields outperform the quantum mechanical (QM) or QM/MM force fields because they have much simpler energy representations, which are much more efficient on large protein systems. The potential energy described by MM force fields usually consists of bond, angle, dihedral, improper dihedral, and non-bonded energies such as electrostatic energy and van der Waals energy.9 The parameters of all energy terms are generally fitted to experimental data, such as macroscopic properties, spectroscopic and crystallographic data, and quantum mechanical (QM) calculations.10 Four major families of general force fields, AMBER,11–14 CHARMM,15–18 GROMOS,19–22 and OPLS,23–26 are widely used in MD simulations, and many improvements have been achieved to increase the applicability of these force fields over the years. In addition to the traditional fixed-charge MM force fields, polarizable force fields were developed to account for the polarization effect27 based on fluctuating charges,28 Drude oscillators,29 inducible dipoles,30 and methods including multipole electrostatics.31
Even though the current protein force fields are applicable to many problems, they are still not accurate enough and could fail in some situations.32,33 For example, protein folding pathways and thermodynamics cannot be identified very well,34,35 and the strength of protein–water interactions is not balanced well compared to protein–protein interactions.36,37 In addition, most force fields have deficiencies in simulating intrinsically disordered proteins (IDPs), which lack well-defined three-dimensional structures but have full functionalities.38,39 Since an IDP can only be characterized with an ensemble of flexible interconverting conformations, the MD simulation approach would be an ideal approach.40–42 This suggests the urgency of developing more accurate force fields. The current force fields generally generate a too compact ensemble of IDP,43–46 and two types of strategies were generally applied to improve force fields for IDP.47,48 The first one is adjusting backbone dihedral or energy correction map (CMAP) parameters to avoid the high propensity of secondary structures for α-helix or β-sheet, such as ff14IDPSFF,49 CHARMM36IDPSFF,50,51 OPLSIDPSFF,52 RSFF2C,53 and ESFF.54 The second strategy is refining protein–water interactions by adjusting non-bonded parameters to achieve a better description of protein–water interactions, such as a99SB-UCB,55 ff03ws,37 and KBFF.56–58 Moreover, specific coarse-grained force fields were also developed for IDP,59,60 such as AWSEM-IDP61 and MOFF-IDP.62 With the continuous development of force fields, the simulation results for IDP were improved, but there are still unsolved issues, such as the limitation of studying the post-translation modifications in IDP and the difficulty of obtaining balanced local and global structural features.41,48 This is caused by the simplicity and limitations of force field forms in MM representations. According to an investigation on the structural reorganization energy defined by the energy difference between the MM optimal structure and the QM optimal structure, the minimization of QM optimal structures with the major AMBER, CHARMM, GROMOS, and OPLS force fields all resulted in large reorganization energies, which suggests that the MM force field forms need to be improved.63
Recently, machine learning techniques were widely and successfully integrated into force field development because of their capability to describe potential energy surfaces (PESs) at a high level of theory at a low computational cost.64–68 Although machine learning methods can be utilized to adjust MM force field parameters,69–72 our focus is primarily on the construction of general force fields with machine learning models because of the potential improvement of functional form. The common machine learning models used to build force fields are divided into two categories: kernel-based methods and artificial neural networks (NNs). One example of kernel-based methods is the Gaussian approximation potential, which is approximated by the Gaussian process regression on different descriptors and kernels,73 such as local atomic density and the SOAP kernel.74 Another kernel-based method is gradient domain machine learning (GDML), which is trained to predict the force directly rather than the energy.75–78 For artificial NNs, high-dimensional neural network potential (HDNNP) was the first descriptor-based neural network potential (NNP) that uses atomic-centered symmetry functions as descriptors,79,80 and subsequent developments included the consideration of long-range interaction and non-local phenomena.81–83 Some different NNPs were also developed based on different descriptors such as ANI,84 TensorMol,85 deep potential,86 QM/MM-NN,87,88 water NN,89 and embedded atom NN.90,91 Another type of NN is end-to-end NN, which directly uses the Cartesian coordinates and nuclear charges as inputs instead of descriptors.92 End-to-end NN is usually based on graph NN,93 like message-passing NN,94 and some prominent examples are deep tensor NN,95,96 SchNet,97,98 and PhysNet.99,100 Machine learning provides a powerful and promising way to develop force fields with QM accuracy and excellent efficiency.
Despite the success of machine learning force fields in a plethora of chemical problems, including electronic effects, thermodynamics, reactions, and spectroscopies,101 the development of general, transferable, and scalable force fields for proteins is still very challenging for the following reasons. First, to obtain plenty of data for the training of a machine learning model, the calculation of QM reference energy is infeasible for proteins since they could have thousands of atoms. Second, the size of proteins for MD simulations could range from short peptides to large proteins, which increases the difficulty of training a general machine learning force field that is applicable to different systems. To address these issues, fragmentation methods can be utilized to systematically express the total energy of large systems with the individual energy of small fragments.102 Some common fragmentation methods include divide and conquer,103 many-body expansion,104–107 systematic molecular fragmentation (SMF),108–110 the effective fragment potential approach,111 electrostatically embedded generalized molecular fractionation with conjugate caps (EE-GMFCC),112–115 the X-pol method,116–119 and the energy-based fragmentation method.120,121 In 2019, Hao and Yang developed residue-based systematic molecular fragmentation (rSMF) to enable the construction of general fragments for any proteins, and the neural network force field (NNFF) was trained on glycine and alanine dipeptides to demonstrate the transferability to mixed alanine and glycine polypeptides.122 More machine learned force fields based on similar fragmentation methods were developed,123–125 but a comprehensive and transferable force field that can be applied to general protein simulations is still not available.
In this work, we apply the rSMF method to obtain fragments that are trained with NN and develop an NNFF based on the CHARMM force field, named CHARMM-NN. The CHARMM-NN is based on atom types, and the input features are similar to the variables in CHARMM energy terms, such as bonds, angles, dihedrals, and non-bonded features, which is highly compatible with the original MM simulations. The CHARMM-NN is first validated by the geometric data and structural reorganization energies, and more simulations of peptides and proteins are performed to test the quality of this force field.
II. METHODS
A. Residue-based systematic molecular fragmentation
We approach the construction of the protein force field by partitioning the protein into embedded fragments and developing the NN description of fragments with input data from QM calculations.122 Based on the original SMF method,108,126 residue-based systematic molecular fragmentation (rSMF) was developed by Wang and Yang in 2019,122 which generates fragments that can be applied to any protein. The fragments are based on amino acid residues, and different levels of rSMF can be applied. The rSMF at level 1 generates dipeptides, defined by one side chain group and two peptide bonds, as the basic fragments. A higher level of rSMF can be applied to increase the accuracy of fragmentation, but the sampling effort as well as the computation of QM energies and forces will increase significantly. Therefore, the rSMF used in this work is at level 1. Using the rSMF method, a protein with N residues capping with an acetyl (ACE) group on the N terminus and an N-methyl amide (NME) group on the C terminus can be divided into N dipeptides and N − 1 ACE-NME fragments. As an example, a tripeptide P3 is fragmented, as shown in Fig. 1, where A1 and A2 are the corresponding dipeptides for each amino acid in the tripeptide, and ACE-NME is the peptide bond between A1 and A2.
The test of the rSMF method on homogeneous and heterogeneous alanine and glycine polypeptides suggests that this method can provide stability of errors with increasing size, which is highly desirable and necessary in protein force field development since the protein size could be arbitrarily large. Based on this behavior, we construct NN models for the ACE-NME fragment and 24 dipeptide fragments, including three dipeptides for different protonation states of His (Hse, Hsd, and Hsp), two protonated states of negative charged amino acid dipeptides (AspH and GluH), and 19 other natural amino acid dipeptides (Gly, Ala, Val, Cys, Pro, Leu, Ile, Met, Trp, Phe, Ser, Thr, Tyr, Asn, Gln, Lys, Arg, Asp, and Glu). Even though these fragments are enough for some proteins, many proteins contain disulfide bonds formed by two cysteines, which cannot be described by the fragments. Therefore, to enable the applicability on more proteins, we complement the rSMF method with two more fragments: one is the dipeptide fragment of cysteine with an SMe group connected to the S atom, named Cys-SMe, and another is the dimethyl disulfide (DMDS) fragment that accounts for the overlap between two cysteines that form a disulfide bond. As shown in Fig. 2, the molecule of two cysteines connected by a disulfide bond can be fragmented into two Cys-SMe dipeptides and a DMDS fragment. The CHARMM-NN includes 27 fragments in total, and each fragment has its own NN model.
The NN correction energy is the sum of the energies of all dipeptide fragments minus the energies from ACE-NME and DMDS, and the total energy is the sum of the MM energy and the NN correction energy. This is an approximation because the non-bonded interactions between the capped hydrogen atoms and other atoms in the dipeptide fragment and ACE-NME fragment are not completely the same, even though the additional bond, angle, and dihedral energy terms that are related to the capped hydrogen atoms can be exactly canceled each other. Based on this approximation, we can run normal MD simulations with C36m force fields and apply the additional correction from NNFF, which is very convenient for integration into existing MD programs without the requirement to modify core codes.
B. Atom type based NN and MM based input features
C. Computational details
All MD simulations to test CHARMM-NN on dipeptides, peptides, and proteins were performed in the modified GROMACS 2021.6,144 in which the NN corrections were calculated with the TensorFlow C API, and all QM energies and forces were calculated by Gaussian 16.145 The basic MD parameters, such as cutoff scheme, temperature coupling, and pressure coupling, are the same as they were in the adaptive samplings. In the simulations to validate CHARMM-NN, the bonds involving hydrogens were constrained by the LINCS algorithm146 with a fourth order and one iteration. The integration time step was, therefore, set at 1 or 2 fs. All systems were first solvated by creating a cubic water box with the length of the longest distance in the protein plus 2 nm buffer and neutralizing it with enough sodium or chloride counter-ions. Next, the solvated systems were minimized with the steepest descent algorithm and equilibrium with a 500 ps NVT simulation using a v-rescale thermostat followed by a 1 ns NPT simulation using a Parrinello–Rahman barostat. Starting from the QM-optimized structures, the conformers of all dipeptides were minimized with the steepest descent algorithm for up to 50 000 steps for C36m and CHARMM-NN. The structural reorganization energies were calculated by subtracting the energy at the last step of minimization, which is the MM-optimized structure, from the energy at the first step of minimization, which is the QM-optimized structure. All production MD simulations were performed in solution with the TIP3P model, and a v-rescale thermostat and Parrinello–Rahman barostat were used. For the simulations of Gly3, Ala3, Val3, and Ala4, the total simulation time was set as 1 µs, in which the first 100 ns were discarded and the coordinates were saved every 10 ps. The J-coupling constants were calculated with Karplus equations, and the parameters were extracted following the previous work.147 Slightly different results would be obtained if different sets of Karplus parameters are applied.
All data related to this work can be found at https://figshare.com/projects/CHARMM-NN/163756. The energy, force, and coordinates of the sampled structures, the CHARMM-NN force field parameters, the codes to run MD simulations with CHARMM-NN, training data and scripts, MD simulation files, and analysis scripts are provided.
III. RESULTS AND DISCUSSION
. | Data . | E train . | F train . | E test . | F test . | R2 . |
---|---|---|---|---|---|---|
ACE-NME | 2 499 | 0.44 | 1.71 | 0.49 | 1.84 | 0.99 |
DMDS | 1 180 | 0.16 | 1.05 | 0.13 | 0.93 | 1.00 |
Gly | 6 381 | 0.57 | 2.08 | 0.78 | 2.16 | 0.99 |
Ala | 7 836 | 0.67 | 2.09 | 0.85 | 2.16 | 0.98 |
Val | 12 620 | 0.86 | 2.29 | 1.09 | 2.37 | 0.98 |
Cys | 10 629 | 1.04 | 2.53 | 1.29 | 2.64 | 0.98 |
Pro | 5 829 | 0.74 | 2.18 | 1.02 | 2.29 | 0.97 |
Leu | 12 493 | 0.84 | 2.16 | 1.12 | 2.27 | 0.97 |
Ile | 13 705 | 0.84 | 2.20 | 1.06 | 2.28 | 0.98 |
Met | 11 128 | 0.90 | 2.23 | 1.18 | 2.27 | 0.97 |
Trp | 14 275 | 0.80 | 2.30 | 1.15 | 2.32 | 0.98 |
Phe | 11 713 | 0.84 | 2.70 | 1.12 | 2.77 | 0.98 |
Hse | 10 882 | 0.91 | 2.28 | 1.23 | 2.35 | 0.97 |
Hsd | 10 428 | 0.83 | 2.25 | 1.13 | 2.28 | 0.98 |
Cys-SMe | 11 510 | 0.96 | 2.30 | 1.30 | 2.47 | 0.97 |
Ser | 9 167 | 0.81 | 2.29 | 1.00 | 2.35 | 0.98 |
Thr | 14 717 | 0.89 | 2.47 | 1.14 | 2.64 | 0.97 |
Tyr | 10 023 | 0.88 | 2.83 | 1.22 | 2.92 | 0.97 |
Asn | 11 005 | 0.90 | 2.53 | 1.19 | 2.64 | 0.98 |
Gln | 13 009 | 0.91 | 2.34 | 1.17 | 2.37 | 0.97 |
Lys | 20 008 | 1.34 | 3.00 | 2.08 | 3.79 | 0.96 |
Arg | 23 587 | 1.35 | 2.81 | 1.86 | 2.97 | 0.97 |
Hsp | 15 491 | 0.73 | 2.16 | 1.07 | 2.20 | 0.99 |
Asp | 11 278 | 1.04 | 2.92 | 1.80 | 3.35 | 0.98 |
Glu | 12 387 | 1.07 | 2.91 | 1.68 | 3.50 | 0.98 |
AspH | 11 926 | 0.89 | 2.42 | 1.13 | 2.47 | 0.98 |
GluH | 12 381 | 0.90 | 2.35 | 1.15 | 2.34 | 0.96 |
Average | 0.86 | 2.35 | 1.16 | 2.48 | 0.98 |
. | Data . | E train . | F train . | E test . | F test . | R2 . |
---|---|---|---|---|---|---|
ACE-NME | 2 499 | 0.44 | 1.71 | 0.49 | 1.84 | 0.99 |
DMDS | 1 180 | 0.16 | 1.05 | 0.13 | 0.93 | 1.00 |
Gly | 6 381 | 0.57 | 2.08 | 0.78 | 2.16 | 0.99 |
Ala | 7 836 | 0.67 | 2.09 | 0.85 | 2.16 | 0.98 |
Val | 12 620 | 0.86 | 2.29 | 1.09 | 2.37 | 0.98 |
Cys | 10 629 | 1.04 | 2.53 | 1.29 | 2.64 | 0.98 |
Pro | 5 829 | 0.74 | 2.18 | 1.02 | 2.29 | 0.97 |
Leu | 12 493 | 0.84 | 2.16 | 1.12 | 2.27 | 0.97 |
Ile | 13 705 | 0.84 | 2.20 | 1.06 | 2.28 | 0.98 |
Met | 11 128 | 0.90 | 2.23 | 1.18 | 2.27 | 0.97 |
Trp | 14 275 | 0.80 | 2.30 | 1.15 | 2.32 | 0.98 |
Phe | 11 713 | 0.84 | 2.70 | 1.12 | 2.77 | 0.98 |
Hse | 10 882 | 0.91 | 2.28 | 1.23 | 2.35 | 0.97 |
Hsd | 10 428 | 0.83 | 2.25 | 1.13 | 2.28 | 0.98 |
Cys-SMe | 11 510 | 0.96 | 2.30 | 1.30 | 2.47 | 0.97 |
Ser | 9 167 | 0.81 | 2.29 | 1.00 | 2.35 | 0.98 |
Thr | 14 717 | 0.89 | 2.47 | 1.14 | 2.64 | 0.97 |
Tyr | 10 023 | 0.88 | 2.83 | 1.22 | 2.92 | 0.97 |
Asn | 11 005 | 0.90 | 2.53 | 1.19 | 2.64 | 0.98 |
Gln | 13 009 | 0.91 | 2.34 | 1.17 | 2.37 | 0.97 |
Lys | 20 008 | 1.34 | 3.00 | 2.08 | 3.79 | 0.96 |
Arg | 23 587 | 1.35 | 2.81 | 1.86 | 2.97 | 0.97 |
Hsp | 15 491 | 0.73 | 2.16 | 1.07 | 2.20 | 0.99 |
Asp | 11 278 | 1.04 | 2.92 | 1.80 | 3.35 | 0.98 |
Glu | 12 387 | 1.07 | 2.91 | 1.68 | 3.50 | 0.98 |
AspH | 11 926 | 0.89 | 2.42 | 1.13 | 2.47 | 0.98 |
GluH | 12 381 | 0.90 | 2.35 | 1.15 | 2.34 | 0.96 |
Average | 0.86 | 2.35 | 1.16 | 2.48 | 0.98 |
Next, we examined the detailed performance of CHARMM-NN on dipeptides. One typical system is alanine dipeptide, which has many conformations and has received systematic studies.148 The graphical structure of Ala dipeptide is shown in Fig. 3. We first compare the geometric information optimized from C36m, CHARMM-NN, and B3LYP. As displayed in Table II, the CHARMM-NN outperforms C36m on all three Ala conformers, C7eq, C7ax, and C5. Most angles and dihedral angles from CHARMM-NN are closer to the B3LYP geometries compared to C36m, and almost all bonds optimized from CHARMM-NN are better than C36m. The large corrections occur on C4–C5 and C5–N7 bonds, for which the bond lengths from C36m are generally much smaller than B3LYP structures, and on the C9–C11 bond, which typically has a large bond length from C36m.
. | C7eq . | C7ax . | C5 . | ||||||
---|---|---|---|---|---|---|---|---|---|
C36m . | CHARMM-NN . | B3LYP . | C36m . | CHARMM-NN . | B3LYP . | C36m . | CHARMM-NN . | B3LYP . | |
Dihedral angles (deg) | |||||||||
ϕ | −77.1 | −78.9 | −82.3 | 78.0 | 75.9 | 73.4 | −158.1 | −159.5 | −158.1 |
ψ | 71.3 | 73.6 | 74.5 | −56.8 | −55.8 | −56.1 | 161.5 | 161.7 | 161.8 |
ω1 | 178.2 | 179.4 | −178.7 | 174.0 | 174.5 | 175.4 | 179.3 | 178.8 | 178.2 |
ω2 | −179.7 | −175.4 | −174.7 | −179.3 | −177.8 | −178.6 | 179.2 | 179.0 | 177.4 |
Bonds (Å) | |||||||||
C4–C5 | 1.481 | 1.517 | 1.515 | 1.480 | 1.519 | 1.516 | 1.482 | 1.519 | 1.518 |
C5–N7 | 1.341 | 1.361 | 1.360 | 1.343 | 1.360 | 1.358 | 1.337 | 1.360 | 1.361 |
C5–O6 | 1.224 | 1.236 | 1.236 | 1.225 | 1.237 | 1.237 | 1.223 | 1.231 | 1.232 |
N7–C9 | 1.449 | 1.467 | 1.467 | 1.457 | 1.473 | 1.475 | 1.441 | 1.450 | 1.449 |
C9–C11 | 1.543 | 1.525 | 1.523 | 1.547 | 1.537 | 1.535 | 1.545 | 1.542 | 1.540 |
C9–C12 | 1.530 | 1.551 | 1.548 | 1.524 | 1.543 | 1.544 | 1.516 | 1.531 | 1.533 |
C12–O13 | 1.229 | 1.231 | 1.231 | 1.228 | 1.232 | 1.233 | 1.230 | 1.231 | 1.232 |
C12–N17 | 1.346 | 1.356 | 1.356 | 1.346 | 1.352 | 1.351 | 1.348 | 1.356 | 1.355 |
N17–C19 | 1.444 | 1.456 | 1.454 | 1.444 | 1.452 | 1.453 | 1.445 | 1.456 | 1.457 |
N7–H8 | 0.992 | 1.012 | 1.011 | 0.993 | 1.010 | 1.010 | 0.996 | 1.014 | 1.015 |
N17–H18 | 1.002 | 1.018 | 1.017 | 1.004 | 1.017 | 1.019 | 0.994 | 1.011 | 1.010 |
Angles (deg) | |||||||||
C4–C5–N7 | 117.0 | 116.5 | 116.4 | 116.4 | 116.3 | 116.1 | 116.8 | 115.8 | 115.9 |
C5–N7–C9 | 123.6 | 123.2 | 123.1 | 126.1 | 125.9 | 126.8 | 122.6 | 122.0 | 122.2 |
N7–C9–C12 | 112.9 | 110.7 | 110.0 | 114.0 | 113.5 | 114.2 | 108.3 | 107.0 | 107.1 |
C9–C12–N17 | 116.9 | 114.4 | 114.0 | 117.5 | 116.2 | 116.6 | 117.5 | 115.4 | 115.7 |
C12–N17–C19 | 122.8 | 121.7 | 121.4 | 122.9 | 121.8 | 121.3 | 121.8 | 121.7 | 121.7 |
. | C7eq . | C7ax . | C5 . | ||||||
---|---|---|---|---|---|---|---|---|---|
C36m . | CHARMM-NN . | B3LYP . | C36m . | CHARMM-NN . | B3LYP . | C36m . | CHARMM-NN . | B3LYP . | |
Dihedral angles (deg) | |||||||||
ϕ | −77.1 | −78.9 | −82.3 | 78.0 | 75.9 | 73.4 | −158.1 | −159.5 | −158.1 |
ψ | 71.3 | 73.6 | 74.5 | −56.8 | −55.8 | −56.1 | 161.5 | 161.7 | 161.8 |
ω1 | 178.2 | 179.4 | −178.7 | 174.0 | 174.5 | 175.4 | 179.3 | 178.8 | 178.2 |
ω2 | −179.7 | −175.4 | −174.7 | −179.3 | −177.8 | −178.6 | 179.2 | 179.0 | 177.4 |
Bonds (Å) | |||||||||
C4–C5 | 1.481 | 1.517 | 1.515 | 1.480 | 1.519 | 1.516 | 1.482 | 1.519 | 1.518 |
C5–N7 | 1.341 | 1.361 | 1.360 | 1.343 | 1.360 | 1.358 | 1.337 | 1.360 | 1.361 |
C5–O6 | 1.224 | 1.236 | 1.236 | 1.225 | 1.237 | 1.237 | 1.223 | 1.231 | 1.232 |
N7–C9 | 1.449 | 1.467 | 1.467 | 1.457 | 1.473 | 1.475 | 1.441 | 1.450 | 1.449 |
C9–C11 | 1.543 | 1.525 | 1.523 | 1.547 | 1.537 | 1.535 | 1.545 | 1.542 | 1.540 |
C9–C12 | 1.530 | 1.551 | 1.548 | 1.524 | 1.543 | 1.544 | 1.516 | 1.531 | 1.533 |
C12–O13 | 1.229 | 1.231 | 1.231 | 1.228 | 1.232 | 1.233 | 1.230 | 1.231 | 1.232 |
C12–N17 | 1.346 | 1.356 | 1.356 | 1.346 | 1.352 | 1.351 | 1.348 | 1.356 | 1.355 |
N17–C19 | 1.444 | 1.456 | 1.454 | 1.444 | 1.452 | 1.453 | 1.445 | 1.456 | 1.457 |
N7–H8 | 0.992 | 1.012 | 1.011 | 0.993 | 1.010 | 1.010 | 0.996 | 1.014 | 1.015 |
N17–H18 | 1.002 | 1.018 | 1.017 | 1.004 | 1.017 | 1.019 | 0.994 | 1.011 | 1.010 |
Angles (deg) | |||||||||
C4–C5–N7 | 117.0 | 116.5 | 116.4 | 116.4 | 116.3 | 116.1 | 116.8 | 115.8 | 115.9 |
C5–N7–C9 | 123.6 | 123.2 | 123.1 | 126.1 | 125.9 | 126.8 | 122.6 | 122.0 | 122.2 |
N7–C9–C12 | 112.9 | 110.7 | 110.0 | 114.0 | 113.5 | 114.2 | 108.3 | 107.0 | 107.1 |
C9–C12–N17 | 116.9 | 114.4 | 114.0 | 117.5 | 116.2 | 116.6 | 117.5 | 115.4 | 115.7 |
C12–N17–C19 | 122.8 | 121.7 | 121.4 | 122.9 | 121.8 | 121.3 | 121.8 | 121.7 | 121.7 |
Besides the geometric evaluation, we also compared the relative potential energies and reorganization energies between the C36m and CHARMM-NN force fields. As shown in Table III, six Ala conformers, C7eq, C7ax, C5, αL, β2, and α′, were calculated because the optimization with B3LYP can be obtained for these conformers so that direct comparison can be executed. The optimized backbone dihedrals from QM calculations are shown in Table III for all conformers. According to B3LYP calculations, the C7eq conformer has the lowest energy, which is selected as the baseline for comparison, and the relative energies are listed in Table III, in which the C5 and α′ conformers have the lowest and highest relative energies, respectively. For the conformers minimized with C36m, the errors of relative energies are between 0.8 and 1.4 kcal/mol for C7ax, αL, and α′ conformers compared to B3LYP. For C5 and β2 conformers, the errors are up to 2.29 and 1.92 kcal/mol, respectively, and the C36m optimized C5 tends to be slightly more stable than C7eq. Using CHARMM-NN, the relative energies are very consistent with B3LYP results for all conformers, and the largest error is only 0.2 kcal/mol. The accurate relative energies from CHARMM-NN demonstrate that the local minima on the PES of Ala dipeptide are close to QM PES, which can be very important for the simulation samplings being similar to QM results. In addition to the relative potential energies, we also compared the reorganization energies between C36m and CHARMM-NN for these six conformers of Ala dipeptide. In Table III, we can observe that the reorganization energies with respect to B3LYP range from 12.63 to 31.55 kJ/mol for C36m but only between 0.93 and 2.39 kJ/mol for CHARMM-NN. The minimized structures from CHARMM-NN are much closer to B3LYP optimal structures than C36m.
Conformers (ϕ, ψ) . | Relative potential energies . | Reorganization energies . | |||
---|---|---|---|---|---|
C36m . | CHARMM-NN . | B3LYP . | C36m . | CHARMM-NN . | |
C7eq (−82.3, 74.5) | 0.00 | 0.00 | 0.00 | 16.99 | 1.71 |
C7ax (73.4, −56.1) | 1.27 | 2.13 | 2.10 | 14.03 | 2.10 |
C5 (−158.1, 161.8) | −0.67 | 1.75 | 1.62 | 12.63 | 1.37 |
αL (69.0, 22.3) | 4.18 | 5.10 | 5.30 | 22.49 | 1.57 |
β2 (−105.3, 9.8) | 1.21 | 3.12 | 3.13 | 20.35 | 2.39 |
α′ (−166.3, −43.1) | 5.41 | 6.80 | 6.80 | 31.55 | 0.93 |
Conformers (ϕ, ψ) . | Relative potential energies . | Reorganization energies . | |||
---|---|---|---|---|---|
C36m . | CHARMM-NN . | B3LYP . | C36m . | CHARMM-NN . | |
C7eq (−82.3, 74.5) | 0.00 | 0.00 | 0.00 | 16.99 | 1.71 |
C7ax (73.4, −56.1) | 1.27 | 2.13 | 2.10 | 14.03 | 2.10 |
C5 (−158.1, 161.8) | −0.67 | 1.75 | 1.62 | 12.63 | 1.37 |
αL (69.0, 22.3) | 4.18 | 5.10 | 5.30 | 22.49 | 1.57 |
β2 (−105.3, 9.8) | 1.21 | 3.12 | 3.13 | 20.35 | 2.39 |
α′ (−166.3, −43.1) | 5.41 | 6.80 | 6.80 | 31.55 | 0.93 |
Besides the few tests of reorganization energies for Ala dipeptides with respect to B3LYP, which is the reference QM method that we train the CHARMM-NN on, more thorough validations of the reorganization energies were performed on the YMPJ conformer database,149,150 which includes the conformers of dipeptides optimized with MP2/cc-PVTZ. As shown in Table IV, the 14 neutral dipeptides have different numbers of conformers at energy minima. For C36m, the reorganization energies for almost all dipeptides are larger than 30 kJ/mol, except for the Gly dipeptide (24.33 kJ/mol), which is the smallest dipeptide. The reorganization energies can be higher than 40 kJ/mol for Gln, Met, Thr, and Tyr dipeptides. For CHARMM-NN, the reorganization energies for most dipeptides are below 10 kJ/mol, except for Gln, Leu, Met, and Phe dipeptides. The Met dipeptide, which includes the most conformers, has the largest reorganization energy (12.07 kJ/mol), but it is still much better than the C36m results. The average reorganization energy for C36m is 38.82 kJ/mol, which is close to the work from König and Riniker.63 Even though the CHARMM-NN is trained with B3LYP, it outperforms C36m on the dataset of MP2/cc-PVTZ as well, and the average reorganization energy is only 9.01 kJ/mol. The low reorganization energy with respect to QM is critical to obtaining accurate thermodynamic properties like free energy that are close to QM since the expected variance is exponentially related to the reorganization energies for the free-energy estimation.63
. | No. minima . | C36m . | CHARMM-NN . |
---|---|---|---|
Ala | 10 | 32.72 | 7.07 |
Asn | 12 | 36.76 | 7.40 |
Cys | 23 | 36.83 | 7.36 |
Gln | 20 | 41.55 | 10.25 |
Gly | 8 | 24.33 | 6.87 |
Ile | 24 | 39.63 | 7.70 |
Leu | 26 | 37.83 | 10.02 |
Met | 56 | 41.37 | 12.07 |
Phe | 26 | 39.34 | 10.15 |
Pro | 5 | 33.10 | 7.07 |
Ser | 23 | 35.94 | 7.10 |
Thr | 17 | 42.96 | 8.13 |
Tyr | 16 | 42.04 | 6.72 |
Val | 14 | 39.88 | 7.38 |
Average | 38.82 | 9.01 |
. | No. minima . | C36m . | CHARMM-NN . |
---|---|---|---|
Ala | 10 | 32.72 | 7.07 |
Asn | 12 | 36.76 | 7.40 |
Cys | 23 | 36.83 | 7.36 |
Gln | 20 | 41.55 | 10.25 |
Gly | 8 | 24.33 | 6.87 |
Ile | 24 | 39.63 | 7.70 |
Leu | 26 | 37.83 | 10.02 |
Met | 56 | 41.37 | 12.07 |
Phe | 26 | 39.34 | 10.15 |
Pro | 5 | 33.10 | 7.07 |
Ser | 23 | 35.94 | 7.10 |
Thr | 17 | 42.96 | 8.13 |
Tyr | 16 | 42.04 | 6.72 |
Val | 14 | 39.88 | 7.38 |
Average | 38.82 | 9.01 |
To validate the performance of CHARMM-NN in MD simulations, the J-coupling constants were calculated from the MD simulations for Gly3, Ala3, Val3, and Ala4, and the results are shown in Table V. For Gly3, the J-coupling constants are close to the experimental values for residues 2 and 3. The largest error is on 2J(N, Cα), which has a deviation of 2.66 and 1.38 Hz from experiments for residue 2 and residue 3, respectively. However, the experimental values of 10.45 and 9.05 Hz can never be obtained because the maximum value is less than 9 Hz with the parameters for the Karplus equation. For Ala3, the deviations for 3J(HN, Hα), 3J(Hα, C′), and 3J(HN, Cβ) are around 0.82 to 1.01 Hz for residue 2, but they are much better for residue 3. Other J-coupling constants have similar errors for residue 2 and residue 3, and the largest deviation is 2.00 Hz on 2J(N, Cα) for residue 2. For Val3, the 3J(HN, Hα) is pretty good for residue 2 with a difference of 0.40 Hz, but it is terrible for residue 3, which has an error of 4.32 Hz. In contrast, the 1J(N, Cα) and 2J(N, Cα) have a better agreement with experiments on residue 3 compared to residue 2. For Ala4, the results are similar to Ala3 in that the 3J(HN, Hα), 3J(Hα, C′), and 3J(HN, Cβ) have less errors from residue 2 to residue 4. Overall, Gly3 has good results for all J-coupling constants except 2J(N, Cα). The Ala3 and Ala4 have similar results, suggesting that the simulations on systems with different sizes provide stable results for CHARMM-NN. The results of Val3 are good for some J-coupling constants, but they can also have a large error on others. In addition, MM force fields still have better agreement with experimental J-coupling values than the current CHARMM-NN, mainly because it only has corrections for intramolecular interactions of dipeptides, while the protein–protein and protein–solvent non-covalent interactions have not been addressed yet.
. | 3J(HN, Hα) . | 3J(HN, C′) . | 3J(Hα, C′) . | 3J(C′, C′) . | 3J(HN, Cβ) . | 1J(N, Cα) . | 2J(N, Cα) . | 3J(HN, Cα) . | ||
---|---|---|---|---|---|---|---|---|---|---|
Gly3 | Residue 2 | CHARMM-NN | 6.04 | 0.48 | 4.28 | 0.76 | 2.26 | 11.02 | 7.79 | 0.61 |
Exp. | 5.89 | 1.10 | 4.01 | 0.26 | ∖ | 12.17 | 10.45 | 0.78 | ||
Residue 3 | CHARMM-NN | 6.04 | 0.48 | 4.25 | 0.77 | 2.28 | 11.18 | 7.67 | 0.59 | |
Exp. | 5.87 | 0.99 | 3.90 | ∖ | ∖ | 12.77 | 9.05 | 0.61 | ||
Ala3 | Residue 2 | CHARMM-NN | 6.69 | 1.63 | 2.66 | 0.91 | 1.56 | 10.44 | 7.14 | 0.48 |
Exp. | 5.68 | 1.13 | 1.84 | 0.25 | 2.39 | 11.34 | 9.14 | 0.70 | ||
Residue 3 | CHARMM-NN | 6.88 | 1.66 | 2.60 | 0.96 | 1.53 | 10.48 | 7.05 | 0.49 | |
Exp. | 6.52 | 1.29 | 2.14 | ∖ | 2.02 | 11.47 | 8.45 | 0.65 | ||
Val3 | Residue 2 | CHARMM-NN | 8.34 | 3.04 | 2.45 | 1.86 | 0.56 | 9.62 | 6.88 | 0.58 |
Exp. | 7.94 | 0.58 | 2.42 | 0.34 | 1.38 | 10.80 | 8.35 | 0.77 | ||
Residue 3 | CHARMM-NN | 3.59 | 3.39 | 3.55 | 1.01 | 0.54 | 11.06 | 7.24 | 0.16 | |
Exp. | 7.91 | 1.01 | 2.45 | ∖ | 1.40 | 11.02 | 7.80 | 0.75 | ||
Ala4 | Residue 2 | CHARMM-NN | 6.45 | 1.57 | 2.73 | 0.87 | 1.63 | 10.47 | 7.11 | 0.46 |
Exp. | 5.62 | 1.15 | 1.87 | 0.19 | 2.36 | 11.39 | 9.17 | 0.68 | ||
Residue 3 | CHARMM-NN | 6.57 | 1.53 | 2.69 | 0.86 | 1.64 | 10.40 | 7.14 | 0.48 | |
Exp. | 5.89 | 1.11 | 1.95 | ∖ | 2.24 | 11.33 | 8.56 | 0.60 | ||
Residue 4 | CHARMM-NN | 6.90 | 1.62 | 2.72 | 0.95 | 1.52 | 10.44 | 6.97 | 0.48 | |
Exp. | 6.56 | 1.26 | 2.24 | ∖ | 1.99 | 11.53 | 8.37 | 0.60 |
. | 3J(HN, Hα) . | 3J(HN, C′) . | 3J(Hα, C′) . | 3J(C′, C′) . | 3J(HN, Cβ) . | 1J(N, Cα) . | 2J(N, Cα) . | 3J(HN, Cα) . | ||
---|---|---|---|---|---|---|---|---|---|---|
Gly3 | Residue 2 | CHARMM-NN | 6.04 | 0.48 | 4.28 | 0.76 | 2.26 | 11.02 | 7.79 | 0.61 |
Exp. | 5.89 | 1.10 | 4.01 | 0.26 | ∖ | 12.17 | 10.45 | 0.78 | ||
Residue 3 | CHARMM-NN | 6.04 | 0.48 | 4.25 | 0.77 | 2.28 | 11.18 | 7.67 | 0.59 | |
Exp. | 5.87 | 0.99 | 3.90 | ∖ | ∖ | 12.77 | 9.05 | 0.61 | ||
Ala3 | Residue 2 | CHARMM-NN | 6.69 | 1.63 | 2.66 | 0.91 | 1.56 | 10.44 | 7.14 | 0.48 |
Exp. | 5.68 | 1.13 | 1.84 | 0.25 | 2.39 | 11.34 | 9.14 | 0.70 | ||
Residue 3 | CHARMM-NN | 6.88 | 1.66 | 2.60 | 0.96 | 1.53 | 10.48 | 7.05 | 0.49 | |
Exp. | 6.52 | 1.29 | 2.14 | ∖ | 2.02 | 11.47 | 8.45 | 0.65 | ||
Val3 | Residue 2 | CHARMM-NN | 8.34 | 3.04 | 2.45 | 1.86 | 0.56 | 9.62 | 6.88 | 0.58 |
Exp. | 7.94 | 0.58 | 2.42 | 0.34 | 1.38 | 10.80 | 8.35 | 0.77 | ||
Residue 3 | CHARMM-NN | 3.59 | 3.39 | 3.55 | 1.01 | 0.54 | 11.06 | 7.24 | 0.16 | |
Exp. | 7.91 | 1.01 | 2.45 | ∖ | 1.40 | 11.02 | 7.80 | 0.75 | ||
Ala4 | Residue 2 | CHARMM-NN | 6.45 | 1.57 | 2.73 | 0.87 | 1.63 | 10.47 | 7.11 | 0.46 |
Exp. | 5.62 | 1.15 | 1.87 | 0.19 | 2.36 | 11.39 | 9.17 | 0.68 | ||
Residue 3 | CHARMM-NN | 6.57 | 1.53 | 2.69 | 0.86 | 1.64 | 10.40 | 7.14 | 0.48 | |
Exp. | 5.89 | 1.11 | 1.95 | ∖ | 2.24 | 11.33 | 8.56 | 0.60 | ||
Residue 4 | CHARMM-NN | 6.90 | 1.62 | 2.72 | 0.95 | 1.52 | 10.44 | 6.97 | 0.48 | |
Exp. | 6.56 | 1.26 | 2.24 | ∖ | 1.99 | 11.53 | 8.37 | 0.60 |
We also ran several simulations on several folded proteins with CHARMM-NN. The folded proteins we tested are ubiquitin (PDB: 1UBQ151), crambin (PDB: 1EJG152), GB3 (PDB: 1P7E153), and lysozyme (PDB: 135L154). As displayed in Fig. 4, the potential of the mean force plots for the folded proteins is similar to the normal Ramachandran plots. The dominating regions include α region (−160° < ϕ < −20° and −120° < ψ < 50°) in which the right-handed α helix αR (−100° < ϕ < −30° and −67° < ψ < −7°) are sampled most, β region (−180° < ϕ < −90° and 50° < ψ < 180° plus −180° < ϕ < −90° and −180° < ψ < −120° plus 160° < ϕ < 180° and 110° < ψ < 180°), ppII region (−90° < ϕ < −20° and 50° < ψ < 180° plus −90° < ϕ < −20° and −180° < ψ < −120°), and αL region (30° < ϕ < 100° and 7° < ψ < 67°).18 The crambin and GB3 have almost all their points located in these regions, showing the reliability of CHARMM-NN in sampling conformational space. However, results for ubiquitin and lysozyme show several unfavored regions, such as the regions with 100° < ϕ < 120° for ubiquitin and the regions with 0° < ϕ < 30° for lysozyme. The non-ideal results from the Ramachandran plots of some folded proteins and the J-coupling constants of some peptides indicate that the CHARMM-NN is still not ready to be used for general proteins. The difficulty of CHARMM-NN is because of the following reasons. First, it is only trained with all data in the vacuum state, and no background charges are included in QM calculations and NN inputs; therefore, the method can only approach the limit of accuracy at the mechanical embedding QM/MM level without considering the MM charge contribution to the QM region. We do include solvent in all MD simulations of peptides and proteins, while the protein–solvent interactions are based on the MM force fields and not improved with machine learning models. Protein–water interactions are crucial to MD simulations of proteins. However, the protein–water interactions cannot be easily included in the training of NN because the solvent environment of small fragments is largely different from the solvent environment of proteins, and there is yet no general sampling method that can be applied to sample enough small fragments surrounding with solvent to resemble the actual solvent environment in all proteins. Second, the non-bonded interactions between fragments are still calculated at the MM level, which may cause the CHARMM-NN corrections to be imbalanced on proteins.
IV. CONCLUSIONS
In summary, we constructed machine learned force fields based on C36m force fields, named CHARMM-NN, by using the rSMF method to generate the elementary fragments that can form any kind of protein. The dataset was generated and enlarged by combining NMS and adaptive sampling based on prediction uncertainty. The CHARMM-NN force fields use the atom-type based NN to calculate energies and forces, and the input features can be obtained or simply derived from the traditional MM variables. The high compatibility between CHARMM-NN and MM force fields enables the convenient implementation of CHARMM-NN in all MD programs without the need to modify core codes. The training error for CHARMM-NN is low for all 27 fragments, and the validations on dipeptides demonstrate that CHARMM-NN can result in good geometric data similar to QM calculations and much lower reorganization energies than traditional MM force fields. For the MD simulations with CHARMM-NN on several peptides and proteins, some results are acceptable, but some results are not sufficiently accurate because the current CHARMM-NN can only achieve corrections at the mechanical embedding QM/MM level, suggesting that the CHARMM-NN needs to be improved further. Future directions include considering the comprehensive solvent effect in data sampling and constructing machine learned force fields to describe non-bonded interactions between fragments beyond the mechanical embedding level.
ACKNOWLEDGMENTS
This work has been supported by the National Institutes of Health (Grant No. R01-GM061870).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Pan Zhang: Data curation (lead); Formal analysis (lead); Investigation (lead); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead). Weitao Yang: Conceptualization (lead); Funding acquisition (lead); Methodology (lead); Project administration (lead); Resources (lead); Supervision (lead); Writing – review & editing (lead).
DATA AVAILABILITY
The data that support the findings of this study are available within the article.