We describe and apply a method to extend permutationally invariant polynomial (PIP) potential energy surface (PES) fitting to molecules with more than 10 atoms. The method creates a compact basis of PIPs as the union of PIPs obtained from fragments of the molecule. An application is reported for trans-N-methyl acetamide, where B3LYP/cc-pVDZ electronic energies and gradients are used to develop a full-dimensional potential for this prototype peptide molecule. The performance of several fragmented bases is verified against a benchmark PES using all (66) Morse variables. The method appears feasible for much larger molecules.
INTRODUCTION
Potential energy surfaces (PESs) play a central role in much of computational chemistry. There has been major progress in the past 15 years in developing the so-called non-parametric, machine learning approaches to fit large data sets of electronic energies. These include permutationally invariant polynomials (PIPs), Gaussian process regression, and neural networks as well as combinations of all three. A number of reviews have appeared that cover much of this field.1–11 These methods have steep scaling with respect to the number of atoms in the molecule or cluster and the size of the data set. Several comparative studies of these methods have recently appeared.12,13 Clearly, there is major motivation to extend these methods to large molecules of interest in chemistry and materials science and certainly progress is being made along these lines for materials.7,8
Our group has developed and used PIPs1,10,14 to represent many high-dimensional PESs of molecules and molecular clusters. The approach makes use of all n(n − 1)/2 Morse variables, yij = exp(−rij/λ), where rij is the internuclear distance between atoms i and j and n is the number of atoms. In the vast majority of our applications, the range parameter λ is between 2 and 3 bohr and this is adequate to span the range of chemical interactions as small as tens of wavenumbers (tenths of kcal/mol). An important point to make about these variables for the present report is that they go rapidly to zero as rij goes to infinity.
There are many advantages in using polynomials of Morse variables in fitting PESs, e.g., a second order polynomial is the Morse potential and hence the terminology.1 However, the number of Morse variables in a molecule scales like , where n is the number of atoms. This is clearly unfavorable for large molecules. A second issue is the non-linear growth in the number of terms in the PIP basis for a fixed total polynomial order. This number can be determined using the Molien series, as discussed elsewhere.1 For example, for the 12-atom N-methyl acetamide (NMA), there are 66 Morse variables and with a reduced permutational symmetry which accounts for the permutational symmetry of the two groups of three H atoms of the two distant methyl groups (see Fig. 1), the basis contains 8040 terms (and thus, unknown linear coefficients) at maximum polynomial order of 3, but grows to 95 965 at polynomial order of 4. (Increasing the permutational symmetry reduces the number of terms dramatically;1 however, with a much increased cost to obtain them and often to also evaluate each basis function.)
We have developed two software packages to obtain the PIP basis. One is a large library of primary and secondary invariant polynomials.1 These pre-calculated polynomials can be called for molecules with as many as ten atoms. The other package, Molecular Symmetrization Approach (MSA),14 obtains the so-called symmetrized monomials using an efficient algorithm. There is no theoretical limit to the number of atoms treated with this software; however, the computational effort does increase substantially with this number and the maximum polynomial order of the basis. The MSA software was recently modified by us to incorporate gradients, as well as energies (which were used exclusively in the original MSA software) in the PIP fits. The new software is available at GitHub.15
To address the bottlenecks with respect to both the number of Morse variables and the size of the PIP basis, we recall that the PIP basis can be generated with a “seed” monomial14 given by Eq. (1)
where, for simplicity, the Morse variables are indexed by an integer. Clearly, for large molecules, many internuclear distances are large, and thus, the corresponding Morse variables are approximately zero. Thus, basis functions containing these variables are also zero and can be discarded. This simple observation is central to the new approach we propose for large molecules. One way to take advantage of this is to start with the full basis and then remove those basis polynomials with very small Morse variables. However, this requires obtaining the full fitting basis first, and this could already be prohibitive.
A more efficient approach that we take here is to directly fragment the PIP basis. We do this by generating the PIP basis of fragments of the molecule and generating the entire PIP basis from the union of these fragmented bases. It is worth noting that this approach is close in spirit to fragmentation approaches to obtain the energies of large molecules, developed by several groups.16–18 However, we stress that we are not using a fragmentation method to obtain the energies and gradients.
The details of our approach are given below, followed by an application to a PES for NMA, which is a well-studied prototype peptide and has been extensively studied in theoretical chemistry using a large variety of approaches including some involving hydration.19–24 Certainly in some contexts, NMA would not be considered a large molecule; however, as we show below, it is large enough to apply and test the new methodology. Also, a full-dimensional ab initio-based PES of NMA, which to the best of our knowledge, has not been reported. In addition, this size is still manageable using the full PIP basis, so we can compare the new methodology with the full basis. The structure of this molecule at equilibrium is shown in Fig. 1, and all the atoms are numbered for convenience. As seen, the H atoms on the two methyl rotors (atoms 1–3 and 10–12) are the most distant sets of atoms and, indeed, these are the ones with very small Morse variables. A detailed discussion is given below.
COMPUTATIONAL DETAILS
In order to test various fragmentation schemes, a full PES is determined using all Morse variables and a suitable PIP basis. Ab initio molecular dynamics (AIMD) were performed to generate the configurations for the data set. These were done at the efficient Density Functional Theory (B3LYP/cc-pVDZ) level of theory, using Molpro.25 Microcanonical sampling (NVE) was used in the AIMD calculations at five different total internal energies (1000, 5000, 10 000, 20 000, and 30 000 cm−1), and two trajectories were run for each energy. Each trajectory was propagated for 3000 steps using a step size of 5.0 au (0.12 fs). From each trajectory, configurations were collected every 10 steps so that they are more scattered in the configuration space. In total, the data set consists of energies and gradients (36 components) at 3000 configurations. This results in a moderately large dataset size of 111 000. Figure 2 shows the distribution of the potential energies of these 3000 configurations. As seen, the distribution is very broad but with concentration below 1000 cm−1. This low energy concentration results from data from the lowest energy trajectory.
The full PIP basis was generated using MSA software with a permutational symmetry of 33111111 and a maximum polynomial order of 3. This “33111111” notation indicates only the three hydrogen atoms within a methyl rotor are treated identically. The full basis consists of 8040 PIPs, and all the 66 Morse variables are involved here.
As we noted above, some of the Morse variables are very small, so PIPs involving these variables would also be very small and can be removed from the fitting basis, thus, reducing its size. We consider two fragmentation schemes to do this. The first one uses two fragments: one consists of atoms 1–9 and the other consists of atoms 4–12. Therefore, the nine H–H Morse variables between the two methyl rotors are not included in the fitting basis. That this fragmentation makes sense can be verified from Fig. 3, which shows the distribution of all the internuclear distances from the AIMD calculations, where rHH represents the H–H distances between the two methyl rotors. One can see that these H–H distances are, indeed, large and most of them are greater than 8 bohr. The corresponding Morse variables are smaller than 0.018 if a range parameter (λ) of 2.0 bohr is used. Thus, these Morse variables are very small. The second fragmentation scheme is more computationally efficient but also less precise, as shown below. This one consists of three fragments: the first contains atoms 1–6 and 8, the second consists of atoms 4–9, and the third consists of atom 5 and atoms 7–12. These two fragmentation schemes are labeled “Frag-2” and “Frag-3,” respectively, where the “2” or “3” represents the number of fragments.
Distribution of all the internuclear distances in 3000 configurations. rHH refers to the H–H distances between the two methyl rotors.
Distribution of all the internuclear distances in 3000 configurations. rHH refers to the H–H distances between the two methyl rotors.
The PIP fitting bases of the fragments were also generated using the MSA software. In Frag-2 scheme, the 3111111 permutational symmetry was used for both fragments, while in Frag-3 scheme, 31 111 was used for the two 7-atom fragments and 111111 (no permutations) was used for the 6-atom fragment. Consider Frag-2 and denote the PIP basis for each fragment as {pi} and {qj}. Thus, in scheme Frag-2, the potential is given by
i.e., the fitting basis of the molecule consists all the p’s and q’s. We note that, as written, this fitting basis has some redundancy. That is, there are some PIPs that are the same in {pi} and {qj}. These redundant PIPs correspond to terms involving the set of atoms common in the two fragments. In the present case, these are the six atoms 4–9. This number of atoms is clearly less than the number of atoms in each fragment, and so the number of redundant PIPs is a small fraction of the total PIP fragmented basis given above. The redundant PIPs can, in principle, be removed although with some non-trivial programming effort. Practically, the redundancy is not an issue as the number of redundant PIPs is relatively small, as just noted, and also because the Linear Algebra Package (LAPACK) software we use (DGELSS) to solve the least-squares problem is based on singular value decomposition, which can deal numerically with this rank deficient least-squares problem. The removal of redundant basis will be done in the future.
RESULTS AND DISCUSSION
The first set of results is shown in Table I. Here, we give the number of Morse variables and size of the fitting basis, the RMS fitting errors of the energy and gradient magnitude, and the time to generate these bases using our MSA software. The maximum polynomial order is 3 in all cases. Note that the number of Morse variables shown in the table is the unique ones in the basis functions, while the number of PIPs does not account for the redundancy, so the number of unique PIPs is less than what we report. The number of Morse variables is reduced from 66 to 57 or 45 when using Frag-2 or Frag-3, respectively. The number of PIPs is reduced from 8040 to 6056, when Frag-2 is used instead of the full one; this seems only a small reduction, but actually the number of terms for each fragment is only 3028, which is a great reduction compared with the 8040 terms in the full basis. Therefore, the time to generate the basis reduces from 210 to just 10 s due to very nonlinear scaling of computing the basis as a function of the size of the molecule. Since for large molecules, generating the invariant polynomials is one significant bottleneck, the use of a fragmented basis can dramatically reduce the computational cost to generate the basis. This large reduction in the basis in the 2-fragment scheme only causes small increase in the fitting error. This is because the interaction between the two methyl rotors in NMA is small and so, indeed, the elimination of Morse variables between them is justified. Further reduction in the basis by using 3-fragment basis leads to larger fitting errors as expected. In this basis, the Morse variables y1,7, y2,7, y3,7, y6,10, y6,11, and y6,12 are missing, but they are actually not small; their corresponding internuclear distances are around 5–6 Å. However, the cost of generating PIP basis is greatly reduced and it is almost negligible. In addition, this fitting error is actually reasonably small if we consider the large energy range (roughly 0–30 000 cm−1) of the data set. As shown below, this “minimal” basis performs very well for several key properties of the PES.
Number of Morse variables and coefficients, RMS fitting errors of energies (RMSE, in cm−1) and gradients (RMSG, in cm−1/bohr), and time (in seconds) to generate the fitting basis.
Basis . | NMorse . | NPIP . | RMSE . | RMSG . | Time . |
---|---|---|---|---|---|
Full | 66 | 8040 | 26.8 | 54.7 | 210 |
Frag-2 | 57 | 6056 | 34.3 | 67.4 | 10 |
Frag-3 | 45 | 1974 | 148.9 | 171.9 | <1 |
Basis . | NMorse . | NPIP . | RMSE . | RMSG . | Time . |
---|---|---|---|---|---|
Full | 66 | 8040 | 26.8 | 54.7 | 210 |
Frag-2 | 57 | 6056 | 34.3 | 67.4 | 10 |
Frag-3 | 45 | 1974 | 148.9 | 171.9 | <1 |
A point worth making about smaller fragments is that a higher maximum polynomial order can be used because the number of coefficients is greatly reduced by the fragmentation. For example, if we increase the maximum polynomial order to 4, the fitting error of Frag-3 could be reduced to 97 cm−1 although it is still larger than that of the Frag-2 with polynomial order of 3 because some relatively large Morse variables are still missing in the fitting basis. A second point concerns the “missing” Morse variables. Even with fragmentation, these could be added back to the fitting basis. For example, we can add a fragment H3⋯H3 to Frag-2 and all the Morse variables are included in the basis. This indeed could reduce the fitting error for NMA, however, only marginally because the interaction between the two methyl rotors are very small and adding back those Morse variables does not make a big difference.
In addition to the reduction in the time of generating the fitting basis, the fragment basis also speeds up the evaluation of potential and gradients since there are fewer terms in the fragmented fitting basis. We tested the timing of potential and gradient evaluations on a batch of 10 000 geometries, and the times are 12.0, 4.4, and 0.7 s when using the full, Frag-2 and Frag-3 basis, respectively. All of these PESs are much faster than the direct B3LYP/cc-pVDZ calculations using Molpro,25 which takes 12 s for a single point calculation of the energy and gradient and so roughly 120 000 s for 10 000 evaluations.
Next, we examine the accuracy of the harmonic frequencies of the various PESs relative to the direct B3LYP results. These are given in Table II. As seen, the frequencies from the two fragmented-basis PESs are in good agreement with those from the full PES. Also, agreement with the direct B3LYP frequencies is overall very good. Note, the full and Frag-2 PESs produce one small imaginary frequency instead of the very small real frequency, corresponding to the torsion of the methyl rotor attached to the C=O group. As shown below, this is due to the very flat variation of the potential energy surface in this mode.
Comparison of harmonic frequencies (in cm−1) from the fits with the DFT(B3LYP)/vDZ ones.
Mode . | B3LYP . | Full . | Frag-2 . | Frag-3 . |
---|---|---|---|---|
1 | 46 | 18i | 19i | 34 |
2 | 109 | 96 | 96 | 79 |
3 | 160 | 158 | 159 | 160 |
4 | 294 | 288 | 288 | 266 |
5 | 428 | 419 | 415 | 400 |
6 | 442 | 431 | 433 | 422 |
7 | 625 | 624 | 625 | 625 |
8 | 634 | 632 | 631 | 628 |
9 | 871 | 870 | 869 | 868 |
10 | 990 | 986 | 982 | 987 |
11 | 1037 | 1037 | 1038 | 1043 |
12 | 1111 | 1110 | 1111 | 1115 |
13 | 1138 | 1137 | 1132 | 1135 |
14 | 1158 | 1162 | 1164 | 1159 |
15 | 1263 | 1261 | 1261 | 1264 |
16 | 1375 | 1386 | 1386 | 1378 |
17 | 1412 | 1413 | 1412 | 1429 |
18 | 1439 | 1448 | 1452 | 1449 |
19 | 1459 | 1464 | 1466 | 1459 |
20 | 1467 | 1470 | 1473 | 1462 |
21 | 1472 | 1476 | 1480 | 1497 |
22 | 1557 | 1554 | 1553 | 1563 |
23 | 1775 | 1779 | 1780 | 1784 |
24 | 3013 | 3015 | 3018 | 3030 |
25 | 3040 | 3040 | 3044 | 3045 |
26 | 3070 | 3068 | 3071 | 3075 |
27 | 3123 | 3124 | 3125 | 3124 |
28 | 3127 | 3126 | 3127 | 3131 |
29 | 3150 | 3149 | 3151 | 3146 |
30 | 3627 | 3628 | 3631 | 3633 |
Mode . | B3LYP . | Full . | Frag-2 . | Frag-3 . |
---|---|---|---|---|
1 | 46 | 18i | 19i | 34 |
2 | 109 | 96 | 96 | 79 |
3 | 160 | 158 | 159 | 160 |
4 | 294 | 288 | 288 | 266 |
5 | 428 | 419 | 415 | 400 |
6 | 442 | 431 | 433 | 422 |
7 | 625 | 624 | 625 | 625 |
8 | 634 | 632 | 631 | 628 |
9 | 871 | 870 | 869 | 868 |
10 | 990 | 986 | 982 | 987 |
11 | 1037 | 1037 | 1038 | 1043 |
12 | 1111 | 1110 | 1111 | 1115 |
13 | 1138 | 1137 | 1132 | 1135 |
14 | 1158 | 1162 | 1164 | 1159 |
15 | 1263 | 1261 | 1261 | 1264 |
16 | 1375 | 1386 | 1386 | 1378 |
17 | 1412 | 1413 | 1412 | 1429 |
18 | 1439 | 1448 | 1452 | 1449 |
19 | 1459 | 1464 | 1466 | 1459 |
20 | 1467 | 1470 | 1473 | 1462 |
21 | 1472 | 1476 | 1480 | 1497 |
22 | 1557 | 1554 | 1553 | 1563 |
23 | 1775 | 1779 | 1780 | 1784 |
24 | 3013 | 3015 | 3018 | 3030 |
25 | 3040 | 3040 | 3044 | 3045 |
26 | 3070 | 3068 | 3071 | 3075 |
27 | 3123 | 3124 | 3125 | 3124 |
28 | 3127 | 3126 | 3127 | 3131 |
29 | 3150 | 3149 | 3151 | 3146 |
30 | 3627 | 3628 | 3631 | 3633 |
Figure 4 shows the potential energy as a function of the torsion angles of the two methyl rotors, using Frag-2 PES. The torsion angles are both zero at the equilibrium structure shown in Fig. 1. CH3(NH) represents the methyl rotor attached to N–H, and CH3(CO) represents the one attached to the C=O group. The barrier heights using Frag-2 are 28.8 and 167.5 cm−1 for the two methyl rotors. The barriers from the Frag-3 PES are 24.2 and 149.6 cm−1, and those from the full PES are 20.2 and 169.2 cm−1. So the three PESs are in good agreement with each other. These results are gratifyingly close to the ones obtained directly from B3LYP/cc-pVDZ calculations, namely, 59.5 and 181.5 cm−1, obtained by single point calculations at the minimum and transition states. Almost certainly, closer agreement with the direct results would result from fits with more data around the barriers. For the present purposes, we are satisfied that all the PESs describe the 3-fold symmetry of the methyl rotor since the relevant permutational symmetry is used for all the PESs. And the major differences in the two barrier heights are also captured by all the PESs.
Torsional potential (not fully relaxed) of the two methyl rotors on the Frag-2 PES.
Torsional potential (not fully relaxed) of the two methyl rotors on the Frag-2 PES.
Overall, the above results for the NMA example are very promising. As noted already, NMA is not a very large molecule but it is large enough to justify applying at least the Frag-2 scheme and also small enough to apply a full PIP fitting basis. It is worth considering how the present fragmentation strategy could be extended to larger molecules. Let us consider applying this fragmented basis approach to a 19-atom molecule, CH3–NH–CO–CH2–NH–CO–CH3. The full PIP basis for this large molecule with a maximum polynomial order of 3 consists of 178 058 terms, and it takes roughly 1 day to generate these PIPs. If we fragment the molecule into five pieces: CH3–NH–CO–C, C–NH–CO–CH2–N, N–CO–CH2–NH–C, C–CH2–NH–CO–C, and C–NH–CO–CH3, and thus, each fragment is similar in size to the Frag-2 in NMA. The total number of coefficients would be 23 456 at a maximum polynomial order of 3 (which still has some redundancy). This number is still quite feasible for a linear least-squares fit. More importantly, the fitting basis can be generated within 1 minute. If we decompose the molecule into 6 fragments with each fragment similar in size to those in Frag-3, the number of coefficients would be 5238, and the basis can be generated within about 1 s. The number of unique non-negligible Morse variables would also be significantly reduced from the full set of 171.
So for large molecules, the cost and size of the PIP fitting basis are no longer a limitation and the bottleneck is obtaining the electronic energies and gradients for the training data set. However, efficient density functional theory (DFT) methods, localized methods, quantum mechanics/molecular mechanics (QM/MM) approaches, etc., are certainly viable approaches. In addition, perhaps more efficient sampling methods, based on the fragmentation scheme, can be investigated.
SUMMARY AND CONCLUSIONS
In summary, we presented a new fragmented permutationally invariant polynomial fitting basis for potential energy surfaces, which greatly extends the size of molecules beyond the current practical limit of 10 atoms. The approach was demonstrated for 12-atom N-methyl acetamide with two methyl rotors. The quality of the PESs using the fragmented basis was tested against the PES using a full basis as well as direct DFT calculations. Our results show that the fragmented basis significantly reduces the computational cost of generating the invariant polynomial fitting basis as well as the number of terms and unknown coefficients. The number of Morse variables also decreases in the new approach although this is not an essential aspect of it.
SUPPLEMENTARY MATERIAL
See supplementary material for the full PES for trans-N-methyl acetamide included as compressed file.
ACKNOWLEDGMENTS
We acknowledge financial support from NASA Grant No. NNX16AF09G.