The architecture of neural network potentials is typically optimized at the beginning of the training process and remains unchanged throughout. Here, we investigate the accuracy of Behler–Parrinello neural network potentials for varying training set sizes. Using the QM9 and 3BPA datasets, we show that adjusting the network architecture according to the training set size improves the accuracy significantly. We demonstrate that both an insufficient and an excessive number of fitting parameters can have a detrimental impact on the accuracy of the neural network potential. Furthermore, we investigate the influences of descriptor complexity, neural network depth, and activation function on the model’s performance. We find that for the neural network potentials studied here, two hidden layers yield the best accuracy and that unbounded activation functions outperform bounded ones.
I. INTRODUCTION
Machine learning has emerged as a powerful tool in the field of atomistic simulations and is currently revolutionizing the way we predict and understand the properties of matter at the atomic scale. Since the first machine learning potential was published in 1995,1 the field has advanced significantly, and a wide range of methods are now available, including neural network potentials (NNP) of various types,2–25 Gaussian approximation potentials (GAP),26 Gaussian processes,27 support vector machines (SVM),28 atomic permutationally invariant polynomials (aPIP),29 spectral neighbor analysis potentials (SNAP),30,31 moment tensor potentials (MTP),32 gradient domain machine learning (GDML),33 and potentials based on the atomic cluster expansion (ACE).34
In this work, we focus on neural network potentials, which are among the most established approaches,35 and address the question of how to choose their architecture for optimum accuracy. Utilizing artificial neural networks to represent atomistic potentials has several advantages. First, neural networks are universal approximators36 and, therefore, possess the ability to capture complex, non-linear relationships between atomic positions and the corresponding energy and forces. Another notable advantage of neural networks is their ability to effectively learn from large datasets and, at the same time, offer rapid evaluation once they are trained. The evaluation time of neural network potentials scales with the architecture of the neural network itself rather than growing with the size of the dataset, as is typically the case for most kernel based methods.37
Neural network potentials, however, also have certain limitations. Their complexity makes them challenging to understand theoretically, resulting in various approaches and topologies based primarily on empirical knowledge. One common challenge in neural networks, regardless of the approach, involves finding good hyperparameters such as the number of features (descriptors) and the number of fit parameters (weights and biases).
The focus of this paper is to understand how the accuracy of Behler–Parrinello high-dimensional neural network potentials (HDNNP)3 depends on the number of descriptors used to characterize chemical environments and on the number of fitting parameters in the classical regime. We demonstrate that to achieve optimal accuracy, the number of parameters must be adjusted for each training set size and descriptor complexity. In the following, we will first briefly review the methodology of HDNNPs and discuss some computational details. Then we will devise a strategy to find the optimum number of parameters for a given training set size and discuss the impact of neural network depth and activation functions.
II. THEORY
A. High-dimension neural network potentials
B. Atom-centered symmetry functions
C. Weighted symmetry functions
Atom-centered symmetry functions (ACSFs) have been proven to work well for many systems,41 but poor scaling with the number of elements reduces their applicability to more complex systems. While the number of required radial symmetry functions scales linearly with the number of elements Nelem in the system, the number of required angular symmetry functions is proportional to Nelem(Nelem + 1)/2 to include all triplet combinations.
With this modification, the number of symmetry functions per element remains the same as in the single element system, and the total number of symmetry functions increases only linearly with the number of elements. However, in this approach, more information is compressed into each symmetry function, causing a loss of resolution in the description of local chemical environments, which can manifest itself in reduced accuracy with large training sets.
III. COMPUTATIONAL DETAILS
A. Dataset
We optimized the architectures of HDNNPs for the QM9 and 3-(benzyloxy)pyridin-2-amine (3BPA) datasets.43,44 The QM9 dataset consists only of structures that correspond to energy minima, simplifying the task of predicting energies alone. Conversely, the 3BPA dataset consists of structures that are out of equilibrium, requiring the learning of both energies and forces.
The QM9 dataset is widely used as a potential machine learning benchmark.45 The dataset is composed of 134k small organic molecules made up of H, C, O, N, and F atoms. The energies in the QM9 dataset are computed at the B3LYP/6-31G(2df,p) level of theory. Since the 3054 structures failed the consistency test,43 it is common practice to remove these structures from the dataset.8,9,12 From this pruned dataset containing 130 831 structures, we randomly selected 10 000 structures to serve as a validation set that controls early stopping.8 Structures that are not in the training or validation sets are assigned to the test sets.
The 3BPA dataset comprises 500 training structures obtained at room temperature. Out of these 500 structures, 50 were used as the validation set during the training process. The energies and forces for all structures were computed using density functional theory with the ωB97X exchange-correlation functional and the 6-31G(d) basis set.
B. Optimizer
The n2p2 package46 was used to obtain the neural network potential for the QM9 dataset. The training was carried out with the Multistream Extended Kalman Filter (MS-EKF) due to its higher performance for this type of task compared to gradient optimization algorithms.47,48 However, the nonlinear scaling with the number of parameters (weights and biases) and limited parallelization make this algorithm suitable only for rather small neural networks. We note that all results of this work were computed only on 12, 16, and 24 CPU threads.
Since the mean absolute error (MAE) is a common metric for the QM9 benchmark, here we apply the Kalman filter to this loss function. As for the 3BPA, the root mean square error (RMSE) has been minimized. The relative importance of each force in terms of energy updates has been set to 10.
Reported errors in this paper are a sample average over five models for cross-validation, 10 models for the QM9 dataset, and 20 models for the 3BPA dataset. The models differ in the random seed, which affects the initialization of the weights and the mixing of training structures. A different approach with averaging over different train/test splits has also been reported.9 However, we have found that the standard deviation estimated from different initial parameters is similar to that of the train/test split only for small training sets. As the number of structures in the training set increases, the standard deviation between model MAE due to different NN initialization outweighs the effect of the train/test split, as summarized in Table I.
C. Descriptors
A previous study demonstrated that the accuracy of a neural network’s potential can be affected by the number of descriptors employed.42 In particular, it was shown that by transitioning from an unweighted model with 220 symmetry functions to a full-weighted model with 32 symmetry functions, the MAE was reduced by 25% for the QM9 dataset. To study the influence of descriptor complexity on model accuracy, we introduced three types of models differing in complexity: full-weighted, angular-weighted, and unweighted models utilizing polynomial symmetry functions (refer to the supplementary material for detailed information). After pruning unrealized bonds, the complexity of our models can be described as follows: the full-weighted model contains 34 symmetry functions (7 weighted radial and 27 weighted angular), the angular-weighted model comprises 61 symmetry functions (34 unweighted radial and 27 weighted angular), and the unweighted model consists of 225 symmetry functions (18 unweighted radial and 207 unweighted angular) on average per element. The element-wise numbers of SFs are provided in Table S1.
To describe the chemical environment for the 3BPA dataset, we employed only the angular-weighted model. Since the 3BPA molecules do not contain fluorine, the average number of SFs is reduced to 54 (27 unweighted radial and 27 weighted angular).
IV. RESULTS AND DISCUSSION
A. QM9 data set
First, we have reproduced 5-fold cross-validation with 8000 training structures and 2000 validation structures, utilizing the original Gaussian descriptors from previous work.42 The same 5-fold cross-validation has also been carried out for our polynomial set with full-weighted SFs. Table II compares the 5-fold averaged MAE for both models. Significantly better accuracy has been achieved with Gaussian symmetry functions in comparison to the original work.42 This accuracy improvement can be mainly attributed to the adoption of the multi-stream Kalman filter, which offers superior results compared to the previously used single-stream Kalman filter.47 The polynomial wACSFs slightly outperform the Gaussian wACSFs; thus, we report another example showing that polynomial symmetry functions are an advantageous alternative to the original Gaussian SFs.40
Symmetry functions . | MAE (kcal/mol) . | ||
---|---|---|---|
Training . | Validation . | Testing . | |
Gaussian: other work42 | 1.18 | 1.84 | 1.83 |
Gaussian: this work | 0.55 | 1.23 | 1.28 |
Polynomial | 0.53 | 1.19 | 1.24 |
Symmetry functions . | MAE (kcal/mol) . | ||
---|---|---|---|
Training . | Validation . | Testing . | |
Gaussian: other work42 | 1.18 | 1.84 | 1.83 |
Gaussian: this work | 0.55 | 1.23 | 1.28 |
Polynomial | 0.53 | 1.19 | 1.24 |
Previous work, described in Ref. 42, mainly focused on the introduction of weighted ACSFs. For this purpose, only 10 000 structures were employed from the whole QM9 dataset. Here, we have determined the QM9 learning curves for our three models, differing in complexity (see the Descriptors section in Computational Details), with training sets ranging from 1000 to 75 000 structures. The neural network architecture is the same as in Ref. 42, i.e., there are two hidden layers with 10 and 50 nodes, respectively, connected by a softplus activation function. The resulting BPNN learning curves are shown in Fig. 1.
The comparison of our three models shows that the model combining unweighted radial and weighted angular symmetry functions performs best. For a training set size of 10 000, the full-weighted model underperforms by 25%, while the unweighted model has twice the error. This indicates that the simplest model (full-weighted) is undertrained as the descriptors do not have sufficient resolution, and the most complex model (unweighted) is affected by the curse of dimensionality.49 Due to the significantly larger error, the model with 225 descriptors on average (the unweighted model) is excluded from further analysis. Although the full-weighted and angular-weighted models perform well, a loss of linearity in the log–log plot is observed after 10 000 training structures for these two models.
The total error of a machine learning model originates from approximation, generalization, and optimization errors. The optimization origin of the error of our networks seems unlikely since the deviation from power-law behavior (linearity in the log–log plot) was not observed for the unweighted model. It also should not stem from the non-unique and biased distribution of the configuration space, as other methods do not experience this deviation.8,9,12,50,51 Hence, it seems likely that the errors are mostly approximation-based, i.e., the descriptors fail to sufficiently distinguish the different chemical environments, and/or the small number of parameters does not allow them to capture the finer features of the potential energy surface. Increasing the number of descriptors from the full-weighted to the angular-weighted model slightly reduces the underfitting but does not eliminate the slower pace of training. The learning curves shown in Fig. 1 were obtained using a neural network comprising two hidden layers connected with the softplus activation function. The hidden layers consisted of between 10 and 50 nodes, as this particular architecture has been previously reported in the literature.42 However, this neural network architecture has been optimized for only 8000 training structures. As non-linearity in the log–log plot is observed in 10 000 training structures, an insufficient number of parameters might be the cause for this behavior.
For training sets of sizes 1000, 3170, 4900, 6699, 8321, and 10 000, we have investigated the dependence of the validation MAE on the number of parameters. The investigated neural network architectures consist of two hidden layers, where the number of nodes in both hidden layers is identical. For each training set size, we have trained 10 models starting with different random seeds for the parameters. The errors were then averaged and fitted with the exponential function f(x) = Ae−bx + c (see Figs. S1 and S2). We define the optimal number of parameters, nopt, using the fitted parameter b as nopt = 4.5/b. With this definition, the deviation of the error from its asymptotic value c has decayed to 1% at nopt. For both models, a linear dependence of nopt on the training set size is observed, as shown in Fig. 2.
Next, we investigated whether extrapolating nopt to larger training set sizes improves the learning curves for training sets of over 10 000 structures. The original number of parameters based on the 62–10–50–1 architecture corresponds to 1231 for a hydrogen atom. For a training set size of 75 000 structures, we should have ∼4000 parameters according to our linear extrapolation. With this knowledge, we have retrained our models with predicted optimal architectures for each training set size. The resulting QM9 learning curves are depicted in Fig. 3. Linearity in the log–log plot is now also observed beyond a training set size of 10 000, indicating that the original neural network architecture does not provide sufficient complexity for larger training set sizes.
In Fig. 4, we present a comparison between our optimized angular-weighed BPNN and other machine learning models, i.e., SchNet,8 PhysNet,12 HIP-NN,9 SOAP,52 Allegro,25 MACE,54 and Wigner–Kernels.53 The BPNN model demonstrates comparable accuracy to SchNet8 and HIP-NN.9 However, recent parametric methods, such as MACE54 or Allegro,25 achieve higher accuracy, albeit at the cost of complex architectures with significantly more parameters. For instance, the most accurate Allegro model encompasses 17.9 × 106 parameters, exceeding the number of parameters of our architecture-optimized BPNN by more than three orders of magnitude.
As we have shown, it is beneficial to vary the number of fitting parameters with the size of the training set. However, if computation capacity is not a matter of concern, could this approach be replaced by a fixed architecture where the optimal number of parameters for the largest training set is used for all training set sizes? Our largest training set, with 75 000 structures, needs 4000 parameters (weights + biases). To verify this approach, we have gradually increased the number of parameters with a corresponding increase in training epochs for the smallest training dataset with 1000 structures. Figure 5 shows how, at first, the validation MAE decreases with increasing numbers of parameters. For more than 1000 parameters, however, the validation MAE slowly starts to increase. Thus, if a model with 4000 parameters is employed instead of 600 parameters for a training set of size 1000, the resulting model is 26% slower on the CPU and also ∼10% less accurate. Hence, both insufficient complexity and an overabundance of complexity degrade the accuracy of the potential energy prediction.
The depth of an HDNNP is another crucial hyperparameter to consider. We analyzed learning curves for neural networks with 1–5 hidden layers. For all models and training set sizes, we set the number of parameters based on the optimal value determined for the two hidden layer model. We have found that increasing or decreasing the depth of the network does not result in improved accuracy, as listed in Table III. The model with only one hidden layer exhibited an average MAE that was 11% higher compared to the model with two hidden layers. Similarly, the models with 3, 4, and 5 hidden layers displayed an MAE that was larger by 7%, 15%, and 26%, respectively. This observation indicates a degradation of accuracy with an increase in the depth of the neural network while maintaining the same number of parameters. Nevertheless, it is crucial to acknowledge the inherent bias in this comparison, as the optimization of the number of parameters was specifically tailored to the two hidden layer model. Consequently, it is conceivable that the optimal number of parameters is not exclusively dictated by the number of descriptors and the size of the training set, as discussed above, but is also influenced by the depth of the neural network. Furthermore, the arrangement of parameters within the hidden layers may also impact the model’s performance. In this study, a uniform node distribution was employed, but it is plausible that alternative distributions could potentially yield more favorable outcomes.
Training set size . | MAE (kcal/mol) . | ||||
---|---|---|---|---|---|
1H . | 2H . | 3H . | 4H . | 5H . | |
1 000 | 3.36 | 3.07 | 3.45 | 3.59 | 4.44 |
3 162 | 1.91 | 1.67 | 1.78 | 1.88 | 1.98 |
10 000 | 1.03 | 0.90 | 0.93 | 1.01 | 1.01 |
25 000 | 0.65 | 0.60 | 0.61 | 0.64 | 0.67 |
50 000 | 0.45 | 0.41 | 0.44 | 0.47 | 0.49 |
75 000 | 0.37 | 0.34 | 0.36 | 0.40 | 0.43 |
Training set size . | MAE (kcal/mol) . | ||||
---|---|---|---|---|---|
1H . | 2H . | 3H . | 4H . | 5H . | |
1 000 | 3.36 | 3.07 | 3.45 | 3.59 | 4.44 |
3 162 | 1.91 | 1.67 | 1.78 | 1.88 | 1.98 |
10 000 | 1.03 | 0.90 | 0.93 | 1.01 | 1.01 |
25 000 | 0.65 | 0.60 | 0.61 | 0.64 | 0.67 |
50 000 | 0.45 | 0.41 | 0.44 | 0.47 | 0.49 |
75 000 | 0.37 | 0.34 | 0.36 | 0.40 | 0.43 |
For our model with two hidden layers, we also evaluated the performance of four activation functions: tanh, sigmoid, rectified linear unit (ReLU), and softplus. In Fig. 6, we observe that unbounded activation functions (ReLu and softplus) exhibit significantly better generalization abilities, particularly for smaller training sets. As the training set size increases, the difference between activation functions diminishes, and for a training set of 75 000 structures, the sigmoid function achieves the same level of accuracy as the ReLU function. Nevertheless, its MAE is still 16% higher than that of the softplus function, which prevails across all training set sizes.
B. 3BPA data set
Given that the 3BPA molecule consists of 27 atoms, for reference energy, there are 81 corresponding force components. We investigated the dependence of the validation RMSE on the training set size for datasets consisting of 100, 200, and 300 structures. The averaged RMSE values from 20 models, which were fitted using an exponential function, are depicted in Fig. S3. For each training set size, we determined the optimal parameter value nopt for energies and forces separately using the same approach as for the QM9 dataset. The resulting dependence is illustrated in Fig. 7.
Similar to the findings for the QM9 dataset, we once again note a positive correlation between the optimum number of parameters and the size of the training set. However, the optimum number of parameters differs for energies and forces. As can be inferred in Fig. 7, for a given training set size, about 150 more parameters are required for the optimal prediction of the energy, even though 81 times more forces than energies are available. This suggests that a more complex architecture is required for energies than for forces.
It is noteworthy that for the dataset consisting of a total of 24 600 reference data points (300 structures), achieving 99% accuracy within the classical regime only requires a modest number of 600 parameters.
V. SUMMARY
In this paper, we have analyzed the learning curves of HDNNPs for the QM9 dataset using three models that employ varying numbers of descriptors, with average values of 34, 61, and 225 SFs. We find that among these models, the one with a combination of unweighted radial symmetry functions and weighted angular symmetry functions exhibits the highest accuracy. The full-weighted model demonstrates underfitting, likely due to insufficient information captured by the descriptors. On the other hand, the unweighted model tends to overfit within the investigated range of training set sizes.
Our calculations show that in order to achieve power-law behavior over the entire training set size range (from 1000 to 75 000), the neural network architecture has to be adapted according to the training set size. For the angular-weighted model, we have empirically found a linear dependence of the optimal number of parameters nopt on the training set size D: nopt = 0.045D + 526.224. The correlation between the number of parameters and the size of the training set is also evident in the 3BPA dataset. Interestingly, two distinct dependencies emerge for energies and forces, with the energies necessitating a larger number of parameters compared to the forces. Our results indicate that the evaluation time of optimized neural networks is not fully independent of the training set size. Hence, with each significant change in the training set size, e.g., during active learning, the optimal neural network architecture should be re-determined in order to make full use of the available training data. Furthermore, we have demonstrated that an excess of parameters can be detrimental.
A comparison of learning curves for the QM9 dataset shows that the accuracy achieved with our optimized angular-weighed BPNN is comparable to that of other, more complex machine learning models such as SchNet8 and HIP-NN.9 More recent methods such as MACE,44 Allegro,25 or Wigner Kernels,53 however, achieve substantially higher accuracy but at the cost of greater complexity.
We also investigated the impact of varying the number of hidden layers for the QM9 dataset while maintaining a constant number of parameters. Our results revealed that utilizing two hidden layers yielded the lowest MAE, with the accuracy gradually decreasing as the number of layers increased. However, it is important to note that the number of parameters utilized for each training set size was optimized specifically for the two hidden layer model. This raises the question of whether the optimal number of parameters depends not only on the number of descriptors and training set size, as demonstrated, but also on how the parameters are distributed over the layers. Finally, our findings indicate that unbounded activation functions, such as ReLU and softplus, outperform bounded activation functions (tanh and sigmoid) for small datasets, but the difference in performance gradually decreases as the dataset increases. Among the tested activation functions, the softplus function yielded the highest accuracy.
SUPPLEMENTARY MATERIAL
The supplementary material contains detailed information about the polynomial symmetry functions used in this work. In addition, the supplementary material includes validation errors as a function of the number of parameters for the QM9 and the 3BPA datasets.
ACKNOWLEDGMENTS
We acknowledge financial support from the Doctoral College Advanced Functional Materials—Hierarchical Design of Hybrid Systems DOC 85 doc.funds funded by the Austrian Science Fund (FWF) and the Vienna Doctoral School in Physics (VDSP). The computational results presented have been partly achieved using the Vienna Scientific Cluster (VSC).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Lukáš Kývala: Conceptualization (lead); Data curation (lead); Investigation (lead); Methodology (lead); Software (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). Christoph Dellago: Conceptualization (supporting); Funding acquisition (lead); Investigation (supporting); Project administration (lead); Resources (lead); Supervision (lead); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.