One hidden yet important issue for developing neural network potentials (NNPs) is the choice of training algorithm. In this article, we compare the performance of two popular training algorithms, the adaptive moment estimation algorithm (Adam) and the extended Kalman filter algorithm (EKF), using the Behler–Parrinello neural network and two publicly accessible datasets of liquid water [Morawietz *et al.*, Proc. Natl. Acad. Sci. U. S. A. **113**, 8368–8373, (2016) and Cheng *et al.*, Proc. Natl. Acad. Sci. U. S. A. **116**, 1110–1115, (2019)]. This is achieved by implementing EKF in TensorFlow. It is found that NNPs trained with EKF are more transferable and less sensitive to the value of the learning rate, as compared to Adam. In both cases, error metrics of the validation set do not always serve as a good indicator for the actual performance of NNPs. Instead, we show that their performance correlates well with a Fisher information based similarity measure.

## I. INTRODUCTION

Neural network potentials (NNPs) are one category of machine learning potentials,^{1–4} which approximate potential energy surfaces (PESs) and allow for large-scale simulations with the accuracy of reference electronic structure calculations but only at a fraction of the computational cost.^{5}

One prominent architecture of NNPs is the Behler–Parrinello neural network (BPNN),^{6} which introduced the idea of partitioning the total potential energy of the system into effective atomic contributions. BPNNs have been applied to a wide range of molecules and materials.^{7–12} Despite these successes, the BPNN architecture relies on the selection of a set of symmetry functions before the training in order to describe the local chemical environments. On the contrary, features in deep-learning^{13} are automatically learned via hierarchical filters rather than handcrafted. In particular, graph convolution neural networks (GCNNs), which consider the atoms as nodes and the pairwise interactions as weighted edges in the graph, have emerged as a new type of architecture in constructing NNPs for both molecules and materials.^{14–16}

Innovations regarding new architectures certainly drive the improvement of the performance of NNPs. However, one hidden and less discussed issue is the impact of training algorithms on the construction of NNPs. As a matter of fact, earlier implementations of BPNNs^{17–21} used either the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS)^{22} or the extended Kalman filter algorithm (EKF)^{23} as the default optimizer, while recent implementations of GCNN architectures^{14,16,24} almost exclusively chose the adaptive moment estimation algorithm (Adam)^{25} instead. This contrast is, partly, due to the convenience that efficient implementations of Adam are available on popular frameworks such as TensorFlow^{26} and PyTorch.^{27} Nevertheless, the practical equivalence among optimization algorithms for training NNPs is assumed rather than verified.

In principle, comparing training algorithms for the performance of NNPs is a highly non-trivial task. This largely comes from the fact that the final performance of NNPs is affected by many different factors that could convolute the comparison. These include the construction of dataset; the setup of loss function and batch schedule; the choice of neural network architecture; the hyperparameters of training algorithms, not to mention human errors in implementing these training algorithms; and the corresponding neural network architecture. Even if everything is properly controlled, the ground truth to judge the effect of the training algorithm on the performance of NNPs in real-application scenarios may simply not be available.

To carry out this daunting task, our strategy is to use published datasets of liquid water where credible reference values in real-application scenarios are available.^{8,9} This allows us to just focus on the effect of the training algorithm, while the rest of hyperparameters, i.e., the NNP architecture (setups of the BPNN and symmetry functions), the loss function, and the batch schedule, in addition to the dataset, can be inherited and kept fixed. To minimize the impact due to differences in implementation, we implement EKF in the TensorFlow-based PiNN code,^{16} where the BPNN architecture was previously implemented and tested. The technical soundness of our implementation of EKF is verified against the EKF implementation in the RuNNer code,^{19} which was used to generate NNPs in the reference works with the same datasets.^{8,9} In this way, we can compare EKF and Adam on an equal footing, where the native implementation of Adam in TensorFlow is used.

Before showing these results, we will outline the details of our comparative study, including the training algorithms, datasets, the network architecture, and molecular dynamics simulations used for the density prediction, as described in Sec. II.

## II. METHODS

### A. Loss function and batch schedule

As the aim in this study is to elucidate the effect of the training algorithm on the performance of NNPs, the definition of the loss function and the choice of batch schedule need to be held consistent between EKF and Adam algorithms.

We denote $y\u0302(w,x)$ as the neural network prediction given the weight vector **w** and input vector **x** and **y** as the training-data labels. Here, we define $\xi =c/n(y\u0302\u2212y)$ as the scaled error vector, where *n* is the size of vector **y** and *c* is the weighting factor (for balancing number of energy and force labels). This leads to the *L*_{1} loss function as *L*_{1} = ∑_{i}|*ξ*_{i}|. The corresponding *L*_{2} loss function $L2=\u2211i\xi i2$ is then related to the mean squared error scaled by *c*. Since the EKF minimizes the error vector ** ξ** and Adam minimizes the

*L*

_{2}loss, this setup ensures that two algorithms optimize the same loss function as long as the training data are the same.

Another distinction between the two algorithms (without diving into the details, as elaborated in Sec. II B) is on the batch schedule. EKF, as originated from the field of control theory and signal processing,^{28} is designed for on-line training (i.e., one training data-point at a time), in contrast to the mini-batch (a group of randomly selected training data) often used in Adam. This difference becomes blurred when the multi-stream variant of EKF is employed.^{20} Nevertheless, we followed the practice of doing weight-update step based on either energy or force as in the previous studies to avoid the complication of combining energy and force data in one step.^{20}

In the RuNNer setup of EKF, the weights are updated when a given number of error and corresponding Jacobian is computed. As the number of force labels overwhelms that of energy labels, a small fraction of force labels is typically used.^{20} In addition, the labels can be filtered according to the magnitude of the error, which potentially improves the efficiency. An illustration of the batch schedule used in RuNNer is shown in Fig. 1(a).

Here, we choose to apply a simplified scheme of batch schedule in which the weight update is based on randomly selected force or energy label in each iteration without filtering, which is computationally more efficient in TensorFlow. In our particular instance, it is one energy label or ten force labels for each iteration, as illustrated in Fig. 1(b). As shown in Sec. III, this simplification in the batch schedule does not incur an inferior performance of the NNPs optimized with EKF. Subsequently, the same batch schedule is used in Adam for the sake of consistency.

Further details regarding the loss function and batch schedule, including the estimated number of weight updates based on energy and forces in each scheme and the weighting factor *c* thereof, are listed in the supplementary material (Sec. B).

### B. Algorithms

In the following, we first introduce the common notations used for the optimization algorithm and then briefly state the Adam and EKF algorithms compared within this work.

Continuing from Sec. II A, we denote **g** as the gradient of the *L*_{2} loss function with respect to the weight vector **w** and **J** is the Jacobian matrix of ** ξ** with respect to

**w**.

The first optimization algorithm used in this work is Adam, which is a popular algorithm for the training of neural networks.^{25} The algorithm can be considered as an extension of the stochastic gradient descent in which the first moment **m**(*t*) and second moment **v**(*t*) are estimated at each step and used as a preconditioner to the gradients. The algorithm is shown in Algorithm 1.

The other optimization algorithm used in this work is EKF, which estimates the internal state of a system given a series of observations over time. In the context of neural network training, it can be interpreted as updating the weights of the neural network according to the gradient of the error with respect to the weights in past samples.^{23} The algorithm is summarized in Algorithm 2. Note that the notation used here follows Ref. 20, where the learning rate *η* is controlled by the observation-noise covariance matrix **R**. This formulation is more transparent as compared to earlier studies.^{29}

### C. Dataset description

Two datasets containing structures, energies, and forces of liquid water (and ice phases) were used to train the NNPs in this work. 6324 structures in the BLYP dataset of both liquid phase and ice phases were taken from Ref. 8 and re-computed with the CP2K suite of programs^{30} and the BLYP functional.^{31,32} The revPBE0-D3 dataset of 1593 structures of liquid water computed with CP2K was directly taken from Ref. 9.

### D. Neural network potential

#### 1. Network architecture

In this work, the NNPs were constructed using the BPNN architecture,^{6} with the symmetry function taken from Ref. 8. Specifically, 30 symmetry functions for oxygen and 27 for hydrogen were used for the local environment description of atoms. The exact set of symmetry functions used is given in the supplementary material (Sec. A). This local description was fed to an element-specific feed-forward neural network containing two hidden layers each comprised of 25 nodes with the tanh activation function and a linear output node to give atomic energy predictions, from which the total energy and force predictions are derived. The weight parameters were initialized using the Nguyen–Widrow method in RuNNer^{33} and the Xavier method in PiNN.^{34}

#### 2. Algorithm hyperparameters

The PiNN–Adam models were trained for 5 × 10^{6} steps, with a learning rate *η* that decays by a factor of 0.994 every 10^{4} steps. Notably, the gradient clipping technique was used with the Adam optimizer to alleviate the exploding gradient problem.^{35} Specifically, gradient vectors **g** with ‖**g**‖ > 0.01 will be scaled by a factor of 0.01/‖**g**‖ during training.

The PiNN–EKF models were trained with a learning rate *η* for 5 × 10^{5} steps. The RuNNer models were trained for 20 epochs (passes of the entire dataset), which correspond to a total of around 3 × 10^{5} steps. As shown in Algorithm 2, the learning rate *η* of EKF is embedded in the covariance matrix **R** and implicitly decays according to the inverse of the iteration [see Ref. 36 or Eq. (2) below]. *η* is set to a constant in the PiNN setup; the counterpart of *η* asymptotically approaches unity in the RuNNer setup, following the earlier studies.^{19,29}

In both cases, 80% of the dataset were used for training and the rest 20% were left out as a validation set. For each setup, ten models were trained with different random seeds to split the dataset or initialize the weights of the neural network. Whenever applicable, the standard deviations of the prediction across the models are used as error estimates.

### E. Molecular dynamics simulation

The molecular dynamics (MD) simulations for all the models were carried out with the ASE code.^{37} The PiNN code supports calculation with ASE, while the RuNNer code^{19} was interfaced to ASE through the LAMMPS^{38} calculator implemented in ASE. The timestep for all simulations was chosen to be 0.5 fs. For constant particle number, constant pressure, and constant temperature (NPT) simulations, the Berendsen barostat and thermostat^{39} were used to keep the pressure at 1 bar and temperature at 330 or 320 K. For constant particle number, constant volume, and constant temperature (NVT) simulations, the Berendsen thermostat was used to keep the temperature at 300 K, and the density was fixed at 1 g/ml. Each MD simulation was run for 100 ps, from which densities or radial distribution functions are computed from the latter 50 ps.

## III. RESULT AND DISCUSSIONS

### A. The extrapolation regime: The case of the BLYP dataset

The energy-density distribution of the BLYP dataset is shown in Fig. 2. This dataset contains structures from both the liquid phase and different ice phases with a peak centering at 1.0 g/ml.^{8} Given the fact that the equilibrium density of the BLYP water at ambient conditions is below 0.8 g/ml,^{8} the isobaric–isothermal density at 1 bar and 330 K will serve as an instructive example where the NNP is stretched into the extrapolation regime. This motivates us to discuss the result of the BLYP dataset first, where Adam and EKF show qualitative differences. The results shown in this section were generated using the learning rate of 10^{−4} for PiNN-Adam and the learning rate of 10^{−3} for PiNN-EKF, without loss of generality (see Sec. C in the supplementary material for details).

Before presenting those results, it is necessary to first discuss the convergence speed of the NNP training. The EKF optimizer is known to converge much faster as compared to Adam (by approximately one order of magnitude in terms of the number of weight updates to achieve the desired accuracy).^{20,40} This phenomenon is clear in our training of NNPs, as shown in Fig. 3. However, the actual speed-up is compromised due to the higher computational cost of EKF. In practice, the training takes about 2 h for the EKF optimizer and about 5 h for the Adam optimizer on a 28-core computer to achieve similar levels of accuracy in terms of force and energy. It should be noted that the relative speed advantage per iteration increases drastically when the total number of weights grow.

When it comes to the performance of NNPs, one would expect that all the trained potentials with comparable error metrics should yield a consistent water structure at room temperature and experimental density, given the dataset shown in Fig. 2. Indeed, this is borne out, as demonstrated with the O–O radial distribution function in Fig. 4(a). However, this agreement diverges quickly when it comes to isobaric–isothermal simulations using the same NNPs, as shown in Fig. 4(b). This suggests that the corresponding densities should also differ from each other.

We then proceed to see how well Adam and EKF estimate the isobaric–isothermal density at ambient conditions for the BLYP dataset. As shown in Fig. 5, at the pressure of 1 bar and the temperature of 330 K, only 2 out of 10 models trained with Adam manage to predict a stable density, while most of the models trained using EKF lead to an excellent agreement with the previously reported density of around 0.76 g/ml for BLYP water.^{8} In addition, the EKF implementation in PiNN can reproduce the results of RuNNer well for the same dataset.

The qualitative difference between NNPs trained with Adam and EKF seen in Fig. 5 is striking, in light of comparable force and energy errors shown in Fig. 3 and radial distribution functions shown in Fig. 4(a). One may argue that the extrapolation is a difficult task for NNPs because it requires stepping out from their comfort zone. Thus, it is also relevant to consider how the two training algorithms would fare within an interpolation regime. This leads to our second example, the revPBE0-D3 dataset.

### B. The interpolation regime: The case of the revPBE0-D3 dataset

In the following experiment, we have used a dataset that was constructed to cover a wide range in the configuration space uniformly,^{9} as seen in the energy-density distribution in Fig. 6.

With this dataset, both Adam and EKF yield physical densities at the given temperature. Almost all of ten models trained with Adam, regardless of the learning rate, lead to stable NPT simulations, in stark contrast to the case using BLYP dataset (see Fig. S1 in the supplementary material for comparison). However, as shown in Fig. 7, the density prediction strongly depends on the learning rate used to train the model in the case of Adam.

It is tempting to conclude that the models trained with larger learning rates are the better performing ones due to smaller force and energy errors. Similar to the observation made for the BLYP dataset, the error metrics do not seem to correspond to the actual prediction performance of the trained NNPs. Indeed, we notice that the density does not seem to converge when the error metrics (Fig. 7) have reached lower values. Given that the reference density value for this dataset at 1 bar and 320 K is 0.92 g/ml,^{9} this suggests that one has to rethink the common wisdom on performing model selection based on error metrics, such as RMSE.^{41,42}

Regardless of this implication, the present case study demonstrates that the performance of NNPs can be sensitive to the learning rate, in particular for Adam. As a consequence, this can lead to a tangible difference in terms of the density prediction even within the interpolation regime where the dataset already has a good coverage.

### C. Differences in models trained using Adam and EKF

In Secs. III A and III B, we found that NNPs trained with EKF seem to be less sensitive to the learning rate and more generalizable. Meanwhile, it is also clear that model selection based on the error metrics of energy and force may not always lead to a sound choice. While unintuitive, the observation is not completely unexpected. The error metrics are typically evaluated on randomly left-out validation sets, which underestimates the generalization error due to their correlation with the training set. Since the training set has to cover sufficient region in the configuration space, it is hard to construct a test set independent of the training set.

Then, the questions are as follows: (i) can we distinguish “good” and “bad” NNPs by just looking into the weight space (rather than doing the actual MD simulations for the density prediction)? (ii) Why does EKF do better for training NNPs than Adam?

To answer the first question, we have analyzed the models presented in Secs. III A and III B to shed some light on the characteristics of the models trained with Adam and EKF.

One possible measure to distinguish different models is the Euclidean distance of the optimized weights to those in the initialization, which characterizes the implicit regularization effect of stochastic gradient descent.^{43} Denoting the vector of all weight variables in the neural network as **w**, we compared the evolution of the weight distance from initialization ‖Δ**w**‖ = ‖**w** − **w**_{init}‖ and also the initialization-independent distribution of the weights *w*_{i} of the final models for the trained networks.

As shown in Fig. 8(a), different NNPs trained on the BLYP dataset display similar weight distances from initialization. Moreover, the weight distributions of NNPs trained with Adam and EKF are almost indistinguishable as shown in Fig. 8(b). This suggests that the norm-based measure also fails to distinguish models obtained from Adam and EKF. It further hints that using a *L*_{2} norm-based regularization may not improve the performance of NNPs.

Another class of similarity measure is related to the local information geometry of the neural network.^{44} Here, we characterized it with the Fisher information matrix $I$ as follows:

where ^{⊗2} denotes the second tensor power of the gradient vector. Here, we used *L*_{1} loss instead of *L*_{2} loss to construct the Fisher information matrix for reasons that will be clear later. Figure 9 shows the distribution of eigenvalues $\lambda I$ of the Fisher information. For the BLYP dataset, EKF found local minima with much larger eigenvalues as compared to Adam, where the peak of the distribution differs by about one order of magnitude (note the logarithm scale used here). Therefore, the Fisher information is a similarity measure that can effectively distinguish the performance of NNPs.

A similar trend was observed for the revPBE0-D3 dataset. The eigenvalue distribution of NNP trained with Adam and a small learning rate (1 × 10^{−5}) is closer to those trained with EKF. This deviation becomes larger when increasing the learning rate (1 × 10^{−4} and 1 × 10^{−3}). In contrast, eigenvalue distributions of NNPs trained with EKF are almost identical. These results correlate well with the density prediction from MD simulations as shown in Fig. 7.

The usefulness of the Fisher information for model selection is also linked to the second question posed at the beginning of this section. Despite different motivations, both Adam and EKF can be understood in terms of natural gradient descent (NGD), where the inverse of the Fisher information matrix $I$ is used as a preconditioner.

In Adam, $I$ is approximated as a diagonal matrix, with the diagonal terms thereof being the second moment vector $v\u0302(t)$. The inverse square root of the diagonal elements are used as a conservative preconditioner.

On the other hand, the connection between EKF and NGD has been shown recently,^{36} where the inverse $I$ matrix is effectively estimated by the error covariance matrix **P** in EKF. In fact, one can show that

when the covariance matrix of observation noise **R** in Algorithm 2 is chosen to be a scaled identity matrix,^{20,36,40} i.e., $R(t)=1\eta I$.

This means EKF can be viewed as the online estimate of the full Fisher information $I$ with respect to *L*_{1} loss and the 1/*t* decay of the learning rate. We think that such a feature leads to a good similarly measure of models using the Fisher information based on Eq. (1) and a superior performance of EKF for training NNPs as demonstrated in two examples shown in this work.

## IV. CONCLUSION AND OUTLOOK

To sum up, we have compared the performance of two optimization algorithms Adam and EKF for training NNPs of liquid water based on BLYP and revPBE0-D3 datasets. It is found that NNPs trained with EKF are more transferable and less sensitive to the choice of the learning rate when compared to Adam. Furthermore, we show that the Fisher information is a useful similarity measure for the model selection of NNPs when the error metrics and the normal-based measure become ineffective.

Established practice in other neural network applications,^{45} using Adam, indicates that practical training performance can be improved if an architecture is reparameterized to promote weights in a common numerical range, without strong correlations. Another future avenue would thus be to preserve the expressive power of existing NNPs while expressing the weights in such a way that the off-diagonal elements of the resulting Fisher information matrix are minimized. This can be understood in terms of the limitations in the Adam preconditioner strategy.

On the other hand, EKF scales quadratically with the number of trainable parameters in the network, while Adam scales linearly. This makes EKF a less favorable choice when more heavily parameterized models such as GCNN^{14–16} are of interest. A necessary approximation to the estimation of $I$, such as that based on Kronecker factorization,^{46} is essential to transfer the present observation to those models.

Before closing, it is worthwhile to note that the present issue of training algorithms may be overcome with more data (as shown in Fig. 9 with an augmented dataset in our previous work^{16}), for which a number of active-learning approaches tailored for NNPs have been proposed.^{11,47–50} However, we would argue that a better training algorithm does not only improve the performance of NNPs but also improve the data efficiency in active-learning.

This work focuses on the practical role of training algorithm in the performance of NNPs, where different approximations to the Fisher information (in Adam and EKF) and the choice of learning rate are factors primarily considered. Nevertheless, as mentioned at the very beginning, other factors such as the loss function and batch schedule, neural network architecture, decay factor, and decay steps in training algorithm will bring new dimensions into the discussion. In addition, the examples shown here are based on the published dataset of liquid water and the real-application scenario is density prediction. Therefore, the conclusion drawn in this work is yet to be confirmed in more complex systems and other physico-chemical properties, where credible reference values in real-application scenarios can be obtained. Overall, this calls for more systematic investigations to shed light on this topic and the practice of open science to facilitate similar studies in the community. With these efforts, the development of NNPs will become more transparent and reproducible, which, in turn, will provide reliable insights into interesting chemical physics and physical chemistry problems.

## SUPPLEMENTARY MATERIAL

See the supplementary material for the symmetry function parameters, weighting factor in the batch schedule and loss function, and learning rate dependency of error metrics and density for the BLYP dataset.

## ACKNOWLEDGMENTS

C.Z. acknowledges the Swedish Research Council (VR) for a starting grant (No. 2019-05012). We also acknowledge the Swedish National Strategic e-Science program eSSENCE for funding. Part of the simulations were performed on the resources provided by the Swedish National Infrastructure for Computing (SNIC) at UPPMAX and NSC. C.N. acknowledges funding through the Uppsala University AI4Research project. We thank J. Behler for providing the access to the RuNNer code.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

## DATA AVAILABILITY

Datasets used in this work are publicly accessible from https://doi.org/10.5281/zenodo.2634098 and https://archive.materialscloud.org/record/2018.0020/v1. The PiNN code is available from http://github.com/Teoroo-CMC/PiNN, and its EKF implementation will be released in the next version and provided upon request in the interim.

## REFERENCES

_{2}

*Proceedings of the 13th International Conference onArtificial Intelligence and Statistics*,