In this work, we present ænet-PyTorch, a PyTorch-based implementation for training artificial neural network-based machine learning interatomic potentials. Developed as an extension of the atomic energy network (ænet), ænet-PyTorch provides access to all the tools included in ænet for the application and usage of the potentials. The package has been designed as an alternative to the internal training capabilities of ænet, leveraging the power of graphic processing units to facilitate direct training on forces in addition to energies. This leads to a substantial reduction of the training time by one to two orders of magnitude compared to the central processing unit implementation, enabling direct training on forces for systems beyond small molecules. Here, we demonstrate the main features of ænet-PyTorch and show its performance on open databases. Our results show that training on all the force information within a dataset is not necessary, and including between 10% and 20% of the force information is sufficient to achieve optimally accurate interatomic potentials with the least computational resources.

## I. INTRODUCTION

In recent years, machine learning methods have gained popularity as methods to simulate large complex systems due to their ability to predict properties of materials with high accuracy and low computational cost. In particular, machine learning interatomic potentials^{1} (MLPs) are data-driven methods that allow the prediction of energies and forces of atomic structures with precision similar to that of the scheme used to generate the reference data but several orders of magnitude faster. Reference data are usually obtained by first-principles calculations, such as density functional theory^{2,3} (DFT) for bulk systems or post-Hartree–Fock methods for molecular ones.^{4,5} Once trained, these potentials can be employed in conjunction with other advanced simulation techniques, such as molecular dynamics^{6} (MD), Monte Carlo^{7–9} (MC), and evolutionary algorithms.^{10} MLPs have been used with great success in a wide variety of fields including physics, chemistry, and industry, in applications such as computing phonon properties,^{11,12} studying phase diagrams,^{13,14} or predicting properties and structures of crystals and molecules.^{15–18}

Multiple MLP approaches have been proposed in the literature; some examples include artificial neural network-based potentials (ANN-based MLPs),^{19–21} Gaussian approximation potentials,^{22–24} kernel-based methods,^{25–27} message-passing networks,^{28–30} or spectral neighbor analysis potentials^{31,32} among many others. In this case, our focus lies on the first group, ANN-based MLPs, which is based on one of the oldest and most studied methods in machine learning. The main drawback of MLPs is that the calculations to compile the reference dataset and training the potential is time- and resource-intensive. Several fits are usually needed, i.e., the training needs to be repeated several times, first to select the optimal hyperparameters and then to refine the MLP with techniques such as active or on-the-fly learning.^{33–36} The simplest of the approaches relies on training only on the energies of the reference structures. However, this approach usually leads to poor predictions of forces, which are the negative gradient of the predicted potential energy with respect to the atomic coordinates and are of foremost importance to achieve long stable MD simulations.^{37} Therefore, efficient techniques are required to include force information in the training in addition to the reference potential energies, which is far more computationally demanding.

As in most fields concerning machine learning, Graphics Processing Units (GPUs) provide the best solution to this problem. Some work has already been done in recent years in this direction, leading to updates of codes for training MLPs to include GPU support. For instance, that is the case of ANI,^{38} AMP,^{39} or deepMD,^{40} to name but a few. Here, we present an extensive update of the atomic energy network (ænet)^{41} code to allow MLP training on GPUs. ænet has proved to be efficient, especially when handling systems with many species. Our approach is simple: by using a well-known ML framework, PyTorch,^{42} we replace the training process of ænet while keeping full compatibility with all the other ænet resources. In this paper, we describe the main characteristics of our code called ænet-PyTorch and show its potential to train on both system energy and atomic forces. We also show that training on all the forces of a database is redundant by testing the code on several open databases. The code will be available on GitHub as free, open-source software.

### A. Machine learning potentials

*E*

_{i}) is then evaluated using the network specific to its element. It is assumed that the contribution of each atom

*i*depends only on its local environment (denoted as

*σ*

^{(i)}in the previous equation). By numerically describing those environments, via the so-called atomic fingerprints or descriptors and training on the total energy of the system, the resulting potential can be generalized independently of the number of atoms. Several ways of representing atomic environments can be found in the literature,

^{43–47}all of them satisfying a set of conditions regarding symmetry with respect to the exchange of equivalent atoms, rotations and translations of the structures, and smoothness of the descriptor functions.

*α*is a free weight parameter. RMSEs for energy $(LE)$ and forces $(LF)$ are defined as

*N*

_{struc}is the number of structures in the database and

*N*

_{atom,i}is the amount of atoms in the

*i*th structure. Here, the superscripts ANN and REF denote the output of the MLP and the reference value, respectively. For the sake of simplicity, the force error is normalized per atom in the whole set, instead of having the average of force error per structure. That is to say, we consider each force vector as an independent training example.

^{48}The outputs of the MLP ANNs are the energy contributions of each atom in Eq. (1), and the forces acting on each atom can be computed by taking the gradient of the total energy with respect to the coordinates of the atoms ({

**R**

_{k}}),

^{49}Here, we consider the L2 regularization,

^{50}introduced in the training as an extra term in the loss function,

*w*

_{i}} is the set of all parameters to be fitted during training and

*λ*is a weighting parameter. In the following, that

*λ*parameter will be referred to as the regularization or weight decay term.

In practice, in each training epoch, the loss function in Eq. (6) is not evaluated for the whole set of training examples but rather for a subgroup of them at a time. This group, or batch, is used to approximate the gradient of the loss function and to update the parameters in the network in each epoch.

## II. ænet-PyTorch

### A. ænet

ænet^{41} provides a tool for generating, testing, and applying machine learning interatomic potentials based on the Behler–Parrinello method,^{1} entirely written in Fortran 2003. Currently, it includes two different descriptors: the original Behler atom-centered symmetry functions^{1,51} and the Chebyshev descriptors^{52} by Artrith *et al.* Nonetheless, it is worth noting that this update of the code is independent of the descriptor, so any future addition of new descriptors in the original ænet code would be compatible with ænet-PyTorch.

ænet also includes the utilities to employ the MLPs in real applications, for example, in molecular dynamics simulations^{6} [using the interface with Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)^{53} or TINKER^{54}], in path integral molecular dynamics,^{55} or in any of the capabilities provided in the Atomic Simulation Environment^{56} Python package. Our new program has been designed as an extension to the original ænet code, and all of these utilities that ænet provides can be readily used with the potentials generated by ænet-PyTorch.

Including force information in the training increases the transferability of the potentials, improving the force prediction and finally enhancing the stability of MD simulations. Nevertheless, despite ænet’s parallel implementation’s efficient scaling on central processing units (CPUs), training on both energies and forces has been limited to simple systems with a small number of atoms due to its significant requirements for computational time and memory. That is the main reason why we consider this GPU implementation through PyTorch necessary so that the force training becomes feasible for complex systems within our ænet framework.

### B. ænet-PyTorch implementation

The principal novelties included in the training scheme of ænet-PyTorch are the capability to train MLPs using GPU in addition to CPU cores and training both on reference energies and atomic forces. ænet-PyTorch also provides users with easy access to various optimization algorithms and overfitting prevention techniques, such as dropout and batch-normalization layers, all available within the PyTorch framework. All of these are easily implementable and customizable for users.

PyTorch is a free and open-source machine learning framework based on the TorchLib library with a Python interface. It contains routines to enable efficient training of deep neural networks with CPU and GPU support, via the C++ and CUDA-based code, respectively. In contrast to other PyTorch-based implementations of MLPs, which usually rely on custom CUDA or C++ modules, this one is written completely in Python using only PyTorch’s built-in functions. More specifically, GPUs are best suited for tensor operations, so most operations must be expressed as such to achieve optimal performance.

The optimization strategy (depicted in Fig. 1) involves grouping all atoms of the same species together before training. This allows a single network call to predict the energies of all atoms of that type. PyTorch handles the parallel computation of atomic energies and forces internally. The process of grouping all the atoms and later reordering the resulting atomic energies and forces is also performed using only PyTorch’s built-in routines, so it is also a parallel process. As we will see in Sec. III, this idea leads to great scaling of the code, particularly on GPU.

For a detailed explanation of the capabilities and options of the code, the user is referred to the documentation of the code, which can be found online in the GitHub repository (https://github.com/atomisticnet/aenet-pytorch) together with the code and an example of its usage. However, some of them are worth noting, and they will be tested in the following. One of the main concerns when developing MLPs is the resources needed to train forces, both in terms of computational time and memory. It must be noted that with our implementation, we decided to sacrifice memory resources to speed up the training several orders of magnitude. However, some options have been included to reduce those memory requirements for large datasets, both CPU and GPU memory (called RAM and VRAM, respectively):

**GPU (gpu)**: This mode stores all the data needed on the GPU memory. Despite being the fastest way, storing all the data in the GPU device might be problematic for large datasets. The aim of this storage mode is to minimize GPU VRAM usage while optimizing the execution time, assuming that the VRAM is generally the bottleneck. Therefore, all information is loaded into the CPU RAM first and then moved to the VRAM after preprocessing the data.**GPU (cpu)**: If enough RAM is available but that of the GPU is limited, the data can be all stored in the CPU RAM. When each batch is processed, only the required data for that batch is moved to the GPU. This one is a slightly slower approach, as it requires constant communication between GPU and CPU.**CPU**: All calculations are done using CPU cores, without GPU support.

## III. ænet-PyTorch SCALING

In order to test the performance of ænet-PyTorch, we will replicate already published calculations using this new code and compare the results with those published by their original authors. Let us first consider the database used in the first ænet code release,^{41} formed by 7815 structures belonging to different phases of bulk titanium dioxides (TiO_{2}). In this section, we will use this set of calculations to check the performance of the code for energy and force training. All MLP training would usually require the selection of an architecture for the ANN, but in our case, we will be using the one that the authors proposed and proved to be optimal: 48–15–15–1 architecture using the Chebyshev polynomials as descriptors.

### A. Energy training

As we have already stated, this implementation allows the usage of all the tools in PyTorch, starting with all the optimization algorithms. Therefore, we first aim to select the most appropriate set of hyperparameters for the training. This includes the optimization method, batch size, learning rate, and weight decay. To do so, we consider all the variants of the Adam optimizer^{57–59} available in PyTorch, batch sizes ranging from 64 to 1024, learning rates from 10^{−6} to 10^{−1}, and weight decays from 10^{−5} to 10^{−2}. This results in a total of 600 training processes.

The models are evaluated based on their accuracy on the independent validation set, and the best results are displayed in Table I. All calculations are performed for 10^{4} training epochs to ensure that those with lower learning rates have also converged. In the following, a training epoch refers to a single iteration of the learning algorithm over the entire training dataset, i.e., processing the entire set once, independently of the number of batches in which the data are divided.

Method . | BS . | LR . | WD . | RMSE . |
---|---|---|---|---|

Adagrad | 64 | 10^{−1} | 10^{−4} | 4.221 |

Adamw | 128 | 10^{−4} | 10^{−5} | 4.434 |

Adamw | 256 | 10^{−4} | 10^{−5} | 4.387 |

Adamax | 512 | 10^{−3} | 10^{−5} | 4.310 |

Adamax | 1024 | 10^{−3} | 10^{−4} | 4.877 |

Method . | BS . | LR . | WD . | RMSE . |
---|---|---|---|---|

Adagrad | 64 | 10^{−1} | 10^{−4} | 4.221 |

Adamw | 128 | 10^{−4} | 10^{−5} | 4.434 |

Adamw | 256 | 10^{−4} | 10^{−5} | 4.387 |

Adamax | 512 | 10^{−3} | 10^{−5} | 4.310 |

Adamax | 1024 | 10^{−3} | 10^{−4} | 4.877 |

It has already been discussed that the feature that makes ænet-PyTorch efficient is the grouping of atoms of the same species before training. This has some implications: the time needed per epoch considerably decreases with the increase of the batch size, but it comes at the cost of more memory. Figure 2 shows the time needed per epoch for different batch sizes. The number of epochs needed to achieve convergence depends on each specific case; notwithstanding, the time per epoch is a good indicator of the scaling of the code. For each value, the best combination of method, learning rate, and regularization have been selected based on the results shown in Table I. The time needed for training considerably decreases when using GPU. The improvement ranges from one to two orders of magnitude, increasing with the batch size. On the other hand, the memory used also increases with the batch size, so in some cases, it would be better to save the dataset information in the CPU RAM so as to find a trade-off between time- and memory-efficiency. In that case, the speedup is still considerable while keeping the GPU memory (which is usually the limiting computational feature) more moderate. However, these problems rarely arise from calculations involving only energies.

These results show that energy training with ænet-PyTorch is efficient even on CPUs, considering that the CPU computations here are performed using only 2 CPU cores. As a reference, this training time is similar to that needed in the original ænet to obtain a similar error training with 16 processors. This comparison is not completely fair, since ænet uses an optimization algorithm much slower, the limited memory BGFS algorithm.

### B. Force training

Let us now consider the new feature introduced with this code, i.e., training on atomic forces in addition to energies. In this case, we will use the optimal training hyperparameters selected for energy training, and we will focus on two other parameters influencing force training: the *α* parameter from Eq. (2) weighting the energy and force RMSEs in the loss function and the fraction of structures with force information. Regarding the former, the *α* parameter is bounded from 0 to 1, i.e., from pure energy training to only training on forces. As for the latter, some works in the field already suggest that including the forces of all the atoms of every structure in the reference set is not necessary to reach accurate predictions: Singraber *et al.* used from 0.41% to 4.1%,^{48} and Artrith *et al.* stated that 10% was enough.^{60,61} The results of our tests are depicted in Fig. 3, where both the energy and force RMSEs are displayed as a function of the fraction of structures whose forces have been used in the training. These structures are randomly selected from the whole training set.

First, the *α* parameter determines the balance between the accuracy of energies and forces. Lower values result in lower error of energies but higher error of forces, while higher values result in the opposite. For MD simulations, getting accurate forces is more important to ensure the stability of the simulations, whereas for Monte Carlo calculations, the energy is the main issue. Therefore, we need to find a balance between accuracy in energies and forces depending on the task at hand. It is important to note that the introduction of forces helps to generalize the prediction of energies in regions that are not included in the training examples,^{38,43} so in most cases, the introduction of forces will be beneficial to some extent.

In our current example, the prediction error for forces decreases with increasing alpha, but for values above 0.1, the performance improves very little. In the case of the energies, all the models with *α* below 0.3 predict energies with an accuracy on the order of 1 meV/atom. Thus, we can conclude that *α* ∈ (0.1, 0.3) is the optimal range of values for a general-purpose potential for this system.

As shown in Fig. 3(b), the error of the forces does indeed decrease when adding more force information to the training set. However, this improvement diminishes when the percentage of forces included is high enough. This means that including from 10% to 20% is sufficient to achieve results close to the best performance that the model can provide.

Moreover, the number of forces included in the training examples heavily influences the computational requirements. This is shown in Fig. 4, where the training time, CPU RAM memory, and GPU VRAM memory for all three training strategies are displayed, this time as a function of the percentage of forces included. First, we note that including forces, even the smallest amount, increases at least 1 order of magnitude the time needed for training, which increases with the fraction of forces included. However, it is the memory that suffers the most with the addition of forces, rapidly increasing with it. This is somewhat mitigated by storing all the batch information in the CPU and only moving to the GPU the information one at a time. In any case, as Sec. IV will show, in most applications, including more than a small subset of the forces in the training set is not worth the computational cost.

It must be noted that in these calculations, all of the structures in the dataset are used for fitting the energy, whereas only a fraction of them are considered for training forces. Thus, let us investigate the importance of including the energy information of the structures that are not considered for force training. To do so, we have trained MLPs with only small portions of the structures of the dataset but training with all their forces in each case, i.e., excluding the energy-only structures from the training. Our results in Fig. 5 show that the removal of the remaining data for energy training heavily impacts the prediction of energies, especially when the percentage of data used is small. The prediction of forces is also affected, but not nearly as much. Thus, we conclude that adding force information helps generalize the energy prediction around the data points included in the training set, but for structures far from the reference ones, the prediction is not as accurate. Therefore, structures from all over the potential energy surface should be included in the training set to achieve accurate models.

## IV. BENCHMARK ON OPEN DATA SETS

In this section, we will benchmark our code using several open databases and compare them with the results obtained by their authors. Unless otherwise stated, the descriptor used has been the Chebyshev polynomials with the same parameters as in the reference article, and so is the chosen architecture. Here, we will use *α* = 0.1 to weigh the error for energy and force. Based on the analysis performed in Sec. III B, the prediction of energies is expected to be slightly less accurate than the one that would be achieved by fitting energies alone, but that will also result in more accurate force predictions.

### A. Titanium dioxide

First, we summarize the results that have already been shown during this paper for the database of 7815 structures of several titanium dioxide phases developed by Artrith and Urban and which was used to test the original ænet code in its first release.^{41} Reference energy and forces were obtained from DFT calculations using the Perdew–Burke–Ernzerhof (PBE) exchange–correlation functional.^{62} In that work, the model was fitted only to the reference energies, and the descriptor used was the Behler–Parrinello symmetry functions.^{51} The best fit resulted in an error of around 4 meV/atom in the energies, while the error on forces was not quantified in that work.

Our results are shown in Fig. 6, where we include the errors in energies and forces for both the training and testing sets. As we have already mentioned before, the error in energies slightly increases with the number of forces, while the one for forces drastically diminishes. The red dotted line shows the error obtained in the original work, which is slightly better than the one we get here for the energy. This is due to the choice of the *α* parameter; a similar accuracy to that of the original article could be obtained by reducing the value of that parameter, but this would come at the cost of increasing the error of the force prediction (for example, with *α* = 0.01 in Fig. 4).

On the other hand, the error of the predicted absolute value of force converges to around 0.3 eV/Å with only 20% of forces included in the training. This error corresponds to around 2% of the mean force of the structures on the dataset ($\u223c20$ eV/Å). Figure 6(c) shows the error in the direction of the predicted forces for different percentages. The addition of forces greatly improves its prediction, which is of great importance for MD simulations to be stable and accurate. Note that in cases where the absolute value of the force is smaller, the error in direction is more likely to be larger.^{63}

### B. Liquid water

Second, we consider a database composed of 9189 liquid water structures, each with 192 atoms, whose energies and forces were evaluated with the revised PBE functional and with the addition of the Grimme D3 van-der-Waals correction.^{64} This dataset was used by Chen *et al.* to test the performance of the LAMMPS and TINKER interfaces of ænet.^{6} Water systems have been widely used as a benchmarking system for new developments in the field of MLPs^{65–68} due to their complexity, so many open databases can be found. This one here is more of a challenge than the one for TiO_{2}, since the number of atoms in each structure is much higher and, therefore, so is the number of reference forces to be fitted. In this case, the authors also fitted the model to the reference energies.

This time, our results for energy training [Fig. 7(a)] are in excellent agreement with the fitting of Chen *et al.*, about 1 meV/atom. The error in forces is again lower in our fit, reducing about 50% with the smallest fraction of forces included. The absolute error of the forces cannot be directly compared to the results for the previous dataset, but the relative error is around 1% of the mean forces of the set, which is much lower in this case (around 2 eV/Å) than that of TiO_{2}. The most remarkable improvement comes from the angle deviation of forces, which becomes narrow around 0°.

### C. Li–Mo–Ni–Ti oxide

Our next example is a much more complex quaternary oxide, LMNTO, a database consisting of 2616 bulk structures, with 56 atoms each. Here, the SCAN semilocal functional^{69} was used to evaluate energies and forces. This reference dataset was used by Cooper *et al.* to test their method for including force information via Taylor series expansions,^{63} so we have the reference value of the errors for fitting to forces in addition to energies in this case. Moreover, being a system composed of five elements, it is a great example of the case in which the Chebyshev polynomials excel as atomic fingerprints, since the resulting descriptor size does not depend on the total number of elements.^{52}

Figure 8 demonstrates that our models perform equally well in predicting energies compared to the reference models but outperform them in terms of force prediction. With no force information, our models have a higher error than that of Cooper *et al.*; however, direct training on forces leads to improved performance over the Taylor series expansion. As in the two previous examples, the inclusion of forces initially increases energy error, but when sufficient force information is included, our models perform better than with only energy training. This highlights the benefit of incorporating forces in enhancing the generalization and transferability of the potentials.

### D. Amorphous Li_{x}Si materials

The last database that we will consider consists of about 45 000 structures of amorphous Li_{x}Si, developed using a combination of density functional theory calculations, using the PBE functional, and evolutionary algorithms.^{10} This set includes many phases with different stoichiometry, both bulk surfaces and nanoparticles. In this case, only energies are included in the database, so force training is not possible, but it will still be useful to see the performance of our code in a large database with many atoms per structure.

The best fit using ænet-PyTorch yields an error of 6.5 meV/atom in the training set and 7.6 meV/atom in the testing set, which is as good as the reference fit (6.3 meV/atom in training and 7.7 eV/atom in testing). As for the resources, 10 GB of memory was needed to fit the energies of the whole set with 128 structures per batch.

## V. CONCLUSIONS

The last decade has shown that MLPs will play an important role in the study of new and complex materials at the atomic scale, and this has created a huge demand for tools to efficiently train such potentials. Training on GPUs is the standard practice in most fields of machine learning, and here, we presented an upgrade of the original ænet software to provide this capability. The ænet-PyTorch extension ensures that training the neural networks is no longer a bottleneck in the development of MLPs, even when accurate forces are required.

We demonstrated with different materials examples that ænet-PyTorch is efficient, particularly when training only on energies. The CPU version of the code is as fast as the original ænet implementation, and the GPU implementation reduces the training time by one to two orders of magnitude. Due to its compatibility with the ænet package, we expect this extension to have great synergy with the other features available in ænet, such as including force information in the training via a Taylor series expansion.^{63}

If, on the other hand, direct training using atomic forces is desired, this is now feasible with the ænet-PyTorch code. We demonstrated that directly including force information in the training process is possible with ænet-PyTorch, thanks to the computationally efficient GPU implementation. Nonetheless, we strongly recommend users to include only a small fraction of forces in the training, since our benchmarks on systems with different numbers of atomic species, numbers of atoms, and dataset sizes demonstrated that accurate models can be obtained by including between 10% and 20% of the force information.

## ACKNOWLEDGMENTS

This work was supported by the “Departamento de Educación, Política Lingüística y Cultura del Gobierno Vasco” (IT1458-22), the “Ministerio de Ciencia e Innovación” (Grant No. PID2019-106644GB-I00), and the Project HPC-EUROPA3 (Grant No. INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. The authors acknowledge technical and human support provided by SGIker (UPV/EHU/ERDF, EU) and the Duch National e-Infrastructure and the SURF Cooperative for computational resources (National Supercomputer Snellius). J.L.-Z. acknowledges financial support from the Basque Country Government (PRE_2019_1_0025). N.A. acknowledges funding from the Bayer AG Life Science Collaboration (“!AIQU”).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Jon López-Zorrilla**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **Xabier M. Aretxabaleta**: Data curation (equal); Formal analysis (equal); Investigation (equal); Writing – review & editing (equal). **Inwon Yeu**: Validation (equal); Writing – review & editing (equal). **Iñigo Etxebarria**: Conceptualization (equal); Supervision (equal); Writing – review & editing (equal). **Hegoi Manzano**: Conceptualization (equal); Funding acquisition (equal); Supervision (equal); Writing – review & editing (equal). **Nongnuch Artrith**: Conceptualization (equal); Funding acquisition (equal); Methodology (equal); Resources (equal); Software (equal); Supervision (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

ænet-PyTorch is open source and free for everyone to use and customize, as is the ænet package. The ænet-PyTorch code can be obtained from https://github.com/atomisticnet/aenet-PyTorch. Being written purely in Python and PyTorch, we believe that this code can be easily used for prototyping new techniques based on PyTorch features, such as custom loss functions, learning rate schedulers, and dropout layers to reduce overfitting.

## REFERENCES

_{x}Si using machine-learning-assisted sampling with an evolutionary algorithm

*ab initio*calculations and machine learning

*ab initio*molecular dynamics

*Amp*: A modular approach to machine learning in atomistic simulations

_{2}

*Advances in Neural Information Processing Systems 32*

*Advances in Neural Information Processing Systems 4*

*ab initio*parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu