Machine learning (ML) approaches enable large-scale atomistic simulations with near-quantum-mechanical accuracy. With the growing availability of these methods, there arises a need for careful validation, particularly for physically agnostic models—that is, for potentials that extract the nature of atomic interactions from reference data. Here, we review the basic principles behind ML potentials and their validation for atomic-scale material modeling. We discuss the best practice in defining error metrics based on numerical performance, as well as physically guided validation. We give specific recommendations that we hope will be useful for the wider community, including those researchers who intend to use ML potentials for materials “off the shelf.”
I. INTRODUCTION
Machine learning (ML)-based interatomic potentials are becoming increasingly popular for mainstream material modeling.1–5 ML potentials are “trained” using quantum-mechanical reference data (energies and forces on atoms) and, once developed and properly validated, enable large-scale atomistic simulations at a similar level of quality while requiring only a small fraction of the computational cost. In recent years, ML potentials have been used to address fundamental research questions that would otherwise have been inaccessible for quantum-accurate studies: the complex high-pressure phase behavior of seemingly simple elements,6–8 the microscopic growth mechanism of amorphous carbon films,9 or the photodynamics of a biologically relevant molecule.10 At the same time, ML potentials are being developed for diverse functional materials, with applications including phase-change chalcogenides,11,12 battery electrodes and solid-state electrolytes,13–16 and multicomponent alloys.17,18 The field is thriving, without any doubt.
With ML potentials becoming increasingly available, it becomes important to ensure careful validation of both the overall methods and specific models. This requirement is particularly important for machine-learned data-driven models, which do not have (much) physical information “built in” by construction. A key step in validation was reported by Zuo et al., who performed a detailed study of numerical energy and force prediction errors across different types of ML potentials.19 The focus of that work was on the fair comparison on different fitting frameworks, and on the identification of a “Pareto front” of efficiency, i.e., of the respective most accurate methods and settings at a given level of computational cost. The validation tests used by Zuo et al. included liquid and crystalline structures for a number of elements,19 and an advantage of this protocol is that it could, in principle, be applied to any elemental system. In contrast, a recent publication on a carbon ML potential includes various more subject-specific (or “domain-specific”) tests, such as the formation energies of specific topological defects in graphene,20 and extended benchmarking and more complex tests were later carried out by other groups.21–23 Earlier, de Tomas et al. had stressed the importance of careful property-based validation for carbon potentials, including an ML-based example, as well as established empirical potentials.24,25 Indeed, many questions on validation are relevant for any type of interatomic potential, machine-learned or otherwise. Finally, the importance of validating ML potentials based not only on numerical error values but also on the predicted physical behavior is being pointed out increasingly in the literature.26–29
The aim of the present work is to review and discuss the validation criteria for ML potentials in a “tutorial” style. It covers both numerical and physically guided validation, while being clearly focused on ML potential models for atomistic simulations. We aim to complement related works by others: more generally on error and uncertainty measures for ML models of various types,30–32 and recently published general best-practice guidelines for the use of ML in chemistry.33,34 Our work focuses strongly on inorganic materials, and we refer the reader to an excellent tutorial on neural network (NN) potentials, which places stronger emphasis on molecular force fields—for example, for water.35 In what follows, we provide a brief overview of relevant methods and concepts, illustrative examples, and best-practice recommendations.
II. WHAT ARE ML POTENTIALS?
To set the scene, we start by briefly reviewing what ML potentials are in the first place. We will focus on the major types of choices to be made in developing ML potentials and how these choices affect the quality of the model. A reader who is familiar with the methodology may wish to skip to Sec. III.
Machine learning means extracting information from large datasets—in this case, from quantum-mechanical energies and forces. An ML potential is, therefore, a complex mathematical model for a given potential energy surface (PES) whose parameters have been learned from reference data (the ground truth). There are three main ingredients for doing so,2 as shown in Fig. 1(a).
The first step in constructing an ML potential is building the reference database to which the fit is made: devising and selecting small-scale structural models that contain enough “relevant” chemical environments to give the model high accuracy where needed, as well as sufficient constraints for it to be valid. Once representative structures (in ML terms, data locations) have been chosen, they are given reference values (data labels): energies, forces, and stresses as computed with the ground-truth method. For inorganic materials, the latter is typically some flavor of the density-functional theory (DFT). Higher-level, beyond-DFT computations are beginning to be used as well.37
In terms of how reference data determine the quality of an ML potential [Fig. 1(b)], two aspects are relevant—corresponding to the horizontal and vertical axes in the sketches of Fig. 1, respectively. On one hand, the judicious choice of data locations (for which configurations exactly do we want to fit?) is important: this choice can be guided by the practitioner’s physical and chemical knowledge, or by automated active learning approaches for steering the database building,38–40 or both. However, even with carefully crafted datasets, there remains a risk of potentials not having “seen” what they need, resulting in poor extrapolation outside the training domain and, consequently, in incorrect behavior. On the other hand, the nature and quality of the data labels are important. Any ML potential will reproduce at best the level of data on which it has been trained: for example, if the reference data labels have been computed with a simple DFT functional that does not correctly capture van der Waals interactions, neither will the resulting ML potential. The role of the numerical quality of the data (e.g., the fact that strict convergence with k-point sampling is required) has been pointed out in Ref. 41, and has been systematically analyzed recently.42
The second step is to represent the local environments of the atoms in the database in a mathematical form that is suitable for learning. The tool for this task is most commonly called a descriptor in the community and is analogous to a set of features in ML research. A good descriptor is invariant (unchanged) with respect to translations, rotations, and permutations of atoms, and is as complete as possible45 while remaining numerically efficient. Many currently established ML potentials rely on hand-crafted descriptors constructed from radial and angular basis functions,46–48 and other recent approaches “learn” a structural representation as part of the potential fit.49,50 The construction of structural descriptors for atomistic ML has been reviewed in Ref. 51.
Again, the user’s choices here will affect the quality of the resulting potential [central panels in Fig. 1(b)]. For example, models based purely on pair-wise, “2-body” descriptors may work well for bulk metals but not for covalent systems.52 Then, there is the choice of hyperparameters, i.e., of those parameters that are not directly optimized when training a single instance of a model. As one of many examples, the cartoon in Fig. 1(b) indicates the varied atomic density broadening that can be chosen in the Smooth Overlap of Atomic Positions (SOAP) descriptor:47 larger values can make potentials robust enough for structure searching, while smaller values are needed for highly accurate predictions.53
The third step is to fit a flexible (highly parameterized) function to the reference data. Among the main classes of fitting approaches currently used for ML potentials, there are (i) artificial neural network (NN) models such as Behler–Parrinello-type NNs54–56 or the DeepMD,57 SchNet,49 and NequIP50 schemes; (ii) kernel-based methods such as the Gaussian Approximation Potentials (GAP) framework;58 and (iii) linear models including the Spectral Neighbor Analysis Potential (SNAP),48 Moment Tensor Potential (MTP),59 and Atomic Cluster Expansion (ACE)60 techniques. There are interesting connections between the different methodologies—for example, a recently proposed “multi-layer” ACE approach61 combines the ACE descriptors with message-passing NN architectures. We leave details of the available fitting methodologies to recent review articles.4,41,62
The third set of user choices, therefore, concerns the regression framework itself.63 Does one want a highly flexible deep learning model requiring lots of data, or a “tailored” kernel-based approach that can make do with fewer examples? What are the hyperparameters of the fit: in GAPs, the regularization (“expected error”) is important;41 NNs depend on the learning rate and batch size; other methods will be affected by other aspects. For the purpose of the present paper and the following discussion, any ML potential fitting framework will be relevant in equal parts.
III. WHAT MAKES A POTENTIAL “VALID”?
Overall, the answer to this question seems to be easy. A new ML potential, and, indeed, any interatomic potential model, has to pass two tests. Does it qualitatively predict what it should (to the extent that is comparable with experiment)? Does it quantitatively do so—say, in predicting measurable properties?
In practice, it is often very difficult to determine the quality of a potential, and to connect the numerical performance to its physical meaning. Therefore, validation tests become important. Figure 2 summarizes a way to think about the qualitative “correctness” of a potential in terms of false maxima and minima and their respective effect on predictions. This issue was discussed in recent reviews and perspective articles,41,43 and we illustrate it here using representative examples (throughout this work, we show data from our own studies, which deliberately include examples of “failed” potentials).
Qualitatively correct potentials are expected to show the correct behavior for a given system—with respect to either experiment, or a high-quality quantum-mechanical prediction, or ideally both. We illustrate this expectation on the right-hand side of Fig. 2(d), which shows snapshots from the simulated compression of amorphous silicon, as taken from a recent work.7 The key observation in that study—and thus, the benchmark that a “correct” simulation will need to reproduce—was the pressure-induced crystallization to form simple hexagonal (sh) silicon. The atomic coordination number in that phase is 8, and so the final polycrystalline structure is rendered in orange in Fig. 2(d).
The figure then compares the “correct” prediction [Fig. 2(d)] to the same simulation with a more limited, yet much faster, empirical interatomic potential [Fig. 2(e)], and also to the result of a candidate ML potential that is deliberately chosen as an example of one that is not correct [Fig. 2(f)]. In the former case, which we take to illustrate a false maximum, the structure remains relatively similar to the low-coordinated low-density form of amorphous silicon (purple)—no structural collapse or crystallization happens.7 We presume that this is because the functional form of the empirical potential used is not as flexible as that of a typical ML potential, instead favoring the tetrahedral local geometry inherent to low-density silicon. For the latter case [Fig. 2(f)], we show a candidate ML potential that predicts an unphysical structure: “unphysical” is here taken to mean that the system does not crystallize, rather getting stuck in a fully disordered configuration with coordination numbers of (Ref. 44).
IV. NUMERICAL ERRORS
Numerical error metrics are of central importance in many areas of ML—for example, in quantifying the performance of a new model compared to the existing state-of-the-art (SOTA). We briefly review relevant techniques for numerical validation and error definitions that are commonly used in the ML literature (and described in introductory textbooks in more detail).64 We then discuss the application of these error metrics to interatomic potential models specifically.
A. Error measures
When testing any ML regression model, a predicted label, , is generated for each entry (each data location) in a test set of data that have not been included in training the model. Each predicted label corresponds to a correct ground-truth value, yi, and numerical error metrics are designed to summarize, in a single measure, the disparity between prediction and ground truth as taken over all data points in the test set. Different error metrics capture different qualities of this disparity.
In many settings, one prefers robust statistics, i.e., those that are not skewed by extreme values. However, we argue that robust statistics should not be used on their own in validating ML potentials, based on the following scenario [Fig. 3(a)]: consider a potential (A) that performs reasonably well over all relevant configurational space, and compare it to another potential (B) that performs extremely well on 90% of the data points but poorly on the remaining 10%. A robust error metric such as the median would identify B as being “better” than A. This is shown in Fig. 3(b), where the middle horizontal line in the box plots indicates the median value. In computational practice, however, a potential that fails dramatically to capture the true nature of the PES in a physically relevant region (B) is not valid, and is, therefore, significantly worse than the one that performs reasonably over all of configurational space (A).
The above-mentioned metrics summarize the distribution of residual errors using a single value. While offering a practical way to compare different models, they necessarily lose information about the overall distribution of errors. Using a kernel density estimation or “violin” plot is a graphically succinct way to more completely describe the error distribution [Fig. 3(c)], and in the context of ML potentials, seems preferable to a box plot [Fig. 3(b)] that incorporates robust statistics. Note how in this numerical example, both the violin plot [Fig. 3(c)] and individual RMSE and MAE values [Fig. 3(d)] provide an indication of unfavorable errors in B compared to A.
B. Subtleties (I): Errors for forces
When dealing with collections of errors for scalar properties (such as energies), the individual errors, , are scalar values, and, therefore, summary metrics such as the above-mentioned MAE and RMSE are easy to apply. When dealing with errors in vector properties—in the context of ML potentials, forces on atoms—, there are two main ways to apply summary metrics. Both involve converting the individual error vectors, , to a collection of scalar values, before calculating metrics for those. While both approaches are used in the literature, they give different absolute results.
Crucially, however, the L1 norm is not rotationally invariant, and hence, neither is the component-wise MAE: for an N-dimensional vector of Euclidean length 1, the L1 norm takes values between 1 and depending on its orientation relative to the Cartesian axes. This can have a strong effect even for N = 3. For example, the component-wise MAE varies by 0.15 eV Å−1 (a 43% relative difference!) when applied to variously rotated predictions for a distorted graphite structure (429th structure in the test set of the C-GAP-17 ML potential for carbon; Ref. 65). In more isotropic structures, the effect is less pronounced, but in principle remains. We, therefore, recommend that component-wise MAE values are not used when determining force errors for ML potentials, instead using component-wise RMSE values or metrics for the magnitude.
Finally, we stress that each of these metrics, applied to the same set of force errors, results in a different absolute value. For the complete C-GAP-17 test set, we obtain values of MAEmag = 1.67 eV Å−1 and RMSEmag = 1.89 eV Å−1 for the magnitude of per atom force errors, and a component-wise RMSEcomp = 1.09 eV Å−1. (A Python implementation of this analysis is provided alongside the paper; see the Data Availability statement.) It is, therefore, critical to include the exact type of metric used when reporting force errors and to compare like with like.
C. Subtleties (II): Scaling with system size
ML potential models typically make atom-wise predictions: atomic energies (that are then summed up to give the total energy54,58) and forces on atoms. When predicting per atom properties, such as the magnitude or Cartesian components of forces, where there is a one-to-one correspondence between DFT ground truth and ML-predicted values, no further considerations regarding the number of atoms in the cell(s) are, therefore, required when reporting errors. However, for energies, DFT does not normally provide per atom values, and the ground truth in this case is the DFT total energy for a given simulation cell. When predicting system-wide properties, such as the total energy, N individual ML model predictions are, therefore, summed and compared to a single true label. In these cases, the system size does have an effect on the behavior of the numerical errors.
Figure 4 illustrates this point using again a practical example: a set of structures from the reference database of the C-GAP-17 potential for carbon (Ref. 65). We evaluated the model errors for different system sizes and investigated if, and how, they change with N. The prediction errors for the entire cells trivially increase with N [Fig. 4(a)]. In contrast, when the energy error is reported as a per atom value (by dividing the predicted total energy by N), the result systematically decreases with N [Fig. 4(c)]. Figure 4(b) illustrates that scaling the predictions by would alleviate this problem.
We, therefore, recommend that, when comparing model performance for per-cell properties such as the total energy, datasets containing structures of the same size are used. If this is not possible, it would seem desirable to use the statistically justified normalization procedure of dividing by . We do note, in practical terms, that per-cell energies normalized in this way have different absolute values from the commonly quoted error values in eV per atom (which one would obtain by dividing by N), and are, therefore, not directly comparable—as seen from the different y-axis values in the panels of Fig. 4.
The most important message of this figure, rather than specific error values for a specific potential, is that numerical errors do need to be evaluated with care. Errors for a given ML potential, if quoted in isolation, will likely be of limited use; however, the evolution of a well-defined error measures across different types of fitting methods, database compositions, hyperparameters, etc., and the use of such measures in systematic benchmarks will be (and will continue to be) highly informative.
D. Cross-validation and external test sets
Many ML model classes can nearly perfectly fit to the data on which they have been trained. To measure a model’s true capabilities and, in particular, its ability to generalize, it is, therefore, important to test on data points that have not been included in the training.
A popular way to numerically validate an ML model is to use k-fold cross-validation [Fig. 5(a)]. This procedure involves separating the complete dataset into k non-overlapping sets (or “folds”) and training k separate models, each using the remaining data when the k-th set is held out for testing. Averaging an error metric over all k folds yields a measure that is less affected by random noise induced by the exact choice of train-/test-set splitting. In Fig. 5(a), we sketch this process schematically for a database that consists of three clusters of data. For an ML potential, these could correspond, say, to bulk crystalline, liquid, and surface structural models, respectively—see Ref. 20 for an example of how such a dataset might look in practice. The sketches in dashed boxes indicate the distribution of the data points in a 2D projection of structural similarity (similar structures being close together and vice versa), with the three types of data forming three clusters, and different points being used as the test data (green) in each of the k folds. The prediction of the ML model is then plotted against the ground-truth value for every point as measured when it was used for testing, and error metrics such as the RMSE (Sec. IV A) can be calculated.
An alternative and more time-consuming procedure for numerical validation involves generating one or more “external” test sets. These test sets could be generated in similar ways to the training data, that is, be independent yet structurally related—or they could extend to different types of data [Fig. 5(b)].
In atomistic material modeling, an example of a general system-agnostic approach is a standardized random structure search (RSS) protocol;66–68 these searches explore a wide range of configurations, thus generating an unbiased sample of diverse atomic structures, including local minima in the PES. We discuss RSS-based test sets in more detail in Sec. V C. The creation of more specific test sets can be guided using domain knowledge about the chemical system—for instance, surfaces, defects, transition paths, and polymorphs not seen during training. To more fully understand the behavior in a particular domain, these sets can be highly specialized, for instance, containing solely a set of manually distorted crystal structures or snapshots from an MD simulation at high temperature and pressure. The error values obtained using this technique will depend strongly on the nature of the test set, and baselining using several different test sets offers a way to more comprehensively understand model behavior.
We note that our wording “external” in Fig. 5(b) refers to the construction process of the test set, rather than to its location relative to the training set. Comparing the two test sets shown in Fig. 5(b), data located in regions close to training points (upper panel) will typically be modeled more accurately than those further away (lower panel). Such “training example data leakage” can lead to overly optimistic error estimates, unless the test set accurately reflects the data to which the model will be applied.
E. Uncertainty quantification
While the present work is focused on the validation of ML potentials before they are used in production simulations, we also briefly mention numerical techniques that can be used to test the quality of a model during a simulation itself. These techniques fall in the wider remit of uncertainty quantification (UQ) and are increasingly relevant to atomistic ML.69 Some of them are specific to certain classes of models—for example, the use of the Gaussian process variance in “on-the-fly” fitting of ML potentials,70 or of dropout neural networks.71 Some are more general, such as the use of committee models (comparing predictions from multiple separately fitted models at a given data location): an early application of this idea to NN potentials was described by Artrith and Behler in 2012,72 but committee models, in general, can be built for other ML potential fitting frameworks as well.
For an introduction to UQ methodology for atomistic modeling, we refer the interested reader to a comprehensive survey in Ref. 30. Using adsorption energies on surfaces as the application case, the authors provide an introduction, discussion, and evaluation of different types of relevant UQ methods.
V. PHYSICALLY GUIDED VALIDATION
In this section, we discuss a series of physically guided tests for ML potentials for materials. This area is one where the validation becomes very different from techniques used in “standard” ML research and relies on the practitioner’s domain knowledge. We argue that such physically guided validation plays an important role in making an ML potential applicable in practice.
A. Domain-specific error analysis
The first of our proposed physically guided tests is actually still to do with energy and force errors. The key difference is that we generate external testing data entirely separately, in a setting that is informed by physical and chemical knowledge as far as possible.
In Fig. 6, we show the construction and use of such a physically motivated test set for ML potentials, taken from the original work in Ref. 26. The idea is to carry out a small-scale molecular dynamics (MD) simulation that mimics the “real thing”—small enough so that structural snapshots from that MD trajectory can be evaluated (labeled) in subsequent single-point computations with the ground-truth method. In this case, the reference MD simulation was run with an ML potential (referred to as the “original model” in Fig. 6), and an important prerequisite for doing so was that that potential had itself been validated in earlier work.73,74 In the study of Ref. 26, the aim was to test the behavior of modified GAP models for elemental silicon, which were based on the original GAP-18 potential73 and included additional data and custom regularization to more accurately describe diverse crystalline allotropes and their vibrational properties.26
In this example, two new candidate ML potentials are tested, which only differ in one aspect of the fit, namely, in the choice of the regularization hyperparameters corresponding to the “expected error” in the input data [in Fig. 1(b), all aspects would, therefore, be the same except for the last one in the series]. Candidate potential 1 is a model where new structures have been added to the existing GAP-18 database, but not all too much weight is placed on them. Candidate 2 is a model where the regularization is “tighter” by a factor of 10, and so the potential is very good indeed at describing phonons, but at the expense of physically reasonable behavior outside that scope. In particular, an MD melt–quench simulation using candidate 1 reproduced the behavior of the original GAP-18 model, whereas the same simulation using candidate 2 failed entirely.26
Force errors on a physically motivated test set can be predictive of this behavior, as shown in Fig. 6(c). By contrast, we emphasize that the energy errors on their own do not reveal any trouble [Fig. 6(b)], other than a slightly larger scatter than the original GAP-18 model. In fact, candidate 2 was found to have a lower numerical energy error for the liquid configurations in this test set (8.4 meV/atom) than candidate 1 (11.5 meV/atom).26
Albeit the present work is focused on inorganic materials modeling, the general message about low-error-yet-unstable-MD is fully in line with recent studies on molecular systems: see, for example, Ref. 27 (linear ACE models) or Ref. 29 (graph neural network potentials). Both in materials and molecular simulation, the question whether there can be a “better,” more predictive external test set for numerical evaluation will likely be of continuing and, indeed, growing interest.
B. Domain-specific structural benchmarks
The second type of physically motivated tests concerns structural similarity analysis, which we carry out using the SOAP kernel.47 As in Sec. V A, we suggest to validate a candidate potential by comparing its behavior to an accurate reference simulation, which could be based on DFT (small-scale), or driven by an existing and previously validated ML potential (large-scale).44
As an illustrative example, similar to Fig. 2, we use as a benchmark the results of large-scale (100 000-atom) MD simulations of the pressure-induced crystallization of amorphous silicon.7 In Fig. 7(a), we color-selected structural snapshots by the atomistic SOAP kernel similarity to the crystalline sh phase, which ranges from 0 to 1. This color-coding clearly highlights the two significant structural changes: collapse of the 4-fold-coordinated amorphous phase, followed by nucleation of sh crystallites. Figure 7(a) shows the average SOAP similarity, taken over the whole simulation cell at each time step. The interpretation of this similarity measure is intuitive: the atomic environments in the low-density amorphous phase are mainly tetrahedral-like, and similar to the diamond-type crystalline form (high SOAP similarity value), and then become dramatically less diamond-like upon structural collapse. The similarity to sh silicon follows the opposite trend, with very high similarity at the end of the simulation positively identifying the sh crystallites compared to other possible competing phases (see Ref. 44 for more details).
There is a subtle detail in constructing these similarity plots. For the solid lines in Fig. 7(a), we relax the reference crystal structure under an external pressure that matches that of the corresponding frame in the MD simulation. This approach accounts for the change in bond lengths with pressure, which is most evident in the SOAP similarity of the low-density amorphous phase to diamond at 0–10 GPa. Even though no significant structural rearrangement occurs, the similarity to the fixed reference crystal at ambient pressure decreases linearly (dashed line)—in contrast, if a pressure-adjusted crystal is used as the reference, almost no change in the SOAP similarity is seen up to about 10 GPa (solid line).
Once a structural similarity benchmark has been developed, it can be used to assess new candidate ML potentials. In the case we review here, originally reported in Ref. 44, the aim was to train computationally much cheaper potentials that still show the same physical behavior as their “teacher” model. In Fig. 7(b), we use the quantitative structural metric provided by SOAP to compare predictions of candidate “student” potentials to the previously validated reference simulation of Ref. 7. Specifically, we generated two sets of MTP models that are controlled by their maximum level, L, fitted separately to a large database of structures (candidate 1) and to a comparatively small one (candidate 2). Unreasonable, highly coordinated false minima (of differing kinds) are detected by the SOAP analysis for candidate 2. Instead, candidate 1 passed the test and could, therefore, be used for large-scale MD simulations with confidence.44
More generally, we think that compression MD simulations starting from some highly disordered structures can provide an insightful test for the physical behavior of ML potentials—particularly if a large system size is used, allowing for the frequent sampling of a range of configurations involving the close approach of atoms. The increasing comprehensiveness of easily accessible structural databases, such as the Materials Project,75 means that reference crystal structure data are readily available for many chemical systems, and structural similarity analyses such as those exemplified in Fig. 7 can be set up easily.
To illustrate this type of analysis, in practice, we provide a Python (Jupyter) notebook that implements it for a 10 000-atom system—the smaller system size means that the code can be quickly run on standard computer hardware, and we invite the reader to use and adapt this code as they wish. A link to this notebook, as well as others supporting the present work, is provided in the Data Availability statement as follows.
C. Random search and exploration
Random searching was first introduced as an approach to first-principle crystal structure prediction, in the Ab Initio Random Structure Searching (AIRSS) framework by Pickard and Needs.66,67 AIRSS aims to discover previously unknown crystal structures by generating and relaxing (with DFT) large numbers of random structures that satisfy some simple constraints, such as the minimum separation between atoms,67 and a similar approach can be taken with ML potentials.53,77 The task of relaxing random structures into their local, often high-energy, minima provides a stringent test for an interatomic potential. This type of test has been introduced for silicon, where a range of widely used empirical potentials do not reproduce the energy distribution of the RSS minima as closely as an ML potential.73
To illustrate the use of RSS in assessing and validating interatomic potentials, we show in Fig. 8 the energies and volumes of 10 000 random silicon structures that have been relaxed into local minima separately with (a) the general-purpose GAP-18 model taken from Ref. 73, which had previously been shown to reproduce AIRSS results well;73 (b) a candidate indirectly learned potential (GAP → MTP) that has been trained on GAP-18 data;44 and (c) the empirically fitted Stillinger–Weber (SW) potential,76 which is widely used for modeling silicon. For both ML potentials, a distinct basin at low volumes can be observed (arrows in Fig. 8), corresponding to structures similar to simple hexagonal-like phases at high pressures. The empirical SW potential, by contrast, fails to relax the same corresponding random structures into this chemically sensible local minimum because it strongly favors diamond-like, lower-density structures [this observation is consistent with the fact that SW predicts no structural collapse under pressure; Fig. 2(e) and Ref. 7]. All three potentials do find diamond-like minima, which we define by the relaxed structure having a SOAP similarity of to the ideal crystalline form; however, the empirical potential finds considerably fewer: 112 (GAP-18), 135 (MTP), and 55 (SW).
In a different vein, we mention that the structures produced by an RSS run—both the relaxed ones and the points along the minimization trajectory—also constitute an unbiased set that can be useful for out-of-sample testing. To this end, the RSS structures can be “labeled” with the ground-truth method, and any candidate potential can be compared against this set. The numerical errors in this case will likely be relatively high for these rather unusual structures, underscoring the need for viewing the absolute error values in context. We have shown an example of this in a recent work.44
D. Experimental data
Ultimately, the test for a simulation is whether it agrees with (and explains) experimental observations. Therefore, the direct validation against experimental data is perhaps the most important and relevant test for an ML potential, especially when combined with thorough validation against the reference (ground-truth) method. Often, the validation of ML potentials for materials includes some comparisons with previously published experimental data. There is, however, a wide range of ways how exactly this comparison might be made.
We summarize relevant techniques in Table I. Crystalline materials are widely characterized by x-ray and neutron diffraction experiments, yielding lattice parameters that may be compared to those of a computationally optimized structure. The comparison is straightforward, and yet it is important to choose the most appropriate reference data: low-temperature measurements are closer to the “zero-Kelvin” simulation than those at room temperature; high-resolution data from synchrotron experiments are better (but much more scarce) than data from in-house diffractometers; powder diffraction data are preferable for lattice parameters, while single-crystal diffraction yields the most accurate atomic positions, all else being equal. The Inorganic Crystal Structure Database (ICSD; Ref. 78 and references therein) can be helpful in locating experimental data.
Quantity . | Experimental technique . | Computational counterpart . |
---|---|---|
Lattice parameters | X-ray (•) or neutron diffraction (•••) | Structural relaxation (•) |
Thermodynamic stability | Calorimetry (••) | Energy (ΔE, •) or enthalpy (ΔH, ••) |
Vibrational spectroscopy | Inelastic neutron scattering (•••) | Vibrational density of states from MD |
or phonon computations (••) | ||
Infrared or Raman spectroscopy (•) | As above-mentioned, but with IR/Raman | |
intensities predicted (•••) or ignored (••) | ||
Thermal properties | Thermal conductivity measurements (••) | Phonon computations and post-processing (••) |
Atomic spectroscopy | NMR (••), x-ray spectroscopy (•••) | Not normally directly, but can be computed using DFT |
on small ML-generated structural models (••) | ||
Disordered structure | Pair distribution function (PDF) analysis (•••) | Radial distribution function from MD simulation (•) |
X-ray or neutron structure factor, | Fourier transform of radial distribution function (•) | |
S(q), for liquid and amorphous phases (•••) |
Quantity . | Experimental technique . | Computational counterpart . |
---|---|---|
Lattice parameters | X-ray (•) or neutron diffraction (•••) | Structural relaxation (•) |
Thermodynamic stability | Calorimetry (••) | Energy (ΔE, •) or enthalpy (ΔH, ••) |
Vibrational spectroscopy | Inelastic neutron scattering (•••) | Vibrational density of states from MD |
or phonon computations (••) | ||
Infrared or Raman spectroscopy (•) | As above-mentioned, but with IR/Raman | |
intensities predicted (•••) or ignored (••) | ||
Thermal properties | Thermal conductivity measurements (••) | Phonon computations and post-processing (••) |
Atomic spectroscopy | NMR (••), x-ray spectroscopy (•••) | Not normally directly, but can be computed using DFT |
on small ML-generated structural models (••) | ||
Disordered structure | Pair distribution function (PDF) analysis (•••) | Radial distribution function from MD simulation (•) |
X-ray or neutron structure factor, | Fourier transform of radial distribution function (•) | |
S(q), for liquid and amorphous phases (•••) |
Calorimetric measurements give information about enthalpy (≈energy) differences between different phases, say, two crystalline polymorphs of a material or an amorphous phase compared to its crystalline counterpart. In Ref. 79, for example, when validating an ML potential for SiO2, it was shown that a particular DFT level gives energetics compatible with the experiment, and this good performance is inherited by the ML model. In this case, the DFT level in question was the Strongly Constrained Appropriately Normed (SCAN) functional.80 We note that with regard to validation, there are two separate effects here, viz., the error of the ground-truth computation compared to the experiment, and the error of the ML fit to the reference database. It is, therefore, important to disentangle both.
The vibrational properties are another useful quantity, and the phonon dispersions for crystalline phases are now easy to predict computationally—for example, using the phonopy software.81 The experimental data are somewhat more difficult to come by if one is interested in the full phonon spectrum. While measuring the entirety of the vibrational density of states (VDOS) requires inelastic neutron scattering techniques, the much more common infrared and Raman spectroscopy are routine characterization tools in the laboratory. In turn, the latter types of spectra are more difficult to predict computationally: they require the (rather intricate) computation of absorption intensities associated with given phonon modes. ML models have begun to be developed for this purpose,82 and it is likely that “integrated” ML models that combine electronic predictions with potentials will become useful in this context.83,84
With vibrational properties available, the thermal conductivity can be predicted—and ML potentials are an increasingly popular approach for this task.85 In 2012, Sosso et al. demonstrated the use of a neural network potential for studying thermal transport in the phase-change memory material GeTe,86 and more recent studies revealed good agreement for diverse crystalline materials such as gallium sesquioxide, Ga2O3,87 zirconia, ZrO2,88 or the skutterudite-type compound YxCo4Sb12 (Ref. 89). The latter, a thermoelectric material, is an example of a practical application of ML potentials—and we believe that even in the absence of a direct application, carefully performed thermal conductivity measurements can provide challenging benchmarks for the development and validation of new ML potential models. Thermal property evaluations for complex structures can be prohibitively expensive with DFT, but they can be carried out, say, for a range of candidate ML potentials with different settings.
Spectroscopic techniques more generally are widely used to characterize materials. Emerging ML models are being built that, for example, enable comparison with x-ray photoelectron spectroscopy data,90–92 and even in the absence of those models, DFT-based predictions are often readily available (as long as a relatively small simulation cell is deemed to be sufficient). The application of such computational spectroscopy techniques to validating potentials against experiments is beginning to be explored. Shapeev et al. have shown how to validate ML potentials, in this case, built using the MTP framework, against experimental extended x-ray absorption fine structure (EXAFS) data, which provide a fingerprint of local structure.93 The authors found that a major source of possible discrepancies was the nature of the DFT reference data, which corresponds to the “quality of data labels” point in Fig. 1(b).
For liquid and amorphous phases, structural information is difficult to obtain and only typically accessible through indirect observations. A primary type of analysis involves inspecting the structure factor from x-ray or neutron diffraction, and its Fourier transform, which yields the pair distribution function (PDF). One key example is in the study of battery materials, such as nanoporous carbons, which can be experimentally characterized by PDF analysis (among other techniques).94 A recent ML-driven study compared computational predictions against those quantities.95 We emphasize that these types of studies are typically focused on validating the structural model itself, not nuances of the potential. Nevertheless, the potential itself is influential in generating the structural model, so that the quality of the latter can act as a metric for the performance of a potential. We also emphasize that for amorphous materials, it will be particularly challenging to create accurate and reproducible experimental benchmarks, and it would be interesting to see whether more benchmarks of this type can be created in the future.36,96
The computational speed of ML potentials makes it possible to generate structural models on the length scale of several nanometers and more, and with a nanoscale structure, that would not be possible to describe with DFT-based simulations. For example, such ML-based structural models can include grain boundaries and inhomogeneity arising from the coexistence of different phases. The accessibility of accurate large-scale simulations allows for the convergence of structural metrics, such as the predicted structure factor, with system size. It is important to test this convergence, which is often not feasible with DFT, to ensure that any conclusions on the quality of the potential are reliable. It is also important to ensure that those tests that can be carried out with DFT are done beforehand: a fully polycrystalline sample cannot be described at that level, but small and representative structural models can. This way, the user will have energetic validation on small-scale computations, and physics-guided validation on larger-scale structures—together giving high confidence in the quality of a prediction.
VI. BEST-PRACTICE RECOMMENDATIONS
We suggest that the following should be included in publications that introduce an ML potential:
Energy and force errors from internal (k-fold) cross-validation, including a definition of how these errors have been obtained (RMSE or MAE, absolute force or force component error, etc.).
Energy and force errors for one or more external (out-of-sample) test sets, for example, from separate MD simulations or random search.
Comparison with experimental data wherever these are available from previous literature (Table I), including a brief discussion of the errors or uncertainty of the literature values.
We suggest that the following should be included in publications that use an existing ML potential:
A mention of the above-mentioned error metrics, if defined in the original publication.
A brief comment on how, and to what extent, these previously given errors are applicable to the problem studied in the new work.
If possible, benchmarks using a small-scale DFT simulation that is representative of the problem at hand (see Fig. 6 for an example).
Wherever available, comparison with experimental data sourced from previous literature (Table I).
We hope that these points will not only increase the confidence of the computational practitioner themselves (knowing that they are operating slightly away from the security of quantum mechanics) but also of experimental colleagues who will read the work.
We also emphasize the importance, and the expected long-term advantages, of openly sharing training and benchmark data, as well as testing workflows especially as they become more complex. We refer again to a recently published set of best-practice guidelines for ML models more generally, in Refs. 33 and 34, beyond the case of interatomic potentials discussed herein.
VII. CONCLUSIONS AND OUTLOOK
We have provided a tutorial-style overview of the multi-faceted problem of validating ML potential models for material simulations. There has always been a need to ensure the validity and accuracy of interatomic potentials, and this need is becoming more acute as ML potentials are becoming increasingly widely used outside their specialized community of developers.
Looking forward, it would seem desirable to create openly available “packaged” tests that can be run directly from an openly available and easily accessible code, say, a Python script or automated workflow. The testing framework for elemental silicon described in Ref. 73 is an excellent example of this, and we have benefited from it ourselves during the work described in Ref. 44.
We expect that, with properly chosen reference configurations, numerical errors will remain important and, indeed, become more important. The question on which data exactly to carry out numerical validation, and whether there can be an optimized set of out-of-sample testing data for one material (or for many materials) constitutes a key challenge. We envision the increased use of random (RSS) configurations in this, as well as the creation and sharing of dedicated benchmark simulations, such as that in Fig. 6. Once suitable structural snapshots are found, which are representative of a given physical problem, anyone can download these structures, re-label them with the specific reference method used in their new potential, and evaluate the error on those (the re-labeling step will be required in most cases because usually, ML potentials are fitted to data at different computational levels). We believe that errors on a specified, physically guided benchmark will be useful in evaluating future generations of ML potentials, and that they will convey information that simple cross-validation cannot.
We hope that the ideas and approaches discussed in this Tutorial will help establish ML potentials as everyday tools in material modeling, in the same way that DFT-based simulation methods are abundantly and very successfully used today. We look forward to seeing how, in the years ahead, carefully crafted and validated ML potentials will accelerate scientific discovery in physics, chemistry, and related fields.
ACKNOWLEDGMENTS
We thank J. George, A. V. Shapeev, and Y. Zhou for helpful comments on the manuscript. J.D.M. acknowledges funding from the EPSRC Centre for Doctoral Training in Inorganic Chemistry for Future Manufacturing (OxICFM), Grant No. EP/S023828/1. J.L.A.G. acknowledges a UKRI Linacre—The EPA Cephalosporin Scholarship, support from an EPSRC DTP (Award No. EP/T517811/1), and from the Department of Chemistry, University of Oxford. V.L.D. acknowledges a UK Research and Innovation Frontier Research grant (Grant No. EP/X016188/1). Structural drawings were created with the help of OVITO.97
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Joe D. Morrow: Investigation (lead); Writing – original draft (equal); Writing – review & editing (equal). John L. A. Gardner: Investigation (supporting); Writing – original draft (equal); Writing – review & editing (equal). Volker L. Deringer: Supervision (lead); Writing – original draft (lead); Writing – review & editing (equal).
DATA AVAILABILITY
Data generated in this work, as well as Python code to reproduce relevant figures, are provided in openly available repositories. Specifically, Jupyter notebooks implementing numerical error measures and methods for physical validation are provided at https://github.com/MorrowChem/how-to-validate-potentials. A copy has been deposited in Zenodo and is available at https://doi.org/10.5281/zenodo.7675642.