The MACE architecture represents the state of the art in the field of machine learning force fields for a variety of in-domain, extrapolation, and low-data regime tasks. In this paper, we further evaluate MACE by fitting models for published benchmark datasets. We show that MACE generally outperforms alternatives for a wide range of systems, from amorphous carbon, universal materials modeling, and general small molecule organic chemistry to large molecules and liquid water. We demonstrate the capabilities of the model on tasks ranging from constrained geometry optimization to molecular dynamics simulations and find excellent performance across all tested domains. We show that MACE is very data efficient and can reproduce experimental molecular vibrational spectra when trained on as few as 50 randomly selected reference configurations. We further demonstrate that the strictly local atom-centered model is sufficient for such tasks even in the case of large molecules and weakly interacting molecular assemblies.

Machine learning force fields are becoming part of the standard toolbox of computational chemists, as demonstrated by the increasing number of successful applications leading to new scientific discoveries in fields including that of amorphous materials,1 high-pressure systems,2 phase diagrams,3 and reaction dynamics of molecules.4 These applications were enabled by significant effort in developing a wide range of novel machine learning force field architectures. Recently, many of these were incorporated into a single, unifying design space,5,6 which helped uncover the relationship between seemingly dissimilar approaches such as descriptor based machine learning force fields7–12 and graph neural network based models.13–17 This new understanding directly led to the MACE architecture,18 which combines equivariant message passing with a high body-order description of the local atomic environment via spherical harmonic polynomials, leading to excellent performance in terms of accuracy, extrapolation, data, and computational efficiency.

In this paper, we evaluate the MACE architecture by training it on previously published datasets and benchmark its performance on a wide variety of tasks to demonstrate its out-of-the-box applicability across the chemical and physical sciences. The paper is organized as follows. Sec. II briefly reviews the MACE architecture, and we also detail a training strategy that results in highly accurate models on all tested systems. In Sec. III, we provide evidence that MACE can accurately describe the dynamics of large supra-molecular systems, including those critically controlled by intermolecular interactions. Section IV shows that MACE can fit a large, chemically diverse dataset of compounds made up of the H, C, N, and O chemical elements, improving on the previous state of the art by over a factor of three. In Sec. V, we demonstrate that MACE can accurately fit the potential energy surface of small molecules with as few as 50 conformers. The resulting model is not only accurate on an independent test set but is also able to run long, stable molecular dynamics simulations without the need for any parameter tuning or any further iterative training. In Secs. VI, VII, and IXVI, VII and IX we test MACE on condensed phase systems (carbon, disordered materials, and liquid water), showing considerable improvements in accuracy compared to the models that were previously tested on these datasets. In the case of water, we also show that the resulting MACE model can accurately describe the thermodynamic and kinetic properties using NVT and NPT molecular dynamics simulations. Finally, in Sec. X, we evaluate the MACE architecture on the QM9 machine learning benchmark, demonstrating that it also improves on the state of the art for several targets that are not force field fitting tasks.

In this section, we briefly review the MACE many-body equivariant message passing force field architecture and then discuss a general training strategy relying on a loss scheduler. We also highlight the body-order of MACE models and discuss its implications for smoothness and extrapolation.

The MACE model parametrizes the mapping from the positions and chemical elements of the atoms to their potential energy. This is achieved by decomposing the total potential energy of the system into site energies (atomic contributions). The site energy of each atom depends on symmetric features that describe its chemical environment. MACE parametrizes these features using a many-body expansion. To construct the features on the atoms, first, the local environment N(i) of atom i has to be defined. N(i) is the set of all atoms j in the system for which |rij| ≤ rcut, where rij denotes the vector from atom i to atom j and rcut is a predefined cutoff (we discuss the effect of choosing different cutoff radii in Sec. III). Defining a local neighborhood allows the construction of a graph from the geometry, where the nodes are the atoms and the edges connect atoms in each other’s local environment. We denote the array of features of node (atom) i by hi and express them in the spherical harmonic basis, and hence its elements are always indexed by l and m. The superscript on h(s) indicates the iteration steps (corresponding to “layers” of message passing in the parlance of graph neural networks), and we denote the number of layers in the model by S. Equivariance of the model is achieved by utilizing the transformation properties of spherical harmonics Ylm under 3D rotations, which are inherited by the node features with corresponding indices.

We now describe the complete MACE model as it was introduced in Ref. 18. In our discussion, we focus on showing both the key equivariant operations and also on identifying all free parameters of the model that are optimized during training. The equations are grouped together in a block for ease of reference. All free (“learnable”) parameters are denoted by the letter W, and all occurrences below correspond to different blocks of free parameters—the shapes of these parameter arrays are indicated by the use of different indices. Note that in addition to the explicitly shown free parameters W, the fully connected multi-layer perceptrons (MLPs) also contain internally further learnable parameters.

We start by initializing the node features hi(0) as a (learnable) embedding of the chemical elements with atomic number zi into k learnable channels, cf. Eq. (1). This kind of mapping has been extensively used for graph neural networks13–16 and elsewhere19,20 and has been shown to lead to some transferability between molecules with different elements.5 The zeros for the lm indices correspond to these initial features being scalars, i.e., rotationally invariant. The higher order elements of hi(0) with nonzero lm indices are implicitly initialized to zero. At the beginning of each subsequent iteration, the node features (both scalars and higher order) are linearly mixed together, resulting in h̄j [cf. Eq. (2)].

Next, we combine the features of each of the neighboring atoms (edges in the graph) with the interatomic displacement vectors pointing to them from the central atom expressed using radial and spherical harmonics basis. This is analogous to the construction of the one-particle basis of neighbor density representations, such as SOAP8 and ACE,5,10 and we construct it in a similar way to Cormorant21 and NequIP.16 The relationships between these different approaches are discussed in Ref. 5. We construct the radial basis set using the first spherical Bessel function j0n for different wavenumbers, n, up to some small maximum (typically 8), in Eq. (3) as proposed in Ref. 14. The Bessel functions are multiplied by a polynomial cutoff function fcut(rij), which goes to zero smoothly at rij = rcut. The radial information is then passed through an MLP, Eq. (4), whose inputs are the Bessel functions with different frequencies and have many outputs, indexed by (η1, l1, l2, l3), whose purpose is explained next. When combining positional information (itself an equivariant that transforms under rotation like a vector) with equivariant node features, we use the spherical tensor product formalism of angular momentum addition.22 All possible combinations of equivariants are constructed using the appropriate Clebsch–Gordan coefficients, Eq. (5). This operation is currently implemented using e3nn.23 There are multiple ways of constructing an equivariant combination with a given symmetry corresponding to (l3, m3), and these multiplicities are enumerated by the index η1. The large number of outputs (and their associated learnable weights) from the radial MLP are used as separate degrees of freedom,
hi,k00(0)=zWkzδzzi,
(1)
h̄i,kl2m2(s)=k̃Wkk̃l2(s)hi,k̃l2m2(s),
(2)
j0n(rij)=2rcutsinnπrijrcutrijfcut(rij),
(3)
Rkη1l1l2l3(s)(rij)=MLPj0n(rij)n,
(4)
ϕij,kη1l3m3(s)=l1l2m1m2Cη1,l1m1l2m2l3m3Rkη1l1l2l3(s)(rij)×Yl1m1(r̂ij)h̄j,kl2m2(s),
(5)
Ai,kl3m3(s)=k̃,η1Wkk̃η1l3(s)jN(i)ϕij,k̃η1l3m3(s),
(6)
Ai,klm(s),ν=ξ=1νAi,klξmξ(s),
(7)
Bi,ηνkLM(s),ν=lmCηνlmLMAi,klm(s),ν,
(8)
mi,kLM(s)=νηνWziηνkL(s),νBi,ηνkLM(s),ν,
(9)
hi,kLM(s+1)=k̃WkL,k̃(s)mi,k̃LM(s)+k̃WkziL,k̃(s)hi,k̃LM(s).
(10)

The one-particle basis ϕ is summed over the neighborhood in Eq. (6); this is where permutation invariance of the MACE descriptors over the atoms in the neighborhood is achieved—note that the identity of chemical elements has already been embedded, and hence this sum is over all atoms, regardless of their atomic number. A linear mixing of k channels with learnable weights yields the atomic basis, Ai (cf. ACE5,10).

In an analogous fashion to the ACE construction, many-body symmetric features are formed on each atom by taking the tensor product of the atomic basis, A, with itself ν times, yielding the “product basis,” Ai [cf. Eq. (7)]. Note that in forming the tensor product, each k channel is treated independently. This method of tensor sketching has been widely used in signal processing24 and was formally shown to not degrade the expressibility of many-body models while substantially reducing computational cost compared to a full tensor product.25 The tensor product of Eq. (7) is implemented using an efficient loop tensor contraction algorithm.18 

In Eq. (8), the product basis is contracted to yield the fully symmetric basis, Bi, using the generalized Clebsch–Gordan coefficients, C, where again, there are multiple ways to arrive at a given output symmetry, and these are enumerated by ην. The bold lm signify that these are multi-indices (an array of indices), and the bold styles of A and B are a reminder that these are many-body features. The maximum body-order is controlled by limiting ν.

Finally, a “message” mi is formed on each atom as a learnable linear combination of the symmetrized many-body features of the neighbors, as shown in Eq. (9). To form the node features of the next layer, we are adding this message to the atoms’ (nodes’) features from the previous iteration using weights that depend explicitly on the atoms’ chemical element (zi), as shown in Eq. (10). Note that because the initial node features h(0) are solely functions of the chemical element corresponding to the node, in the first layer the second term of Eq. (10) is omitted. This allows setting and fixing the energy of isolated atoms (i.e., those with no neighbors in their environment),5 which is often desirable.26 

Equations (2)(10) comprise a MACE layer, and multiple layers are built by iteration. This means that the effective receptive field of the model (the region around an atom from which information is used to determine the site energy of the atom) is approximately a sphere of radius S × rcut. More precisely, a neighboring atom contributes to the site energy if it can be reached in S hops on the graph defined earlier. The selected clusters form a sparse set of all possible clusters in S × rcut. Therefore, message passing methods can be viewed as sparsifications of larger atom-centered methods.5 In practice, we almost always use two layers, so S = 2.

The output of the model, the site energy, is a learnable combination of the rotationally invariant parts of the node features,
Ei=s=1SEi(s)=s=1SR(s)hi(s),
(11)
where R is a linear map for the first layer features and a shallow one hidden layer MLP for the second layer features,
R(s)hi(s)=kWk(s)hi,k00(s)ifs<S,MLPhi,k00(s)kifs=S.
(12)
Keeping the readout function linear for all but the last layer helps preserve the body-ordered nature of the model, see Sec. II B and Ref. 5. The forces on the atoms are determined as usual by taking analytical derivatives of the total potential energy using auto-differentiation tools,
F=iEi.
(13)

Neural network models such as MACE have a very large number of free parameters, especially compared with linear or kernel based models typically applied to the same tasks. It is thought that this overparameterization helps the training when using stochastic gradient descent and brings accuracy and regularity.27,28 The trade-off between the size (and, therefore, computational cost) of the model and its accuracy needs to be controllable. The key model size controls in MACE are the number of embedding channels k and the highest order Lmax of the symmetric features Bi,ηνkLM(s),ν. In this paper, we will refer to the different sized models using these two numbers; for example, the invariant MACE model (corresponding to Lmax = 0) with 64 channels is denoted MACE 64-0.

The computational cost of running a MACE model depends on the details of the systems studied (e.g., the average number of neighbors) and the hyperparameters that control the size of the model. The latency of the models (i.e., the shortest time to calculate forces on small systems, excluding calculation of the neighbor list and other book-keeping necessary for running MD) ranges from 0.7 to 20 ms, corresponding to small (MACE 64-0) or larger (MACE 256-2) models as measured on an Nvidia A100 GPU. A comprehensive assessment of MD performance, which we leave to future work, needs to consider the relationship between model size, accuracy, and execution speed. It is also useful to report the fastest possible MD performance, which we take to be that of the smallest model that can run stable MD for an arbitrary (in practice, very large) number of steps. At present, the performance of MACE in MD simulations is between 2 and 50 M steps/day (a million steps corresponds to a nanosecond with a 1 fs time step), depending on model size and the details of the full software stack, with the highest performance achieved using the JAX version of MACE using JAX-MD29 (see, e.g., the salicylic acid example ran using the 64-0 model). The evaluation speeds in LAMMPS and OpenMM are currently somewhat slower, but new custom CUDA kernels and improved MD interfaces are actively being developed that will likely enable even higher performance. The training times of the MACE models also vary vastly depending on the dataset size and the model size and typically range from 1 h to 1 day on a single A100 GPU for the models shown here.

The computational performance of MACE compares very favorably to other alternative equivariant neural network potentials. This is primarily the result of MACE forming the high body-ordered features very efficiently via the symmetric tensor products carried out on the nodes. For example, the Allegro model, which is one of the leading architectures that uses tensor products between the features on the edges of the graph rather than nodes, has a reported speed of 18 M steps per day.30 It should be noted though, that the Allegro model has already been demonstrated to scale well to many millions of atoms on many GPUs.31 This is, in principle, possible for MACE, which is also a strictly local model with a similarly moderate size receptive field, 5–10 Å, though the necessary LAMMPS interface implementation is ongoing work and will be demonstrated in future publications. Allegro typically uses 4–6 Å for production runs and larger for accuracy benchmarks.

It is interesting to consider the node features hi of MACE as body-ordered descriptors that, in the limit of the complete basis of radial functions and spherical harmonics, fully linearly span the space of symmetric functions over chain like clusters with hops of size rcut up to the maximum body-order of the features.5,25,32 In the MACE models studied in this paper, each layer has a body-order of 4 [corresponding to ν = 3 in Eq. (7)]. Therefore, the node features of the first layer are four-body functions. They are then expanded in the second layer in the second one-particle basis, which is 4 + 1 = 5-body functions, since each application of the one-particle basis adds one to the body-order on account of the central atom. Then, taking the tensor products of these features with themselves three times, we obtain 3 × 4 + 1 = 13-body features (note that the five-body terms of the first layer all share the same central atom). This mechanism of efficiently forming very high body-order linearly complete features is unique to the MACE architecture, and it might be one of the reasons underpinning its excellent performance in the low-data regime and extrapolation.

We found that to get the best performance from the MACE model, particular care has to be taken with the weight factors in the loss function. The training loss, L, is typically computed as the weighted sum of the mean squared errors of the total energy, the force components (and virials or stresses if available for periodic systems), e.g.,
L=λEBb=1BEbÊbNb2+λF3Bb=1Bib,α=1Nb,3Ebrib,αF̂ib,α2,
(14)
where the sum is taken over the atomic configurations b in the current batch with batch size B, Eb is the model’s prediction of the total energy, and Êb and F̂b are the training data corresponding to total energy and force, respectively. The number of atoms in the configuration is Nb; atoms are indexed by ib with elements of their position vector denoted by rib,α in the Cartesian direction α. The weights of energies and forces are controlled by λE and λF. The choice of these weights is crucial, and we observed that the most accurate models are obtained when we train with a high weight on the forces compared to the other properties. A possible reason for this could be that the forces are local quantities and contain information about the dependence of the total energy on each atom. However, having very accurate force predictions does not necessarily result in accurate energy predictions. This is especially the case in systems where the training set is heterogeneous, by which we mean that it is composed of a wide variety of different systems well separated in atomic configuration space, e.g., different molecules or phases of a solid with very different structures. In this case, the model can accurately learn the local potential energy surface of each system individually, but it might not learn their relative energies correctly, resulting in large absolute energy errors.

To reduce absolute energy errors, we use a loss scheduler. For about 60% of the total training time, λF > λE is used. For the second part of the training, the weights are switched so that λE > λF and the learning rate is decreased by a factor of 10. In this way, the absolute energy errors can be reduced while keeping the force predictions’ high accuracy. An example of the energy validation error dramatically improving as the loss is changed is given in  Appendix A, Fig. 7. The Figure also compares the two phase training with using the weights of the second phase from the beginning and shows that the two phase protocol not only accelerates convergence but also results in an overall decrease in errors. Note that many of the experiments in the paper were carried out using an earlier version of the loss function, which also used the average over the force components rather than the sum, as shown in  Appendix B.

In this section, we examine the locality assumption used by MACE and compare its performance against a global machine learning force field, sGDML, which does not use a cutoff to decompose the energy into local site energy terms.33 In particular, we compare MACE to the recently improved version of sGDML, which uses a reduced set of global descriptors to allow for the fitting of systems with hundreds of atoms.34 Furthermore, we also compare MACE to the VisNet-LSRM model,35 which employs a mixed short-range and long-range description of the potential energy surface by combining local message passing with message passing between larger fragments.

As explained in Sec. II A, the effective cutoff in MACE is the number of layers times the cutoff distance in each layer. We also examine to what extent the information transfer between layers is effective and how property predictions of large molecular systems differ between a two-layer and a single-layer MACE model with the same receptive fields.

For this study, we use the recently published MD22 dataset, which was designed to be challenging for short-range models.36 The datasets include large molecules and molecular assemblies containing hundreds of atoms with complex intermolecular interactions. The dataset was created by running ab initio molecular dynamics simulations at elevated temperatures to better sample the configuration space of the systems. The training set size was determined such that the total energy error of the sGDML model is below the chemical accuracy of 1 kcal/mol. Our study uses the same training set sizes for a fair comparison (Table I).

TABLE I.

Global vs long-range vs local models on MD22 dataset of large molecules. Energy (E, meV/atom) and force (F, meV/Å) mean absolute errors (MAE) of models. The approximate diameter of the system is denoted by d. For each system the model with the lowest error is highlighted in boldface.

MACE 256-2MACE 256-0MACE 256-2VisNet-LSRMsGDML
Cutoff distance d (Å)  2 × 3 Å 1 × 6 Å 2 × 5 Å Long-range Global 
Tetrapeptide ∼12 0.608 0.345 0.064 0.080 0.40 
7.6 17.0 3.8 5.7 34 
Fatty acid ∼16 0.446 0.399 0.102 0.058 1.0 
6.2 23.5 2.8 3.6 33 
Tetrasaccharide ∼14 0.252 0.357 0.062 0.044 2.0 
6.8 27.0 3.8 5.0 29 
Nucleic acid (AT-AT) ∼22 0.902 0.155 0.079 0.055 0.52 
13.3 14.9 4.3 5.2 30 
Nucleic acid (AT-AT-CG-CG) ∼24 0.603 0.166 0.058 0.049 0.52 
16.3 20.1 5.0 8.3 31 
Buckyball catcher ∼15 0.476 0.171 0.141 0.124 0.34 
13.1 22.2 3.7 11.6 29 
Double-walled nanotube ∼33 0.207 0.231 0.194 0.117 0.47 
17.9 39.6 12.0 28.7 23 
MACE 256-2MACE 256-0MACE 256-2VisNet-LSRMsGDML
Cutoff distance d (Å)  2 × 3 Å 1 × 6 Å 2 × 5 Å Long-range Global 
Tetrapeptide ∼12 0.608 0.345 0.064 0.080 0.40 
7.6 17.0 3.8 5.7 34 
Fatty acid ∼16 0.446 0.399 0.102 0.058 1.0 
6.2 23.5 2.8 3.6 33 
Tetrasaccharide ∼14 0.252 0.357 0.062 0.044 2.0 
6.8 27.0 3.8 5.0 29 
Nucleic acid (AT-AT) ∼22 0.902 0.155 0.079 0.055 0.52 
13.3 14.9 4.3 5.2 30 
Nucleic acid (AT-AT-CG-CG) ∼24 0.603 0.166 0.058 0.049 0.52 
16.3 20.1 5.0 8.3 31 
Buckyball catcher ∼15 0.476 0.171 0.141 0.124 0.34 
13.1 22.2 3.7 11.6 29 
Double-walled nanotube ∼33 0.207 0.231 0.194 0.117 0.47 
17.9 39.6 12.0 28.7 23 

In order to test the influence of the receptive field of local models on their accuracy, we trained a series of three MACE models with different combinations of cutoff distances and the number of layers. The typical MACE model with two layers and a 5 Å cutoff at each layer is compared to a one layer MACE with a 6 Å cutoff and a two layer MACE with both layers having 3 Å cutoffs.

The best performing MACE model, which employs a 2 × 5 Å cutoff, improves the errors of the sGDML model by up to a factor of 10. Crucially, even this 10 Å receptive field is considerably smaller than the diameter of the systems in the dataset, so the MACE model is effectively local. Even the 2 × 3 and 1 × 6 Å models outperform sGDML for all systems. This is likely because the strength of local intramolecular (covalent) interactions is much higher than that of intermolecular interactions. So better overall accuracy can be achieved by only improving the short-range description, which MACE evidently does. In  Appendix D, we show the vibrational spectrum of the tetrapeptide, showing excellent agreement between all three MACE models. When comparing MACE to VisNet-LSRM, the best model reported to date, we find that the strictly local MACE model has significantly lower force errors, but in most cases, the long-range model has lower energy errors, though it should be noted that the energy errors of both models are in the order of 0.1 meV/atom or lower in most cases. This small but consistent improvement in the energies might result from a more accurate description of the remaining small long-range interactions beyond the 10 Å cutoff of our longest range MACE model or from different relative force and energy weights used during training. It is not trivial to come up with benchmarks that specifically test the description of long range interactions and might pose a difficulty for short range models.

Typical intermolecular interactions appear to be well captured in the 5–6 Å range. The evidence for this is in the energy errors for the nucleic acid and the Bucky-ball catcher systems. In these systems, the intermolecular interactions contribute considerably to the total energy, and the MACE model with a 3 Å cutoff in each layer cannot describe them well. In contrast, the two longer-ranged MACE models have significantly lower energy errors. It is not a coincidence that most short range ML force field models in the literature have cutoffs between 5 and 10 Å (Table I).

TABLE II.

Mean absolute errors on the COMP6 benchmark dataset. Total energies are given in kcal/mol, forces in kcal/mol/Å. Note that the ANI-1x model was trained on 10× more data than the other models. *For NewtonNet, the decomposition of errors for the subsets was not published, and conformations of molecules whose energies were outside a 100 kcal/mol energy range were omitted from the testing. For each test set the model with the lowest error is highlighted in boldface.

ANI-1xGM-NNNewtonNetTensorNetMACE 64-0MACE 96-1MACE 192-2
ANI-MD 3.40 3.83 ⋯ 1.61 10.3 2.81 3.25 
2.68 1.43 ⋯ 0.82 1.92 0.89 0.62 
DrugBank 2.65 2.78 ⋯ 0.98 1.81 1.04 0.73 
2.86 1.69 ⋯ 0.75 1.20 0.70 0.47 
GDB 7 - 9 1.04 1.22 ⋯ 0.32 0.77 0.40 0.21 
2.43 1.41 ⋯ 0.53 0.96 0.54 0.34 
GDB 10–13 2.30 2.29 ⋯ 0.83 1.54 0.88 0.53 
3.67 2.25 ⋯ 0.97 1.52 0.92 0.62 
S66x8 2.06 2.95 ⋯ 0.62 1.17 0.69 0.39 
1.60 0.93 ⋯ 0.33 0.65 0.33 0.22 
Tripeptides 2.92 3.06 ⋯ 0.92 2.10 1.18 0.79 
2.49 1.48 ⋯ 0.62 1.09 0.66 0.44 
COMP6 total 1.93 2.03 1.45* ⋯ 1.47 0.76 0.48 
3.09 1.85 1.79* ⋯ 1.31 0.77 0.52 
ANI-1xGM-NNNewtonNetTensorNetMACE 64-0MACE 96-1MACE 192-2
ANI-MD 3.40 3.83 ⋯ 1.61 10.3 2.81 3.25 
2.68 1.43 ⋯ 0.82 1.92 0.89 0.62 
DrugBank 2.65 2.78 ⋯ 0.98 1.81 1.04 0.73 
2.86 1.69 ⋯ 0.75 1.20 0.70 0.47 
GDB 7 - 9 1.04 1.22 ⋯ 0.32 0.77 0.40 0.21 
2.43 1.41 ⋯ 0.53 0.96 0.54 0.34 
GDB 10–13 2.30 2.29 ⋯ 0.83 1.54 0.88 0.53 
3.67 2.25 ⋯ 0.97 1.52 0.92 0.62 
S66x8 2.06 2.95 ⋯ 0.62 1.17 0.69 0.39 
1.60 0.93 ⋯ 0.33 0.65 0.33 0.22 
Tripeptides 2.92 3.06 ⋯ 0.92 2.10 1.18 0.79 
2.49 1.48 ⋯ 0.62 1.09 0.66 0.44 
COMP6 total 1.93 2.03 1.45* ⋯ 1.47 0.76 0.48 
3.09 1.85 1.79* ⋯ 1.31 0.77 0.52 

In the following, we investigate the Bucky-ball catcher system in more detail because intermolecular interactions play a crucial role in the dynamics. Following the experiment in Ref. 36, we ran 200 ps of NVT molecular dynamics simulation with each of the three MACE models introduced in Sec. III A. As shown on the left panel of Fig. 1 for the 2 × 3 Å MACE model, the system dissociated into the Bucky-ball and the catcher within 10 ps. The model, which on the whole has excellent accuracy (just 0.5 meV/atom for energies and 13 meV/Å for forces), is unable to fit the attractive dispersion between the two sub-parts of the system because they are typically further than 3 Å apart. On the other hand, the two slightly longer ranged MACE models provide qualitatively correct dynamics. We have also computed the molecular vibrational spectrum obtained as the Fourier transform of the velocity-velocity auto-correlation function to analyze the dynamics quantitatively. On the right panel of Fig. 1, we compare the result of the one-layer 6 Å MACE model to the spectrum published in Ref. 36 computed using sGDML and find excellent agreement, including for low frequencies, demonstrating that local models can simulate the dynamics of systems even when intermolecular interactions are important.

FIG. 1.

Bucky-ball catcher MD The left panel illustrates that the Bucky-ball catcher dissociates for the short cutoff MACE model but stays together with the longer cutoff. The right panel compares the molecular vibrational spectrum computed using local MACE and the global sGDML model.

FIG. 1.

Bucky-ball catcher MD The left panel illustrates that the Bucky-ball catcher dissociates for the short cutoff MACE model but stays together with the longer cutoff. The right panel compares the molecular vibrational spectrum computed using local MACE and the global sGDML model.

Close modal

In this section, we demonstrate that the MACE architecture can be used to train a highly accurate transferable general organic force field for the H, C, N, and O chemical elements. To train the model, we used a subset of the ANI-1x dataset, which contains coupled cluster calculations.37 We trained a series of MACE models, going from a small (64-0) MACE model to a medium (96-1) and a large (192-2) model. First, we trained the models using density functional theory (DFT) energies and forces and tested them on the COMP6 benchmark suite.38 The results are summarized in Table II. We can see that even the smallest MACE model outperforms most previously published models.38–41 The large MACE model improves on the previous state of the art by about a factor of 5, achieving an overall error well below 0.5 kcal/mol. The ANI-MD subset is one where the MACE total energy errors are relatively high compared with the other subsets. This subset is comprised of configurations of 14 different molecules sampled from ANI-MD trajectories. The MACE errors are relatively low on 12 of the 14 molecules. However, on the two largest ones (Chignolin, 149 atoms, and TrpCage, 312 atoms), the MACE energy is shifted by a constant value in comparison to the DFT reference energy. The energy–energy correlation plot for all molecules in this subset is shown in  Appendix F, Fig. 9. The task is particularly challenging because the training set contains very few molecules with more than 50 atoms; hence, when testing on larger systems, there can be an accumulation of small errors that are not controlled in the training. A simple test of this hypothesis could be the addition of a small number of larger molecules to the training set.

TABLE III.

Energy and Force Mean Absolute Errors on the HME21 dataset.54 The boldface indicates the model with the lowest error.

TeaNet56 NequIP16 MACE 256-2
Energy (meV/atom) 19.6 47.8 16.5 (±1.2) 
Forces (meV/Å) 174 199 140.2 (±3.8) 
TeaNet56 NequIP16 MACE 256-2
Energy (meV/atom) 19.6 47.8 16.5 (±1.2) 
Forces (meV/Å) 174 199 140.2 (±3.8) 

We also performed transfer learning to the coupled cluster level of theory using energies only, following Ref. 39. The resulting six pre-trained MACE models are published and available to download from GitHub and can be evaluated using the Atomic Simulation Environment (ASE), OpenMM, and LAMMPS.

We now evaluate the coupled cluster transfer learned versions of the above-mentioned MACE organic force fields on the challenging dihedral torsion dataset introduced in Ref. 42. This dataset consists of 88 small drug-like molecules with different biaryl dihedral torsional profiles. Such a benchmark is of particular interest in connection with small molecule drug discovery. Accurately describing the torsional barriers is a typical task where classical empirical force fields cannot provide sufficiently accurate descriptions of the potential energy surface.43,44

We have evaluated the MACE dihedral torsional scans on a subset of the full dataset that contains only H, C, N, and O chemical elements (78 molecules). In the following, we evaluate the largest MACE model (192-2) transfer learned to the CC level of theory and compare it to ANI-1ccx. We find that the MACE model achieves a mean absolute barrier height error compared to coupled cluster singles doubles and perturbative triples [CCSD(T)] of 0.36 kcal/mol, improving significantly on the ANI-1ccx model, which was identified as the best to date and has an error of 0.78 kcal/mol. We have also compared the cumulative error distribution of MACE and ANI and shown it on the bottom panel of Fig. 2. It shows that MACE has very few molecules for which it makes an error close to or larger than 1 kcal/mol. Notably, even the medium and small MACE models do better or are close to the same as the ANI model, with average barrier height errors of 0.56 and 0.83 kcal/mol, respectively.

FIG. 2.

Dihedral scans. The top three panels illustrate a selection of challenging dihedral torsional scans computed using MACE 192-2 from Ref. 42 where ANI fails. The bottom panel compares MACE 192-2 and ANI-1ccx torsional barrier height errors.

FIG. 2.

Dihedral scans. The top three panels illustrate a selection of challenging dihedral torsional scans computed using MACE 192-2 from Ref. 42 where ANI fails. The bottom panel compares MACE 192-2 and ANI-1ccx torsional barrier height errors.

Close modal

All 78 torsional profiles computed with all three CC transfer learned MACE models can be found as part of the Supplementary Information. On Fig. 2, we show three examples that were identified as being particularly challenging in the original paper and compare MACE 192-2 to the ANI-1ccx model. The MACE model produces a smooth surface with the correct position of the minima and maxima in all three cases, similarly to the rest of the test molecules that can be found in the supplementary material.

A final test of MACE on small molecular systems is to evaluate its performance in the low-data regime. Data efficiency is crucial in atomistic machine learning as it can save time and computational costs or enable the models to be trained with more accurate reference data. Recent work has found that the majority of machine learning potentials are not able to run stable molecular dynamics simulations without iterative fitting, even when they are trained on thousands of configurations of a system.45 Furthermore, the authors found that very low energy and force RMSE did not correlate with the ability to perform stable simulations.

To demonstrate the excellent data efficiency and extrapolation capabilities of MACE, we use the 256-2 MACE models published in Ref. 18 trained using rMD1746 and compare the models trained on 50 and 1000 small molecule geometries sampled randomly from the dataset. It was previously shown that 50 configurations are sufficient to train a model with a sub-kcal/mol total energy error. For three selected systems, ethanol, paracetamol, and salicylic acid, we ran 50 ps long NVT molecular dynamics simulations that were stable for both training set sizes, obviating the need for any further active learning or iterative fitting. To validate the dynamics quantitatively, we compared the molecular vibrational spectrum. The top panel of Fig. 3 shows the spectra for ethanol. The two MACE models are in excellent agreement and are able to reproduce the peaks from the experimental gas-phase IR spectrum of ethanol.47 In  Appendix G, we also show that the same result holds for paracetamol and salicylic acid.

FIG. 3.

Vibrational spectrum of ethanol The figure illustrates the molecular vibrational spectrum computed from the velocity-autocorrelation function. The top panel compares the MACE model fitted to 50 (red) and 1000 (blue) random revMD17 ethanol geometries. The bottom panel compares the spectrum of the small MACE-COMP6-CC model from Sec. IV (brown) and a MACE model fitted to the same 50 random revMD17 geometries (green) recomputed using the CCSD(T) level of theory.

FIG. 3.

Vibrational spectrum of ethanol The figure illustrates the molecular vibrational spectrum computed from the velocity-autocorrelation function. The top panel compares the MACE model fitted to 50 (red) and 1000 (blue) random revMD17 ethanol geometries. The bottom panel compares the spectrum of the small MACE-COMP6-CC model from Sec. IV (brown) and a MACE model fitted to the same 50 random revMD17 geometries (green) recomputed using the CCSD(T) level of theory.

Close modal

With model architectures able to learn an accurate representation of the potential energy surface from so few configurations, we can create better models by training them on reference data obtained using much more expensive correlated wave-function based quantum mechanical calculations. To demonstrate this, we recomputed the 50 training configurations from rMD17 using the CCSD(T) level of theory, including the forces.48 We fitted a MACE model to these data directly, without transfer learning. We also compared the model to the smallest (64-0) general organic MACE model that was transferred learned to coupled-cluster data from Sec. IV. We find that the custom trained and general MACE models predict very similar spectra, as shown on the bottom panel of Fig. 3. Interestingly, the peaks corresponding to stretching models involving Hydrogens are blue-shifted compared to the experimental positions. This is a result of nuclear quantum effects, and the experimental spectrum can be recovered by including these effects via techniques treating the nuclei as quantum particles.49 Note how these shifts are smaller for the original rMD17-derived models—an example of cancellation of errors due to missing correlation in DFT and quantum nuclear effects.

To demonstrate the capabilities of MACE on complex solid state systems, we train a model on the carbon dataset published in Ref. 50. This dataset comprises a wide variety of carbon phases, from crystalline diamond and graphite structures to various amorphous phases and small clusters. The complexity comes from the unique bonding behavior of carbon atoms, which can accommodate 1–4 neighbors depending on their oxidation and hybridization states. While the crystalline phases typically only contain one type of hybridization of carbon, sp2 in graphite and sp3 in diamond, in amorphous carbon, these occur simultaneously and can even vary dynamically. Amorphous carbon can be regarded as one of the most challenging systems for machine learning force fields due to its highly complex short- and medium-range electronic structure, especially in the low density, sp2 dominated phases.51,52

Figure 4 compares the test set errors of various subsets of the carbon dataset from Ref. 50. It shows that MACE 256-2 significantly improves the state-of-the-art ACE potential across all phases.

FIG. 4.

Carbon force errors. The figure compares the force errors of the PACE and MACE models for a test dataset consisting of a diverse set of 1640 carbon structures.50 The PACE errors correspond to the model with equal loss weights for all configurations similarly to the MACE model.

FIG. 4.

Carbon force errors. The figure compares the force errors of the PACE and MACE models for a test dataset consisting of a diverse set of 1640 carbon structures.50 The PACE errors correspond to the model with equal loss weights for all configurations similarly to the MACE model.

Close modal

The simultaneous description of a large number of different chemical elements is a long standing challenge for machine learning force fields.19,25,53 The HME21 dataset54,55 is made up of a diverse set of disordered and regular crystals from 37 different chemical elements, covering a broad spectrum of chemical space. We train a 256-2 MACE model on the dataset and compare our results with other previously published models in Table III.

TABLE IV.

Energy and Force Mean Absolute Errors on the M3GNet dataset. For each property, the model with the lowest error is highlighted in boldface.

M3GNetMACE 128-2
Energy (meV/atom) 34.7 34.1 
Forces (meV/Å) 71.7 60.1 
Stress (GPa) 0.41 0.62 
M3GNetMACE 128-2
Energy (meV/atom) 34.7 34.1 
Forces (meV/Å) 71.7 60.1 
Stress (GPa) 0.41 0.62 

The low errors demonstrate that MACE can effectively handle a wide range of chemical elements and achieves excellent accuracy for both forces and energies. It outperforms the NequIP16 and TeaNet models56 by more than 30% in both metrics.

Another challenging condensed phase materials science dataset is the training set of the M3GNet model.57 This dataset contains an extensive range of materials extracted from the materials project.58 The aim of the dataset is to provide a training set for the universal condensed matter interatomic potential of all 89 elements ranging from hydrogen to thorium.

This dataset is distinct from the previously studied ones due to its greater range of interaction energies and forces. To be able to train the MACE model on such a diverse dataset we used the Huber loss,59 as proposed in the M3GNet paper.57 This loss function switches between an L1 and L2 loss, making the learning more tolerant toward outliers. We also found it beneficial to slightly generalize the radial basis by making it dependent on the atom’s elements and local environment (see  Appendix C). We observe that an out-of-the-box MACE model with the modified loss and radial function roughly matches the accuracy of M3GNet, improving on forces. There are several possible modifications to MACE that could improve further its performance on training sets with a large chemical diversity. These include making the cutoffs element-dependent, which will be investigated as part of future work (Table IV).

To assess the ability of MACE to describe complex molecular liquids, we fitted a dataset of 1593 liquid water configurations made up of 64 molecules each.60 The dataset was computed using the CP2K software61 at the revPBE0-D3 level of density functional theory, which is known to give a reasonably good description of the structure and dynamics of water at a variety of pressures and temperatures.62 

Table V compares the energy and force errors of the MACE model with other machine learning force fields trained on the same dataset but using different train test splits. The three-body atom-centered symmetry function-based feed-forward neural network model (BPNN) has the highest errors. The three-body invariant message passing model REANN63 and the two-body equivariant message passing model NequIP16 significantly improve on the errors of the BPNN model. A further improvement is achieved by the many-body equivariant MACE model. Interestingly, a relatively small MACE model using invariant messages with an overall body-order of 13 achieves already lower errors than the other best models, and the larger MACE model only slightly improves on the force errors.

TABLE V.

Energy and Force errors on liquid water dataset (Ref. 60). The model with the lowest error is highlighted in boldface.

BP-NN60 REANN63 NequIP16 (L = 2)MACE 64-0MACE 192-2
E RMSE (meV/H2O) 7.0 2.4 ⋯ 1.9 1.9 
F RMSE (meV/Å) 120 53.2 ⋯ 37.1 36.2 
E MAE (meV/H2O) ⋯ ⋯ 2.5 1.2 1.2 
F MAE (meV/Å) ⋯ ⋯ 21 20.7 18.5 
BP-NN60 REANN63 NequIP16 (L = 2)MACE 64-0MACE 192-2
E RMSE (meV/H2O) 7.0 2.4 ⋯ 1.9 1.9 
F RMSE (meV/Å) 120 53.2 ⋯ 37.1 36.2 
E MAE (meV/H2O) ⋯ ⋯ 2.5 1.2 1.2 
F MAE (meV/Å) ⋯ ⋯ 21 20.7 18.5 

To characterize the smaller and faster MACE water model, we ran 200 ps NPT simulations at a range of temperatures from 250 to 315 K using a combined Nose-Hoover and Parrinello-Rahman barostat.65,66 As shown in Fig. 5, the average density was computed after an initial equilibration period at each temperature. The top left panel shows the water density isobar, showing the characteristic density maximum at around 270 K. This is somewhat lower than the experimental value but is consistent with previous DFT based studies of water.60 At 300 K, we also compare the density of the MACE model with the available ab initio value from Ref. 64 and find a very good agreement.

FIG. 5.

Thermodynamic properties of liquid water The top left panel shows the liquid water density isobar at p = 1.0 bar pressure and compares the isobar from Cheng et al.60 and the ab initio density64 to MACE 64-0. The bottom panel shows an example of the density distribution in the NPT simulations. The right panel shows the mean squared displacement of the water molecules in three independent NVT simulations at the equilibrium density (0.91 g/cm3) of MACE. The corresponding DFT value was obtained from a simulation at the DFT equilibrium density of 0.92 g/cm3.

FIG. 5.

Thermodynamic properties of liquid water The top left panel shows the liquid water density isobar at p = 1.0 bar pressure and compares the isobar from Cheng et al.60 and the ab initio density64 to MACE 64-0. The bottom panel shows an example of the density distribution in the NPT simulations. The right panel shows the mean squared displacement of the water molecules in three independent NVT simulations at the equilibrium density (0.91 g/cm3) of MACE. The corresponding DFT value was obtained from a simulation at the DFT equilibrium density of 0.92 g/cm3.

Close modal

Finally, we also investigated a dynamic property, the diffusivity of water, which is notoriously difficult to compute accurately. To obtain the diffusivity, we took the water configuration from the NPT simulation, which had an equilibrium density of 0.91 g/cm3, and ran three independent 200 ps long NVT simulations. The mean squared displacements from these simulations are shown on the right panel of Fig. 5. By fitting a linear function on the diffusive part of the MSD, we obtained a value of the diffusion coefficient of 2.14 ± 0.14 × 10−9 m2/s, which is in reasonably good agreement with the ab initio value estimated from much smaller simulations in Ref. 62.

We now test the MACE architecture on tasks that are not directly related to force field fitting and use the well established QM9 dataset.67 This benchmark is one of the oldest used to validate machine learning models for chemistry in general, and most of the published machine learning architectures have reported their results on it. Table VI collects the results of these models on 12 different tasks. Notably, MACE achieves state-of-the-art results on 3 of the four energy related tasks (G, H, U, and U0) and also improves the state-of-the-art on one further non-energy related task.

TABLE VI.

Performance of MACE on QM9. The mean absolute error (MAE) of various models on the QM9 dataset demonstrates that MACE improves on the state of the art in several tasks. The bold model has the lowest error, while the italic indicates models with errors within 10% of the best model for each task.

Gap (meV)Homo (meV)Lumo (meV)CV (cal/mol K)μ (D)ZPVE (meV)R2 α02α α03G (meV)H (meV)U (meV)U0 (meV)
NMP68  69 43 38 0.040 0.030 1.50 0.180 0.092 19 17 20 20 
SchNet13  63 41 34 0.033 0.033 1.70 0.073 0.235 14 14 19 14 
Cormorant21  61 34 38 0.026 0.038 2.03 0.961 0.085 20 21 21 22 
LieConv69  49 30 25 0.038 0.032 2.28 0.800 0.084 22 24 19 19 
DimeNet++14  33 25 20 0.023 0.030 1.21 0.331 0.044 7.6 6.5 6.3 6.3 
EGNN70  48 29 25 0.031 0.029 1.55 0.106 0.071 12 12 12 11 
PaiNN15  46 28 20 0.024 0.012 1.28 0.066 0.045 7.4 6.0 5.8 5.9 
TorchMD-NET17  36 20 18 0.026 0.011 1.84 0.033 0.059 7.6 6.2 6.4 6.2 
SphereNet71  32 23 18 0.022 0.026 1.12 0.292 0.046 7.8 6.3 6.4 6.3 
SEGNN72  42 24 21 0.031 0.023 1.62 0.660 0.060 15 16 13 15 
EQGAT73  32 20 16 0.024 0.011 2.00 0.382 0.053 23 24 25 25 
Equiformer74  30 15 14 0.023 0.011 1.26 0.251 0.046 7.6 6.6 6.7 6.6 
MGCN75  64 42 57 0.038 0.056 1.12 0.110 0.030 15 16 14 13 
Allegro30  ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 5.7 4.4 4.4 4.7 
NoisyNodes76  29 20 19 0.025 0.025 1.16 0.700 0.052 8.3 7.4 7.6 7.3 
GNS-TAT+NN77  26 17 17 0.022 0.021 1.08 0.65 0.047 7.4 6.4 6.4 6.4 
Wigner Kernels78  ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 4.3 
TensorNet41  ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 6.0 4.3 4.3 4.3 
MACE 42 22 19 0.021 0.015 1.23 0.210 0.038 5.5 4.7 4.1 4.1 
Gap (meV)Homo (meV)Lumo (meV)CV (cal/mol K)μ (D)ZPVE (meV)R2 α02α α03G (meV)H (meV)U (meV)U0 (meV)
NMP68  69 43 38 0.040 0.030 1.50 0.180 0.092 19 17 20 20 
SchNet13  63 41 34 0.033 0.033 1.70 0.073 0.235 14 14 19 14 
Cormorant21  61 34 38 0.026 0.038 2.03 0.961 0.085 20 21 21 22 
LieConv69  49 30 25 0.038 0.032 2.28 0.800 0.084 22 24 19 19 
DimeNet++14  33 25 20 0.023 0.030 1.21 0.331 0.044 7.6 6.5 6.3 6.3 
EGNN70  48 29 25 0.031 0.029 1.55 0.106 0.071 12 12 12 11 
PaiNN15  46 28 20 0.024 0.012 1.28 0.066 0.045 7.4 6.0 5.8 5.9 
TorchMD-NET17  36 20 18 0.026 0.011 1.84 0.033 0.059 7.6 6.2 6.4 6.2 
SphereNet71  32 23 18 0.022 0.026 1.12 0.292 0.046 7.8 6.3 6.4 6.3 
SEGNN72  42 24 21 0.031 0.023 1.62 0.660 0.060 15 16 13 15 
EQGAT73  32 20 16 0.024 0.011 2.00 0.382 0.053 23 24 25 25 
Equiformer74  30 15 14 0.023 0.011 1.26 0.251 0.046 7.6 6.6 6.7 6.6 
MGCN75  64 42 57 0.038 0.056 1.12 0.110 0.030 15 16 14 13 
Allegro30  ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 5.7 4.4 4.4 4.7 
NoisyNodes76  29 20 19 0.025 0.025 1.16 0.700 0.052 8.3 7.4 7.6 7.3 
GNS-TAT+NN77  26 17 17 0.022 0.021 1.08 0.65 0.047 7.4 6.4 6.4 6.4 
Wigner Kernels78  ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 4.3 
TensorNet41  ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 6.0 4.3 4.3 4.3 
MACE 42 22 19 0.021 0.015 1.23 0.210 0.038 5.5 4.7 4.1 4.1 

To achieve the best result on intensive properties, such as CV, we modified the readout function of the MACE model. This was necessary because the simple sum pooling operation of the node outputs that are used in the MACE force fields results in a size-extensive overall prediction. The precise form of the intensive global non-linear readout function used is described in the supplementary material.

Finally, it is interesting to examine the potential energy (U0) models in more detail. On the left panel of Fig. 6, the learning curves of four different MACE models are compared. The blue models use hyperparameters of MACE that are similar to the ones used throughout the paper (256-2). In comparison, the models denoted in red are significantly larger versions of MACE, which use a much larger MLP in the learnable radial basis [see Eq. (4)], which has been shown to help QM9 and other models like Allegro.30 This extra flexibility results in higher accuracy, especially in a very high data regime. A further interesting point is the comparison of the models trained using energies only (the usual practice for QM9, denoted with the solid line) with the models whose loss function also included forces, exploiting the knowledge that QM9 is made up of equilibrium geometries with zero forces on all the atoms. This extra information (without the need for new QM calculations) increases the accuracy of both the small and large MACE models.

FIG. 6.

QM9 learning curves. The left panel compares the standard large MACE model (256-2) to a modified version for QM9, where the size of the MLP in the radial basis was increased. The figure also shows that including the 0 force information (labeled -wF) in the training leads to lower errors. The right panel compares the best MACE with a number of other machine learning models.

FIG. 6.

QM9 learning curves. The left panel compares the standard large MACE model (256-2) to a modified version for QM9, where the size of the MLP in the radial basis was increased. The figure also shows that including the 0 force information (labeled -wF) in the training leads to lower errors. The right panel compares the best MACE with a number of other machine learning models.

Close modal

In the right panel of Fig. 6, we compare the learning curve of the best MACE model to the learning curve of several other very good models. It shows that kernel models such as FCHL,79 SOAP,8 the linear NICE model,11 the feed forward neural network GMsNN,80 and the message passing SchNet13 and PhysNet81 models all achieve comparable errors. The models that can surpass this significantly are Allegro,30 Wigner kernels,78 and MACE, all of which use higher body-order features. This strongly hints that high body-order is a crucial property of the most successful atomistic ML models.

In this paper, we have demonstrated the power of the MACE architecture for a wide variety of chemical tasks and systems. We have investigated the effect of the locality assumption on large molecular systems and demonstrated that a two layer MACE model with a 5 Å cutoff on each layer is able to improve on the state-of-the-art global model in accuracy by up to a factor of 10. We also showed that the molecular vibrational spectrum can be accurately reproduced using the local MACE model. We have also demonstrated the transferability of the MACE model by fitting it to just 10% of the entire ANI-1x dataset. MACE improved on the state of the art on the COMP6 comprehensive organic molecule benchmark by a factor of 5 and 7 for energies and forces, respectively, compared to the ANI model and a factor of 3–4 compared to the equivariant message passing NewtonNet model. We showed that the models can be transfer learned to the coupled-cluster level of theory and used a CC-transfer learned MACE model to compute the vibrational spectrum of ethanol. Furthermore, tests demonstrated excellent data efficiency and extrapolation by computing the coupled cluster level vibrational spectrum of ethanol using a MACE model trained on just 50 reference calculations without the need for iterative training. We have also shown excellent accuracy in condensed phase systems such as diverse phases of carbon and liquid water. Our results on the M3GNet dataset show that MACE reaches excellent accuracy and transferability on large and chemically diverse datasets, paving the way for new state-of-the-art universal force fields for materials. Finally, we also demonstrated that MACE improves on the state-of-the-art for many properties of the standard QM9 quantum chemistry benchmark.

Overall, these results set a new standard for machine learning force fields in chemistry and demonstrate that the MACE architecture is capable of describing a wide variety of systems with little to no modification to the model training and hyperparameters. This could ultimately lead to force fields with unprecedented accuracy created at ease without significant user input or requirements for expertise.

The supplementary material contains all torsion drives for the biaryl set molecules.

D.P.K. acknowledges support from AstraZeneca and the EPSRC. We used the computational resources of the UK HPC service ARCHER2 via the UKCP consortium and were funded by EPSRC under Grant No. EP/P022596/1. This work was also performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3). We would like to acknowledge Michele Ceriotti for providing the raw data for the QM9 learning curves. We would also like to acknowledge Joshua Horton for helping with the implementation of MACE into the QCEngine driver for the torsional scans. We would like to acknowledge Mario Geiger for the implementation of MACE in JAX. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any author accepted manuscript version arising from this submission.

The authors have no conflicts to disclose.

Dávid Péter Kovács: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Ilyes Batatia: Investigation (equal); Methodology (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Eszter Sára Arany: Investigation (equal); Visualization (equal); Writing – review & editing (equal). Gábor Csányi: Conceptualization (equal); Funding acquisition (equal); Writing – review & editing (equal).

The data that support the findings of this study are available within the article and its supplementary material. The MACE code is freely available on https://github.com/ACEsuit/mace and example input scripts and pre-trained models to reproduce the results are provided in the MACE code documentation website: https://mace-docs.readthedocs.io/en/latest/examples/training_examples.html. The datasets used are all published elsewhere and referenced in the text. The JAX version of MACE is freely available at https://github.com/ACEsuit/mace-jax.

Figure 7 demonstrates how the changing loss weights result in a dramatic improvement of the energy errors for the medium 96-1 MACE model from Sec. IV without significantly affecting the force accuracy.

FIG. 7.

Two phases of learning. The energy and force validation errors of the 64-1 MACE model trained on a subset of Sec. IV training data. The second phase of the loss schedule decreases the energy errors, and the two-phase learning results in decreased errors compared to using the second phase of the loss schedule alone.

FIG. 7.

Two phases of learning. The energy and force validation errors of the 64-1 MACE model trained on a subset of Sec. IV training data. The second phase of the loss schedule decreases the energy errors, and the two-phase learning results in decreased errors compared to using the second phase of the loss schedule alone.

Close modal
Most of the models in the paper have been trained using an alternative loss function,
L=λEBb=1BEbÊbNb2+λF3Bb=1B1Nbib,α=1Nb,3×Ebrib,αF̂ib,α2.
(B1)
This loss is less intuitive as the energy is per atom, but the forces are per atom squared because of the additional averaging. This results in much larger force weights of up to 1000:1 compared to energy weights to counterbalance this effect. With the loss function in the main text, the typical weights have a ratio of 10:1. All models in this paper, except where otherwise stated, were trained using the loss function in Eq. (B1).
The radial basis of Eq. (5) is agnostic to the element of both the sender and receiver atoms. While we observe that this choice performs very well for datasets with few elements (up to 10), it can be beneficial for datasets that include many elements to expend this radial as a function of the element of the atoms. Therefore, we input the radial MLP and the scalar part of the previous node features of both the sender and receiver atoms,
Rkη1l1l2l3(s)(rij,{hi,k00}k,{hj,k00}k)=MLPj0n(rij)n,{hi,k00}k,{hj,k00}k,
(C1)
ϕij,kη1l3m3(s)=l1l2m1m2Cη1,l1m1l2m2l3m3Rkη1l1l2l3(s)×(rij,{hi,k00}k,{hj,k00}k)Yl1m1(r̂ij)h̄j,kl2m2(s),
(C2)
where hi,k00, hj,k00 corresponds to the scalar part of the node features of atoms i and j.

All models were trained using NVIDIA A100 GPUs. The training time varied between the datasets; for small molecules, it was between 1 and 5 h; for the larger datasets, it depended on the model size; and in the case of the ANI-1x and QM9 datasets, the largest models took 3–4 days. We stopped the training once the validation loss stopped improving.

All experiments were carried out using the MACE code available at https://github.com/ACEsuit/mace. Example input files for the experiments can be found on the Examples page of the MACE documentation at: https://mace-docs.readthedocs.io.

1. Common settings to all MACE force field models

All MACE force field models fitted in Secs. III–IX shared most of their core training and model settings. They all used two MACE layers (except where explicitly stated in MD22, where we compared to a single layer model), they all used lmax = 3 in the spherical harmonic expansion, and they all used eight radial Bessel features with a polynomial envelope for the cutoff with p = 5. The radial features are well fed into an MLP of size [64, 64, 64] using SiLu non-linearities. All models used correlation order N = 3 in the MACE layer (four-body messages). The readout function of the first layer was a simple linear transform, whereas for the second layer, it is a single-layer MLP with 16 hidden dimensions. Models were trained with the AMSGrad variant of Adam with default parameters β1 = 0.9, β2 = 0.999, and ϵ = 1 × 10−8. We used a learning rate of 0.01 and an on-plateau scheduler that reduced the learning rate when the validation loss did not improve for a certain number of epochs. We used an exponential moving average with a weight of 0.99 when updating the weights, both for the validation set evaluations and the final models. The loss function used is the one given in Eq. (B1).

2. MD22 models

The three MACE models fitted to the MD22 dataset all used 256 uncoupled feature channels. The two layer models used Lmax = 2 for the messages. 95% of the training sample was used for training, and a random 5% subset was used for validation. The energy was normalized by shifting the mean of the potential energies. The initial weights in the loss function were: λE = 10, λF = 1000, followed by a switch to λE = 1000, λF = 10, with the learning rate being reduced to 0.001.

3. COMP6 models

The three models all used rcut = 5 Å cutoff radius. The number of uncoupled feature channels was 64, 96, and 192, with the message equivariance order Lmax also increasing from 0 (invariant) to 1 and 2 for the three models, respectively. The number of parameters for the three models was 196 944, 517 872, and 2 229 456, respectively. We used loss weights: λE = 40, λF = 1000, followed by λE = 1000, λF = 10 in the second phase of the training. For normalization of the energy, we were using the atomization energy, meaning that we applied a fixed shift to each molecule using the isolated atom energies computed with DFT, which were published as part of the original dataset.

4. revMD17 models

The models used here were identical to the ones published in Ref. 18.

5. Amorphous carbon model

The carbon MACE model was trained on the carbon dataset published in Ref. 50. We created a training, validation, and test split by selecting 10% of each configuration type to be in the test and randomly selecting 10% for the validation set from the training set. We used a MACE model with two layers, 256 feature channels, lmax = 3 spherical harmonics, and passing Lmax = 2 messages. We used a cutoff of rcut = 6.0 Å. The standard weighted energy forces loss was used, with weights of λE = 80, λF = 1000 for the first 225 epochs, and λE = 1000, λF = 100 for the 75 last epochs. The batch size was 5. The weight decay of 5 × 10−7 was similarly applied to selected weights as in Ref. 18.

6. HME21

The MACE model was trained using the published train-test split. The training data were reshuffled after each epoch. We used 256 uncoupled feature channels, lmax = 3, and invariant messages, Lmax = 2. We used a cutoff of 6 Å. The standard weighted energy and force loss was used, with a weight of 1 on energies and a weight of 10 on forces.

7. M3GNet

The MACE model was trained using the published train-test split. The training data was reshuffled after each epoch. We used 128 uncoupled feature channels, lmax = 3, and equivariant messages, Lmax = 2. We used a cutoff of 5 Å. We use the Huber loss with δ = 0.01 and loss weights of 1 on forces, 1 on energies, and 0.1 on stresses.

8. Liquid water model

The two models both used a cutoff of rcut = 6 Å. The small model used 64 uncoupled channels and invariant messages (Lmax = 0), whereas the larger model used 192 channels and Lmax = 2. The weights used for the training were λE = 10, λF = 1000, followed by λE = 1000, λF = 10 in the second phase of the training.

9. QM9 models

For the U0 property, we ran experiments with a standard MACE model identical to the one used for the MD22 tests, meaning 256 uncoupled channels and Lmax = 2. We have found that a slightly modified version of MACE, which uses a much increased radial MLP, achieves superior error compared to the standard one. This model used a radial MLP of size [128, 256, 512, 1024]. We have also reduced the learning rate to 0.001, increased the width of the readout MLP from 16 to 128, and applied an increased exponential moving average exponent of 0.999. Furthermore, for the case of intensive properties (gap, homo, lumo, μ, CV, and zpve), we have changed the readout function from applying sum pooling, which results in a size extensive output, as desired for energy like quantities, into a non-linear pooling operation that first applies both sum, mean, and standard deviation pooling, followed by the application of an attention mechanism to form the final output. We found that such a readout function led to up to a twofold improvement compared to the simple linear extensive pooling operation.

1. Tetrapeptide

In Fig. 8, we show that the vibrational spectrum of the short range and longer range MACE models agree remarkably well for the tetrapeptide system of MD22. This demonstrates that the locality assumption can be valid and that short-range models can be used to compute the accurate vibrational spectrum of even larger systems.

FIG. 8.

Vibrational spectrum of the tetrapeptide from the MD22 dataset.

FIG. 8.

Vibrational spectrum of the tetrapeptide from the MD22 dataset.

Close modal

Figure 9 shows that 12 of the 14 molecules have very low errors scattered randomly around the DFT value. The two larger peptides have a systematic shift compared to the ground truth value.

FIG. 9.

ANI-MD subset of COMP6 showing MACE vs DFT energy correlation plot.

FIG. 9.

ANI-MD subset of COMP6 showing MACE vs DFT energy correlation plot.

Close modal

In Figs. 10 and 11, we show two further examples of accurate MACE molecular vibrational spectrum using a model fitted to just 50 QM calculations. The salicylic acid and paracetamol molecules represent more challenging test cases than ethanol in the main text.

FIG. 10.

Vibrational spectrum of salicylic acid.

FIG. 10.

Vibrational spectrum of salicylic acid.

Close modal
FIG. 11.

Vibrational spectrum of paracetamol.

FIG. 11.

Vibrational spectrum of paracetamol.

Close modal
1.
V. L.
Deringer
,
N.
Bernstein
,
G.
Csányi
,
C.
Ben Mahmoud
,
M.
Ceriotti
,
M.
Wilson
,
D. A.
Drabold
, and
S. R.
Elliott
, “
Origins of structural and electronic transitions in disordered silicon
,”
Nature
589
,
59
(
2021
).
2.
B.
Cheng
,
G.
Mazzola
,
C. J.
Pickard
, and
M.
Ceriotti
, “
Evidence for supercritical behaviour of high-pressure liquid hydrogen
,”
Nature
585
,
217
(
2020
).
3.
V.
Kapil
,
C.
Schran
,
A.
Zen
,
J.
Chen
,
C. J.
Pickard
, and
A.
Michaelides
, “
The first-principles phase diagram of monolayer nanoconfined water
,”
Nature
609
,
512
(
2022
).
4.
T. A.
Young
,
T.
Johnston-Wood
,
H.
Zhang
, and
F.
Duarte
, “
Reaction dynamics of Diels–Alder reactions from machine learned potentials
,”
Phys. Chem. Chem. Phys.
24
,
20820
(
2022
).
5.
I.
Batatia
,
S.
Batzner
,
D. P.
Kovács
,
A.
Musaelian
,
G. N.
Simm
,
R.
Drautz
,
C.
Ortner
,
B.
Kozinsky
, and
G.
Csányi
, “
The design space of E(3)-equivariant atom-centered interatomic potentials
,” arXiv:2205.06643 (
2022
).
6.
J.
Nigam
,
S.
Pozdnyakov
,
G.
Fraux
, and
M.
Ceriotti
, “
Unified theory of atom-centered representations and message-passing machine-learning schemes
,”
J. Chem. Phys.
156
,
204115
(
2022
).
7.
J.
Behler
and
M.
Parrinello
, “
Generalized neural-network representation of high-dimensional potential-energy surfaces
,”
Phys. Rev. Lett.
98
,
146401
(
2007
).
8.
A. P.
Bartók
,
R.
Kondor
, and
G.
Csányi
, “
On representing chemical environments
,”
Phys. Rev. B
87
,
184115
(
2013
).
9.
A. V.
Shapeev
, “
Moment tensor potentials: A class of systematically improvable interatomic potentials
,”
Multiscale Model. Simul.
14
,
1153
(
2016
).
10.
R.
Drautz
, “
Atomic cluster expansion for accurate and transferable interatomic potentials
,”
Phys. Rev. B
99
,
014104
(
2019
).
11.
J.
Nigam
,
S.
Pozdnyakov
, and
M.
Ceriotti
, “
Recursive evaluation and iterative contraction of N-body equivariant features
,”
J. Chem. Phys.
153
,
121101
(
2020
).
12.
D. P.
Kovács
,
C.
van der Oord
,
J.
Kucera
,
A. E.
Allen
,
D. J.
Cole
,
C.
Ortner
, and
G.
Csányi
, “
Linear atomic cluster expansion force fields for organic molecules: Beyond RMSE
,”
J. Chem. Theory Comput.
17
,
7696
(
2021
).
13.
K.
Schütt
,
P.-J.
Kindermans
,
H. E.
Sauceda Felix
,
S.
Chmiela
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
SchNet: A continuous-filter convolutional neural network for modeling quantum interactions
,”
Adv. Neural Inf. Process. Syst.
30
,
992
1002
(
2017
).
14.
J.
Gasteiger
,
S.
Giri
,
J. T.
Margraf
, and
S.
Günnemann
, “
Fast and uncertainty-aware directional message passing for non-equilibrium molecules
,” arXiv:2011.14115 (
2020
).
15.
K.
Schütt
,
O.
Unke
, and
M.
Gastegger
, “
Equivariant message passing for the prediction of tensorial properties and molecular spectra
,” in
International Conference on Machine Learning
(
PMLR
,
2021
), pp.
9377
9388
.
16.
S.
Batzner
,
A.
Musaelian
,
L.
Sun
,
M.
Geiger
,
J. P.
Mailoa
,
M.
Kornbluth
,
N.
Molinari
,
T. E.
Smidt
, and
B.
Kozinsky
, “
E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials
,”
Nat. Commun.
13
,
2453
(
2022
).
17.
P.
Thölke
and
G.
De Fabritiis
, “
Equivariant transformers for neural network based molecular potentials
,” in
International Conference on Learning Representations
,
2022
.
18.
I.
Batatia
,
D. P.
Kovacs
,
G.
Simm
,
C.
Ortner
, and
G.
Csányi
, “
MACE: Higher order equivariant message passing neural networks for fast and accurate force fields
,”
Adv. Neural Inf. Process. Syst.
35
,
11423
(
2022
).
19.
M. J.
Willatt
,
F.
Musil
, and
M.
Ceriotti
, “
Feature optimization for atomistic machine learning yields a data-driven construction of the periodic table of the elements
,”
Phys. Chem. Chem. Phys.
20
,
29661
(
2018
).
20.
K.
Gubaev
,
E. V.
Podryabinkin
,
G. L. W.
Hart
, and
A. V.
Shapeev
, “
Accelerating high-throughput searches for new alloys with active learning of interatomic potentials
,”
Comput. Mater. Sci.
156
,
148
(
2019
).
21.
B.
Anderson
,
T.-S.
Hy
, and
R.
Kondor
, “
Cormorant: Covariant molecular neural networks
,” arXiv:1906.04015.
22.
E.
Wigner
,
Group Theory: And Its Application to the Quantum Mechanics of Atomic Spectra
(
Elsevier
,
2012
), Vol.
5
.
23.
M.
Geiger
and
T.
Smidt
, “
e3nn: Euclidean neural networks
,” arXiv:2207.09453 [cs.LG] (
2022
).
24.
N. D.
Sidiropoulos
,
L.
De Lathauwer
,
X.
Fu
,
K.
Huang
,
E. E.
Papalexakis
, and
C.
Faloutsos
, “
Tensor decomposition for signal processing and machine learning
,”
IEEE Trans. Signal Process.
65
,
3551
(
2017
).
25.
J. P.
Darby
,
D. P.
Kovács
,
I.
Batatia
,
M. A.
Caro
,
G. L.
Hart
,
C.
Ortner
, and
G.
Csányi
, “
Tensor-reduced atomic density representations
,” arXiv:2210.01705 (
2022
).
26.
V. L.
Deringer
,
A. P.
Bartók
,
N.
Bernstein
,
D. M.
Wilkins
,
M.
Ceriotti
, and
G.
Csányi
, “
Gaussian process regression for materials and molecules
,”
Chem. Rev.
121
,
10073
(
2021
).
27.
M.
Geiger
,
A.
Jacot
,
S.
Spigler
,
F.
Gabriel
,
L.
Sagun
,
S.
d’Ascoli
,
G.
Biroli
,
C.
Hongler
, and
M.
Wyart
, “
Scaling description of generalization with number of parameters in deep learning
,”
J. Stat. Mech.: Theory Exp.
2020
,
023401
.
28.
Y.
Jiang
,
B.
Neyshabur
,
H.
Mobahi
,
D.
Krishnan
, and
S.
Bengio
, “
Fantastic generalization measures and where to find them
,” in
International Conference on Learning Representations
,
2020
.
29.
S. S.
Schoenholz
and
E. D.
Cubuk
, “
JAX, M.D.: A framework for differentiable physics
,”
J. Stat. Mech.
2021
,
124016
.
30.
A.
Musaelian
,
S.
Batzner
,
A.
Johansson
,
L.
Sun
,
C. J.
Owen
,
M.
Kornbluth
, and
B.
Kozinsky
, “
Learning local equivariant representations for large-scale atomistic dynamics
,”
Nat. Commun.
14
,
579
(
2023
).
31.
A.
Musaelian
,
A.
Johansson
,
S.
Batzner
, and
B.
Kozinsky
, “
Scaling the leading accuracy of deep equivariant models to biomolecular simulations of realistic size
,” arXiv:2304.10061 (
2023
).
32.
G.
Dusson
,
M.
Bachmayr
,
G.
Csányi
,
R.
Drautz
,
S.
Etter
,
C.
van der Oord
, and
C.
Ortner
, “
Atomic cluster expansion: Completeness, efficiency and stability
,”
J. Comput. Phys.
454
,
110946
(
2022
).
33.
S.
Chmiela
,
H. E.
Sauceda
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Towards exact molecular dynamics simulations with machine-learned force fields
,”
Nat. Commun.
9
,
3887
(
2018
).
34.
A.
Kabylda
,
V.
Vassilev-Galindo
,
S.
Chmiela
,
I.
Poltavsky
, and
A.
Tkatchenko
, “
Efficient interatomic descriptors for accurate machine learning force fields of extended molecules
,”
Nat. Commun.
14
,
3562
(
2023
).
35.
Y.
Li
,
Y.
Wang
,
L.
Huang
,
H.
Yang
,
X.
Wei
,
J.
Zhang
,
T.
Wang
,
Z.
Wang
,
B.
Shao
, and
T.-Y.
Liu
, “
Long-short-range message-passing: A physics-informed framework to capture non-local interaction for scalable molecular dynamics simulation
,” arXiv:2304.13542 (
2023
).
36.
S.
Chmiela
,
V.
Vassilev-Galindo
,
O. T.
Unke
,
A.
Kabylda
,
H. E.
Sauceda
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
Accurate global machine learning force fields for molecules with hundreds of atoms
,”
Sci. Adv.
9
,
eadf0873
(
2023
).
37.
J. S.
Smith
,
R.
Zubatyuk
,
B.
Nebgen
,
N.
Lubbers
,
K.
Barros
,
A. E.
Roitberg
,
O.
Isayev
, and
S.
Tretiak
, “
The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules
,”
Sci. Data
7
,
134
(
2020
).
38.
J. S.
Smith
,
B.
Nebgen
,
N.
Lubbers
,
O.
Isayev
, and
A. E.
Roitberg
, “
Less is more: Sampling chemical space with active learning
,”
J. Chem. Phys.
148
,
241733
(
2018
).
39.
V.
Zaverkin
,
D.
Holzmüller
,
L.
Bonfirraro
, and
J.
Kästner
, “
Transfer learning for chemically accurate interatomic neural network potentials
,”
Phys. Chem. Chem. Phys.
25
,
5383
(
2023
).
40.
M.
Haghighatlari
,
J.
Li
,
X.
Guan
,
O.
Zhang
,
A.
Das
,
C. J.
Stein
,
F.
Heidar-Zadeh
,
M.
Liu
,
M.
Head-Gordon
,
L.
Bertels
et al, “
NewtonNet: A Newtonian message passing network for deep learning of interatomic potentials and forces
,”
Digital Discovery
1
,
333
(
2022
).
41.
G.
Simeon
and
G.
De Fabritiis
, “
TensorNet: Cartesian tensor representations for efficient learning of molecular potentials
,” arXiv:2306.06482 (
2023
).
42.
S.-L. J.
Lahey
,
T. N.
Thien Phuc
, and
C. N.
Rowley
, “
Benchmarking force field and the ANI neural network potentials for the torsional potential energy surface of biaryl drug fragments
,”
J. Chem. Inf. Model.
60
,
6258
(
2020
).
43.
S.
Riniker
, “
Fixed-charge atomistic force fields for molecular dynamics simulations in the condensed phase: An overview
,”
J. Chem. Inf. Model.
58
,
565
(
2018
).
44.
J. T.
Horton
,
A. E. A.
Allen
,
L. S.
Dodda
, and
D. J.
Cole
, “
QUBEKit: Automating the derivation of force field parameters from quantum mechanics
,”
J. Chem. Inf. Model.
59
,
1366
(
2019
).
45.
X.
Fu
,
Z.
Wu
,
W.
Wang
,
T.
Xie
,
S.
Keten
,
R.
Gomez-Bombarelli
, and
T. S.
Jaakkola
, “
Forces are not enough: Benchmark and critical evaluation for machine learning force fields with molecular simulations
,” in
Transactions on Machine Learning Research
,
2023
.
46.
A. S.
Christensen
and
O. A.
Von Lilienfeld
, “
On the role of gradients for machine learning of molecular energies and forces
,”
Mach. Learn.: Sci. Technol.
1
,
045018
(
2020
).
48.
F.
Neese
,
F.
Wennmohs
,
U.
Becker
, and
C.
Riplinger
, “
The ORCA quantum chemistry program package
,”
J. Chem. Phys.
152
,
224108
(
2020
).
49.
Z.
Chen
and
Y.
Yang
, “
Incorporating nuclear quantum effects in molecular dynamics with a constrained minimized energy surface
,”
J. Phys. Chem. Lett.
14
,
279
(
2023
).
50.
M.
Qamar
,
M.
Mrovec
,
Y.
Lysogorskiy
,
A.
Bochkarev
, and
R.
Drautz
, “
Atomic cluster expansion for quantum-accurate large-scale simulations of carbon
,” arXiv:2210.09161 (
2022
).
51.
V. L.
Deringer
and
G.
Csányi
, “
Machine learning based interatomic potential for amorphous carbon
,”
Phys. Rev. B
95
,
094203
(
2017
).
52.
P.
Rowe
,
V. L.
Deringer
,
P.
Gasparotto
,
G.
Csányi
, and
A.
Michaelides
, “
An accurate and transferable machine learning potential for carbon
,”
J. Chem. Phys.
153
,
034702
(
2020
).
53.
N.
Artrith
,
A.
Urban
, and
G.
Ceder
, “
Efficient and accurate machine-learning interpolation of atomic energies in compositions with many species
,”
Phys. Rev. B
96
,
014112
(
2017
).
54.
S.
Takamoto
,
C.
Shinagawa
,
D.
Motoki
,
K.
Nakago
,
W.
Li
,
I.
Kurata
,
T.
Watanabe
,
Y.
Yayama
,
H.
Iriguchi
,
Y.
Asano
,
T.
Onodera
,
T.
Ishii
,
T.
Kudo
,
H.
Ono
,
R.
Sawada
,
R.
Ishitani
,
M.
Ong
,
T.
Yamaguchi
,
T.
Kataoka
,
A.
Hayashi
,
N.
Charoenphakdee
, and
T.
Ibuka
, “
High-temperature multi-element 2021 (HME21) dataset
,” https://figshare.com/articles/dataset/High-temperature_multi-element_2021_HME21_dataset/19658538.
55.
S.
Takamoto
,
C.
Shinagawa
,
D.
Motoki
,
K.
Nakago
,
W.
Li
,
I.
Kurata
,
T.
Watanabe
,
Y.
Yayama
,
H.
Iriguchi
,
Y.
Asano
et al, “
Towards universal neural network potential for material discovery applicable to arbitrary combination of 45 elements
,”
Nat. Commun.
13
,
2991
(
2022
).
56.
S.
Takamoto
,
S.
Izumi
, and
J.
Li
, “
TeaNet: Universal neural network interatomic potential inspired by iterative electronic relaxations
,”
Comput. Mater. Sci.
207
,
111280
(
2022
).
57.
C.
Chen
and
S. P.
Ong
, “
A universal graph deep learning interatomic potential for the periodic table
,”
Nat. Comput. Sci.
2
,
718
(
2022
).
58.
A.
Jain
,
S. P.
Ong
,
G.
Hautier
,
W.
Chen
,
W. D.
Richards
,
S.
Dacek
,
S.
Cholia
,
D.
Gunter
,
D.
Skinner
,
G.
Ceder
et al, “
Commentary: The materials project: A materials genome approach to accelerating materials innovation
,”
APL Mater.
1
,
011002
(
2013
).
59.
P. J.
Huber
, “
Robust estimation of a location parameter
,” in
Breakthroughs in Statistics: Methodology and Distribution
(
Institute of Mathematical Statistics
,
1992
), p.
492
.
60.
B.
Cheng
,
E. A.
Engel
,
J.
Behler
,
C.
Dellago
, and
M.
Ceriotti
, “
Ab initio thermodynamics of liquid and solid water
,”
Proc. Natl. Acad. Sci. U. S. A.
116
,
1110
(
2019
).
61.
T. D.
Kühne
,
M.
Iannuzzi
,
M.
Del Ben
,
V. V.
Rybkin
,
P.
Seewald
,
F.
Stein
,
T.
Laino
,
R. Z.
Khaliullin
,
O.
Schütt
,
F.
Schiffmann
et al, “
CP2K: An electronic structure and molecular dynamics software package-quickstep: Efficient and accurate electronic structure calculations
,”
J. Chem. Phys.
152
,
194103
(
2020
).
62.
O.
Marsalek
and
T. E.
Markland
, “
Quantum dynamics and spectroscopy of ab initio liquid water: The interplay of nuclear and electronic quantum effects
,”
J. Phys. Chem. Lett.
8
,
1545
(
2017
).
63.
Y.
Zhang
,
J.
Xia
, and
B.
Jiang
, “
Physically motivated recursively embedded atom neural networks: Incorporating local completeness and nonlocality
,”
Phys. Rev. Lett.
127
,
156002
(
2021
).
64.
T.
Ohto
,
M.
Dodia
,
J.
Xu
,
S.
Imoto
,
F.
Tang
,
F.
Zysk
,
T. D.
Kühne
,
Y.
Shigeta
,
M.
Bonn
,
X.
Wu
, and
Y.
Nagata
, “
Accessing the accuracy of density functional theory through structure and dynamics of the water–air interface
,”
J. Phys. Chem. Lett.
10
,
4914
(
2019
).
65.
S.
Melchionna
,
G.
Ciccotti
, and
B.
Lee Holian
, “
Hoover NPT dynamics for systems varying in shape and size
,”
Mol. Phys.
78
,
533
(
1993
).
66.
S.
Melchionna
, “
Constrained systems and statistical distribution
,”
Phys. Rev. E
61
,
6165
(
2000
).
67.
R.
Ramakrishnan
,
P. O.
Dral
,
M.
Rupp
, and
O. A.
Von Lilienfeld
, “
Quantum chemistry structures and properties of 134 kilo molecules
,”
Sci. Data
1
,
140022
(
2014
).
68.
J.
Gilmer
,
S. S.
Schoenholz
,
P. F.
Riley
,
O.
Vinyals
, and
G. E.
Dahl
, “
Neural message passing for quantum chemistry
,” in
International Conference on Machine Learning
(
PMLR
,
2017
), pp.
1263
1272
.
69.
M.
Finzi
,
S.
Stanton
,
P.
Izmailov
, and
A. G.
Wilson
, “
Generalizing convolutional neural networks for equivariance to Lie groups on arbitrary continuous data
,” in
International Conference on Machine Learning
(
PMLR
,
2020
), pp.
3165
3176
.
70.
V. G.
Satorras
,
E.
Hoogeboom
, and
M.
Welling
, “
E(n) equivariant graph neural networks
,” in
International Conference on Machine Learning
(
PMLR
,
2021
), pp.
9323
9332
.
71.
Y.
Liu
,
L.
Wang
,
M.
Liu
,
Y.
Lin
,
X.
Zhang
,
B.
Oztekin
, and
S.
Ji
, “
Spherical message passing for 3D molecular graphs
,” in
International Conference on Learning Representations (ICLR)
,
2022
.
72.
J.
Brandstetter
,
R.
Hesselink
,
E.
van der Pol
,
E. J.
Bekkers
, and
M.
Welling
, “
Geometric and physical quantities improve E(3) equivariant message passing
,” in
International Conference on Learning Representations
,
2022
.
73.
T.
Le
,
F.
Noé
, and
D.-A.
Clevert
, “
Equivariant graph attention networks for molecular property prediction
,” arXiv:2202.09891 (
2022
).
74.
Y.-L.
Liao
and
T.
Smidt
, “
Equiformer: Equivariant graph attention transformer for 3D atomistic graphs
,” in
The Eleventh International Conference on Learning Representations
,
2023
.
75.
C.
Lu
,
Q.
Liu
,
C.
Wang
,
Z.
Huang
,
P.
Lin
, and
L.
He
, “
Molecular property prediction: A multilevel quantum interactions modeling perspective
,”
Proc. AAAI Conf. Artif. Intell.
33
,
1052
1060
(
2019
).
76.
J.
Godwin
,
M.
Schaarschmidt
,
A. L.
Gaunt
,
A.
Sanchez-Gonzalez
,
Y.
Rubanova
,
P.
Veličković
,
J.
Kirkpatrick
, and
P.
Battaglia
, “
Simple GNN regularisation for 3D molecular property prediction and beyond
,” in International Conference on Learning Representations,
2022
, https://openreview.net/forum?id=1wVvweK3oIb (
2021
).
77.
S.
Zaidi
,
M.
Schaarschmidt
,
J.
Martens
,
H.
Kim
,
Y. W.
Teh
,
A.
Sanchez-Gonzalez
,
P.
Battaglia
,
R.
Pascanu
, and
J.
Godwin
, “
Pre-training via denoising for molecular property prediction
,” in
The Eleventh International Conference on Learning Representations
,
2023
.
78.
F.
Bigi
,
S. N.
Pozdnyakov
, and
M.
Ceriotti
, “
Wigner kernels: Body-ordered equivariant machine learning without a basis
,” arXiv:2303.04124 (
2023
).
79.
A. S.
Christensen
,
L. A.
Bratholm
,
F. A.
Faber
, and
O.
Anatole von Lilienfeld
, “
FCHL revisited: Faster and more accurate quantum machine learning
,”
J. Chem. Phys.
152
,
044107
(
2020
).
80.
V.
Zaverkin
,
D.
Holzmüller
,
I.
Steinwart
, and
J.
Kästner
, “
Fast and sample-efficient interatomic neural network potentials for molecules and materials based on Gaussian moments
,”
J. Chem. Theory Comput.
17
,
6658
(
2021
).
81.
O. T.
Unke
and
M.
Meuwly
, “
PhysNet: A neural network for predicting energies, forces, dipole moments, and partial charges
,”
J. Chem. Theory Comput.
15
,
3678
(
2019
).
Published open access through an agreement with JISC Collections

Supplementary Material