The MACE architecture represents the state of the art in the field of machine learning force fields for a variety of in-domain, extrapolation, and low-data regime tasks. In this paper, we further evaluate MACE by fitting models for published benchmark datasets. We show that MACE generally outperforms alternatives for a wide range of systems, from amorphous carbon, universal materials modeling, and general small molecule organic chemistry to large molecules and liquid water. We demonstrate the capabilities of the model on tasks ranging from constrained geometry optimization to molecular dynamics simulations and find excellent performance across all tested domains. We show that MACE is very data efficient and can reproduce experimental molecular vibrational spectra when trained on as few as 50 randomly selected reference configurations. We further demonstrate that the strictly local atom-centered model is sufficient for such tasks even in the case of large molecules and weakly interacting molecular assemblies.

## I. INTRODUCTION

Machine learning force fields are becoming part of the standard toolbox of computational chemists, as demonstrated by the increasing number of successful applications leading to new scientific discoveries in fields including that of amorphous materials,^{1} high-pressure systems,^{2} phase diagrams,^{3} and reaction dynamics of molecules.^{4} These applications were enabled by significant effort in developing a wide range of novel machine learning force field architectures. Recently, many of these were incorporated into a single, unifying design space,^{5,6} which helped uncover the relationship between seemingly dissimilar approaches such as descriptor based machine learning force fields^{7–12} and graph neural network based models.^{13–17} This new understanding directly led to the MACE architecture,^{18} which combines equivariant message passing with a high body-order description of the local atomic environment via spherical harmonic polynomials, leading to excellent performance in terms of accuracy, extrapolation, data, and computational efficiency.

In this paper, we evaluate the MACE architecture by training it on previously published datasets and benchmark its performance on a wide variety of tasks to demonstrate its out-of-the-box applicability across the chemical and physical sciences. The paper is organized as follows. Sec. II briefly reviews the MACE architecture, and we also detail a training strategy that results in highly accurate models on all tested systems. In Sec. III, we provide evidence that MACE can accurately describe the dynamics of large supra-molecular systems, including those critically controlled by intermolecular interactions. Section IV shows that MACE can fit a large, chemically diverse dataset of compounds made up of the H, C, N, and O chemical elements, improving on the previous state of the art by over a factor of three. In Sec. V, we demonstrate that MACE can accurately fit the potential energy surface of small molecules with as few as 50 conformers. The resulting model is not only accurate on an independent test set but is also able to run long, stable molecular dynamics simulations without the need for any parameter tuning or any further iterative training. In Secs. VI, VII, and IXVI, VII and IX we test MACE on condensed phase systems (carbon, disordered materials, and liquid water), showing considerable improvements in accuracy compared to the models that were previously tested on these datasets. In the case of water, we also show that the resulting MACE model can accurately describe the thermodynamic and kinetic properties using NVT and NPT molecular dynamics simulations. Finally, in Sec. X, we evaluate the MACE architecture on the QM9 machine learning benchmark, demonstrating that it also improves on the state of the art for several targets that are not force field fitting tasks.

## II. THE MACE ARCHITECTURE

In this section, we briefly review the MACE many-body equivariant message passing force field architecture and then discuss a general training strategy relying on a loss scheduler. We also highlight the body-order of MACE models and discuss its implications for smoothness and extrapolation.

### A. Many-body equivariant message passing

The MACE model parametrizes the mapping from the positions and chemical elements of the atoms to their potential energy. This is achieved by decomposing the total potential energy of the system into site energies (atomic contributions). The site energy of each atom depends on symmetric features that describe its chemical environment. MACE parametrizes these features using a many-body expansion. To construct the features on the atoms, first, the local environment $N(i)$ of atom *i* has to be defined. $N(i)$ is the set of all atoms *j* in the system for which |*r*_{ij}| ≤ *r*_{cut}, where *r*_{ij} denotes the vector from atom *i* to atom *j* and *r*_{cut} is a predefined cutoff (we discuss the effect of choosing different cutoff radii in Sec. III). Defining a local neighborhood allows the construction of a graph from the geometry, where the nodes are the atoms and the edges connect atoms in each other’s local environment. We denote the array of features of node (atom) *i* by *h*_{i} and express them in the spherical harmonic basis, and hence its elements are always indexed by *l* and *m*. The superscript on *h*^{(s)} indicates the iteration steps (corresponding to “layers” of message passing in the parlance of graph neural networks), and we denote the number of layers in the model by *S*. Equivariance of the model is achieved by utilizing the transformation properties of spherical harmonics $Ylm$ under 3D rotations, which are inherited by the node features with corresponding indices.

We now describe the complete MACE model as it was introduced in Ref. 18. In our discussion, we focus on showing both the key equivariant operations and also on identifying all free parameters of the model that are optimized during training. The equations are grouped together in a block for ease of reference. All free (“learnable”) parameters are denoted by the letter *W*, and all occurrences below correspond to *different* blocks of free parameters—the shapes of these parameter arrays are indicated by the use of different indices. Note that in addition to the explicitly shown free parameters *W*, the fully connected multi-layer perceptrons (MLPs) also contain internally further learnable parameters.

We start by initializing the node features $hi(0)$ as a (learnable) embedding of the chemical elements with atomic number *z*_{i} into *k* learnable channels, cf. Eq. (1). This kind of mapping has been extensively used for graph neural networks^{13–16} and elsewhere^{19,20} and has been shown to lead to some transferability between molecules with different elements.^{5} The zeros for the *lm* indices correspond to these initial features being scalars, i.e., rotationally invariant. The higher order elements of $hi(0)$ with nonzero *lm* indices are implicitly initialized to zero. At the beginning of each subsequent iteration, the node features (both scalars and higher order) are linearly mixed together, resulting in $h\u0304j$ [cf. Eq. (2)].

^{8}and ACE,

^{5,10}and we construct it in a similar way to Cormorant

^{21}and NequIP.

^{16}The relationships between these different approaches are discussed in Ref. 5. We construct the radial basis set using the first spherical Bessel function $j0n$ for different wavenumbers,

*n*, up to some small maximum (typically 8), in Eq. (3) as proposed in Ref. 14. The Bessel functions are multiplied by a polynomial cutoff function

*f*

_{cut}(

*r*

_{ij}), which goes to zero smoothly at

*r*

_{ij}=

*r*

_{cut}. The radial information is then passed through an MLP, Eq. (4), whose inputs are the Bessel functions with different frequencies and have many outputs, indexed by (

*η*

_{1},

*l*

_{1},

*l*

_{2},

*l*

_{3}), whose purpose is explained next. When combining positional information (itself an equivariant that transforms under rotation like a vector) with equivariant node features, we use the spherical tensor product formalism of angular momentum addition.

^{22}All possible combinations of equivariants are constructed using the appropriate Clebsch–Gordan coefficients, Eq. (5). This operation is currently implemented using e3nn.

^{23}There are multiple ways of constructing an equivariant combination with a given symmetry corresponding to (

*l*

_{3},

*m*

_{3}), and these multiplicities are enumerated by the index

*η*

_{1}. The large number of outputs (and their associated learnable weights) from the radial MLP are used as separate degrees of freedom,

The one-particle basis *ϕ* is summed over the neighborhood in Eq. (6); this is where permutation invariance of the MACE descriptors over the atoms in the neighborhood is achieved—note that the identity of chemical elements has already been embedded, and hence this sum is over all atoms, regardless of their atomic number. A linear mixing of *k* channels with learnable weights yields the atomic basis, *A*_{i} (cf. ACE^{5,10}).

In an analogous fashion to the ACE construction, many-body symmetric features are formed on each atom by taking the tensor product of the atomic basis, *A*, with itself *ν* times, yielding the “product basis,” *A*_{i} [cf. Eq. (7)]. Note that in forming the tensor product, each *k* channel is treated independently. This method of *tensor sketching* has been widely used in signal processing^{24} and was formally shown to not degrade the expressibility of many-body models while substantially reducing computational cost compared to a full tensor product.^{25} The tensor product of Eq. (7) is implemented using an efficient loop tensor contraction algorithm.^{18}

In Eq. (8), the product basis is contracted to yield the fully symmetric basis, *B*_{i}, using the generalized Clebsch–Gordan coefficients, $C$, where again, there are multiple ways to arrive at a given output symmetry, and these are enumerated by *η*_{ν}. The bold ** lm** signify that these are multi-indices (an array of indices), and the bold styles of

**and**

*A***are a reminder that these are many-body features. The maximum body-order is controlled by limiting**

*B**ν*.

Finally, a “message” *m*_{i} is formed on each atom as a learnable linear combination of the symmetrized many-body features of the neighbors, as shown in Eq. (9). To form the node features of the next layer, we are adding this message to the atoms’ (nodes’) features from the previous iteration using weights that depend explicitly on the atoms’ chemical element (*z*_{i}), as shown in Eq. (10). Note that because the initial node features *h*^{(0)} are solely functions of the chemical element corresponding to the node, in the first layer the second term of Eq. (10) is omitted. This allows setting and fixing the energy of isolated atoms (i.e., those with no neighbors in their environment),^{5} which is often desirable.^{26}

Equations (2)–(10) comprise a MACE layer, and multiple layers are built by iteration. This means that the effective receptive field of the model (the region around an atom from which information is used to determine the site energy of the atom) is approximately a sphere of radius *S* × *r*_{cut}. More precisely, a neighboring atom contributes to the site energy if it can be reached in *S* hops on the graph defined earlier. The selected clusters form a sparse set of all possible clusters in *S* × *r*_{cut}. Therefore, message passing methods can be viewed as sparsifications of larger atom-centered methods.^{5} In practice, we almost always use two layers, so *S* = 2.

Neural network models such as MACE have a very large number of free parameters, especially compared with linear or kernel based models typically applied to the same tasks. It is thought that this overparameterization helps the training when using stochastic gradient descent and brings accuracy and regularity.^{27,28} The trade-off between the size (and, therefore, computational cost) of the model and its accuracy needs to be controllable. The key model size controls in MACE are the number of embedding channels *k* and the highest order *L*_{max} of the symmetric features $Bi,\eta \nu kLM(s),\nu $. In this paper, we will refer to the different sized models using these two numbers; for example, the invariant MACE model (corresponding to *L*_{max} = 0) with 64 channels is denoted MACE 64-0.

The computational cost of running a MACE model depends on the details of the systems studied (e.g., the average number of neighbors) and the hyperparameters that control the size of the model. The latency of the models (i.e., the shortest time to calculate forces on small systems, excluding calculation of the neighbor list and other book-keeping necessary for running MD) ranges from 0.7 to 20 ms, corresponding to small (MACE 64-0) or larger (MACE 256-2) models as measured on an Nvidia A100 GPU. A comprehensive assessment of MD performance, which we leave to future work, needs to consider the relationship between model size, accuracy, and execution speed. It is also useful to report the *fastest possible* MD performance, which we take to be that of the smallest model that can run stable MD for an arbitrary (in practice, very large) number of steps. At present, the performance of MACE in MD simulations is between 2 and 50 M steps/day (a million steps corresponds to a nanosecond with a 1 fs time step), depending on model size and the details of the full software stack, with the highest performance achieved using the JAX version of MACE using JAX-MD^{29} (see, e.g., the salicylic acid example ran using the 64-0 model). The evaluation speeds in LAMMPS and OpenMM are currently somewhat slower, but new custom CUDA kernels and improved MD interfaces are actively being developed that will likely enable even higher performance. The training times of the MACE models also vary vastly depending on the dataset size and the model size and typically range from 1 h to 1 day on a single A100 GPU for the models shown here.

The computational performance of MACE compares very favorably to other alternative equivariant neural network potentials. This is primarily the result of MACE forming the high body-ordered features very efficiently via the symmetric tensor products carried out on the nodes. For example, the Allegro model, which is one of the leading architectures that uses tensor products between the features on the edges of the graph rather than nodes, has a reported speed of 18 M steps per day.^{30} It should be noted though, that the Allegro model has already been demonstrated to scale well to many millions of atoms on many GPUs.^{31} This is, in principle, possible for MACE, which is also a strictly local model with a similarly moderate size receptive field, 5–10 Å, though the necessary LAMMPS interface implementation is ongoing work and will be demonstrated in future publications. Allegro typically uses 4–6 Å for production runs and larger for accuracy benchmarks.

### B. The body-order of MACE models

It is interesting to consider the node features *h*_{i} of MACE as body-ordered descriptors that, in the limit of the complete basis of radial functions and spherical harmonics, fully linearly span the space of symmetric functions over chain like clusters with hops of size *r*_{cut} up to the maximum body-order of the features.^{5,25,32} In the MACE models studied in this paper, each layer has a body-order of 4 [corresponding to *ν* = 3 in Eq. (7)]. Therefore, the node features of the first layer are four-body functions. They are then expanded in the second layer in the second one-particle basis, which is 4 + 1 = 5-body functions, since each application of the one-particle basis adds one to the body-order on account of the central atom. Then, taking the tensor products of these features with themselves three times, we obtain 3 × 4 + 1 = 13-body features (note that the five-body terms of the first layer all share the same central atom). This mechanism of efficiently forming very high body-order linearly complete features is unique to the MACE architecture, and it might be one of the reasons underpinning its excellent performance in the low-data regime and extrapolation.

### C. Loss scheduler

*b*in the current batch with batch size

*B*,

*E*

_{b}is the model’s prediction of the total energy, and $E\u0302b$ and $F\u0302b$ are the training data corresponding to total energy and force, respectively. The number of atoms in the configuration is

*N*

_{b}; atoms are indexed by

*i*

_{b}with elements of their position vector denoted by $rib,\alpha $ in the Cartesian direction

*α*. The weights of energies and forces are controlled by

*λ*

_{E}and

*λ*

_{F}. The choice of these weights is crucial, and we observed that the most accurate models are obtained when we train with a high weight on the forces compared to the other properties. A possible reason for this could be that the forces are local quantities and contain information about the dependence of the total energy on each atom. However, having very accurate force predictions does not necessarily result in accurate energy predictions. This is especially the case in systems where the training set is heterogeneous, by which we mean that it is composed of a wide variety of different systems well separated in atomic configuration space, e.g., different molecules or phases of a solid with very different structures. In this case, the model can accurately learn the local potential energy surface of each system individually, but it might not learn their relative energies correctly, resulting in large absolute energy errors.

To reduce absolute energy errors, we use a loss scheduler. For about 60% of the total training time, *λ*_{F} > *λ*_{E} is used. For the second part of the training, the weights are switched so that *λ*_{E} > *λ*_{F} and the learning rate is decreased by a factor of 10. In this way, the absolute energy errors can be reduced while keeping the force predictions’ high accuracy. An example of the energy validation error dramatically improving as the loss is changed is given in Appendix A, Fig. 7. The Figure also compares the two phase training with using the weights of the second phase from the beginning and shows that the two phase protocol not only accelerates convergence but also results in an overall decrease in errors. Note that many of the experiments in the paper were carried out using an earlier version of the loss function, which also used the average over the force components rather than the sum, as shown in Appendix B.

## III. LOCALITY OF LARGE MOLECULAR SYSTEMS

In this section, we examine the locality assumption used by MACE and compare its performance against a global machine learning force field, sGDML, which does not use a cutoff to decompose the energy into local site energy terms.^{33} In particular, we compare MACE to the recently improved version of sGDML, which uses a reduced set of global descriptors to allow for the fitting of systems with hundreds of atoms.^{34} Furthermore, we also compare MACE to the VisNet-LSRM model,^{35} which employs a mixed short-range and long-range description of the potential energy surface by combining local message passing with message passing between larger fragments.

As explained in Sec. II A, the effective cutoff in MACE is the number of layers times the cutoff distance in each layer. We also examine to what extent the information transfer between layers is effective and how property predictions of large molecular systems differ between a two-layer and a single-layer MACE model with the same receptive fields.

For this study, we use the recently published MD22 dataset, which was designed to be challenging for short-range models.^{36} The datasets include large molecules and molecular assemblies containing hundreds of atoms with complex intermolecular interactions. The dataset was created by running *ab initio* molecular dynamics simulations at elevated temperatures to better sample the configuration space of the systems. The training set size was determined such that the total energy error of the sGDML model is below the chemical accuracy of 1 kcal/mol. Our study uses the same training set sizes for a fair comparison (Table I).

. | . | . | MACE 256-2 . | MACE 256-0 . | MACE 256-2 . | VisNet-LSRM . | sGDML . |
---|---|---|---|---|---|---|---|

Cutoff distance | d (Å) | 2 × 3 Å | 1 × 6 Å | 2 × 5 Å | Long-range | Global | |

Tetrapeptide | ∼12 | E | 0.608 | 0.345 | 0.064 | 0.080 | 0.40 |

F | 7.6 | 17.0 | 3.8 | 5.7 | 34 | ||

Fatty acid | ∼16 | E | 0.446 | 0.399 | 0.102 | 0.058 | 1.0 |

F | 6.2 | 23.5 | 2.8 | 3.6 | 33 | ||

Tetrasaccharide | ∼14 | E | 0.252 | 0.357 | 0.062 | 0.044 | 2.0 |

F | 6.8 | 27.0 | 3.8 | 5.0 | 29 | ||

Nucleic acid (AT-AT) | ∼22 | E | 0.902 | 0.155 | 0.079 | 0.055 | 0.52 |

F | 13.3 | 14.9 | 4.3 | 5.2 | 30 | ||

Nucleic acid (AT-AT-CG-CG) | ∼24 | E | 0.603 | 0.166 | 0.058 | 0.049 | 0.52 |

F | 16.3 | 20.1 | 5.0 | 8.3 | 31 | ||

Buckyball catcher | ∼15 | E | 0.476 | 0.171 | 0.141 | 0.124 | 0.34 |

F | 13.1 | 22.2 | 3.7 | 11.6 | 29 | ||

Double-walled nanotube | ∼33 | E | 0.207 | 0.231 | 0.194 | 0.117 | 0.47 |

F | 17.9 | 39.6 | 12.0 | 28.7 | 23 |

. | . | . | MACE 256-2 . | MACE 256-0 . | MACE 256-2 . | VisNet-LSRM . | sGDML . |
---|---|---|---|---|---|---|---|

Cutoff distance | d (Å) | 2 × 3 Å | 1 × 6 Å | 2 × 5 Å | Long-range | Global | |

Tetrapeptide | ∼12 | E | 0.608 | 0.345 | 0.064 | 0.080 | 0.40 |

F | 7.6 | 17.0 | 3.8 | 5.7 | 34 | ||

Fatty acid | ∼16 | E | 0.446 | 0.399 | 0.102 | 0.058 | 1.0 |

F | 6.2 | 23.5 | 2.8 | 3.6 | 33 | ||

Tetrasaccharide | ∼14 | E | 0.252 | 0.357 | 0.062 | 0.044 | 2.0 |

F | 6.8 | 27.0 | 3.8 | 5.0 | 29 | ||

Nucleic acid (AT-AT) | ∼22 | E | 0.902 | 0.155 | 0.079 | 0.055 | 0.52 |

F | 13.3 | 14.9 | 4.3 | 5.2 | 30 | ||

Nucleic acid (AT-AT-CG-CG) | ∼24 | E | 0.603 | 0.166 | 0.058 | 0.049 | 0.52 |

F | 16.3 | 20.1 | 5.0 | 8.3 | 31 | ||

Buckyball catcher | ∼15 | E | 0.476 | 0.171 | 0.141 | 0.124 | 0.34 |

F | 13.1 | 22.2 | 3.7 | 11.6 | 29 | ||

Double-walled nanotube | ∼33 | E | 0.207 | 0.231 | 0.194 | 0.117 | 0.47 |

F | 17.9 | 39.6 | 12.0 | 28.7 | 23 |

### A. Effect of locality on energy and force errors

In order to test the influence of the receptive field of local models on their accuracy, we trained a series of three MACE models with different combinations of cutoff distances and the number of layers. The typical MACE model with two layers and a 5 Å cutoff at each layer is compared to a one layer MACE with a 6 Å cutoff and a two layer MACE with both layers having 3 Å cutoffs.

The best performing MACE model, which employs a 2 × 5 Å cutoff, improves the errors of the sGDML model by up to a factor of 10. Crucially, even this 10 Å receptive field is considerably smaller than the diameter of the systems in the dataset, so the MACE model is effectively local. Even the 2 × 3 and 1 × 6 Å models outperform sGDML for all systems. This is likely because the strength of local intramolecular (covalent) interactions is much higher than that of intermolecular interactions. So better overall accuracy can be achieved by only improving the short-range description, which MACE evidently does. In Appendix D, we show the vibrational spectrum of the tetrapeptide, showing excellent agreement between all three MACE models. When comparing MACE to VisNet-LSRM, the best model reported to date, we find that the strictly local MACE model has significantly lower force errors, but in most cases, the long-range model has lower energy errors, though it should be noted that the energy errors of both models are in the order of 0.1 meV/atom or lower in most cases. This small but consistent improvement in the energies might result from a more accurate description of the remaining small long-range interactions beyond the 10 Å cutoff of our longest range MACE model or from different relative force and energy weights used during training. It is not trivial to come up with benchmarks that specifically test the description of long range interactions and might pose a difficulty for short range models.

Typical intermolecular interactions appear to be well captured in the 5–6 Å range. The evidence for this is in the energy errors for the nucleic acid and the Bucky-ball catcher systems. In these systems, the intermolecular interactions contribute considerably to the total energy, and the MACE model with a 3 Å cutoff in each layer cannot describe them well. In contrast, the two longer-ranged MACE models have significantly lower energy errors. It is not a coincidence that most short range ML force field models in the literature have cutoffs between 5 and 10 Å (Table I).

. | . | ANI-1x . | GM-NN . | NewtonNet . | TensorNet . | MACE 64-0 . | MACE 96-1 . | MACE 192-2 . |
---|---|---|---|---|---|---|---|---|

ANI-MD | E | 3.40 | 3.83 | ⋯ | 1.61 | 10.3 | 2.81 | 3.25 |

F | 2.68 | 1.43 | ⋯ | 0.82 | 1.92 | 0.89 | 0.62 | |

DrugBank | E | 2.65 | 2.78 | ⋯ | 0.98 | 1.81 | 1.04 | 0.73 |

F | 2.86 | 1.69 | ⋯ | 0.75 | 1.20 | 0.70 | 0.47 | |

GDB 7 - 9 | E | 1.04 | 1.22 | ⋯ | 0.32 | 0.77 | 0.40 | 0.21 |

F | 2.43 | 1.41 | ⋯ | 0.53 | 0.96 | 0.54 | 0.34 | |

GDB 10–13 | E | 2.30 | 2.29 | ⋯ | 0.83 | 1.54 | 0.88 | 0.53 |

F | 3.67 | 2.25 | ⋯ | 0.97 | 1.52 | 0.92 | 0.62 | |

S66x8 | E | 2.06 | 2.95 | ⋯ | 0.62 | 1.17 | 0.69 | 0.39 |

F | 1.60 | 0.93 | ⋯ | 0.33 | 0.65 | 0.33 | 0.22 | |

Tripeptides | E | 2.92 | 3.06 | ⋯ | 0.92 | 2.10 | 1.18 | 0.79 |

F | 2.49 | 1.48 | ⋯ | 0.62 | 1.09 | 0.66 | 0.44 | |

COMP6 total | E | 1.93 | 2.03 | 1.45* | ⋯ | 1.47 | 0.76 | 0.48 |

F | 3.09 | 1.85 | 1.79* | ⋯ | 1.31 | 0.77 | 0.52 |

. | . | ANI-1x . | GM-NN . | NewtonNet . | TensorNet . | MACE 64-0 . | MACE 96-1 . | MACE 192-2 . |
---|---|---|---|---|---|---|---|---|

ANI-MD | E | 3.40 | 3.83 | ⋯ | 1.61 | 10.3 | 2.81 | 3.25 |

F | 2.68 | 1.43 | ⋯ | 0.82 | 1.92 | 0.89 | 0.62 | |

DrugBank | E | 2.65 | 2.78 | ⋯ | 0.98 | 1.81 | 1.04 | 0.73 |

F | 2.86 | 1.69 | ⋯ | 0.75 | 1.20 | 0.70 | 0.47 | |

GDB 7 - 9 | E | 1.04 | 1.22 | ⋯ | 0.32 | 0.77 | 0.40 | 0.21 |

F | 2.43 | 1.41 | ⋯ | 0.53 | 0.96 | 0.54 | 0.34 | |

GDB 10–13 | E | 2.30 | 2.29 | ⋯ | 0.83 | 1.54 | 0.88 | 0.53 |

F | 3.67 | 2.25 | ⋯ | 0.97 | 1.52 | 0.92 | 0.62 | |

S66x8 | E | 2.06 | 2.95 | ⋯ | 0.62 | 1.17 | 0.69 | 0.39 |

F | 1.60 | 0.93 | ⋯ | 0.33 | 0.65 | 0.33 | 0.22 | |

Tripeptides | E | 2.92 | 3.06 | ⋯ | 0.92 | 2.10 | 1.18 | 0.79 |

F | 2.49 | 1.48 | ⋯ | 0.62 | 1.09 | 0.66 | 0.44 | |

COMP6 total | E | 1.93 | 2.03 | 1.45* | ⋯ | 1.47 | 0.76 | 0.48 |

F | 3.09 | 1.85 | 1.79* | ⋯ | 1.31 | 0.77 | 0.52 |

### B. Effect of locality beyond RMSE—The dynamics of the bucky-ball catcher

In the following, we investigate the Bucky-ball catcher system in more detail because intermolecular interactions play a crucial role in the dynamics. Following the experiment in Ref. 36, we ran 200 ps of NVT molecular dynamics simulation with each of the three MACE models introduced in Sec. III A. As shown on the left panel of Fig. 1 for the 2 × 3 Å MACE model, the system dissociated into the Bucky-ball and the catcher within 10 ps. The model, which on the whole has excellent accuracy (just 0.5 meV/atom for energies and 13 meV/Å for forces), is unable to fit the attractive dispersion between the two sub-parts of the system because they are typically further than 3 Å apart. On the other hand, the two slightly longer ranged MACE models provide qualitatively correct dynamics. We have also computed the molecular vibrational spectrum obtained as the Fourier transform of the velocity-velocity auto-correlation function to analyze the dynamics quantitatively. On the right panel of Fig. 1, we compare the result of the one-layer 6 Å MACE model to the spectrum published in Ref. 36 computed using sGDML and find excellent agreement, including for low frequencies, demonstrating that local models can simulate the dynamics of systems even when intermolecular interactions are important.

## IV. COMP6—H, C, N, O ORGANIC FORCE FIELD

In this section, we demonstrate that the MACE architecture can be used to train a highly accurate transferable general organic force field for the H, C, N, and O chemical elements. To train the model, we used a subset of the ANI-1x dataset, which contains coupled cluster calculations.^{37} We trained a series of MACE models, going from a small (64-0) MACE model to a medium (96-1) and a large (192-2) model. First, we trained the models using density functional theory (DFT) energies and forces and tested them on the COMP6 benchmark suite.^{38} The results are summarized in Table II. We can see that even the smallest MACE model outperforms most previously published models.^{38–41} The large MACE model improves on the previous state of the art by about a factor of 5, achieving an overall error well below 0.5 kcal/mol. The ANI-MD subset is one where the MACE total energy errors are relatively high compared with the other subsets. This subset is comprised of configurations of 14 different molecules sampled from ANI-MD trajectories. The MACE errors are relatively low on 12 of the 14 molecules. However, on the two largest ones (Chignolin, 149 atoms, and TrpCage, 312 atoms), the MACE energy is shifted by a constant value in comparison to the DFT reference energy. The energy–energy correlation plot for all molecules in this subset is shown in Appendix F, Fig. 9. The task is particularly challenging because the training set contains very few molecules with more than 50 atoms; hence, when testing on larger systems, there can be an accumulation of small errors that are not controlled in the training. A simple test of this hypothesis could be the addition of a small number of larger molecules to the training set.

We also performed transfer learning to the coupled cluster level of theory using energies only, following Ref. 39. The resulting six pre-trained MACE models are published and available to download from GitHub and can be evaluated using the Atomic Simulation Environment (ASE), OpenMM, and LAMMPS.

### A. Biaryl torsion benchmark

We now evaluate the coupled cluster transfer learned versions of the above-mentioned MACE organic force fields on the challenging dihedral torsion dataset introduced in Ref. 42. This dataset consists of 88 small drug-like molecules with different biaryl dihedral torsional profiles. Such a benchmark is of particular interest in connection with small molecule drug discovery. Accurately describing the torsional barriers is a typical task where classical empirical force fields cannot provide sufficiently accurate descriptions of the potential energy surface.^{43,44}

We have evaluated the MACE dihedral torsional scans on a subset of the full dataset that contains only H, C, N, and O chemical elements (78 molecules). In the following, we evaluate the largest MACE model (192-2) transfer learned to the CC level of theory and compare it to ANI-1ccx. We find that the MACE model achieves a mean absolute barrier height error compared to coupled cluster singles doubles and perturbative triples [CCSD(T)] of 0.36 kcal/mol, improving significantly on the ANI-1ccx model, which was identified as the best to date and has an error of 0.78 kcal/mol. We have also compared the cumulative error distribution of MACE and ANI and shown it on the bottom panel of Fig. 2. It shows that MACE has very few molecules for which it makes an error close to or larger than 1 kcal/mol. Notably, even the medium and small MACE models do better or are close to the same as the ANI model, with average barrier height errors of 0.56 and 0.83 kcal/mol, respectively.

All 78 torsional profiles computed with all three CC transfer learned MACE models can be found as part of the Supplementary Information. On Fig. 2, we show three examples that were identified as being particularly challenging in the original paper and compare MACE 192-2 to the ANI-1ccx model. The MACE model produces a smooth surface with the correct position of the minima and maxima in all three cases, similarly to the rest of the test molecules that can be found in the supplementary material.

## V. VIBRATIONAL SPECTRUM FROM 50 COUPLED CLUSTER CALCULATIONS

A final test of MACE on small molecular systems is to evaluate its performance in the low-data regime. Data efficiency is crucial in atomistic machine learning as it can save time and computational costs or enable the models to be trained with more accurate reference data. Recent work has found that the majority of machine learning potentials are not able to run stable molecular dynamics simulations without iterative fitting, even when they are trained on thousands of configurations of a system.^{45} Furthermore, the authors found that very low energy and force RMSE did not correlate with the ability to perform stable simulations.

To demonstrate the excellent data efficiency and extrapolation capabilities of MACE, we use the 256-2 MACE models published in Ref. 18 trained using rMD17^{46} and compare the models trained on 50 and 1000 small molecule geometries sampled randomly from the dataset. It was previously shown that 50 configurations are sufficient to train a model with a sub-kcal/mol total energy error. For three selected systems, ethanol, paracetamol, and salicylic acid, we ran 50 ps long NVT molecular dynamics simulations that were stable for both training set sizes, obviating the need for any further active learning or iterative fitting. To validate the dynamics quantitatively, we compared the molecular vibrational spectrum. The top panel of Fig. 3 shows the spectra for ethanol. The two MACE models are in excellent agreement and are able to reproduce the peaks from the experimental gas-phase IR spectrum of ethanol.^{47} In Appendix G, we also show that the same result holds for paracetamol and salicylic acid.

With model architectures able to learn an accurate representation of the potential energy surface from so few configurations, we can create better models by training them on reference data obtained using much more expensive correlated wave-function based quantum mechanical calculations. To demonstrate this, we recomputed the 50 training configurations from rMD17 using the CCSD(T) level of theory, including the forces.^{48} We fitted a MACE model to these data directly, without transfer learning. We also compared the model to the smallest (64-0) general organic MACE model that was transferred learned to coupled-cluster data from Sec. IV. We find that the custom trained and general MACE models predict very similar spectra, as shown on the bottom panel of Fig. 3. Interestingly, the peaks corresponding to stretching models involving Hydrogens are blue-shifted compared to the experimental positions. This is a result of nuclear quantum effects, and the experimental spectrum can be recovered by including these effects via techniques treating the nuclei as quantum particles.^{49} Note how these shifts are smaller for the original rMD17-derived models—an example of cancellation of errors due to missing correlation in DFT and quantum nuclear effects.

## VI. AMORPHOUS CARBON

To demonstrate the capabilities of MACE on complex solid state systems, we train a model on the carbon dataset published in Ref. 50. This dataset comprises a wide variety of carbon phases, from crystalline diamond and graphite structures to various amorphous phases and small clusters. The complexity comes from the unique bonding behavior of carbon atoms, which can accommodate 1–4 neighbors depending on their oxidation and hybridization states. While the crystalline phases typically only contain one type of hybridization of carbon, sp^{2} in graphite and sp^{3} in diamond, in amorphous carbon, these occur simultaneously and can even vary dynamically. Amorphous carbon can be regarded as one of the most challenging systems for machine learning force fields due to its highly complex short- and medium-range electronic structure, especially in the low density, sp^{2} dominated phases.^{51,52}

## VII. HME21—37 ELEMENT DISORDERED CRYSTALS

The simultaneous description of a large number of different chemical elements is a long standing challenge for machine learning force fields.^{19,25,53} The HME21 dataset^{54,55} is made up of a diverse set of disordered and regular crystals from 37 different chemical elements, covering a broad spectrum of chemical space. We train a 256-2 MACE model on the dataset and compare our results with other previously published models in Table III.

## VIII. M3GNet: 89 ELEMENTS CONDENSED MATERIALS

Another challenging condensed phase materials science dataset is the training set of the M3GNet model.^{57} This dataset contains an extensive range of materials extracted from the materials project.^{58} The aim of the dataset is to provide a training set for the universal condensed matter interatomic potential of all 89 elements ranging from hydrogen to thorium.

This dataset is distinct from the previously studied ones due to its greater range of interaction energies and forces. To be able to train the MACE model on such a diverse dataset we used the Huber loss,^{59} as proposed in the M3GNet paper.^{57} This loss function switches between an L1 and L2 loss, making the learning more tolerant toward outliers. We also found it beneficial to slightly generalize the radial basis by making it dependent on the atom’s elements and local environment (see Appendix C). We observe that an out-of-the-box MACE model with the modified loss and radial function roughly matches the accuracy of M3GNet, improving on forces. There are several possible modifications to MACE that could improve further its performance on training sets with a large chemical diversity. These include making the cutoffs element-dependent, which will be investigated as part of future work (Table IV).

## IX. LIQUID WATER

To assess the ability of MACE to describe complex molecular liquids, we fitted a dataset of 1593 liquid water configurations made up of 64 molecules each.^{60} The dataset was computed using the CP2K software^{61} at the revPBE0-D3 level of density functional theory, which is known to give a reasonably good description of the structure and dynamics of water at a variety of pressures and temperatures.^{62}

Table V compares the energy and force errors of the MACE model with other machine learning force fields trained on the same dataset but using different train test splits. The three-body atom-centered symmetry function-based feed-forward neural network model (BPNN) has the highest errors. The three-body invariant message passing model REANN^{63} and the two-body equivariant message passing model NequIP^{16} significantly improve on the errors of the BPNN model. A further improvement is achieved by the many-body equivariant MACE model. Interestingly, a relatively small MACE model using invariant messages with an overall body-order of 13 achieves already lower errors than the other best models, and the larger MACE model only slightly improves on the force errors.

. | BP-NN^{60}
. | REANN^{63}
. | NequIP^{16} (L = 2)
. | MACE 64-0 . | MACE 192-2 . |
---|---|---|---|---|---|

E RMSE (meV/H_{2}O) | 7.0 | 2.4 | ⋯ | 1.9 | 1.9 |

F RMSE (meV/Å) | 120 | 53.2 | ⋯ | 37.1 | 36.2 |

E MAE (meV/H_{2}O) | ⋯ | ⋯ | 2.5 | 1.2 | 1.2 |

F MAE (meV/Å) | ⋯ | ⋯ | 21 | 20.7 | 18.5 |

### A. Thermodynamics and kinetics of liquid water

To characterize the smaller and faster MACE water model, we ran 200 ps NPT simulations at a range of temperatures from 250 to 315 K using a combined Nose-Hoover and Parrinello-Rahman barostat.^{65,66} As shown in Fig. 5, the average density was computed after an initial equilibration period at each temperature. The top left panel shows the water density isobar, showing the characteristic density maximum at around 270 K. This is somewhat lower than the experimental value but is consistent with previous DFT based studies of water.^{60} At 300 K, we also compare the density of the MACE model with the available *ab initio* value from Ref. 64 and find a very good agreement.

Finally, we also investigated a dynamic property, the diffusivity of water, which is notoriously difficult to compute accurately. To obtain the diffusivity, we took the water configuration from the NPT simulation, which had an equilibrium density of 0.91 g/cm^{3}, and ran three independent 200 ps long NVT simulations. The mean squared displacements from these simulations are shown on the right panel of Fig. 5. By fitting a linear function on the diffusive part of the MSD, we obtained a value of the diffusion coefficient of 2.14 ± 0.14 × 10^{−9} m^{2}/s, which is in reasonably good agreement with the *ab initio* value estimated from much smaller simulations in Ref. 62.

## X. THE QM9 BENCHMARK

We now test the MACE architecture on tasks that are not directly related to force field fitting and use the well established QM9 dataset.^{67} This benchmark is one of the oldest used to validate machine learning models for chemistry in general, and most of the published machine learning architectures have reported their results on it. Table VI collects the results of these models on 12 different tasks. Notably, MACE achieves state-of-the-art results on 3 of the four energy related tasks (*G*, *H*, *U*, and *U*_{0}) and also improves the state-of-the-art on one further non-energy related task.

. | Gap (meV) . | Homo (meV) . | Lumo (meV) . | C_{V} (cal/mol K)
. | μ (D)
. | ZPVE (meV) . | R^{2} $\alpha 02$
. | α $\alpha 03$
. | G (meV)
. | H (meV)
. | U (meV)
. | U_{0} (meV)
. |
---|---|---|---|---|---|---|---|---|---|---|---|---|

NMP^{68} | 69 | 43 | 38 | 0.040 | 0.030 | 1.50 | 0.180 | 0.092 | 19 | 17 | 20 | 20 |

SchNet^{13} | 63 | 41 | 34 | 0.033 | 0.033 | 1.70 | 0.073 | 0.235 | 14 | 14 | 19 | 14 |

Cormorant^{21} | 61 | 34 | 38 | 0.026 | 0.038 | 2.03 | 0.961 | 0.085 | 20 | 21 | 21 | 22 |

LieConv^{69} | 49 | 30 | 25 | 0.038 | 0.032 | 2.28 | 0.800 | 0.084 | 22 | 24 | 19 | 19 |

DimeNet++^{14} | 33 | 25 | 20 | 0.023 | 0.030 | 1.21 | 0.331 | 0.044 | 7.6 | 6.5 | 6.3 | 6.3 |

EGNN^{70} | 48 | 29 | 25 | 0.031 | 0.029 | 1.55 | 0.106 | 0.071 | 12 | 12 | 12 | 11 |

PaiNN^{15} | 46 | 28 | 20 | 0.024 | 0.012 | 1.28 | 0.066 | 0.045 | 7.4 | 6.0 | 5.8 | 5.9 |

TorchMD-NET^{17} | 36 | 20 | 18 | 0.026 | 0.011 | 1.84 | 0.033 | 0.059 | 7.6 | 6.2 | 6.4 | 6.2 |

SphereNet^{71} | 32 | 23 | 18 | 0.022 | 0.026 | 1.12 | 0.292 | 0.046 | 7.8 | 6.3 | 6.4 | 6.3 |

SEGNN^{72} | 42 | 24 | 21 | 0.031 | 0.023 | 1.62 | 0.660 | 0.060 | 15 | 16 | 13 | 15 |

EQGAT^{73} | 32 | 20 | 16 | 0.024 | 0.011 | 2.00 | 0.382 | 0.053 | 23 | 24 | 25 | 25 |

Equiformer^{74} | 30 | 15 | 14 | 0.023 | 0.011 | 1.26 | 0.251 | 0.046 | 7.6 | 6.6 | 6.7 | 6.6 |

MGCN^{75} | 64 | 42 | 57 | 0.038 | 0.056 | 1.12 | 0.110 | 0.030 | 15 | 16 | 14 | 13 |

Allegro^{30} | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | 5.7 | 4.4 | 4.4 | 4.7 |

NoisyNodes^{76} | 29 | 20 | 19 | 0.025 | 0.025 | 1.16 | 0.700 | 0.052 | 8.3 | 7.4 | 7.6 | 7.3 |

GNS-TAT+NN^{77} | 26 | 17 | 17 | 0.022 | 0.021 | 1.08 | 0.65 | 0.047 | 7.4 | 6.4 | 6.4 | 6.4 |

Wigner Kernels^{78} | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | 4.3 |

TensorNet^{41} | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | 6.0 | 4.3 | 4.3 | 4.3 |

MACE | 42 | 22 | 19 | 0.021 | 0.015 | 1.23 | 0.210 | 0.038 | 5.5 | 4.7 | 4.1 | 4.1 |

. | Gap (meV) . | Homo (meV) . | Lumo (meV) . | C_{V} (cal/mol K)
. | μ (D)
. | ZPVE (meV) . | R^{2} $\alpha 02$
. | α $\alpha 03$
. | G (meV)
. | H (meV)
. | U (meV)
. | U_{0} (meV)
. |
---|---|---|---|---|---|---|---|---|---|---|---|---|

NMP^{68} | 69 | 43 | 38 | 0.040 | 0.030 | 1.50 | 0.180 | 0.092 | 19 | 17 | 20 | 20 |

SchNet^{13} | 63 | 41 | 34 | 0.033 | 0.033 | 1.70 | 0.073 | 0.235 | 14 | 14 | 19 | 14 |

Cormorant^{21} | 61 | 34 | 38 | 0.026 | 0.038 | 2.03 | 0.961 | 0.085 | 20 | 21 | 21 | 22 |

LieConv^{69} | 49 | 30 | 25 | 0.038 | 0.032 | 2.28 | 0.800 | 0.084 | 22 | 24 | 19 | 19 |

DimeNet++^{14} | 33 | 25 | 20 | 0.023 | 0.030 | 1.21 | 0.331 | 0.044 | 7.6 | 6.5 | 6.3 | 6.3 |

EGNN^{70} | 48 | 29 | 25 | 0.031 | 0.029 | 1.55 | 0.106 | 0.071 | 12 | 12 | 12 | 11 |

PaiNN^{15} | 46 | 28 | 20 | 0.024 | 0.012 | 1.28 | 0.066 | 0.045 | 7.4 | 6.0 | 5.8 | 5.9 |

TorchMD-NET^{17} | 36 | 20 | 18 | 0.026 | 0.011 | 1.84 | 0.033 | 0.059 | 7.6 | 6.2 | 6.4 | 6.2 |

SphereNet^{71} | 32 | 23 | 18 | 0.022 | 0.026 | 1.12 | 0.292 | 0.046 | 7.8 | 6.3 | 6.4 | 6.3 |

SEGNN^{72} | 42 | 24 | 21 | 0.031 | 0.023 | 1.62 | 0.660 | 0.060 | 15 | 16 | 13 | 15 |

EQGAT^{73} | 32 | 20 | 16 | 0.024 | 0.011 | 2.00 | 0.382 | 0.053 | 23 | 24 | 25 | 25 |

Equiformer^{74} | 30 | 15 | 14 | 0.023 | 0.011 | 1.26 | 0.251 | 0.046 | 7.6 | 6.6 | 6.7 | 6.6 |

MGCN^{75} | 64 | 42 | 57 | 0.038 | 0.056 | 1.12 | 0.110 | 0.030 | 15 | 16 | 14 | 13 |

Allegro^{30} | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | 5.7 | 4.4 | 4.4 | 4.7 |

NoisyNodes^{76} | 29 | 20 | 19 | 0.025 | 0.025 | 1.16 | 0.700 | 0.052 | 8.3 | 7.4 | 7.6 | 7.3 |

GNS-TAT+NN^{77} | 26 | 17 | 17 | 0.022 | 0.021 | 1.08 | 0.65 | 0.047 | 7.4 | 6.4 | 6.4 | 6.4 |

Wigner Kernels^{78} | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | 4.3 |

TensorNet^{41} | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | 6.0 | 4.3 | 4.3 | 4.3 |

MACE | 42 | 22 | 19 | 0.021 | 0.015 | 1.23 | 0.210 | 0.038 | 5.5 | 4.7 | 4.1 | 4.1 |

To achieve the best result on intensive properties, such as *C*_{V}, we modified the readout function of the MACE model. This was necessary because the simple sum pooling operation of the node outputs that are used in the MACE force fields results in a size-extensive overall prediction. The precise form of the intensive global non-linear readout function used is described in the supplementary material.

### A. Learning curve on potential energies

Finally, it is interesting to examine the potential energy (U_{0}) models in more detail. On the left panel of Fig. 6, the learning curves of four different MACE models are compared. The blue models use hyperparameters of MACE that are similar to the ones used throughout the paper (256-2). In comparison, the models denoted in red are significantly larger versions of MACE, which use a much larger MLP in the learnable radial basis [see Eq. (4)], which has been shown to help QM9 and other models like Allegro.^{30} This extra flexibility results in higher accuracy, especially in a very high data regime. A further interesting point is the comparison of the models trained using energies only (the usual practice for QM9, denoted with the solid line) with the models whose loss function also included forces, exploiting the knowledge that QM9 is made up of equilibrium geometries with zero forces on all the atoms. This extra information (without the need for new QM calculations) increases the accuracy of both the small and large MACE models.

In the right panel of Fig. 6, we compare the learning curve of the best MACE model to the learning curve of several other very good models. It shows that kernel models such as FCHL,^{79} SOAP,^{8} the linear NICE model,^{11} the feed forward neural network GMsNN,^{80} and the message passing SchNet^{13} and PhysNet^{81} models all achieve comparable errors. The models that can surpass this significantly are Allegro,^{30} Wigner kernels,^{78} and MACE, all of which use higher body-order features. This strongly hints that high body-order is a crucial property of the most successful atomistic ML models.

## XI. CONCLUSION

In this paper, we have demonstrated the power of the MACE architecture for a wide variety of chemical tasks and systems. We have investigated the effect of the locality assumption on large molecular systems and demonstrated that a two layer MACE model with a 5 Å cutoff on each layer is able to improve on the state-of-the-art global model in accuracy by up to a factor of 10. We also showed that the molecular vibrational spectrum can be accurately reproduced using the local MACE model. We have also demonstrated the transferability of the MACE model by fitting it to just 10% of the entire ANI-1x dataset. MACE improved on the state of the art on the COMP6 comprehensive organic molecule benchmark by a factor of 5 and 7 for energies and forces, respectively, compared to the ANI model and a factor of 3–4 compared to the equivariant message passing NewtonNet model. We showed that the models can be transfer learned to the coupled-cluster level of theory and used a CC-transfer learned MACE model to compute the vibrational spectrum of ethanol. Furthermore, tests demonstrated excellent data efficiency and extrapolation by computing the coupled cluster level vibrational spectrum of ethanol using a MACE model trained on just 50 reference calculations without the need for iterative training. We have also shown excellent accuracy in condensed phase systems such as diverse phases of carbon and liquid water. Our results on the M3GNet dataset show that MACE reaches excellent accuracy and transferability on large and chemically diverse datasets, paving the way for new state-of-the-art universal force fields for materials. Finally, we also demonstrated that MACE improves on the state-of-the-art for many properties of the standard QM9 quantum chemistry benchmark.

Overall, these results set a new standard for machine learning force fields in chemistry and demonstrate that the MACE architecture is capable of describing a wide variety of systems with little to no modification to the model training and hyperparameters. This could ultimately lead to force fields with unprecedented accuracy created at ease without significant user input or requirements for expertise.

## SUPPLEMENTARY MATERIAL

The supplementary material contains all torsion drives for the biaryl set molecules.

## ACKNOWLEDGMENTS

D.P.K. acknowledges support from AstraZeneca and the EPSRC. We used the computational resources of the UK HPC service ARCHER2 via the UKCP consortium and were funded by EPSRC under Grant No. EP/P022596/1. This work was also performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3). We would like to acknowledge Michele Ceriotti for providing the raw data for the QM9 learning curves. We would also like to acknowledge Joshua Horton for helping with the implementation of MACE into the QCEngine driver for the torsional scans. We would like to acknowledge Mario Geiger for the implementation of MACE in JAX. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any author accepted manuscript version arising from this submission.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Dávid Péter Kovács**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). **Ilyes Batatia**: Investigation (equal); Methodology (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). **Eszter Sára Arany**: Investigation (equal); Visualization (equal); Writing – review & editing (equal). **Gábor Csányi**: Conceptualization (equal); Funding acquisition (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available within the article and its supplementary material. The MACE code is freely available on https://github.com/ACEsuit/mace and example input scripts and pre-trained models to reproduce the results are provided in the MACE code documentation website: https://mace-docs.readthedocs.io/en/latest/examples/training_examples.html. The datasets used are all published elsewhere and referenced in the text. The JAX version of MACE is freely available at https://github.com/ACEsuit/mace-jax.

### APPENDIX A: EFFECT OF THE LOSS SCHEDULER

### APPENDIX B: OLD LOSS FUNCTION

### APPENDIX C: ELEMENT DEPENDENT RADIAL

*h*

_{i,k00},

*h*

_{j,k00}corresponds to the scalar part of the node features of atoms

*i*and

*j*.

### APPENDIX D: COMPUTATIONAL DETAILS

All models were trained using NVIDIA A100 GPUs. The training time varied between the datasets; for small molecules, it was between 1 and 5 h; for the larger datasets, it depended on the model size; and in the case of the ANI-1x and QM9 datasets, the largest models took 3–4 days. We stopped the training once the validation loss stopped improving.

All experiments were carried out using the MACE code available at https://github.com/ACEsuit/mace. Example input files for the experiments can be found on the Examples page of the MACE documentation at: https://mace-docs.readthedocs.io.

#### 1. Common settings to all MACE force field model*s*

All MACE force field models fitted in Secs. III–IX shared most of their core training and model settings. They all used two MACE layers (except where explicitly stated in MD22, where we compared to a single layer model), they all used *l*_{max} = 3 in the spherical harmonic expansion, and they all used eight radial Bessel features with a polynomial envelope for the cutoff with *p* = 5. The radial features are well fed into an MLP of size [64, 64, 64] using SiLu non-linearities. All models used correlation order *N* = 3 in the MACE layer (four-body messages). The readout function of the first layer was a simple linear transform, whereas for the second layer, it is a single-layer MLP with 16 hidden dimensions. Models were trained with the AMSGrad variant of Adam with default parameters *β*_{1} = 0.9, *β*_{2} = 0.999, and *ϵ* = 1 × 10^{−8}. We used a learning rate of 0.01 and an on-plateau scheduler that reduced the learning rate when the validation loss did not improve for a certain number of epochs. We used an exponential moving average with a weight of 0.99 when updating the weights, both for the validation set evaluations and the final models. The loss function used is the one given in Eq. (B1).

#### 2. MD22 models

The three MACE models fitted to the MD22 dataset all used 256 uncoupled feature channels. The two layer models used *L*_{max} = 2 for the messages. 95% of the training sample was used for training, and a random 5% subset was used for validation. The energy was normalized by shifting the mean of the potential energies. The initial weights in the loss function were: *λ*_{E} = 10, *λ*_{F} = 1000, followed by a switch to *λ*_{E} = 1000, *λ*_{F} = 10, with the learning rate being reduced to 0.001.

#### 3. COMP6 models

The three models all used *r*_{cut} = 5 Å cutoff radius. The number of uncoupled feature channels was 64, 96, and 192, with the message equivariance order *L*_{max} also increasing from 0 (invariant) to 1 and 2 for the three models, respectively. The number of parameters for the three models was 196 944, 517 872, and 2 229 456, respectively. We used loss weights: *λ*_{E} = 40, *λ*_{F} = 1000, followed by *λ*_{E} = 1000, *λ*_{F} = 10 in the second phase of the training. For normalization of the energy, we were using the atomization energy, meaning that we applied a fixed shift to each molecule using the isolated atom energies computed with DFT, which were published as part of the original dataset.

#### 4. revMD17 models

The models used here were identical to the ones published in Ref. 18.

#### 5. Amorphous carbon model

The carbon MACE model was trained on the carbon dataset published in Ref. 50. We created a training, validation, and test split by selecting 10% of each configuration type to be in the test and randomly selecting 10% for the validation set from the training set. We used a MACE model with two layers, 256 feature channels, *l*_{max} = 3 spherical harmonics, and passing *L*_{max} = 2 messages. We used a cutoff of *r*_{cut} = 6.0 Å. The standard weighted energy forces loss was used, with weights of *λ*_{E} = 80, *λ*_{F} = 1000 for the first 225 epochs, and *λ*_{E} = 1000, *λ*_{F} = 100 for the 75 last epochs. The batch size was 5. The weight decay of 5 × 10^{−7} was similarly applied to selected weights as in Ref. 18.

#### 6. HME21

The MACE model was trained using the published train-test split. The training data were reshuffled after each epoch. We used 256 uncoupled feature channels, *l*_{max} = 3, and invariant messages, *L*_{max} = 2. We used a cutoff of 6 Å. The standard weighted energy and force loss was used, with a weight of 1 on energies and a weight of 10 on forces.

#### 7. M3GNet

The MACE model was trained using the published train-test split. The training data was reshuffled after each epoch. We used 128 uncoupled feature channels, *l*_{max} = 3, and equivariant messages, *L*_{max} = 2. We used a cutoff of 5 Å. We use the Huber loss with *δ* = 0.01 and loss weights of 1 on forces, 1 on energies, and 0.1 on stresses.

#### 8. Liquid water model

The two models both used a cutoff of *r*_{cut} = 6 Å. The small model used 64 uncoupled channels and invariant messages (*L*_{max} = 0), whereas the larger model used 192 channels and *L*_{max} = 2. The weights used for the training were *λ*_{E} = 10, *λ*_{F} = 1000, followed by *λ*_{E} = 1000, *λ*_{F} = 10 in the second phase of the training.

#### 9. QM9 models

For the *U*_{0} property, we ran experiments with a standard MACE model identical to the one used for the MD22 tests, meaning 256 uncoupled channels and *L*_{max} = 2. We have found that a slightly modified version of MACE, which uses a much increased radial MLP, achieves superior error compared to the standard one. This model used a radial MLP of size [128, 256, 512, 1024]. We have also reduced the learning rate to 0.001, increased the width of the readout MLP from 16 to 128, and applied an increased exponential moving average exponent of 0.999. Furthermore, for the case of intensive properties (gap, homo, lumo, *μ*, C^{V}, and zpve), we have changed the readout function from applying sum pooling, which results in a size extensive output, as desired for energy like quantities, into a non-linear pooling operation that first applies both sum, mean, and standard deviation pooling, followed by the application of an attention mechanism to form the final output. We found that such a readout function led to up to a twofold improvement compared to the simple linear extensive pooling operation.

### APPENDIX E: EFFECT OF LOCALITY ON MOLECULAR DYNAMICS SIMULATIONS

#### 1. Tetrapeptide

In Fig. 8, we show that the vibrational spectrum of the short range and longer range MACE models agree remarkably well for the tetrapeptide system of MD22. This demonstrates that the locality assumption can be valid and that short-range models can be used to compute the accurate vibrational spectrum of even larger systems.

### APPENDIX F: COMP6: ANI-MD MOLECULES

Figure 9 shows that 12 of the 14 molecules have very low errors scattered randomly around the DFT value. The two larger peptides have a systematic shift compared to the ground truth value.

### APPENDIX G: VIBRATIONAL SPECTRUM FROM 50 QM CALCULATIONS

## REFERENCES

*N*-body equivariant features

*Group Theory: And Its Application to the Quantum Mechanics of Atomic Spectra*

*Breakthroughs in Statistics: Methodology and Distribution*

*Ab initio*thermodynamics of liquid and solid water

*ab initio*liquid water: The interplay of nuclear and electronic quantum effects