We assess Gaussian process (GP) regression as a technique to model interatomic forces in metal nanoclusters by analyzing the performance of 2-body, 3-body, and many-body kernel functions on a set of 19-atom Ni cluster structures. We find that 2-body GP kernels fail to provide faithful force estimates, despite succeeding in bulk Ni systems. However, both 3- and many-body kernels predict forces within an ∼0.1 eV/Å average error even for small training datasets and achieve high accuracy even on out-of-sample, high temperature structures. While training and testing on the same structure always provide satisfactory accuracy, cross-testing on dissimilar structures leads to higher prediction errors, posing an extrapolation problem. This can be cured using heterogeneous training on databases that contain more than one structure, which results in a good trade-off between versatility and overall accuracy. Starting from a 3-body kernel trained this way, we build an efficient non-parametric 3-body force field that allows accurate prediction of structural properties at finite temperatures, following a newly developed scheme [A. Glielmo *et al.*, Phys. Rev. B **95**, 214302 (2017)]. We use this to assess the thermal stability of Ni_{19} nanoclusters at a fractional cost of full *ab initio* calculations.

## I. INTRODUCTION

Metallic nanoparticles have fascinating chemo-physical properties, different from those of their individual atomic constituents and their bulk counterparts.^{1–5} Because of the variety of isomers accessible at finite temperatures and the lack of translational symmetry, implying a non-trivial interplay between their geometric and electronic structure, a comprehensive understanding of metallic nanoclusters remains challenging, despite their potential use in many advanced applications.^{6–16} Density Functional Theory (DFT) is the most common framework to investigate the static and dynamical properties of nanoclusters of few tens of atoms, for which the standard classical force fields cannot systematically be relied upon to provide sufficient accuracy. However, DFT-based calculations are very expensive, and probing limited time scales in first principles dynamical simulations may lead to poor sampling of the nanoclusters’ configuration space.

Machine Learning Force Fields (ML-FFs) may provide a solution to this problem, and the needed access to the dynamical properties of nanoclusters, by extending by several orders of magnitude the accessible time scale, while still describing sufficiently accurately the interactions between the cluster atoms. The ML-FFs of most widespread use in cluster science are based on artificial neural networks.^{17} In many studies, aiming at investigating nanoclusters’ properties, the training databases were constructed from clusters of several sizes, involving structures based on different Bravais lattices and surfaces with different crystallographic orientations.^{18–23} The *ex novo* production of such databases requires many expensive quantum calculations: while some redundancy is hard to avoid, the neural network architecture, by means of multiple layers and a high number of fitted parameters, is usually able to extract the necessary information and correctly predict energies and forces in specific scenarios. The resulting trained force-field, although versatile, can be, however, difficult to interpret because of the inherent complexity of the algorithm.

Another commonly used class of ML-FFs is based on Gaussian Process (GP) regression^{24,25} and has recently been used to predict the properties of both bulk^{26,27} and molecular^{28–30} systems. While GPs have been also applied to predict adsorption energies of small molecules on NiGa and RhAu nanoclusters,^{31,32} they have never been used so far to estimate the finite-temperature structural properties of a nanoparticle. GPs are usually easier to interpret as they contain a small number of physically meaningful hyperparameters. Moreover, including symmetries such as the translational invariance and rotational covariance of forces^{25} or choosing simplifying approximations such as selecting the *n*-body order of interaction between atoms^{27,33–35} in the algorithm is straightforward in the case of GPs, where these properties can be encoded the kernel function, enabling fast training and high prediction accuracy.

In this work, we systematically asses the GP force estimates for a set of Ni_{19} nanocluster structures, as obtained from 2-, 3-, and many-body kernels, and for a number of training databases of different nature and size. Consistent with the significant change of physical properties occurring during the atom-to-bulk transition, we find that a 2-body kernel generally fails to correctly predict forces in small-sized Ni nanoclusters, despite doing so for bulk Ni systems. However, a 3-body kernel performs well, hinting to an increased significance of angular terms in the bonding.

The choice of training dataset is key to the performance of any ML-FF. The search for an optimal training dataset which may encompass all the relevant structures while avoiding redundancy is therefore of interest to guarantee accuracy, while limiting the need to produce *ad hoc ab initio* databases. Consistent with intuition, the GP accuracy decreases as the structural dissimilarity increases between the training and testing morphologies, and the best accuracy is found when using homogeneous training databases highly similar to the target testing structure. However, heterogeneous training databases provide a just slightly less good overall prediction performance, while the trained kernels display a much higher degree of versatility, predicting accurate forces also in out-of-sample tests. Furthermore, 3-body ML-FFs trained on an heterogeneous database accurately reproduce structural fingerprints such as pair distance distribution functions at finite temperatures. Using this ML-FF to derive a non-parametric machine-learning mapped force field (M-FF) via the procedure discussed in Ref. 33 makes it possible to execute tens of ns long simulations with a minimal computational effort, allowing the assessment of the thermal stability of Ni_{19} nanoclusters.

In Sec. II, we introduce the necessary GP formalism (Sec. II A), provide expressions for the kernels used throughout the work (Sec. II B), and briefly explain the concept of “mapped” force field^{33} (Sec. II C). We then describe a protocol for the validation of force predictions (Sec. III A) and discuss the performance achieved by the three kernels when tested on structures either very similar or morphologically different from the ones present in the training database for single-structure (Sec. III B) and multi-structure (Sec. III C) training. The construction of the 3-body M-FF and its validation are described (Sec. III D), while its application in MD simulations investigating the behaviour of Ni_{19} clusters in the 300-1200 K temperature range is described in Sec. III E.

## II. MACHINE LEARNING FORCE FIELDS

### A. Gaussian process regression

A GP regression^{36} is a Bayesian method to learn a function from a finite database $D$ of input-output pairs. As we are interested in learning the local force acting on any given atom, we construct such a training database by extracting (from a DFT simulation) a set of local configurations *ρ*_{i} relative to each atom and the corresponding forces **f**_{i} on that atom. This database $D={(\rho i,fi)}i=1N$ is then partitioned into a training set $Dtr$ (with *N*_{tr} entries) used for learning and a test set $Dtest$ (with *N*_{test} entries) used for validation. As the space of forces is three-dimensional, we here use the multi-output (vectorial) version of GP regression,^{25,37,38} for which the learned function **f**(*ρ*) takes the form

where **K** is a matrix-valued kernel function encoding the correlation of the forces relative to any two atomic environments.

The coefficients *α*_{i} in Eq. (1) can be written in closed form as

where $K$ is the covariance matrix containing *N*_{tr} × *N*_{tr} block entries $Kdd\u2032=K(\rho d,\rho d\u2032)$, $I$ is the identity matrix, and *λ* is a regularization hyperparameter that formally governs the uncertainty associated with the training dataset outputs, which has been kept fixed at a value of 10^{−5}. The performance of a GP is determined by the choice of the kernel function **K** and its hyperparameters and by the choice of training set $Dtr$.

### B. Kernel functions

The kernel function should be invariant with respect to translation and permutation of identical atoms and covariant (when predicting forces) with respect to rotation of the configurations. Furthermore, the function **K** should have a spatial resolution compatible with the features of the energy landscape encoded in the training dataset; this is taken care of by optimizing the kernel hyperparameters. A useful property of a kernel function for force or energy prediction is its *order*, that is, the maximum number of simultaneously interacting particles it can describe (see Ref. 33 for a formal definition). Here we will use 2-body, 3-body, and many-body force kernels.^{25,27}

#### 1. 2-body

We assume that the force **f** acting on an atom located at position **r**_{a} is a sum of independent contributions associated with every other atom in its local environment *ρ*_{a}. Each configuration is expressed as a sum of Gaussian functions of width *σ* representing individual atoms (“SOAP representation”^{39}). A natural way to obtain a rotation invariant scalar energy kernel would be via integration over the group of rotations of the space-integrated overlap of each couple of configurations.^{39} The covariance of predicted forces, a general property not requiring the existence of an underlying invariant total energy function, can also be obtained as an integral over rotations from a suitable matrix-valued covariant integration expression, as detailed in Ref. 25. For the 2-body kernel case, the integration can be carried out analytically, yielding the following matrix-valued energy-conserving kernel:^{25}

where **r**_{ai} expresses the position relative to the central atom of its *i*th neighbour.

#### 2. 3-body

A 3-body kernel allows us to represent an angular dependence on the force components. As described in detail in Refs. 33 and 40, a 3-body force kernel can be built as a double derivative of a 3-body energy kernel with respect to the positions of the central atoms of the configurations *ρ*_{a} and *ρ*_{b},

The 3-body energy kernel $k3s$ compares triplets of atoms that include the central atom across the two configurations. This kernel is intrinsically invariant under permutation, rotation, and translation of the atoms in *ρ*_{a} and *ρ*_{b}, avoiding the need of any integration over SO(3). Each triplet is associated with a vector **q**_{aij} containing the three atomic distances, i.e., $qaij=(rai,raj,rij)T$. Apart from a normalisation factor, the 3-body kernel reads

where $Pc$ ($|Pc|=3$) is the set of cyclic permutations of three elements and *σ* is the single required length scale hyperparameter. Summing over the permutation group is needed to guarantee permutation symmetry of the energy. No *i* ≠ *j* or *k* ≠ *l* restriction is, however, imposed in the external sum, making the overall expression not limited to the case of three distinct atoms so that the kernel also includes the 2-body case as a subset. We note that permutation invariance in 3-body kernels could also be obtained using permutation invariant descriptors, as done in Ref. 27.

#### 3. Many-body

Describing arbitrarily complex interactions requires a many-body kernel function such that force prediction becomes dependent on the full local atomic environment *ρ*_{a} and is no more the result of summing independent pairwise (or triplet) contributions. A way to obtain a many-body kernel *k*_{MB} (see Ref. 33) is to take the exponential of a scalar 2-body kernel *k*_{2},

where

To impose rotational force covariance, we should perform an integration over the *SO*(3) manifold of rotations.^{25} Unfortunately the integration over SO(3) of the many-body kernel in Eq. (6) cannot be done analytically, while numeric integration is computationally heavy. We hence resort to restricting the summation to a discrete symmetry group *R* of rotations (and reflections) whose elements $R$ are represented by orthogonal matrices **R**,

The optimal choice of the rotation group is system-dependent: in FCC and BCC bulk environments, a natural choice is to sum over all elements of the O_{h} point group. The resulting many-body kernel can be made arbitrarily accurate if given a large enough training set,^{33,41} while the predicted force field will not be conservative (make zero work on closed loops to numerical accuracy) by construction. However, to the extent that force errors are small, the energy-conserving character of the reference Hamiltonian forces will be approximately reproduced.

### C. Mapped force field (M-FF)

Once the 3-body GP has been trained, the “mapping” technique described in Ref. 33 can be used to build a non-parametric 3-body force field (a M-FF) which retains the accuracy of the original GP, while typically increasing its computational speed by a factor 10^{3}–10^{4}. This procedure is effectively equivalent to storing the energies predicted by the kernel (5) for a three-dimensional grid of values of the triplet of distances (*r*_{ai}, *r*_{aj}, *r*_{ij}) occurring in a three atom system. In a more complex structure, the contributions from every triplet and atom pair are calculated by spline interpolation over the stored GP predictions of the energy values. Analytic differentiation of the spline expression produces an energy conserving force field practically indistinguishable from the predictions of the 3-body GP used to build it, while independent of the number of configurations *N*_{tr} used for GP training. This M-FF could be seen as a classical *n*-body force field, as simple to physically interpret and fast to compute as a standard parametrised 3-body force field, whose systematic non-parametric construction requires no *ad hoc* parameter choice and fine-tuning. This enables simulation times which would not be achievable by the standard direct GP force prediction or by first-principles molecular dynamics based on the reference DFT Hamiltonian.

## III. RESULTS

We consider Ni nanoclusters of 19 atoms and gather data from Born-Oppenheimer molecular dynamics (BOMD) simulations at 300 K, 600 K, and 900 K, from five different initial structures, represented in Fig. 1: a hcp motif of three layers (3HCP), a double icosahedron (DIH), a bipyramid (BIP), a four stacked hcp layer (4HCP) structure, and a structure obtained by displacing two five-fold arranged vertexes of a double icosahedron to form two six-fold rings, also known as a “rosette” defect^{42} (dDIH). The first three motifs are the most energetically favourable at the PBE (Perdew-Burke-Ernzerhof) exchange correlation DFT level,^{43} and the other two were found with a metadynamics sampling procedure^{13,14} and are included here to introduce low symmetry morphologies in the database set. The BOMD simulation details are provided in the supplementary material.

### A. Validation methodology

It is evident from Eqs. (1) and (2) that the predictions of a GP will depend on how the training dataset $Dtr$ is chosen. To assess the errors incurred by the three kernels while used in interpolation and (putative) extrapolation regimes, we next systematically analyze the GP predictions on test databases containing contributions from all five structures, while training is carried out on different combinations of structures. This procedure allows us to introduce a novel strategy to measure the similarity between cluster geometries, based on the relative GP errors made, while training and testing on two different structures. In our practical implementation, for all the GP trainings, we choose *N*_{test} = 400 for each of our five cluster structures, yielding a total pool of 2000 testing points.

Every test is repeated five times to estimate a standard deviation for the Mean Absolute Error on Forces (MAEFs), defined as the average error done by the GP on the force vector,

where **f** and **f**′ are the reference and predicted forces acting on an atom, respectively, and *c* indicates the Cartesian component. Our tests can be separated into three categories: *self-training*, *cross-training*, and *mixed-training*, depending on which databases were used to build the training sets and which subset of the testing pool is used. In the *self-training*, the configurations used to build the $Dtr$ and $Dtest$ are associated with the same cluster structure. In the *cross-training*, the database from a structure morphology is used for training, and testing occurs on configurations associated with the other four structures. In both of the above, the database is homogeneous; that is, it contains configurations relative to a single morphology. For the *mixed-training*, we build and test all possible heterogeneous training sets $Dtr$ that contain inputs from two, three, and all five morphologies (here as in *self-training*, no data point present in $Dtr$ is allowed to be in $Dtest$).

### B. Self- and cross-training

We first discuss the results for *self-training*. Figure 2 reports a learning curve (MAEF against *N*_{tr}) for the case of a 3HCP cluster structure. The 2-body kernel achieves its maximum accuracy for *N*_{tr} > 50; similarly, the 3-body MAEF decreases with *N*_{tr} until *N*_{tr} > 100 and an accuracy plateau is reached. The accuracy of the many-body kernel, on the other hand, keeps increasing with the number of training set points, as expected for a universal approximator kernel.^{33,41} The learning curves for the other structures show the same qualitative trends (see the supplementary material). Figure 3 shows the converged MAEF achieved by *self-training* GPs for each of the five morphologies when using 2-, 3-, and many-body kernels. The left-hand histogram reveals that modeling the atomic interactions between Ni atoms in terms of a 2-body potential yields a MAEF larger than the target accuracy of 0.15 eV/Å for all morphologies, with higher values for the low symmetry ones (4HCP and dDIH). We note that this is not the case for FCC bulk Ni systems, where 2-body kernels were found to be surprisingly accurate.^{25} The central and right-hand histograms in Fig. 3 reveal that both 3- and many-body kernels achieve a suitably accurate force prediction for all cluster structures if the training dataset used contains 200+ points. For comparison, the calculated MAEF of a state-of-the-art classical parametric potential for Ni^{44} is 0.59 ± 0.39 eV/Å. The relative importance of *n*-body contributions to the forces in the five Ni_{19} cluster morphologies can be appreciated by looking at the accuracy of the *n*-body kernels. For instance, the accuracy of 2-body and 3-body forces is very similar for the 3HCP morphology, indicating that the angular dependence of forces is not crucial in this motif, while it is more significant for the other structures. We note that comparing *n*-body kernel predictions could be more generally used as a way to characterize the nature of the bonding occurring in complex systems such as metal nanoclusters or grain boundaries and to reveal and quantify (dis)similarities between these systems or relative to reference bulk structures.

The MAEFs obtained for *cross-testing* are reported in Fig. 4. In this case, the 2-body kernels MAEFs are consistently larger than 0.15 eV/Å and often twice as large. Comparing the 3- and many-body kernels reveals the accuracy achieved by the 3-body kernel strongly depends on the training database, while the many-body kernel displays more consistent errors over different structures. This could be rationalised by considering that a many-body kernel is capable of learning high-*n* interaction terms whose contributions are effectively sampled in any morphology, even, e.g., in structures in which they are quantitatively less important. These terms help maintaining a good prediction accuracy even on “partially unknown” new morphologies where higher order interactions come more into effect. The 3-body kernel is instead intrinsically restricted to 3-body interactions, and if the reference forces include (say) a 4-body interaction contribution, incorporating this by machine learning based on a lower-dimensional feature space may achieve some success only in *self-training* (interpolation) mode, but will not correctly extrapolate to new structures. This suggests that the accuracy that a 3-body kernel achieves on a target structure is to a significant extent conditional to the presence of database entries representative of that structure in the training database used. Consistently, for this kernel, the training databases comprising the low-symmetry morphologies (4HCP and dDIH) that have the most varied repertoires of atomic environments are those which work best in *cross-testing*. We also note how the *cross-testing* error incurred by training over 4HCP and testing on its higher-symmetry counterpart 3HCP is 0.18 eV/Å, while the reversed test yields a significantly larger 0.26 eV/Å MAEF. The same effect becomes even more apparent by examining the dDIH (low symmetry) and DIH (high symmetry) pair of structures, yielding 0.14 eV/Å and 0.31 eV/Å errors in the direct and reversed tests, respectively.

Further analysis of the GP predictions allows some qualitative understanding of why using different training databases leads to stronger or weaker performances over the available testing sets. We first examine the case of training on each of the five structures and testing on the dDIH structure. Figure 5 is a visual representation of the MAEFs committed at the testing stage by our three kernels after they were trained on the five single-structure training databases. As expected from Fig. 4, the 2-body kernel is associated with large errors for all training sets but the dDIH one—the only one here in the *self-training* regime. In the case of 3-body kernel, training on a 4HCP database yields the best overall *cross-training* performance (while as expected, *self-training* on a dDIH database offers better results). This provides good accuracy on most atoms, falling short only around the rosette defect, a peculiar distortion absent in the 4HCP structure. The MAEFs incurred by training on the DIH database are also very low for the lowermost 5-fold cap of the dDIH cluster. This is expected since these local environments are very similar in these two morphologies. On the other hand, *cross-training* on the BIP and 3HCP datasets fails to predict forces around the icosahedral centres and the rosette defect of the dDIH. These results hold true even in the case of the many-body kernel, for which the DIH is the best performing *cross-training* morphology.

We next compare the pair-distance function (PDF) and the bond angular distribution function (BADF) for the five morphologies as obtained from BOMD simulations at 300 K, reported in Fig. 6. These reveal structural differences between the morphologies. For instance, the PDF peak close to 3.3 Å in the 3HCP, 4HCP, and BIP morphologies is absent in the DIH and dDIH structures. Also, the BADF in the bottom panel displays a broadened distribution for 4HCP and dDIH and much more peaked ones for 3HCP, DIH, and BIP.

As a possible quantitative indicator of how well a PDF “samples” another one, we calculate their Kullback-Leibler (KL) divergence. For a discrete probability distribution, this is calculated as

This (asymmetric) quantity measures the information “lost” when a function *Q* is used to approximate another function *P*, returning a 0 value when *P* = *Q* and increasing positive values as *P* grows dissimilar from *Q*. In the present context, *Q* and *P* are the PDFs associated with the training set and testing set, respectively. Figure 7 contains a normalized scatter plot comparing the KL divergence relative to each ordered pair of PDFs taken from Fig. 6 with the corresponding *cross-training* MAEF incurred by the 2-body kernel (see the supplementary material for details on how these two quantities were normalized). The graph reveals a striking correlation between the two dissimilarity measures, generally highlighting the importance of the presence of database entries which contain pairs of atoms at distance values relevant for the testing dataset. Moreover, since the PDF can be assumed to be an unique structural descriptor in the case of monometallic nanoparticles,^{45–48} the correlation indicates that the 2-body cross-training error is non-trivially linked to properties that go beyond 2-body descriptors. Thus, the KL divergence between PDFs could be used as an *a priori* indicator of extrapolating performance of the 2-body kernel in *cross-training*. Similar tests for the 3-body kernel also display a positive correlation, although of smaller statistical significance (see the supplementary material). The results above suggest that evaluating the KL divergence for other functions than the PDFs could provide more dissimilarity estimators. This could be used to guide the extraction of informed, minimally sized training databases from a “general” database too large to be used in full for GP regression.

### C. Heterogeneous training and training set optimisation

We next examine the *mixed-training* strategy, to learn how a small training database $Dtr$ which still encompasses a sufficiently varied repertoire of atomic distances and bond angle values could be built. For this, we test the accuracy of training on mixed datasets containing entries from two, three, and all five different cluster morphologies. Our results indicate that the MAEF incurred by the 3-body kernel is fairly homogeneous in the various *mixed-training* and testing scenarios, staying the same within a 0.03 eV/Å standard deviation, contrary to what was generally found for *self-training*. This suggests that introducing even a modest amount of variety in the training configuration pool is sufficient to achieve a reasonably complete training of a 3-body kernel, avoiding “local” overfitting causing extrapolation errors. Consistently, the MAEF incurred by a kernel not restricted to just a 3-dimensional feature space and thus much harder to completely train, such as the many-body kernel, is found to have a higher standard deviation (0.05 eV/Å) when trained and tested in the same scenarios. For all kernels, we find that *mixed-training* yields errors comprising between those incurred by *self-training* and *cross-training*, with MAEFs slightly higher than those produced by *self-training* but appreciably smaller than those associated with *cross-training*.

Figure 8 illustrates the performance of the 3-body kernel trained on our “best” single-structure database (4HCP, see Sec. III B), the best choice of two- and three-structure mixed databases, the full 5-structure database (“penta”), and a (“mixed T”) training database containing 1000 DFT configurations extracted from simulations at 300, 600, and 900 K. The results are good for all training scenarios, and notably, the 5-structure “penta” databases achieve the same performance of all the other database choices which needed to be identified as the best restricted ones. To investigate the performance stability of a given, simple database construction recipe, we generated 100 independent “penta” $Dtr$ training sets, each containing 100 randomly chosen configurations for each cluster structure. Training 3-body kernels on the low temperature $Dtr$ of Fig. 8 and testing on a fixed database also comprising configurations from all five structures yield an average MAEF of 0.14 ± 0.07 eV/Å. The small 0.004 eV/Å difference we find between the MAEF incurred by the best-performing low-temperature GP and the average MAEF suggests that the accuracy gain which might be obtained by a “best training set choice” procedure is practically negligible. To further analyze this issue, we performed Metropolis Monte Carlo simulations to optimize the dataset training points and again found no significant accuracy gain (see the supplementary material). An overall better performance was instead achieved when using the “mixed T” database which included higher T configurations for all cluster morphologies.

### D. Building and validating a 3-body M-FF

To perform computationally inexpensive MD simulations with near-*ab initio* accuracy, we map the ML-FF corresponding to the two best performing 3-body kernels (“penta” and “mixed T”) onto two non-parametric M-FFs, following the procedure described in Sec. III C.

We first assess their accuracy on configurations extracted from 600 K to 900 K BOMD simulations started from 3HCP and DIH initial structures, respectively. Both systems undergo several structural changes along their trajectory. The computed MAEFs for the “penta” M-FF are 0.26 ± 0.24 eV/Å for 3HCP at 600 K and 0.25 ± 0.17 eV/Å for DIH at 900 K, indicating that the M-FF retains an acceptable accuracy level, while visiting configurations not represented in the “penta” training database. (Also note that higher errors should be expected for high-temperature samples, where forces have a larger modulus.) The MAEFs associated with the “mixed T” M-FF is 0.25 ± 0.46 eV/Å for 3HCP at 600 K and 0.17 ± 0.09 eV/Å for DIH at 900 K.

The M-FFs described so far are appropriate for simulating dynamical runs as they contain data gathered from BOMD DFT simulations only. Testing their accuracy on minimized 0 K structures reveals a 0.10 ± 0.02 eV/Å MAEF for the “penta” training set and a 0.06 ± 0.02 eV/Å MAEF for the “mixed T” training set. The inclusion of configurations collected during structural relaxation in the training set reduces the MAEF to 0.04 ± 0.02 eV/Å, while using a many-body kernel with 4000 training points yields a further reduced 0.02 ± 0.01 eV/Å MAEF.

To further test the accuracy of our 3-body “penta” M-FF in a dynamical setting, we run three 200 ps-long 300 K MD simulations (see the supplementary material) for each of the five geometries and compare the PDFs and BADFs with the reference ones extracted from equally long BOMD simulations of the same structures kept at the same temperatures. The excellent overlap obtained for both PDFs and BADFs, visible in Fig. 6, provides some further validation of the ability of the M-FF to track the reference DFT forces. The errors averaged over our five 300 K structures incurred by the M-FFs, while predicting energies are 16 ± 10 meV/atom and 9 ± 7 meV/atom for the “penta” and the more comprehensive“mixed T” training sets, respectively. A scatter plot of the energy error incurred at 900 K by the M-FF trained on the “mixed T” database is provided in the supplementary material.

### E. Assessing the thermal behaviour of Ni_{19}

To probe whether results potentially yielding novel physical insights into the nanocluster behaviour can be obtained by using a M-FF, we finally investigate the kinetic behaviour of Ni_{19} to explore the extent of shape fluctuations occurring in the nanocluster.^{48–52} To do so, we carry out MD runs at 50 K-spaced temperatures comprising between 300 K and 1200 K. We perform four 480-ps long simulations for each temperature for each of the five morphologies considered in this work (380 simulations in total). The computational speed-up factor associated with carrying out a M-FF rather than a BOMD simulation in these systems is ∼10^{5}. The root mean bond fluctuation (RMBF) is a quantity describing the average bond length oscillation at a given temperature, often used to characterize phase changes in nanoscale systems,^{6–11} defined as

where *M* is the number of atoms and the averages are taken over the simulation (excluding the first 5 ps to allow for thermal equilibration) for each atom pair. Figure 9 shows the RMBF value averaged over 20 simulations for each temperature value (four repeated simulations for each of the five structures) as a function of temperature for both the “penta” and the “mixed T” M-FFs. The first M-FF was trained on the low temperature “penta” database and was thus expected to be operating in a largely extrapolatory regime. This force field predicted a “slush” transition region rather than abrupt melting (cf. Fig. 9, green symbols). The second M-FF was trained on the “mixed T” database including configurations at 600 K and 900 K not available in the previous “penta” database and more directly relevant to the morphologies visited by the system along the temperature ramp. MD simulations using this M-FF also predicted a “slush” transition region (cf. Fig. 9, orange symbols), consistent with the earlier result.

In more detail, for temperatures below 700 K, all clusters remain solid for both M-FFs, as indicated by the small (<0.1) RMBF. In this region, the “mixed T” M-FF alone displays a non-zero RMBF, hinting at small geometrical changes for some of the starting morphologies. A RMBF > 0.3, characteristic of nanoliquids,^{53} is observed above 900 K in simulations using the “penta” M-FF and similarly above ∼975 K in simulations using the“mixed T” M-FF. In the intermediate, approximately 700-900 K range, our Ni_{19} system is predicted to be associated with a RMBF intermediate between the nanosolid and nanoliquid regimes. The agreement between the RMBFs of the two M-FFs in Fig. 9 implies that training on low temperature structures is sufficient to predict this qualitative feature of the dynamical behaviour of the present system. The prediction is also consistent with the high probability of geometrical rearrangements, corresponding to slush structures, which have been discussed in detail for Al systems.^{53} Several detailed geometrical interconversion processes are observed during our M-FF simulations, whose in-depth characterization is ongoing and will be provided in future work.

## IV. CONCLUSIONS

We investigated the accuracy of a Gaussian process-based machine learning approach to the prediction of interatomic forces in metallic nanoclusters. In particular, we assessed the ability of different *n*-body kernels to correctly model the interactions between atoms in the Ni_{19} system and probed how the prediction accuracy is affected by ML training carried out on single-structure and multi-structure (heterogeneous) databases. We find that, at variance with the case of bulk Ni, a 2-body kernel is not sufficiently accurate, while a 3-body kernel is able to accurately reproduce the reference DFT forces. Restricting the training databases to configurations derived from a single structure yields excellent interpolation accuracy so that a 3-body kernel can be safely used in a “*self-training*” regime. However, we find that “*cross-training*” the kernel is not equally successful. Using training databases comprising configurations derived from different cluster structures is therefore necessary to enable extrapolation, whenever this is deemed to occur, e.g., by evaluation of a Kullback-Leibler asymmetric indicator.

Our results suggest that mixing configurations from as few as two different structures is sufficient to train a 3-body kernel capable of robust extrapolation and thus accurate force prediction for all the structures considered. This in turn suggests that a 3-body kernel can achieve a very good compromise between representation power (which increases with the *n*-body kernel order) and speed of convergence with respect to database size (which instead decreases with *n*), provided that heterogeneous training databases are used. This result could be particularly useful for practical applications, since the force field predicted by a 3-body GP kernel can be mapped onto an exactly equivalent non-parametric “M-FF” force field.^{33} Such mapping yields a very significant efficiency gain, for all practical purposes aligning the M-FF speed of execution with that of any equivalent (e.g., 3-body) parametrised classical force field, while retaining the accuracy and ease of training of the underlying machine learning scheme. As a simple feasibility test, we investigated the thermal behaviour of Ni_{19} between 300 K and 1200 K. To address the thermal behaviour of clusters, we carried out MD simulations, using a M-FF trained on 300 K structures and a second M-FF trained on 300, 600, and 900 K structures, adding up to an ∼400 ns total simulated time. Both M-FFs predict the occurrence of three distinct physical regimes, with similar estimates for the temperature boundaries separating them so that, in particular, dynamical states of the cluster, intermediate between the solid and liquid phases, are predicted to occur between 700 K and 900 K.

## SUPPLEMENTARY MATERIAL

See supplementary material for details on the BOMD and M-FF simulations, the learning curve graphs for DIH, BIP, 4HCP, and dDIH cluster structures, more information about the Kullback-Liebler divergence method, the complete set of graphs for the MAEFS incurred by the 3- and many-body kernels while *mixed-training*, a description of the Monte Carlo Metropolis simulations performed, and the scatter plot of energy predictions for the “mixed T” M-FF on 1500 configurations extracted from DFT BOMD simulations at 900 K.

## ACKNOWLEDGMENTS

C.Z. and A.G. acknowledge funding by the Engineering and Physical Sciences Research Council (EPSRC) through the Centre for Doctoral Training Cross Disciplinary Approaches to Non-Equilibrium Systems (CANES, Grant No. EP/L015854/1) and by the Office of Naval Research Global (ONRG Award No. N62909-15-1-N079). A.D.V. acknowledges further support by the EPSRC HEmS Grant No. EP/L014742/1 and, together with A.F., by the European Union’s Horizon 2020 research and innovation program (Grant No. 676580, The NOMAD Laboratory, a European Centre of Excellence). We are grateful to the UK Materials and Molecular Modelling Hub for computational resources, which is partially funded by EPSRC (No. EP/P020194/1). F.B. acknowledges financial support from the UK Engineering and Physical Sciences Research Council (EPSRC), under Grant Nos. EP/GO03146/1 and EP/J010812/1. K.R. acknowledges financial support from EPSRC, Grant No. ER/M506357/1, the Thomas Young Centre, the MacDiarmid Institute, and Auckland University. K.R. and F.B. acknowledge the financial support offered by the Royal Society under Project No. RG120207. K.R. also thanks Krista Grace Steenbergen for useful discussions. The authors wish to acknowledge the contribution of NeSI high-performance computing facilities to the results of this research. NZ’s national facilities are provided by the NZ eScience Infrastructure and funded jointly by NeSI’s collaborator institutions and through the Ministry of Business, Innovation and Employment’s Research Infrastructure programme, URL https://www.nesi.org.nz.