The accurate representation of multidimensional potential energy surfaces is a necessary requirement for realistic computer simulations of molecular systems. The continued increase in computer power accompanied by advances in correlated electronic structure methods nowadays enables routine calculations of accurate interaction energies for small systems, which can then be used as references for the development of analytical potential energy functions (PEFs) rigorously derived from many-body (MB) expansions. Building on the accuracy of the MB-pol many-body PEF, we investigate here the performance of permutationally invariant polynomials (PIPs), neural networks, and Gaussian approximation potentials (GAPs) in representing water two-body and three-body interaction energies, denoting the resulting potentials PIP-MB-pol, Behler-Parrinello neural network-MB-pol, and GAP-MB-pol, respectively. Our analysis shows that all three analytical representations exhibit similar levels of accuracy in reproducing both two-body and three-body reference data as well as interaction energies of small water clusters obtained from calculations carried out at the coupled cluster level of theory, the current gold standard for chemical accuracy. These results demonstrate the synergy between interatomic potentials formulated in terms of a many-body expansion, such as MB-pol, that are physically sound and transferable, and machine-learning techniques that provide a flexible framework to approximate the short-range interaction energy terms.

## I. INTRODUCTION

Since the first Monte Carlo (MC)^{1,2} and molecular dynamics (MD)^{3,4} simulations of molecular systems, computer simulations have become a powerful tool for molecular sciences, complementing experimental measurements and often providing insights that are difficult to obtain by other means. Although the first simulations were performed for idealized molecular systems, it was recognized since the beginning that both realism and predictive power of a computer simulation are directly correlated with the accuracy with which the underlying molecular interactions are described.

In this context, computer modeling of water is perhaps the most classic example. Given its role as life’s matrix,^{5} it is not surprising that numerous molecular models of water have been developed (see Refs. 6–9 for recent reviews) since the first simulations performed by Barker and Watts,^{10} and Rahman and Stillinger.^{11} However, despite almost 50 years have passed since these pioneering studies, the development of a molecular model that correctly reproduces the behavior of water from the gas to the condensed phase still represents a formidable challenge.

From a theoretical standpoint, the energy of a system containing *N* water molecules can be formally expressed through the many-body expansion (MBE) of the interaction energy as^{12}

where *V*^{1B}(*i*) = *E*(*i*) − *E*_{eq}(*i*) corresponds to the one-body (1B) energy required to deform an individual water molecule (monomer) from its equilibrium geometry. All higher-order terms *V*^{nB} in Eq. (1) describe *n*-body (nB) interactions defined recursively as

Most popular molecular models of water are pairwise additive [i.e., they truncate Eq. (1) at the two-body (2B) term] and use an effective *V*^{2B} to account for many-body contributions in an empirical fashion.^{13–24} Although in the early times of computer simulations this simplification was a necessity dictated by computational efficiency, the importance of many-body effects in water was already recognized in the 1950s by Frank and Wen who introduced a molecular model of liquid water consisting of “flickering clusters of hydrogen-bonded molecules,” emphasizing the “co-operative nature” of hydrogen bonding.^{25} It also became soon apparent that “pair potentials do not realistically reproduce both gas and condensed phase water properties.”^{26} The first attempts to derive potential energy functions (PEFs) for aqueous systems which could rigorously represent the individual terms of the MBE were made in the late 1970s and 1980s.^{27–32} In particular, Clementi and co-workers developed a series of analytical PEFs for water which were fitted to *ab initio* reference data obtained at the fourth-order Møller-Plesset (MP4) and Hartree-Fock levels of theory for the 2B and three-body (3B) terms, respectively, and represented many-body effects through a classical polarization term.^{31,32} Stillinger and David developed a polarizable model for water in which H^{+} and O^{2−} moieties were considered as the basic dynamical and structural elements.^{33} Building upon these pioneering studies, several polarizable models have been proposed over the years, most notably the Dang-Chang model,^{34} the TTM models,^{35–40} and AMOEBA.^{41,42} The interested reader is referred to Refs. 8 and 9 for recent reviews. Finally, in recent years, also machine learning (ML) potentials have been applied to water,^{43,44} which are able to include high order many body terms in the PEFs in the form of structural descriptions of the atomic environments.

The development of efficient algorithms for correlated electronic structure methods along with continued improvements in computer performance has recently made it possible to evaluate the individual terms of Eq. (1), with chemical accuracy. In parallel, tremendous progress has been made in constructing multidimensional mathematical functions that are capable to reproduce interaction energies in generic N-molecule systems, with high fidelity.^{43,45,46}

By combining these three approaches, it has been realized that the MBE provides a rigorous and efficient framework for the development of full-dimensional PEFs entirely from first principles, in which low-order terms are accurately determined from correlated electronic structure data, e.g., using coupled cluster theory with single, double, and perturbative triple excitations, CCSD(T), in the complete basis set, CBS, limit, the current “gold standard” for chemical accuracy, and higher-order terms are represented by classical many-body induction. Along these lines, several many-body PEFs for water have been proposed in the last decade, the most notable of which are CC-pol,^{47} WHBB,^{48} HBB2-pol,^{49} and MB-pol.^{50–52} When employed in computer simulations that allow for explicit treatment of nuclear quantum effects, these many-body PEFs have been shown to correctly predict structural, thermodynamic, dynamical, and spectroscopic properties of water, from the dimer in the gas phase to liquid water and ice (see Ref. 9 for a recent review).

Among the existing many-body PEFs, MB-pol [permutationally invariant polynomial (PIP)-MB-pol in the present nomenclature] has been shown to correctly predict the properties of water across different phases,^{53} reproducing the vibration-rotation tunneling spectrum of the water dimer,^{50} the energetics, quantum equilibria, and infrared spectra of small clusters,^{51,54–56} the structural, thermodynamic, and dynamical properties of liquid water,^{57,58} including subtle quantum effects such as equilibrium isotope fractionation,^{59} the energetics of the ice phases,^{60} the infrared and Raman spectra of liquid water,^{61,62} the sum-frequency generation spectrum of the air/water interface at ambient conditions,^{63} and the infrared and Raman spectra of ice I_{h}.^{64} It has been shown that the accuracy of PIP-MB-pol in reproducing the properties of water depends primarily on its ability to correctly represent each individual term of the MBE at both short- and long-range.

Briefly, within MB-pol, *V*^{1B} in Eq. (1) is represented by the 1B PEF developed by Partridge and Schwenke,^{65} which reproduces intramolecular distortion with spectroscopic accuracy. *V*^{2B} includes a term describing 2B dispersion, which is derived from the asymptotic expansion of the interaction energy, as well as a term describing electrostatic interactions associated with both permanent and induced molecular moments. At short-range, within the original PIP-MB-pol, *V*^{2B} is supplemented by a 4th-degree permutationally invariant polynomial (PIP)^{45} that smoothly switches to zero as the distance between the two oxygen atoms in the dimer approaches 6.5 Å.^{50} Similarly, *V*^{3B} includes a 3B induction term that is supplemented by a short-range 4th-degree PIP that smoothly switches to zero once the oxygen-oxygen distance between two pairs of water molecules within the trimer approaches 4.5 Å.^{51} All higher-body terms are implicitly represented by classical many-body induction according to a modified Thole-type scheme originally adopted by the TTM4-F water model.^{40} The PIP 2B and 3B terms, which were derived from CCSD(T) calculations carried out in the complete basis set limit for large sets of water dimers and trimers, correct for deficiencies associated with a purely classical description of intermolecular interactions by effectively representing quantum-mechanical interactions that arise from the overlap of the monomer electron densities (e.g., charge transfer and penetration, and Pauli repulsion).

In this study, we investigate the application of Behler-Parrinello neural networks^{66,67} (BPNNs) and Gaussian approximation potentials (GAPs) as alternatives for the original PIP representations of MB-pol short-range 2B and 3B terms. Using the same training, validation, and test sets, two additional (BPNN- and GAP-based) analytical expressions of MB-pol are derived, which effectively exhibit the same accuracy as the original, PIP-based expression. This study provides further evidence for the ability of the MBE in combination with machine learning techniques to serve as a rigorous and efficient route for the development of accurate potential energy functions such as MB-pol in the case of water. The article is organized as follows: In Sec. II, we provide an overview of the computational framework associated with the many-body formalism adopted by MB-pol, while in Sec. III we describe the three different models (PIP-MB-pol, BPNN-MB-pol, and GAP-MB-pol) used to represent water two-body and three-body interactions. The results are presented in Sec. IV, and the conclusions along with an outlook are given in Sec. V.

## II. MB-POL FUNCTIONAL FORM AND COMPUTATIONAL DETAILS

We are employing the MB-pol framework for water, which is based on the MBE of Eq. (1) and contains explicit terms for the 1B, 2B, and 3B terms, in combination with classical N-body polarization that accounts for all higher-body contributions to the interaction energy.^{50,51} In MB-pol, the 2B term is divided into long-range interactions that are well described using classical expressions for electrostatics, induction, and dispersion, and short-range interactions that include complex quantum-mechanical effects due to the overlap of the monomer electron densities,

with

where $VTTM,elec2B$ and $VTTM,ind2B$ are electrostatic and induction energies, respectively, represented by a slightly modified version of the Thole-type TTM4-F model^{40,50,51} and the dispersion energy $Vdisp2B$ is modeled by a *C*_{6} term that is dampened at short range.^{50} Similarly, the 3B term in MB-pol is decomposed into classical 3B induction that captures essentially all of the 3B interaction energy at long range and an expression for the highly complex interactions at short range,

Because corrections to the underlying classical baseline potentials $Vlong2B$ and $Vlong3B=VTTM,ind3B$ are only required at short range, and in order to obtain a smooth, differentiable potential energy surface, MB-pol employs switching functions that smoothly turn off the short-range potentials $Vshort2B$ and $Vshort3B$ once the separation between the oxygen atoms of the water molecules exceeds a preset cutoff.

The MB-pol short-range 2B and 3B potentials^{50,51} are written as

and

where

The switching function was chosen as

where

is a scaled and shifted oxygen-oxygen distance for water molecules *i* and *j*. The MB-pol 2B and 3B cutoff values are $Rlow2B=4.5$ Å, $Rhigh2B=6.5$ Å, $Rlow3B=0.0$ Å, and $Rhigh3B=4.5$ Å.

An accurate description of both the 2B and the 3B short-range interactions requires flexible multi-dimensional functions, for which the original PIP-MB-pol model employs permutationally invariant polynomials^{45,68} (PIPs). In this work, we investigate the performance of alternative machine learning (ML) frameworks to represent these 2B and 3B short-range interactions in water, by comparing PIPs to Behler-Parrinello neural networks (BPNNs) and Gaussian approximation potentials (GAPs) for $VML2B$ and $VML3B$. We employ the original MB-pol switching functions and cutoff values with the PIP and BPNN potentials, while GAP uses slightly different cutoff values and switching functions.^{69} In the context of MBE and neural networks, it should be noted that a neural network representation of the many-body expansion of the interaction energy, truncated at the 3B term, has been reported for methanol.^{70} Neural networks and Gaussian process regression (GPR) have very recently been compared for a highly accurate representation of the potential energy surface of formaldehyde.^{71}

### A. Training sets and reference energies

We employ the original MB-pol 2B and 3B data sets,^{50,51} which sample regions of the 2B and 3B water PES, respectively, that are most relevant for simulations of water at normal to moderate temperature and pressure. As described in Ref. 50, the 2B training set consists of 42508 water dimer structures with center-of-mass separations ranging from 1.6 to 8 Å that include the global dimer minimum geometry, several saddle points, compressed geometries with positive interaction energies, and dimers extracted from path-integral molecular dynamics (PIMD) simulations of liquid water carried out at ambient temperature and pressure with the HBB2-pol PEF.^{49} Similarly, the 3B training set contains 12347 water trimer structures that include the global minimum and trimers extracted from a range of MD and PIMD simulations of small water clusters, liquid water, and water ice phases at varying temperatures and pressures carried out with the HBB2-pol PEF.^{49} Both the 2B and the 3B quantum mechanical (QM) reference energies of these data sets were obtained at the complete basis set (CBS) limit of coupled cluster theory with single, double, and iterative triple excitations, CCSD(T). For details see the original publications.^{50,51} The short-range training set energies $Vshortref$ employed in this work were obtained from the QM reference data by subtracting the MB-pol baseline long-range 2B and 3B potentials $Vlong2B$ and $VTTM,ind3B$, respectively.

The original 2B dataset includes a few dimer structures with extremely high binding energy. Those high energy structures are not only physically unimportant but also sparsely distributed, which can lead to difficulties for machine learning techniques to make effective predictions for structures in this regime because of insufficient information. Therefore, we have retained only configurations with binding energies below 60 kcal/mol in this work. In addition, we have removed all configurations with oxygen-oxygen separations larger than the MB-pol 2B short-range cutoff of 6.5 Å, leading to a total of 42069 configurations in the final 2B training set. By contrast, the trimer dataset with 12347 configurations is fully employed. Each dataset is then randomly divided into three separate sets, training, validation, and test sets with a ratio of 0.81:0.09:0.1. The first two are used during training and for model selection, while the last one is kept completely isolated from the training procedure and is employed for the final evaluation only.

### B. Water cluster test sets

Reference interaction energies of $(H2O)n$ clusters with *n* = 4–6 (see Fig. 4) are based on geometries optimized with MP2 and RI-MP2^{72,73} and were taken from Ref. 58. The energies were obtained using the MBE of the interaction energy^{74} with both 2B and 3B interaction energies computed at the same level as the MB-pol 2B and 3B training sets,^{50,51} that is, effectively at the CBS limit of CCSD(T). All higher order contributions to the interaction energy (>3B) were obtained from explicitly correlated CCSD(T)-F12b^{75} calculations with the VTZ-F12 basis set,^{76} which yields results close to the CBS.

## III. MANY-BODY MODELS

### A. Permutationally invariant polynomials

The permutationally invariant polynomials are functions of the distances between pairs involving both the physical atoms (H and O) and two additional sites L_{1} and L_{2} that are located symmetrically along the directions of the oxygen lone pairs of a water molecule,

where *γ*_{||} and *γ*_{⊥} are fitting parameters and $rOH1,2$ are the O-H bond vectors. Exponential functions of the type $\xi i=e\u2212kdi$ or $\xi i=e\u2212k(di\u2212d(0))$ and Coulomb-type functions $\xi iCoul=e\u2212kdi/di$ are built for the set {*d*_{i}} of these distances and are used as basis for the PIPs. The PIP *V*_{ML,PIP} = *∑*_{l}*c*_{l}*η*_{l} is then constructed from the set {*ξ*_{i}} of these functions, where {*η*_{l}} are symmetrized monomials up to a given degree. The symmetrization is carried out such that the monomials, and hence the PIP, are invariant with respect to the permutations of the water molecules as well as to the permutations of equivalent H and L sites within each molecule. The polynomial coefficients *c*_{l} and the exponential coefficients *k* and distances *d*^{(0)} are linear and non-linear fitting parameters, respectively. For details, see the original publications.^{50,51}

For the 2B PIP, we are using 31 basis functions: 6 exponential functions for all intra-molecular HH and OH pairs with exponential coefficients $kHHintra$ and $kOHintra$; 9 Coulomb-type functions for all inter-molecular HH, OH, and OO pairs with exponential coefficients $kHHinter$, $kOHinter$, and $kOOinter$; 16 exponential functions for all inter-molecular LH, LO, and LL pairs with exponential coefficients $kLHinter$, $kLOinter$, and $kLLinter$. A total of 1153 symmetrized monomials form $VML,PIP2B$: 6 first-degree monomials using only intermolecular *ξ*_{i} variables, 63 second-degree monomials with at most a linear dependence on intramolecular variables, 491 third-degree ones containing at most quadratic intramolecular variables, 593 fourth-degree terms involving only quadratic intramolecular variables, as in the original paper.^{50}

For the 3B PIP, we are using 36 exponential functions for each of the intra- and inter-molecular distances between all real (O and H) atoms with exponential coefficients and distances $kHHintra$, $kOHintra$, $kHHinter$, $kOHinter$, $kOOinter$, $dHHintra,(0)$, $dOHintra,(0)$, $dHHinter,(0)$, $dOHinter,(0)$, and $dOOinter,(0)$. A total of 1163 symmetrized monomials form $VML,PIP3B$: 13 second-degree monomials with only intermolecular exponential variables, 202 third-degree monomials with at most a linear dependence on intramolecular variables, and 948 fourth-degree monomials containing at most a linear dependence on intramolecular variables or intermolecular ones involving oxygen-oxygen and hydrogen-hydrogen distances, as in the original paper.^{51} By construction the PIP method provides analytical forces.

The linear and nonlinear parameters were optimized using a singular value decomposition and the simplex algorithm, respectively, by minimizing the regularized sum of squared errors *χ*^{2} for the corresponding training set *S*, commonly referred to as Tikhonov regularization or ridge regression,^{77}

The regularization parameter Γ was set to 5 × 10^{−4} for 2B and 1 × 10^{−4} for 3B in order to reduce the variation of the linear parameters without spoiling the overall accuracy of the fits. The training time required approximately 48 central processing unit (CPU) core hours for the 2B fit and 24 CPU core hours for the 3B fit using a 24-core Intel Xeon Platinum 8160 (Skylake) CPU.

### B. Behler-Parrinello neural networks

Based on the assumption that the total energy of a system can be written as a sum of atomic energy contributions, a BPNN consists of a set of fully connected feed-forward neural networks, each of which provides an atomic energy.^{66,67} Each atomic network takes as its input a set of atom-centered symmetry functions^{78} that encode the atomic positions and at the same time are invariant with respect to overall rotation and translation, as well as to permutations of like atoms. The invariance of the total energy is assured by enforcing that all atomic networks of the same species are identical, thus having the same structure and weights. As a result, for the water systems considered here, there are two sub-networks, one for all H atoms and the other for all O atoms, which need to be trained simultaneously. Aiming at smoothly disabling the short-range interaction energy contribution at long distances, described in Eqs. (6) and (7), the sum of all atomic energies from the last layer of the sub-networks is multiplied by the switching function to produce a final output for a BPNN. The network weights are determined with respect to the values of the reference short-range interaction energies. From the analytical expression of the energy, the forces can be derived accordingly.

The following modified radial and angular symmetry functions, which lack the cutoff functions of the original BPNN approach, have been chosen for each atom *i*:

resulting in an input vector $Gi={Girad/ang}$ for the atomic network. *θ*_{ijk} denotes the angle enclosed by two interatomic distances *R*_{ij} and *R*_{ik}. Each summation above takes into account only same combination of atomic species and the set of parameters, {(*η*, *R*_{s})} and {(*ζ*, *λ*, *η*′)}, is the same for each type of species grouping. We have removed the cutoff function from the original forms of the symmetry functions used in Refs. 66 and 67 since we apply the MB-pol 2B and 3B switching functions, thus never feeding any structures to the 2B and 3B BPNNs that are beyond the cutoff region.

The dimension of the input vector should reflect a balance between giving an effective resolution of the local environment and the computational cost of training and inference with a large input vector neural network. After carefully examining different parameter sets, we have come up with the final set as follows. For the 2B term, there are 24 radial Gaussian-shape filters, Eq. (13), whose centers *R*_{s} are placed evenly between 0.8 Å and 8 Å, which are relatively close to the smallest and the largest interatomic distances in the training set. For O–O distances, the two smallest centers are excluded because the O–O separation is well beyond the space covered by these two filters. The width of those filters is proportional to their centers’ position, $1/2\eta =0.2Rs$. The angular probe in Eq. (14) takes *ζ* = [1, 4, 16] for different filter widths, *λ* = ±1 for switching the filter’s center between 0 and *π*, and $\eta \u2032=0.001,0.01,0.05$ (Å^{−2}) for various levels of the separation dependence. As for 3B BPNNs, a similar scheme is applied with few adjustments, which include 16 radial filters with centers arranged in the same range, between 0.8 Å and 8 Å, and two levels of separation dependence attached to the angular filter, $\eta \u2032=0.001,0.03$ (Å^{−2}). Moreover, to reduce the redundancy and computational cost, for the angular probe for hydrogen atoms, we consider only two types of triplet of atoms, a hydrogen atom with other two hydrogen atoms or with an oxygen atom and another hydrogen atom. In total, a set of 82 and 84 symmetry functions for O and H is formed for the 2B BPNN while another set of 66 and 56 functions for O and H is used for the 3B BPNN. The complete set of the symmetry function parameters can be found in the supplementary material.

The neural network training encounters various hyperparameters and different techniques for initialization of these parameters, which are mostly found by trial and error. Following is our final network architecture and setup for the network training. The atomic network consists of one input layer, three hidden layers, and one single output layer. The input layer takes as its input the preprocessed symmetry functions, each of which is obtained by rescaling the symmetry function with its corresponding maximum value in the training and validation sets. Furthermore, the numbers of units in each hidden layer are chosen to be the same for both atomic networks for O and H. Overall, with 34 and 22 units per hidden layer, the final 2B and 3B BPNN models contain 10542 and 4798 weight and bias parameters, respectively. For the continuity of the energy functional, the activation function for each unit is chosen to be a hyperbolic tangent for the hidden layers and a linear function for the output layer. Besides, the reference energies for the 2B training are converted to energy per atom in eV unit so that the network targets a similar range of values as given by the activation functions.

We build the network models using Keras^{79} with Theano^{80} backend and choose the Adam optimizer with a batch size of 64 for training. The Nguyen-Widrow method^{81} is employed to initialize the network weights and biases. For a stable and effective training, the optimization process is continuously carried out five times with descending starting learning rates $10\u22123,2\u22c510\u22124,6\u22c510\u22125,9\u22c510\u22126,10\u22126$ and corresponding numbers of iterations, or epochs, $1500,1500,1000,1000,1000$. Furthermore, we apply an additional decay rate *α* = 10^{−5} to each learning rate such that at a given epoch *k* the leaning rate is *lr*_{k−1}/(1 + *α*⋅*k*) based on the value at the previous epoch *lr*_{k−1}. The training is to optimize the mean squared error of the modeled energies compared to the reference data in the training set. To avoid overfitting, on each epoch, the quality of the model is monitored on the validation set such that only the model that gives the highest accuracy over this set is ultimately kept. Finally, the trained model is then evaluated on the test set to quantify its capability of generalization to unseen data. For the systems considered here, the training processes generally take 3 hours and 1 hour on a Tesla K40 GPU with the GPU-accelerated cuDNN library for 2B and 3B sets, respectively.

### C. Gaussian approximation potentials

The Gaussian Approximation Potential (GAP)^{82,83} framework, available in the QUIP program package,^{84} is an implementation of Gaussian process regression (GPR) interpolation for the atomic energy as a function of the geometry of the neighbouring atoms. The functional form representing a function *f* that is to be interpolated is identical to that of kernel ridge regression,

where the high dimensional vector ** R** represents the complete geometry of neighbouring atoms,

*k*indexes a set of representative data points {

*R*_{k}},

*K*is the kernel function, and {

*b*

_{k}} are fitting coefficients. In the GPR formalism,

*K*corresponds to an estimate of the covariance of the unknown function, and the linear system is solved in the least-squares sense using Tikhonov regularisation, but the regularisation parameters are now interpreted as estimates of data and model error. In the present case, the regularisation was chosen to be 0.00115 kcal/mol for the 2B term and 0.0231 kcal/mol for the 3B term after manual exploration of the data.

The success of the GAP fit depends on choosing an appropriate kernel, one that captures the structure of the input data and as much as possible about the function to be fitted. Here we use the “Smooth Overlap of Atomic Positions” (SOAP), a kernel that is the rotationally integrated overlap of the neighbour densities, which was shown to be equivalent to the scalar product of the spherical Fourier spectrum.^{83} The atomic environment of atom *i* is described by a set of neighbour densities, one for each atomic species, which are represented as the sum of Gaussians each centred on one of the neighbouring atoms *j*,^{85}

where *j* ranges over neighbours with atomic species *α*, *r*_{ij} are the positions relative to *i*, and $\sigma at2$ is a smoothing parameter. We included the switching function $fcut$ which smoothly goes to zero beyond a specified radial value. This local atomic neighbour density can be expanded in terms of spherical harmonics, $Ylm(r^)$, and orthogonal radial functions, *g*_{n}(|** r**|),

The expansion coefficients are then combined to form the rotationally invariant power spectrum,

where we have emphasized the functional dependence on the complete neighbour geometry. The complete SOAP kernel can be written as

where we have allowed for a small integer exponent *ζ* (here set to 2). The kernel is also normalised so that the kernel of each environment with itself is unity. Separate fits are made to the atomic energy function corresponding to each atomic species taken as the center of an atomic environment. The key free parameters are the radial cutoff in *f*_{cut} and the smoothing parameter *σ*_{at}. In the present cases here, atomic energy functions are represented by the sum of two kernels,^{86} one with a smaller radial cutoff (4.5 Å) and smaller smoothing (0.4 Å) and one with a larger cutoff (6.5 Å for the 2B and 7.0 Å for the 3B fit) and larger smoothing (1.0 Å). The root mean squared error (RMSE) is only weakly sensitive to these, and some manual optimisation was carried out. Each fit uses 10 radial basis functions and a spherical harmonics basis band limit of 10. The representative environments for the fit are chosen using CUR matrix decomposition.^{87} The number of representative points are 9000 in the 2B fit and 10000 in the 3B fit. The full command lines of the fits are given in the supplementary material. Training of the GAP models required 150 CPU core hours for the 2B model and 64 CPU core hours for the 3B model on 16-core Intel Xeon E7-4820 (Westmere) and Intel Xeon E5-2670 (Sandy Bridge) CPUs, respectively. Note that although formally the GAP construction corresponds to a decomposition of the total energy into atomic energies, similar to BPNN above, the cutoffs are sufficiently large to encompass all atoms in the water dimer and trimers in the dataset, and therefore the decomposition does not represent an approximation. Similar to PIP and BPNN, GAP provides analytical forces.

## IV. RESULTS

### A. 2B and 3B interactions and the structure of the training data

The root mean squared errors (RMSEs) obtained with PIPs, BPNNs, and GAPs for the 2B and 3B datasets are reported in Table I. For the 2B term, all three methods achieve similar accuracy: the error on the training set is less than 0.050 kcal/mol per dimer, while the errors on validation and test sets are less than 0.080 kcal/mol per dimer. These errors demonstrate a high level of accuracy since the average value of the target energies in the dataset is 3 kcal/mol. Among the three, the 2B PIP model appears to perform better on the validation and test sets and suffers less from overfitting. The difference in RMSEs for the training set and the test set is below 0.02 kcal/mol with PIP, but around 0.03 kcal/mol with BPNN and 0.04 kcal/mol with GAP. The GAP model gets a slightly lower error for the training set, but overfitting prevents to achieve a similar accuracy for the test set.

. | 2B . | 3B . | ||||
---|---|---|---|---|---|---|

. | Training . | Validation . | Test . | Training . | Validation . | Test . |

PIP | 0.0349 | 0.0449 | 0.0494 | 0.0262 | 0.0463 | 0.0465 |

BPNN | 0.0493 | 0.0784 | 0.0792 | 0.0318 | 0.0658 | 0.0634 |

GAP | 0.0176 | 0.0441 | 0.0539 | 0.0052 | 0.0514 | 0.0517 |

. | 2B . | 3B . | ||||
---|---|---|---|---|---|---|

. | Training . | Validation . | Test . | Training . | Validation . | Test . |

PIP | 0.0349 | 0.0449 | 0.0494 | 0.0262 | 0.0463 | 0.0465 |

BPNN | 0.0493 | 0.0784 | 0.0792 | 0.0318 | 0.0658 | 0.0634 |

GAP | 0.0176 | 0.0441 | 0.0539 | 0.0052 | 0.0514 | 0.0517 |

In order to investigate in more detail the performance of the different regression schemes for predicting the 2B and 3B energies over the MB-pol dimer and trimer data sets, we used a dimensionality reduction scheme to obtain a 2D representation of the structure of the train set. We followed a procedure similar to that used in Ref. 88 to map a database of oligopeptide conformers. A metric based on SOAP descriptors^{85} was used to assess the similarity between reference conformations of dimers or trimers. A 2D map that best preserved the similarity between 1000 reference configurations selected by farthest point sampling^{89} was obtained using the sketch-map algorithm.^{90,91} All other configurations (training and testing) were then assigned 2D coordinates (*x*_{i}, *y*_{i}) by projecting them on the same reference sketch-map. We could then compute the histogram of configurations *h*(*x*, *y*) and the averages of the properties of the different configurations and of the test RMSE for the various methods, conditional on the position on the 2D map, e.g.,

Figure 1 demonstrates the application of this analysis to the dimer dataset. One of the sketch-map coordinates correlates primarily with the O–O distance, while different relative orientations and internal monomer deformations are mixed in the other direction. Conformational space is very non-uniformly sampled [Fig. 1(b)], with a large number of configurations at a large O–O distance—which correspond to $Vshort2B$ of less than 0.01 kcal/mol—and at intermediate distances, with sparser sampling in the high-energy, repulsive region [Fig. 1(c)]. It is interesting to see that the three regression schemes we considered exhibit very similar performance in the various regions, with tiny errors <0.01 kcal/mol for far-away molecules, and much larger errors, as large as 1 kcal/mol, for configurations in the repulsive region. These large errors are not only due to the high energy scale of $Vshort2B$ in this region: the largest errors appear in the portion of the map which is characterized by both large $Vshort2B$ and low density of sample points.

The non-uniform sampling of the dimer space configuration means that there is room to improve it. Figure 2 compares the test RMSE obtained by BPNN fits constructed on subsets of the overall training set. The error can be reduced by up to a factor of five by choosing the subset with a FPS strategy, rather than at random. This observation is consistent with recent observations made using SOAP-GAP in a variety of systems.^{86,92} Selecting training configurations from a larger database of potential candidates using FPS gives a viable strategy to reduce the number of high-end calculations that have to be performed to describe accurately interactions in the construction of a MB potential.

Figure 3 shows a similar analysis for the case of the trimer data and $Vshort3B$. 3B energies span a smaller range than the 2B component, which includes most of the core repulsion. The higher dimensionality of the problem, however, makes this a harder regression problem, as is apparent from the irregular correlations between energy and position on the map, which reveals an alternation of regions of positive and negative contributions.

As a result, the absolute RMSE accuracy of the regression models is comparable to that for the 2B terms, with PIP and GAP yielding comparable accuracy (RMSE ≈ 0.05 kcal/mol), followed closely by BPNN (RMSE ≈ 0.06 kcal/mol). As in the case of 2B energy contributions, an analysis of the error distribution shows that improving the sampling density and uniformity for the train set is likely to be the most effective strategy to further improve the model. Errors are concentrated at the periphery of the data set. The good performance of the GAP model can be traced to the fact that it provides a very good description of the short RMS *d*_{OO} region, even if only a few reference structures are available, even though it performs less well than PIP or NN for configurations that involve far away molecules.

### B. Water clusters

Isomers of water clusters $(H2O)n$ with *n* = 4, 5, 6 (see Fig. 4 for the structures) serve as larger test systems to investigate the performance of MB-pol with PIP, BPNN, and GAP representations of the short-range 2B and 3B energies and the corresponding effect on the total interaction energies of the clusters.

An analysis of the 2B and 3B contributions to the total interaction energy of the water clusters is shown in Fig. 5. MB-pol errors with respect to the CCSD(T) reference values are smaller than 0.3 kcal/mol in all cases, independent of the cluster size and geometry and independent of the approach that is used to represent the short-range 2B and 3B energies. The errors increase somewhat with cluster size as the individual errors for the larger number of 2B and 3B terms can start to add up for cluster configurations that contain repeating dimer and trimer units. This is mostly pronounced for 2B interaction energies. While similar errors in 2B interaction energies are seen with the three potentials, GAP-MB-pol exhibits smaller errors in 3B interaction energies than PIP-MB-pol and BPNN-MB-pol.

Figure 6 compares the total interaction energies of all water cluster isomers as obtained with MB-pol using PIP, BPNN, and GAP representations of short-range 2B and 3B energies in comparison to the CCSD(T)/CCSD(T)-F12b reference values. In correspondence with the 2B and 3B contributions, the error in the total interaction energy increases with cluster size. Due to extended hydrogen bonding and symmetry, the ring-type isomers also have relatively large higher-body contributions that can be non-negligible and that can exhibit errors of similar magnitude as the 2B and 3B terms as has been shown in previous work.^{58,93} The error for this type of isomers is thus particularly large. However, the deviation in the computed interaction energies never exceeds 0.8 kcal/mol and, most importantly, the relative order of the total interaction energies for the different isomers of each cluster is retained in all cases. Overall we conclude that any of the investigated approaches to represent short-range 2B and 3B interaction energies within the MB-pol model is suitable to predict accurate interaction energies of water clusters.

## V. CONCLUSIONS

We have explored different representations of MB-pol short-range two-body (2B) and three-body (3B) interaction energies using permutationally invariant polynomials (PIP), Behler-Parrinello neural networks (BPNNs), and Gaussian approximation potentials (GAPs). Any of these models can be trained within a few hours on a single state-of-the-art compute node. The accuracy of the three models has been assessed by comparing their ability to reproduce large datasets of CCSD(T)/CBS 2B and 3B interaction energies as well as in predicting the energetics of small water clusters, which are always found to be within chemical accuracy (1 kcal/mol). These results demonstrate that the three models are effectively equivalent, consistently exhibiting similar performance in representing many-body interactions in water within the MB-pol framework. The most promising approach to further increase the accuracy for both the 2B and 3B terms involves increasing the number of reference calculations and optimizing the training set to cover more uniformly the relevant configuration space. Our analysis of the 2B and 3B contributions to the MB-pol interaction energies can be taken as a case study for the general problem of the systematic construction of potentials derived from the many-body expansion. The combination between an accurate machine-learning representation of the short-range terms in combination with a physically sound form of long-range contributions provides a promising route to the development of accurate, efficient, and transferable potential energy surfaces.

## SUPPLEMENTARY MATERIAL

See supplementary material for the following: the list of parameters for the symmetry functions used in BPNN-MB-pol, an example of the command line necessary to generate the GAP fit in QUIP, interaction energies for the water hexamer isomers analyzed in Figs. 5 and 6 whose Cartesian coordinates are given in Ref. 58, and a code to compute the short-range 2B and 3B interaction energies of BPNN-MB-pol.

## ACKNOWLEDGMENTS

This work was supported by the National Science Foundation through Grant No. ACI-1642336 (to F.P. and A.W.G.). This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant No. ACI-1548562. J.B. is grateful for a Heisenberg professorship funded by the DFG (No. Be3264/11-2). E.Sz. would like to acknowledge the support of the Peterhouse Research Studentship and the support of BP International Centre for Advanced Materials (ICAM). M.C. was supported by the European Research Council under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 677013-HBMAP). G.I. acknowledges funding from the Fondazione Zegna.