A strategy is outlined to reduce the number of training points required to model intermolecular potentials using Gaussian processes, without reducing accuracy. An asymptotic function is used at a long range, and the crossover distance between this model and the Gaussian process is learnt from the training data. The results are presented for different implementations of this procedure, known as boundary optimization, across the following dimer systems: CO–Ne, HF–Ne, HF–Na^{+}, CO_{2}–Ne, and (CO_{2})_{2}. The technique reduces the number of training points, at fixed accuracy, by up to ∼49%, compared to our previous work based on a sequential learning technique. The approach is readily transferable to other statistical methods of prediction or modeling problems.

## I. INTRODUCTION

In any molecular simulation, approximations of the potential energy surfaces (PESs) that describe the relevant interactions are a pre-requisite. Traditionally, these approximations have been made using empirical potentials (force fields).^{1,2} However, such force fields have closed functional forms, which limit their capacity to capture the complicated topography of the PES. Furthermore, they are laborious to produce and may fail to capture accurately even the fitted data. As the accuracy of the simulation depends on the potential employed, much work has been devoted to developing force fields that provide approximations of PESs with quantum-mechanical accuracy.

Such work includes methods to generate *ab initio* force fields^{3} and attempts to “learn” the potential via a machine learning algorithm.^{4,5} This article concerns the latter approach, which has also been applied in other fields in chemistry and materials science.^{6–8} This approach proceeds by training a statistical technique on a relatively small set of data from *ab initio* calculations on the PES of interest, known as the training set. Many such techniques have been employed to predict the energy in these algorithms, including neural networks,^{9–13} moment tensors,^{14–16} and Gaussian processes^{17–34} (GPs).

Herein, GPs are employed as the statistical technique. An existing example of a force field that uses GPs is FFLUX,^{35} which has been used to approximate the energies of weakly bound complexes^{36} and water clusters,^{37} among other applications.^{38–40} Furthermore, in the field of materials science, GP models that invoke a smooth overlap of atomic positions (SOAP) kernel^{41} have seen successful applications to many systems.^{29–31,42–45}

GP models have also produced promising results in applications to intermolecular interactions.^{17–19} Initially,^{17} the training sets for these models were constructed with Latin hypercube (LHC) sampling^{46–48} though sequential design strategies,^{49,50} which achieve a prescribed accuracy with a smaller training set, have been shown to outperform such methods.^{19} Regardless of the training set design, a fixed crossover distance, *R*_{cross}, was imposed prior to training. This parameter defines a boundary beyond which a simple, long-range asymptotic function takes over prediction from the GP.^{17,19} A similar approach is used in materials science, in which the contribution by one atom to the neighbor density of another is assumed to be zero if the two are separated by a distance in excess of *R*_{cross}.^{31,44} Crossover distances have been applied alongside other statistical methods of prediction too,^{5,51} with neural network^{9–12} and moment tensor^{15} models of PESs employing fixed, pre-determined crossover distances.

The use of a crossover distance limits the portion of the PES approximated by the statistical method to the region around the potential well. Consequently, fewer training points are required to develop an accurate model. Previously, these distances were all fixed and specified *a priori*; however, they can be learnt from the training data. Herein, the work of Uteva *et al**.*^{17,19} is extended into a sequential design strategy in which *R*_{cross} varies as a function of the number of training points and the species of the interacting atoms. This produces models of the same accuracy with fewer training points than the same design strategy with a fixed *R*_{cross}. The process by which *R*_{cross} is optimized is referred to here as boundary optimization.

### A. GP regression

All GP models herein make predictions via GP regression, with detailed descriptions of GP regression theory available elsewhere.^{52–54} This section briefly introduces the key concepts in the context of intermolecular potentials. For any intermolecular interaction, the PES can be thought of as a multivariate function *f*(**x**), $x\u2208RZ$, where **x** is a vector of inputs and Z is the number of elements in **x**. Here, the inputs are inverse interatomic separations, although promising results have also been obtained with Morse variables.^{24,55} Only pair intermolecular interactions are considered, although GP models have been applied successfully to non-additive interactions^{17,19,40} and the methodology outlined here can be extended to such cases straightforwardly.

When the outputs *f*(**x**_{i}) = Y_{i} are available at *N* values of *i*, the set ${xi,Yi}i=1N$ forms the training set. Letting **Y** be a vector of the observed energies from this set, GP regression can be used to approximate the value of *f*(**x**) at a new point **x**_{*} as

where **K**_{*} is a vector of the covariances between **x**_{*} and all **x**_{i},

and **K** is the positive-definite covariance matrix,

Here, $\sigma n2$ is the Gaussian noise variance, which accounts for the noise in the training data.

All entries in **K** are found by evaluating a covariance function, *k*(**x**, **x**′), where **x** and **x**′ are configurations from the training set. The same holds for **K**_{*}, only with one training configuration replaced by the new point **x**_{*}. A common example is the squared exponential covariance function for which

Here, *D* is the total number of interatomic distances, which comprise unique atomic pairs; $\sigma f2$ is the signal variance; *x*_{d} and $xd\u2032$ are the inverse separations in **x** and **x**′, which contain the *d*th atomic pair; and *l*_{d} is the lengthscale for the interaction between this pair of atoms. The hyperparameters of the covariance function are $\sigma n2$, $\sigma f2$, and the set of lengthscales, *l*_{d}.

This covariance function can be modified to account for the fact that two configurations can have equivalent energies due to symmetry. The result is the symmetric squared exponential covariance function, *k*_{sym}(**x**, **x**′), investigated by Uteva *et al**.*,^{17} which is used here. The set of all permutations of the interatomic distances in **x** under which the energy surface is unchanged is denoted as *P*, while *p* is a single permutation within this set. Assuming that *l*_{d} is invariant under interchange of **x**_{d} and $xd\u2032$,

where *k*(**x**, **x**′) is the squared exponential covariance function from Eq. (3). GPs also employ a mean function, which is taken here to be zero everywhere.

To specify the hyperparameters of *k*_{sym}(**x**, **x**′) that produce a model that achieves the best fit to the training data, the log of the marginal likelihood function, log($L$), is maximized,^{52,54}

Equation (5) shows that this optimization entails inversion of the *N* × *N* covariance matrix. This incurs a sizable computational cost, which translates to scaling of order $O$(*N*^{3}) for hyperparameter optimization.^{56} Consequently, although they achieve higher predictive accuracies than other statistical methods when modeling PESs,^{51,57,58} GPs require greater computational effort to train.^{51} For prediction, meanwhile, **K**^{−1}**Y** in Eq. (1) needs to be calculated only once, and as a result, the cost scales linearly with *N*. Even so, molecular simulations that employ a GP model for which *N* is large are more computationally intensive than those that use a traditional force field.

These issues have led to attempts to minimize the training set size required by a GP model to achieve a given error. For two training strategies that achieve the same error, that which does so with fewer training points is more computationally efficient. Attempts to develop more computationally efficient training strategies have involved active learning or sequential design methods,^{19,29,59} composite kernels,^{24} and new sampling methods.^{27,60}

However, no attempt was made in the previous works to increase training efficiency by learning the optimal crossover distance from the data used in training. Our boundary optimization approach has the potential to increase the gains in efficiency from these prior methods further still with little associated increase in computational time, provided a viable long-range approximation is available. Such an improvement is shown here, building upon the sequential design method of Uteva *et al**.*^{19}

### B. Training set design

It was found by Uteva *et al**.*^{17} that the predictive performance of a GP for a PES is enhanced by using inverse interatomic separations rather than non-inverse separations to describe intermolecular configurations. This is because *k*_{sym}(**x**, **x**′) is a stationary kernel (i.e., it relies only on the distance between the two configurations being compared), meaning it assumes that the rate of change of the output (i.e., the interaction energy) with the input (the interatomic distances) is constant. This is not the case for a PES, where the energy varies rapidly with separation in the short-range repulsive wall, but barely at all in the long-range asymptotic region. The *r* → *r*^{−1} conversion addresses this issue: For low *r*, the rate of change in *r*^{−1} is faster than in *r*, meaning that the change in the input more closely matches the change in the output under the former; meanwhile, for large *r*, the rate of change in *r*^{−1} is far smaller, as is that of the energy.

Previously, Latin hypercube (LHC) sampling, which is a space-filling design, has been employed to build datasets when modeling PESs with GPs.^{17–19,61} The approach of Uteva *et al**.*^{17} entailed designing LHCs in inverse separations over a range of angles that specified the symmetry-distinct region.^{17} The algorithm generates a large number of candidate LHCs and finds the minimum separation in each. The LHC with the largest such separation is selected under what is referred to here as the “maximin” criterion. Quantum chemical calculations are undertaken on the selected LHC only. This LHC is then subject to a high-energy cutoff, *E*_{cut}, and any configuration for which the interaction energy exceeds *E*_{cut} is discarded from the LHC.^{17} In addition, a geometric constraint of 8.5 Å was placed on all LHCs to ensure no configurations with minimum separations above this threshold were included.^{17} It was observed that at separations greater than the geometric constraint, the GP predictions tended toward a small, non-zero constant.^{17} As these predictions should tend toward zero, a long-range function derived from the multipole expansion^{62} was introduced to approximate energies at large separations.

In a later work, Uteva *et al**.*^{19} presented active and sequential learning approaches, in which the model of the PES is progressively refined by adding new points to the GP. In this approach, training and validation of the GP models involve three datasets: a training, a reference, and a test set. The training set is used in the GP regression. The reference set provides a pool of configurations from which new training points can be selected, while the test set determines the GP’s accuracy against an independent dataset.

One sequential design strategy presented by Uteva *et al**.*^{19} is the highest error search. This approach selects new training points based on the configuration in the reference set with the largest GP prediction error. We adopt this approach herein. However, Uteva *et al.*^{19} applied this strategy only at separations below the fixed, pre-determined crossover distance,^{19} meaning no new training data could be added at separations above *R*_{cross}.

The highest error search is best described as a sequential design rather than an active learning^{63,64} method. This is because active learning methods are a subset of sequential design methods that compute the output corresponding to an input only at the point it is added to the training set. This means, in the context of modeling PESs, an active learning method would use a reference set comprising configurations for which the energies were not calculated. Active learning strategies have, however, been successfully used to develop GP,^{19,29,59} moment tensor,^{15} and neural network^{13} models of PESs, and the methodology used here for boundary optimization could be altered for use with the two-set method of Uteva *et al**.*,^{19} which is an active learning method.

When using a sequential design method, meanwhile, the energies in the reference and test sets are pre-computed using a relatively computationally inexpensive *ab initio* technique. Once a model with the requisite predictive accuracy is obtained, the energies in the minimal training set can be re-calculated using a more accurate, and costly, *ab initio* method before use in applications. This process is known as transfer learning,^{65,66} and it allows a GP model of a PES to be produced with the accuracy of a high-level *ab initio* technique with relatively few computationally expensive calculations. For example, MP2^{67} energies have been upgraded to coupled cluster with single, double, and perturbative triple excitations [CCSD(T)]^{68} energies for calculation of the CO_{2}–CO second virial coefficient.^{17}

## II. METHODOLOGY

### A. Overview

In previous applications of both LHC learning^{17} and sequential design methods,^{19} a fixed value of the crossover distance, *R*_{cross}, was chosen *a priori*; however, it is possible to determine its optimal value from the reference data. This is boundary optimization and is possible because a sequential design method permits *R*_{cross} to be varied each time the GP is updated. Consequently, the size of the region over which GP regression is used for prediction will grow with the predictive accuracy of the GP.

Boundary optimization may lead to increased accuracy for the following reason. When the number of training points, *N*_{TP}, is low, the predictive accuracy of the long-range function at the outer edge of the potential well will exceed that of the GP. Consequently, it is anticipated that allowing the long-range function to approximate the energies of configurations in this region at low *N*_{TP} will increase the predictive accuracy of the overall model and facilitate more efficient model development.

### B. Data set generation

Both the reference and test sets for each chemical system were generated under the LHC design strategy discussed in Sec. I B. All energies were calculated in Molpro^{69} using MP2 with an aug-cc-pVTZ basis set and the counterpoise correction with the only exception being HF–Na^{+}, which used an aug-cc-pwCVTZ basis set instead. The specifications of the LHCs for each system are given in Table I. Therein, *r* is the distance between the bond centers (not centers of mass), *θ* is the angle between *r* and the bond axis of the molecule, *θ*_{1} and *θ*_{2} are the angles between *r* and the bond axis of the first and second CO_{2} molecules, respectively, and *ϕ* is the torsional angle between the two CO_{2} molecules. The molecules were kept rigid, with *r*_{CO} = 1.1283 Å for CO, *r*_{CO} = 1.1632 Å for CO_{2}, and *r*_{HF} = 0.9170 Å. However, boundary optimization could be applied to non-rigid systems straightforwardly, with the only additional requirement being generalization of the long-range function to non-rigid molecules. A larger geometric constraint of 100 Å (instead of 8.5 Å^{17}) was employed to probe the long-range behavior of the system. The maximum value of *N*_{TP} was 100 for all systems apart from (CO_{2})_{2}, which used 300 training points at most.

System . | Coordinate . | Range . | N_{ref}
. | N_{test}
. | E_{min}
. |
---|---|---|---|---|---|

CO–Ne | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 1914 | 5718 | −1.502 × 10^{−4} |

cos(θ) | −1 to 1 | ||||

HF–Ne | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 2148 | 6468 | −2.633 × 10^{−4} |

cos(θ) | −1 to 1 | ||||

HF–Na^{+} | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 2760 | 8416 | −2.518 × 10^{−2} |

cos(θ) | −1 to 1 | ||||

CO_{2}–Ne | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 5057 | 5072 | −2.895 × 10^{−3} |

cos(θ) | 0–1 | ||||

(CO_{2})_{2} | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 5810 | 5837 | −1.975 × 10^{−3} |

cos(θ_{1}) | 0–1 | ||||

cos(θ_{2}) | 0–1 | ||||

Φ | 0°–180° |

System . | Coordinate . | Range . | N_{ref}
. | N_{test}
. | E_{min}
. |
---|---|---|---|---|---|

CO–Ne | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 1914 | 5718 | −1.502 × 10^{−4} |

cos(θ) | −1 to 1 | ||||

HF–Ne | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 2148 | 6468 | −2.633 × 10^{−4} |

cos(θ) | −1 to 1 | ||||

HF–Na^{+} | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 2760 | 8416 | −2.518 × 10^{−2} |

cos(θ) | −1 to 1 | ||||

CO_{2}–Ne | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 5057 | 5072 | −2.895 × 10^{−3} |

cos(θ) | 0–1 | ||||

(CO_{2})_{2} | r^{−1} | 0.01–0.67 $A\u030a\u22121$ | 5810 | 5837 | −1.975 × 10^{−3} |

cos(θ_{1}) | 0–1 | ||||

cos(θ_{2}) | 0–1 | ||||

Φ | 0°–180° |

A high-energy cutoff, *E*_{cut}, was applied to the reference and test sets to remove configurations with interaction energies, which exceeded its value. *E*_{cut} = 0.005 *E*_{h} for all systems apart from HF–Na^{+}, for which *E*_{cut} = 0.05 *E*_{h} because the charge–dipole interaction increases the well depth. (1 *E*_{h} ≈ 27.211 eV ≈ 2625.5 kJ mol^{−1}).

For systems with *N*_{test} ≫ *N*_{ref}, the independence of the test set is self-evident. However, for systems where *N*_{test} ≈ *N*_{ref}, it is also true that the test set is independent. This follows because, although the reference and test sets for each system were designed using the same “maximin” strategy, the stochastic nature of the LHC algorithm means that separate LHCs contain completely independent sets of configurations. Furthermore, the “maximin” criterion is based on just one separation in the whole dataset, meaning that two LHCs with similar maximin will still be dissimilar. This is demonstrated in Fig. 1, which shows the energies against the inverse C–Ne separation for the reference and test sets used in training models of the CO_{2}–Ne potential.

### C. Long-range asymptotic functions

For the long-range energy model, multipole series were employed for all systems apart from HF–Na^{+}. The contributions included in the multipolar long-range functions are shown in Table II for each system apart from CO_{2}–Ne, for which the function is already described in previous work,^{17} and (CO_{2})_{2}. The latter was developed prior to the other multipolar long-range functions and uses atomistic charge, dipole, quadrupole, and polarizability contributions from Hartree–Fock^{70} theory, which were scaled to give the known total molecular properties. For HF–Na^{+}, a fitted long-range function was used, which was derived by fitting a sum of two power laws between two points where the energy was predicted by GP regression. More information on the motivation for and derivation of this function is found in the Appendix.

System . | Dipole . | Quadrupole . | Polarizability . | Dispersion . |
---|---|---|---|---|

CO–Ne | ✓ | × | ✓ | ✓ |

HF–Ne | ✓ | × | ✓ | ✓ |

HF–Na^{+} | ✓ | ✓ | ✓ | ✓ |

Level of theory | MRCI^{71,72} | MRCI | MP2 | CCSD |

System . | Dipole . | Quadrupole . | Polarizability . | Dispersion . |
---|---|---|---|---|

CO–Ne | ✓ | × | ✓ | ✓ |

HF–Ne | ✓ | × | ✓ | ✓ |

HF–Na^{+} | ✓ | ✓ | ✓ | ✓ |

Level of theory | MRCI^{71,72} | MRCI | MP2 | CCSD |

### D. Classification of phase space using a boundary

When modeling PESs using GPs, it is often necessary to classify configurations as suitable for prediction via the GP or via a long-range asymptotic function. In the work of Uteva *et al**.*,^{17,19} the classifier formed a boundary from the superposition of atom-centered spheres defined by a single constant, *R*_{cross}. Specifically, if any interatomic distance was less than *R*_{cross}, the GP was used. As *R*_{cross} was fixed, this classifier is referred to here as C_{fixed}. Denoting the region in which GP regression was used for prediction as A_{GP} and the region which employed the long-range function as A_{LR}, under C_{fixed}, these regions were

and

where **r** is a set of intermolecular atom–atom distances and min(**r**) is the smallest separation in **r**.

In boundary optimization, *R*_{cross} varies according to the GP accuracy. Thus, *R*_{cross} is not constant, but a model parameter, which is learnt from the reference data. As this classifier is still parameterized by a single value, *R*_{cross}, it is referred to here as C_{single}.

More elaborate classifiers are possible by using more detailed parametric forms to define the boundary region. A simple way of defining a more complex classifier is for the value of *R*_{cross} to depend on the atom types that comprise the interatomic distance. The resulting classifier is referred to here as C_{multi}. For a system of molecules with *D* interatomic pairs of chemically different atoms, using C_{multi} requires the vector of crossover distances **R**_{cross} = (*R*_{1}, …, *R*_{d}, …, *R*_{D}). This defines a multiple-parameter boundary region as follows:

where min_{d}(**r**) is the minimum separation in **r** that involves an atomic pair of type *d*.

The optimum value of the classifier parameters is determined by minimizing the error between the model and the reference set (i.e., by minimizing the training error), meaning the sizes of *A*_{GP} and *A*_{LR} vary with the GP. The sum of squared errors, SSE_{tot}, over the two regions is

where

Here, “method” denotes either GP or LR, *N*_{method} is the number of points in A_{method}, $Y\u0302i$ is the prediction of the energy for the *i*th configuration from the desired method, and Y_{i} is the calculated energy of the same configuration. The root-mean-square error (RMSE) against the test set, RMSE_{test}, is given by

where SSE_{test} is SSE_{tot} over the test set and *N*_{test} is the number of configurations in the test set. RMSE_{test} is therefore a function of *R*_{cross} with discrete steps, as the RMSE changes only when a change in *R*_{cross} causes a configuration in the test set to be re-classified. Both C_{single} and C_{multi} are simple parametric classifiers that pre-impose a mathematical form on the classification. Hence, neither is expected to be optimal with respect to the RMSE against the reference or test sets. That is, a more complicated boundary than that described by these classifiers will likely produce a lower RMSE against a given dataset. However, it is shown later that an artificial “ideal” classifier produces only very marginal improvements over C_{multi}, suggesting this classifier provides a very good balance of simplicity and accuracy.

### E. Overall algorithm

Using the data and calculation methods above, the algorithm generates a GP model sequentially as follows: Train the GP to the current training set, select the classifier parameters by minimizing the RMSE against the reference set, and move a new point from the reference to the training set based on the largest error. Below are further details of each step, along with how the choice of classifier and placement strategy affects the algorithm.

#### 1. GP training

All GPs described herein were trained using the GPy^{73} package in Python 2.7. Optimization of the hyperparameters was carried out by maximizing log($L$) using 20 independent restarts whenever a configuration was added to the training set. Moreover, a gamma distribution with an expectation of one and a variance of two was used as a prior on all hyperparameters for all systems. This was to weakly penalize large hyperparameter values, given that the expected values are typically of order 0.1 or below.

#### 2. Direct search algorithm

When using C_{single}, *R*_{cross} can be optimized via a direct search that exploits how the RMSE varies as a piecewise constant function of *R*_{cross}. Because of this feature, all possible values of SSE_{tot} (and hence the RMSE) can be computed readily against the reference set. A direct search of these values is guaranteed to find the global minimum. Full details of the direct search are given in Algorithm 1.

1: Compute the squared errors for the GP and long-range function and min(r) at each configuration in the reference set. |

2: Order the configurations from smallest to largest in terms of min(r). |

3: Approximate SSE_{tot} initially from the squared errors of the long-range function alone. |

4: Iterate through the min(r) from the lowest to the highest: a: set R_{cross} = min(r), which moves a single configuration from A_{LR} to A_{GP}; b: Update SSE_{tot} by deducting the squared long range error of the moved point from SSE_{LR} and adding its squared GP error to SSE_{GP}; c: store the new SSE_{tot}. |

5: Select whichever value of R_{cross} corresponds to the smallest value of SSE_{tot}. |

1: Compute the squared errors for the GP and long-range function and min(r) at each configuration in the reference set. |

2: Order the configurations from smallest to largest in terms of min(r). |

3: Approximate SSE_{tot} initially from the squared errors of the long-range function alone. |

4: Iterate through the min(r) from the lowest to the highest: a: set R_{cross} = min(r), which moves a single configuration from A_{LR} to A_{GP}; b: Update SSE_{tot} by deducting the squared long range error of the moved point from SSE_{LR} and adding its squared GP error to SSE_{GP}; c: store the new SSE_{tot}. |

5: Select whichever value of R_{cross} corresponds to the smallest value of SSE_{tot}. |

The direct search algorithm is computationally cheap for the following reasons. Each instance of the direct search requires only a single set of predictions from each of the long-range function and GP regression. As the long-range function is fixed throughout the sequential design process, these predictions need only be calculated once at the start. The GP predictions need to be re-calculated whenever the GP is updated, which occurs once per step of the sequential design. However, these predictions are already required by a sequential design step when choosing the new training point. Furthermore, the calculation of all possible squared error values (step 4) is cheap because calculating each value in order requires only a simple update of SSE_{LR} and SSE_{GP}. Consequently, the direct search is fast compared to the other steps of the sequential design algorithm and can therefore be undertaken at each design step with negligible additional computational effort.

#### 3. Orthogonal direct search algorithm

An adaption of the direct search algorithm is required for C_{multi}, as this requires a multidimensional optimization of multiple classifier parameters. This is called the orthogonal direct search as it optimizes one element of **R**_{cross} while keeping the others fixed. The single element that varies is optimized using the direct search algorithm, as this is guaranteed to return the best minimum along that 1D slice of **R**_{cross}. Although orthogonal optimizations can be time consuming, the speed of the one-dimensional direct search means that repeating it multiple times for all crossover distances is feasible.

The orthogonal direct search proceeds via Algorithm 2. As in the one-dimensional direct search, the square errors of the long-range and GP predictions at each configuration are pre-computed for this algorithm. The orthogonal search algorithm is not guaranteed to find the global minimum, as local minima may exist in the RMSE landscape. Hence, *NR*_{max} restarts are performed with randomly selected starting points. Each restart involves *NI*_{max} optimizations of each *R*_{d}. Here, *NI*_{max} = 15 and *NR*_{max} = 5 for all systems. If the values in **R**_{cross} converge such that further one-dimensional searches in any orthogonal direction do not change its elements prior to *NI* = *NI*_{max}, **R**_{cross} is saved and the next restart is undertaken. In fact, it was rare that *NI* reached *NI*_{max} for any of the systems explored here. Moreover, despite the low *NR*_{max} employed, the same minimum was usually found across multiple restarts.

1: Choose limits NI_{max} and NR_{max} on the number of iterations and restarts respectively (see the text for details). |

2: Assemble the array of minimum distances M, as follows: a: For every configuration, collect all interatomic distances which comprise the same atoms types into group d and find min_{d}(r) for each d. b: Arrange the lists of min_{d}(r) values in an N × D array, M, where N is the number of configurations in the dataset and D is the number of unique atomic pairs in r. c: For each column in M, order the values from smallest to largest to produce an ordered list in each column. |

3: For each restart, select a row from M at random to be the initial guess at R_{cross}. |

4: For each d in turn, fix all cross-over distances apart from R_{d} and find the optimal value of R_{d} using a direct search. |

5: Repeat step 4 until NI = NI_{max} or the elements of R_{cross} remain unchanged; save this R_{cross} and its corresponding SSE_{tot}. |

6: Repeat steps 3–5 until NR = NR_{max}. |

7: Select the R_{cross} that corresponds to the lowest value of SSE_{tot}. |

1: Choose limits NI_{max} and NR_{max} on the number of iterations and restarts respectively (see the text for details). |

2: Assemble the array of minimum distances M, as follows: a: For every configuration, collect all interatomic distances which comprise the same atoms types into group d and find min_{d}(r) for each d. b: Arrange the lists of min_{d}(r) values in an N × D array, M, where N is the number of configurations in the dataset and D is the number of unique atomic pairs in r. c: For each column in M, order the values from smallest to largest to produce an ordered list in each column. |

3: For each restart, select a row from M at random to be the initial guess at R_{cross}. |

4: For each d in turn, fix all cross-over distances apart from R_{d} and find the optimal value of R_{d} using a direct search. |

5: Repeat step 4 until NI = NI_{max} or the elements of R_{cross} remain unchanged; save this R_{cross} and its corresponding SSE_{tot}. |

6: Repeat steps 3–5 until NR = NR_{max}. |

7: Select the R_{cross} that corresponds to the lowest value of SSE_{tot}. |

The direct search offers a method that is at once very fast, designed for the discrete-stepped surface and perfect in a single dimension. This latter property means it skips over any local minima in a given orthogonal direction. As such, optimal crossover distances are obtained quickly and reproducibly under this method. Typically, about 80% of random restarts return the same set of crossover distances for a given system and a given training set.

#### 4. Training point placement methods

Once the classifier parameters are optimized, the PES model is specified and the next training point is determined from the highest error method. The point with the greatest error can be selected either from A_{GP} alone or from the union of A_{GP} and A_{LR}. Using A_{GP} alone is referred to as the constrained placement method, which proceeds via Algorithm 3. Steps 2–5 of this algorithm comprise a stage of training, with the RMSE_{test} also calculated at each such stage.

1: Select the configuration in the reference set with the highest energy; add this configuration to the training set and remove it from the reference set. |

2: Retrain the GP to the updated training set. |

3: Determine the boundary that minimizes the RMSE against the reference set (see Subsection II E 2). |

4: Find the configuration in A_{GP} for which the GP error is highest, where A_{GP} is defined by the boundary from the previous step. |

5: Add this configuration to the training set and remove it from the reference set. |

6: Repeat steps 2–5 until the desired RMSE or number of training points is reached. |

1: Select the configuration in the reference set with the highest energy; add this configuration to the training set and remove it from the reference set. |

2: Retrain the GP to the updated training set. |

3: Determine the boundary that minimizes the RMSE against the reference set (see Subsection II E 2). |

4: Find the configuration in A_{GP} for which the GP error is highest, where A_{GP} is defined by the boundary from the previous step. |

5: Add this configuration to the training set and remove it from the reference set. |

6: Repeat steps 2–5 until the desired RMSE or number of training points is reached. |

Choosing new points from the union of A_{GP} and A_{LR} is named the open placement method. This proceeded identically to the constrained placement method (Algorithm 3) except the highest error point was from either A_{GP} or A_{LR}. In each region, this point is found using the corresponding method of prediction (i.e., the error of the long-range function was used in A_{LR}).

Models were trained under both point placement methods because each held potential advantages over the other. The constrained placement method ensures that all training points are added in the GP region and so are of immediate use in prediction. Meanwhile, the open placement method is capable of immediately placing points in regions of the PES where the long-range function performed poorly, potentially transferring these regions to A_{GP} more rapidly than under the constrained placement approach.

#### 5. Closest model training strategy

In the present work, all training strategies used are combinations of a classifier and a point placement method. A further strategy, which is intended for comparison only, is the closest model training strategy. This employs the open placement method to select training points and classifies a configuration using C_{optimal}, where

Here, SE_{method} is the squared error in the prediction from “method” at **r**. C_{optimal} is so named because it classifies a configuration based on whether GP regression or the long-range function best approximates its energy.

Equation (12) shows that C_{optimal} employs no boundary and so requires prior knowledge of the energy of a configuration in order to classify it. Consequently, models obtained from the closest model strategy are unsuitable for prediction. However, this method represents an “ideal” classifier that is guaranteed to find the optimal RMSE_{test} for a given GP model. Hence, C_{optimal} is useful for estimating how inaccuracies in the parametric classifiers affect the training efficiency. If models from C_{optimal} significantly outperform the other classifiers, this suggests that the other classifiers are too simple to properly approximate the true boundary.

There are circumstances where C_{optimal} may not give an optimal RMSE against the test set. There may be short-range hypersurfaces where the interaction energy is predicted near-exactly by the long-range function due to chance. Points in the reference set that are near these hypersurfaces will be classified by C_{optimal} as part of A_{LR} instead of A_{GP}. If the configuration density of the reference set around such a hypersurface is insufficiently high, no training points will be added in its vicinity. Consequently, points in the test set near the hypersurface will be inadequately approximated by either method of prediction; the GP will perform poorly due to a lack of nearby training points, and the long-range function will be inaccurate for short-range points that are not extremely close to the hypersurface. This problem was avoided by using dense reference sets and by not using the constrained placement strategy with C_{optimal}.

#### 6. Combining classifiers and placement strategies

The sequential design method, or training strategy, requires a choice of classifier and placement strategy. The combinations of these examined in this work are given in Table III. Methods that do not involve C_{optimal} can make predictions and so are suitable for applications. The method involving C_{optimal} can only classify points if the true energy is already known and so is only useful to estimate the loss in performance due to inaccuracies in the parametric classifiers. The fixed boundary method corresponds to the method of Uteva *et al.*^{19} and is included to allow comparison with this prior method, on which the new methods build. The method of Uteva *et al.*^{19} was previously shown to significantly reduce the number of training points compared to the LHC design. It is demonstrated below that these new boundary optimization methods improve further the already efficient methods of Uteva *et al.*^{19}

Training strategy . | Classifier . | Point placement method . |
---|---|---|

Single-constrained | C_{single} | Constrained placement |

Multi-constrained | C_{multi} | Constrained placement |

Single-open | C_{single} | Open placement |

Multi-open | C_{multi} | Open placement |

Closest model | C_{optimal} | Open placement |

Fixed boundary | C_{fixed} | Constrained placement |

Training strategy . | Classifier . | Point placement method . |
---|---|---|

Single-constrained | C_{single} | Constrained placement |

Multi-constrained | C_{multi} | Constrained placement |

Single-open | C_{single} | Open placement |

Multi-open | C_{multi} | Open placement |

Closest model | C_{optimal} | Open placement |

Fixed boundary | C_{fixed} | Constrained placement |

## III. RESULTS AND DISCUSSION

Comparisons of the performances of the different training strategies are made using the HF–Ne, HF–Na^{+}, CO–Ne, CO_{2}–Ne, and (CO_{2})_{2} potentials. These were selected as they provide a range of interaction types and well depths to test the robustness of the new training strategies.

The number of training points is *N*_{TP}, which in each system was less than 10% of the number of configurations in the corresponding reference set (see Table I and the related text). Consequently, the training sets for the models discussed are candidates for transfer learning because their small size makes, for example, CCSD(T) calculations possible for the whole training set. Boundary optimization is anticipated to improve training efficiency as, for small *N*_{TP}, the long-range function will outperform the GP at the outer edge of the potential well. This is illustrated in Fig. 2 for the (CO_{2})_{2} system using a GP model trained under the single-constrained strategy up to *N*_{TP} = 21.

Figure 2 shows that the predictive accuracy of the long-range function exceeds that of GP regression for configurations with C–C separations above the crossover distance. Despite this, in the fixed boundary method, these configurations would be predicted with the GP. This not only reduces the overall accuracy but also means new training points must be placed at a long range to address this when allowing the long-range function to predict that these energies would give adequate accuracy.

The impact of the above on training efficiency compared to fixed boundary training is shown for HF–Ne in Fig. 3. For C_{multi}, the RMSE falls faster with *N*_{TP} compared to C_{fixed}, indicating improved training efficiency. This improvement is most pronounced when *N*_{TP} is low and so RMSE_{test} is high. This is expected because increasing the size of the training set increases the size of the GP region, meaning *R*_{cross} in the boundary-optimized strategies will approach the fixed value of 8.5 Å. Consequently, the difference between the fixed boundary and boundary-optimized models will close as *N*_{TP} increases. Equivalent plots to Fig. 3 for all training strategies for all systems are found in the supplementary material. The RMSE data are somewhat noisy. Possible causes of this noise are minor variations in the hyperparameters upon retraining, the discrete nature and stochastic design of the reference set, and the low values of *N*_{TP} meaning that the addition of a single point has a non-smooth effect on the RMSE. The new methods herein have considerably lower noise than the prior fixed boundary method.

To compare the efficiency gain over fixed boundary training across all new strategies from Subsections II E 5 and II E 6, a metric, *E*, is used. This compares *N*_{TP} required by two different strategies to achieve a given RMSE_{test}. When comparing a new training strategy with fixed boundary training,

where *N*_{TP} and *N*_{TP,fixed} are the number of training points required by the new strategy and fixed boundary training, respectively. These quantities and *E* are shown as functions of RMSE_{test} as they vary with its value.

For a given RMSE_{test}, the values of *N*_{TP} and *N*_{TP,fixed} are determined from least squares fits of log_{10}(RMSE_{test}) vs log_{10}(*N*_{TP}), with examples of such fits shown in Fig. 3 for the HF–Ne potential. Fits are made in the region where the RMSE decays as a power law of *N*_{TP}. For all systems, this corresponds to 1 × 10^{−6} *E*_{h} ≤ RMSE_{text} ≤ 1 × 10^{−4} *E*_{h}, apart from HF–Na^{+} where the region is 1 × 10^{−5} *E*_{h} ≤ RMSE_{text} ≤ 1 × 10^{−3} *E*_{h} due to the larger high-energy cutoff for this potential. The fits provide continuous lines to interpolate to any RMSE_{test} within the above range. This enables a comparison of *N*_{TP} between training strategies at fixed RMSE_{test} that accounts for the somewhat noisy data. Moreover, this is the range of errors in which the models are useful for applications and the decrease in log(RMSE_{test}) with log(*N*_{TP}) is linear.

The fitted equations for each training strategy for the CO–Ne system are given in Table IV as power laws in *N*_{TP}. Equivalent tables for all other systems are found in the supplementary material. As the data are noisy, the tables in the supplementary material also show the R^{2} value of each fit. In the case of CO–Ne, these evidence the high quality of the fits. In fact, no fit from any training strategy for any system achieves an R^{2} value lower than 0.8907 (from the fixed boundary training strategy on the HF–Na^{+} system). Furthermore, Fig. 3 and the corresponding plots in the supplementary material show that the RMSE data follow a straight line (in a log–log plot). This implies that R^{2} arises from scatter in the data rather than unsuitability of the fitting function.

Training strategy . | Line of best fit to RMSE_{test}
. | R^{2}
. |
---|---|---|

Single-constrained placement | 4.694$NTP\u22123.842$ | 0.929 |

Multi-constrained placement | 1.551$NTP\u22123.647$ | 0.944 |

Single-open placement | 2.901$NTP\u22123.718$ | 0.921 |

Multi-open placement | 3.113$NTP\u22123.830$ | 0.958 |

Closest model | 0.865$NTP\u22123.490$ | 0.944 |

Fixed boundary | 546.0$NTP\u22125.069$ | 0.914 |

Training strategy . | Line of best fit to RMSE_{test}
. | R^{2}
. |
---|---|---|

Single-constrained placement | 4.694$NTP\u22123.842$ | 0.929 |

Multi-constrained placement | 1.551$NTP\u22123.647$ | 0.944 |

Single-open placement | 2.901$NTP\u22123.718$ | 0.921 |

Multi-open placement | 3.113$NTP\u22123.830$ | 0.958 |

Closest model | 0.865$NTP\u22123.490$ | 0.944 |

Fixed boundary | 546.0$NTP\u22125.069$ | 0.914 |

Re-arranging the equations in Table IV provides expressions for *N*_{TP} in terms of RMSE_{test} for the CO–Ne potential. These give *E* as a function of RMSE_{test}, via Eq. (13), for all new training strategies herein. This was done for all other potentials as well, with plots of *E* against RMSE_{test} for all training strategies for each system given in Fig. 4.

As observed earlier, the training efficiency gains in Fig. 4 are more pronounced at high RMSE_{test} for all training strategies across all systems. This suggests that boundary optimization is most effective when the training set is small, making this technique ideal for applications where a computationally cheap but less accurate potential is required. Nevertheless, significant reductions in the number of training points are also obtained in the RMSE range where PESs become useful for first-principles predictions. For example, a GP potential with an RMSE of 3 × 10^{−4} eV per atom (1.1 × 10^{−5} *E*_{h} per atom) was employed in a recent simulation of the thermal properties of *β*-Ga_{2}O_{3}.^{31} Furthermore, Uteva *et al**.* successfully determined the CO_{2}–CO second virial coefficient using a GP PES with an RMSE of 2.4 × 10^{−5} *E*_{h}.^{17} In this RMSE range, the boundary optimization methods typically reduce the required number of training points by 15%–33% (see Fig. 4).

Generally, the closest model strategy generates the largest efficiency gain, while the smallest improvement comes from strategies involving C_{single}. Efficiency gains only slightly below those from the closest model strategy are achieved by the strategies that use C_{multi}. This hierarchy of improvement implies that the choice of classifier, rather than the point placement strategy, is most important because strategies with the same classifier perform more similarly than those with the same point placement method. The closest model strategy is included only to illustrate the total training efficiency gain possible from an “ideal” classifier, and it is encouraging that the best boundary optimization methods are close to this ideal case. Indeed, the difference in *E* between this ideal method and the closest-performing C_{multi} strategy never exceeds ∼3% for any system. This implies that C_{multi} captures the true shape of the boundary region for the systems explored sufficiently. Thus, introducing a more detailed classifier would not be worth the increased cost of evaluating the PES.

C_{multi} always outperforms C_{single}. However, for the CO_{2}–Ne potential [Fig. 4(d)], C_{multi} also outperforms the closest model strategy. This is because the long-range function is nearly exact for a group of short-range configurations in the reference set. This is shown in Fig. 5 by the thin “peninsula” of points that are best estimated by the long-range function (in red), which encroaches deep into the GP region (in blue). This “peninsula” exists because the long-range function is of higher energy than the MP2 data in some regions of the PES and of lower energy in others, meaning there must be some hypersurface in between where the two are equal. Prediction for test configurations close to the “peninsula” is problematic under C_{optimal} unless the reference set is very dense in this region. This is because the exact predictions of the long-range function on the “peninsula” mean no training points are added there, leading the closest model strategy to perform relatively poorly for the CO_{2}–Ne potential.

The total gain in training efficiency achieved by boundary optimization varies somewhat between systems. While the best-performing training strategy for HF–Ne improved training efficiency by between 25% and 39% (i.e., *E* = 61%–75%), for (CO_{2})_{2}, the gain was only 12%–18% (*E* = 82%–88%). The more limited gains for (CO_{2})_{2} may be because the *a priori* choice of R_{cross} = 8.5 Å, required for the fixed boundary method, is reasonable for this system. Nevertheless, even for the (CO_{2})_{2} potential, the use of multi-constrained training confers an efficiency gain of ∼18% over fixed boundary training. This means that for all of the potentials explored, the use of a training strategy that employs C_{multi} confers a useful improvement over fixed boundary training.

Boundary optimization improves the training efficiency due to more effective placement of training points. This is illustrated in Fig. 6, which shows, for HF–Ne, the differences in training point placement for three training methods: fixed boundary, single-constrained, and closest model. While all place most points at separations below 3 Å, indicating that the repulsive wall is the hardest region to model, the placement of points at larger separations diverges between methods. For the first 20 training points, few configurations are placed beyond 3 Å for the single-constrained and closest model strategies. Thereafter, the distance at which training points are added slowly increases, even for the closest model strategy, which does not employ a boundary directly. In fact, single-constrained training adds training points beyond 8 Å only after ∼90 training points have been placed and the closest model strategy does not add any training points above 7 Å at all.

In contrast, the fixed boundary training method adds its fifth training point at a minimum separation of above 5 Å and its eighth at 8.5 Å. This demonstrates that a fixed boundary strategy switches between placing training points in the repulsive wall and at the boundary from the onset of training. This difference is because fixed boundary training requires the GP to predict energies at separations up to 8.5 Å from the start of training. Consequently, training points must be added at separations near the 8.5 Å boundary from the onset, even though the energies at these separations are very small and generally well-approximated by the long-range function. This contrasts with the boundary-optimized and closest model strategies, which allow the long-range function to approximate configurations at separations around 8.5 Å when the number of training points is low. This produces more efficient model development under boundary-optimized and closest model training because point placement can be focused on the short-range region of the PES, where the energy varies more rapidly with configuration.

Training point plots for the CO–Ne potential, given in Fig. 7, show that the single-open strategy extends A_{GP} much further than single-constrained training. This suggests that the capacity to place points in A_{LR} facilitates the faster expansion of the boundary for this system. However, these plots also indicate that this discrepancy is most noticeable when the number of training points is large; specifically, the rapid expansion of the boundary under single-open training was instigated by the placement of a single training point at a long range, which facilitated the increase in *R*_{cross} to 63.0 Å. Prior to this, the crossover value from single-open training was quite similar to that from single-constrained training (both were ∼17 Å). Hence, the two training strategies differ significantly only when the predictions from GP regression are already highly accurate, as at this stage, the predictions of the long-range function are poor enough by comparison to the merit placement of a training point at a long range. The other systems also show a close similarity between the crossover distance for the single-constrained and single-open strategies, with equivalent plots to Fig. 8 found in the supplementary material. In addition, Fig. 8 shows that for the CO_{2}–Ne potential, the values of *R*_{cross} from the two strategies are similar throughout training. Such a trend suggests that the choice of point placement strategy does not significantly change the value of *R*_{cross}, implying once more that the choice of classifier is of greater importance.

Similarities in the evolution of the boundary are seen when the crossover distances achieved under C_{multi} and C_{single} are compared for a given point placement method. Figure 9 shows the results of such a comparison for the HF–Na^{+} and CO_{2}–Ne systems. The crossover distances generally grow at a similar rate, but the value of *R*_{cross} is consistently larger than all values in **R**_{cross}. This suggests that there are differences between C_{single} and C_{multi}, which manifest in both the training efficiency and boundary placement. The larger GP region under C_{single} compared with C_{multi} is explained by noting that both are approximations of an “ideal” classifier. When *N*_{TP} is low (i.e., one or two), such a classifier will attempt to make A_{GP} as small as possible because the training set comprises configurations from the repulsive wall only. Consequently, given the same training set, C_{single} and C_{multi} also try to minimize the size of A_{GP}. As C_{multi} is more flexible, its approximation of the “ideal” classifier will be closer than that of C_{single}, meaning A_{GP} under C_{single} will be initially larger. Other than the repulsive wall, the largest errors tend to occur at separations around the boundary. Thus, due to its larger A_{GP}, a C_{single} strategy will place its first long-range training point at a larger separation than a C_{multi} strategy. This facilitates faster expansion of the boundary under C_{single} than C_{multi} regardless of which point placement method is used.

Figure 9(a) shows that $RH\u2013Na+$ < $RF\u2013Na+$ throughout most of the training for the HF–Na^{+} potential, meaning that the interaction involving the larger atom (F) obeys a larger crossover distance. Such ordering of the crossover distances also applies to the HF–Ne system for most of the training. Moreover, Fig. 9(b) shows that the ordering of the crossover distances for the CO_{2}–Ne potential is consistent, with *R*_{O}_{–}_{Ne} < *R*_{C}_{–}_{Ne} throughout the training. In fact, for the (CO_{2})_{2} and CO–Ne potentials, it holds generally throughout training that *R*_{O}_{–}_{O} < *R*_{O}_{–}_{C} < *R*_{C}_{–}_{C} and *R*_{O}_{–}_{Ne} < *R*_{C}_{–}_{Ne}, respectively, for both point placement methods. This implies that the ordering of the crossover distances under C_{multi} is consistent between systems as well as being physically reasonable.

Additionally, Figs. 6–9 show that the crossover distance increases with the size of the training set. This is expected because a larger training set means the GP has more information with which to infer the function describing the PES. The resultant more accurate GP enables larger crossover distance(s). This is the case for all training strategies across all systems.

Boundary optimization methods decrease the cost of using PES by reducing the number of training points and the size of the GP region. Fewer training points means GP evaluations are cheaper, as the GP cost is proportional to *N*_{TP}. Additionally, the lower *R*_{cross} or **R**_{cross} of the boundary-optimized model means that in any application the GP would be used less often, in favor of the much cheaper asymptotic function. Hence, boundary optimization produces models that are more efficient to implement in applications than fixed boundary training, with no reduction in overall accuracy.

Boundary optimization also increases the reproducibility of training. That is, for two separate models from the same training strategy and reference set, the results will be more similar for a boundary-optimized strategy compared to using a fixed boundary. For example, when modeling the CO_{2}–Ne potential with single-open training, the results from two separate training runs were identical. This is shown in Fig. 10. This is noteworthy because the values of the hyperparameters selected when maximizing log($L$) can vary slightly, even for the same training set. Such variations can alter the predictions of the GP, leading to different values for *R*_{cross} or **R**_{cross} and the selection of different training points. Thus, the fact that two separate runs of the same training strategy are totally identical is encouraging.

Furthermore, Fig. 10 shows that for fixed boundary training, separate runs were identical only up to *N*_{TP} = 27. While the exact reproducibility in Fig. 10(a) is not present for the other potentials, there is generally significantly less difference between independent runs of the boundary-optimized training methods than for fixed boundary training. For example, models of the HF–Ne potential from single-constrained and single-open training are reproducible up to *N*_{TP} = 30 and *N*_{TP} = 41, respectively, compared with *N*_{TP} = 16 for fixed boundary training. Equivalent plots of Fig. 10 for this system under these strategies are given in the supplementary material. This reproducibility increase compared to fixed boundary training does not transfer to the use of the C_{multi} strategies, likely because a direct search is not as reproducible in multiple dimensions as in a single dimension. However, from the R^{2} values in Table IV, it can be seen that the use of either C_{multi} or C_{single} reduces the scatter in the RMSE_{test} data relative to the use of C_{fixed}, as evidenced by the larger R^{2} values achieved when training with the former two. This trend is repeated across all other potentials examined here, which suggests that the use of boundary optimization leads to more stability in selection of the hyperparameters and hence more consistent predictions.

For the (CO_{2})_{2} system, reproducibility is seen not just for repeat runs of identical training strategies but also between the training strategies that use C_{single}. This is illustrated in Fig. 11, which shows that the single-constrained and single-open training strategies choose identical training points until *N*_{TP} = 210. Such an observation explains why the two methods have identical *E* in Fig. 4(e). This is because when *N*_{TP} = 210, the RMSE_{test} values for single-open and single-constrained placement were 2.817 × 10^{−7} *E*_{h} and 2.784 × 10^{−7} *E*_{h}, respectively. Consequently, the error was too low at this *N*_{TP} to be included in the fit, meaning the two strategies were identical over the RMSE_{test} values used in fitting.

## IV. CONCLUSION

It has been shown that boundary optimization produces GP PESs of the same accuracy using fewer training points than fixing *R*_{cross} *a priori*. This improvement in efficiency is hierarchical, with a boundary defined by a single, variable crossover distance offering a modest improvement and a boundary defined by multiple such distances facilitating a further gain.

The results presented imply that the classifier is more important to the training strategy than the point placement method. In the RMSE range that is suitable for first-principles calculations (∼2 × 10^{−5} *E*_{h}), the boundary optimization methods typically reduce the required number of training points by 15%–33% relative to a training strategy that is already established as efficient.^{19} Because of their reduced training set size, the resulting boundary-optimized PESs are strong candidates for transfer learning, in which the existing *ab initio* calculations are upgraded to a higher level of theory. Furthermore, as the size of the GP region increases with the size of the training set, only as needed, the resulting GPs are also less computationally intensive in applications than fixed boundary methods, as they employ the GP over a smaller region of phase space.

The classifier C_{multi}, which uses different crossover distances for different atomic pairs, performed almost as well as an “ideal” classifier. Across all systems, the largest difference in performance between the closest model strategy, which uses an “ideal” classifier, and nearest boundary-optimized strategy was ∼3%. In all cases, the best-performing boundary-optimized strategy employed C_{multi}, implying that a classifier comprising a spherical boundary with a unique radius on each unique atomic pair captures the true boundary effectively. Further refinement of the classifier would not result in a sufficient reduction in training points to justify the extra classifier expense.

The crossover distances are learned from the reference data under boundary optimization using a direct search. This is sufficiently fast to be used at every stage of training whether one or many crossover distances are employed and in the multi-dimensional case returns crossover distances in a physically reasonable order. Moreover, both direct search algorithms can be used easily in conjunction with another machine learning technique or on another chemical interaction. In fact, a direct search can be applied to any problem whereby a boundary is sought between a good method of approximation in one region of phase space (the long-range function here) and a machine learning technique in another. This means that the methodology that underpins boundary optimization is both fast and flexible, in addition to being effective in solving the problem of reducing the computational expense associated with training a GP model of an intermolecular PES.

Physical systems in which the behavior crosses over from a simple asymptotic function to more complicated behavior are common in many fields. A prominent example is the transition from ideal to non-ideal gas behavior. As the boundary optimization techniques herein exploit this crossover in behavior, there are many potential applications of this technique to physical problems beyond intermolecular potentials.

## SUPPLEMENTARY MATERIAL

See the supplementary material for more plots of results from the explored systems.

## ACKNOWLEDGMENTS

The computations of all reference and test sets described in this paper were performed using the University of Nottingham’s Augusta HPC service, which provides a High Performance Computing service to the University’s research community. See https://www.nottingham.ac.uk/it-services/research/hpc/ for more details. The authors also acknowledge funding provided by the Leverhulme Trust Doctoral Scholarship through Modelling and Analytics for a Sustainable Society (MASS) at the University of Nottingham.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

## DATA AVAILABILITY

The data that support the findings of this study are openly available in the Boundary Optimization Data repository at https://doi.org/10.6084/m9.figshare.16610713.v1.^{74}

### APPENDIX: DERIVATION OF AN EMPIRICAL LONG-RANGE FUNCTION

HF–Na^{+} was the only intermolecular potential examined above for which a multipolar long-range function was not used. Instead, a fitted empirical long-range function was employed. Although determining an optimal long-range function was not the goal of this work, one that was accurate enough that all configurations were not rapidly transferred to the GP region was needed for proof of concept. The multipolar long-range function was unsuitable for this because of the discrepancy between its predictions and the MP2 energies.

The evidence of this is given in Fig. 12, which also highlights the superiority of the predictions of the empirical long-range function. The RMSEs of each method of prediction were separated by two orders of magnitude, with the empirical function achieving an RMSE of 3.07 × 10^{−7} *E*_{h} vs 4.15 × 10^{−5} *E*_{h} for the multipolar function against the MP2 data shown in Fig. 12.

Figure 12 evidences that the multipolar long-range function captured the power law of the interaction energy for this system in this configuration, with the source of its error being that it was offset from the calculated energies. This offset was a product of the energies being calculated using MP2, while many of the properties used to derive the multipolar long-range function were calculated at higher levels of theory (see Table II). In the other systems, the small magnitude of the long-range energies meant that this disagreement was negligible and the multipolar long-range function was usable. However, in the case of HF–Na^{+}, long-range energies with larger magnitudes were commonplace due to the strong repulsive and attractive interactions between the H–F dipole and the Na^{+} cation. This exacerbated the difference between the predictions of the multipolar long-range function and the MP2 energies.

To approximate accurately the long-range data without upgrading the reference and test data to a higher level of theory, a long-range function was produced by fitting directly to these data. This was the empirical long-range function. Taking r to be the distance between the center of the H–F bond (not center of mass) and the Na^{+}, this function estimated the energy, *E*, as a sum of power laws,

In doing so, the empirical long-range function exploits that the dominant powers of *r* in the HF–Na^{+} interaction are known to be −2 and −3 but assumes that the coefficients of these terms are unknown.

As the energy varies differently with *r* when *θ* changes, a sum such as in Eq. (A1) must be fitted for every configuration in a given dataset for which *θ* is unique. For a given *θ*, the coefficients in Eq. (A1) can be found using simultaneous equations, which are set up by following Algorithm 4. In all cases, *r*_{min} = 8.5 Å and *r*_{max} = 10.5 Å, although the power laws that resulted were accurate up to 100 Å. Furthermore, both GPs in Algorithm 4 were trained on ∼300 training points from a LHC design.

1: Train a GP, GP_{min}, on a range of θ values at separation r = r_{min}; do the same for another GP, GP_{max}, at r = r_{max}. |

2: For a given θ, predict E_{min} and E_{max} using GP_{min} and GP_{max} respectively. |

3: Set up simultaneous equations of the form shown in Eq. (A1): one with E = E_{min} and r = r_{min}, and another with E = E_{max} and r = r_{max}. |

4: Solve the equations from step three for the coefficients for the current θ. |

1: Train a GP, GP_{min}, on a range of θ values at separation r = r_{min}; do the same for another GP, GP_{max}, at r = r_{max}. |

2: For a given θ, predict E_{min} and E_{max} using GP_{min} and GP_{max} respectively. |

3: Set up simultaneous equations of the form shown in Eq. (A1): one with E = E_{min} and r = r_{min}, and another with E = E_{max} and r = r_{max}. |

4: Solve the equations from step three for the coefficients for the current θ. |

By repeating steps 2–4 in Algorithm 4, a sum of power laws was determined for every *θ* value in the reference set and test set. Fitting to the latter was possible as knowledge of the energies in the set over which fitting was undertaken was unnecessary. This was because the GP predictions were based on their respective training sets, and *r* and *θ* were found from the inverse interatomic separations at each configuration alone.

Deriving a long-range function from the predictions of two GPs could be problematic if a transfer learning approach were to be invoked as the data used to train these GPs would itself need to be upgraded. For the HF–Na^{+} potential, this would not be an issue because increasing the quality of the training data would increase the quality of the fit from the multipolar long-range function.

However, when an empirical function is the only option for modeling the long-range data, upgrading the data used in training GP_{min} and GP_{max} would be of considerable computational expense. This is because, currently, each comprises ∼300 configurations. Such an issue could be circumvented by using a sequential design strategy to build minimal training sets for these GPs, which could then be upgraded instead. Furthermore, the training sets used for GP_{min} and GP_{max} are independent of that used to train a GP on the wider PES. Letting the number of training points in the training sets of GP_{min} and GP_{max} be Y and that in the training set of the other GP be Z, the cost of short-range predictions in any simulation would scale linearly with Z and the cost of any long-range predictions would scale linearly with 2Y. Given that the training of GP_{min} and GP_{max} take place at a fixed separation, it is likely that 2Y < Z if all GPs were trained under a sequential design strategy. This means that the predictions of the empirical long-range function would not be the computational bottleneck and that such a function is suitable for use in simulations. Finally, the method is sufficiently flexible that the GP predictions can easily be replaced by those of another statistical method.

## REFERENCES

_{2}relevant to CCS

_{2}O

_{3}

_{2}Sb

_{2}Te

_{5}, with a machine-learned interatomic potential