A new force field called FFLUX uses the machine learning technique *kriging* to capture the link between the properties (energies and multipole moments) of topological atoms (i.e., output) and the coordinates of the surrounding atoms (i.e., input). Here we present a novel, general method of applying kriging to chemical systems that do not possess a fixed number of (geometrical) inputs. Unlike traditional kriging methods, which require an input system to be of fixed dimensionality, the method presented here can be readily applied to molecular simulation, where an interaction cutoff radius is commonly used and the number of atoms or molecules within the cutoff radius is not constant. The method described here is general and can be applied to any machine learning technique that normally operates under a fixed number of inputs. In particular, the method described here is also useful for interpolating methods other than kriging, which may suffer from difficulties stemming from identical sets of inputs corresponding to different outputs or input biasing. As a demonstration, the new method is used to predict 54 energetic and electrostatic properties of the central water molecule of a set of 5000, 4 Å radius water clusters, with a variable number of water molecules. The results are validated against equivalent models from a set of clusters composed of a fixed number of water molecules (set to ten, i.e., decamers) and against models created by using a naïve method of treating the variable number of inputs problem presented. Results show that the 4 Å water cluster models, utilising the method presented here, return similar or better kriging models than the decamer clusters for all properties considered and perform much better than the truncated models.

## I. INTRODUCTION

Machine learning algorithms are rapidly gaining popularity within computational chemistry, owing to their ability to accurately predict chemical properties of interest in significantly less time than what might be required to calculate or measure said properties using conventional means.^{1–7} As the computational expense of predicting atomic and molecular properties through use of machine learning models is substantially less than calculating such properties via quantum chemical calculations, use of machine learning models in computational chemistry opens the door to the development of quantum-accuracy molecular dynamics simulations in a fraction of the computational time. Work towards the development of a machine learning based force-field (known as FFLUX^{8} and previously known as the QCTFF) is currently underway in the authors’ group. FFLUX views the individual atoms of a system as malleable boxes of electron density, known as topological atoms, and defined by the Quantum Theory of Atoms in Molecules (QTAIM).^{9,10} These topological atoms can then interact at long range through multipolar electrostatics; at short range through the combination of inter-atomic Coulomb, exchange, and correlation energies; and also through the influence of any other energy terms relevant to the chosen energy partitioning scheme (such as the atomic self-energy term defined through the Interacting Quantum Atoms (IQA)^{11} energy partitioning scheme). As the properties of these topological atoms are a function of the system around them, they lend themselves well to prediction through machine learning. In order to drive the dynamics of a system, FFLUX uses this efficient prediction approach to obtain the necessary properties and in a fraction of the time required to calculate them from scratch.

At the core of FFLUX is the machine learning method “kriging” or “Gaussian Process Regression,”^{12} which has repeatedly demonstrated its versatility, accurately predicting atomic and molecular properties for amino acids,^{13} carbohydrates,^{14} oligopeptides,^{15} and water clusters.^{16} Developed from a geostatistical background,^{17} and applied to computer experiments by Sacks *et al.*,^{18} then Jones,^{19,20} kriging is an interpolating predictor that relates a set of outputs to a corresponding set of inputs in order to predict a given property of interest. For a chemical system where the value of a property of interest is a function of the system’s atomic configuration, the required set of kriging inputs can be as straightforward and intuitive as a full description of the system’s atomic coordinates. However, as a kriging model is only defined for a constant number of inputs, a single model cannot be readily applied to similar systems of different sizes (i.e., with a differently sized description of the system’s configuration).

In an attempt to reduce the complexity of high-dimensional kriging problems, recent work by Fletcher and Popelier^{15} has demonstrated that, for large systems, accurate kriging predictions can be obtained by reducing the number of inputs to only those describing the immediate surrounding of an atom of interest. As the strength of interaction between two atoms usually decreases as a function of increasing distance, their results^{15} are intuitive, and that method can be viewed as similar to the application of the cutoff radius commonly applied in molecular dynamics simulations, beyond which interactions are considered negligible. Thus, the reliance of machine learning models in general, and kriging in particular, on a fixed number of inputs (i.e., surrounding atoms) instead of a flexible number of inputs determined by interaction distance, must be seen as a limitation of their application to chemical systems. This limitation will be overcome in the current article.

This work aims to address the problems associated with applying machine learning models, and in particular kriging, to chemical systems that may possess a variable number of inputs, by introducing a novel, general method to account for such systems. Although methods of addressing these problems have previously been proposed for particular input schemes (e.g., the works of Rupp *et al.*,^{4} Hansen *et al.*,^{5} Bartok *et al.*,^{21} and Behler^{22}), the method proposed here is the first for a set of inputs directly defined by a molecular system’s geometrically defined inputs. In addition, the method proposed here is general and is equivalent to the methods above when applied to such input definition schemes.

The rest of this paper is structured as follows: after a brief introduction to kriging, a description is presented outlining how the inputs of the kriging models used here are defined. After this, an overview of the difficulties faced when attempting to apply kriging to systems with a variable number of inputs is discussed, before a novel method addressing these problems is presented. The proposed method is then applied to the prediction of 54 different properties of liquid water clusters of radius 4 Å, sampled from a molecular dynamics simulation. Prediction results are obtained and compared to similar systems, via a number of validation metrics. Finally, major conclusions of the current work are summarized and presented in Sec. V.

## II. METHOD

### A. Kriging

Kriging is an interpolating machine learning technique, capable of mapping an output’s response to a given set of inputs (also known as *features*).^{18–20} In simple kriging, the value of a given output, $y x $, is expressed as the sum of the output’s mean, *μ*, and an error term $\epsilon x $: dependent upon a particular atomic configuration with fixed number of descriptors (** x**),

When kriging chemical systems, *y*(** x**) in Equation (1) represent some property of interest (e.g., the net charge on an atom of choice), while ε(

**) represents some deviation from the mean value of the property due to the particular atomic configuration, the system is currently in. Intuitively, kriging assumes the smoothness of physical phenomena in space by considering that the value of a property in a given location is more likely to be similar in value to other nearby points, as opposed to ones further away. Thus, a kriging model must be**

*x**trained*using a set of data from which the correlation relation may be optimized, and from which correlations can be used to predict outputs at new points. A predicted output, $ y \u02c6 $, for a given set of inputs,

*x*^{*}, is given through the following equation:

where $ \mu \u02c6 $ is a prediction of the global mean estimated from the training data consisting of *n* training points,^{19} ** a** is a vector of constant weights, and $\varphi x * \u2212 x i $ is a basis function relating the input

*x*^{*}to the i-th training point

*x*^{i}. In this work, the correlation between points is described through the basis function,

where *N _{f}* is the number of features describing the system (i.e., the dimensionality of the problem), and

*θ*and

_{h}*p*are hyper-parameters corresponding to feature

_{h}*h*, which are obtained through maximizing the log-likelihood function,

where **y** is the column vector of the modelled property evaluated at each of the *N _{t}* training points;

**1**is a column vector of ones; $ \u22c5 T $ is the transpose of its argument,

*σ*

^{2}is the variance of the process estimated from the data;

^{19}and

**is the correlation matrix, where element $ij=\varphi x i \u2212 x j $, and $ R $ its determinant.**

*R*As the point to be predicted (i.e., a test point) approaches a training point (i.e., *x*^{*}⇒*x*^{i}), it can be seen from Equation (3) that the correlation between the test point and training point will increase to one. This means that *simple* kriging is an interpolative predictor, being able to perfectly predict any test point that has the exact same input coordinates as one of the model’s training points. From Equation (3), it can also be seen where the requirement for a fixed number of inputs arises: as the correlation between two points is a function of the *kriging distance* between the two points, both points must exist within the *N _{f}*-dimensional

*feature space*for which the basis function is defined.

### B. Defining features for molecular systems

In a molecular system, many properties of interest are a function of the atomic configuration of the system. Thus, by using the configurational information of a molecular system as its input, a machine learning model is well suited to predicting such molecular properties. Multiple methods have been proposed to represent the configurational information of a system as features. A shortlist of these include the molecular local frame (MLF) of Handley *et al.*,^{1} an expansion of a local environment into spherical harmonics with atomic neighbourhood densities expressed using Gaussians (known as SOAP) by Bartok *et al.*,^{3,21} the eigenvalues of the Coulomb Matrix of Rupp *et al.*,^{4} the symmetry functions of Behler,^{22} and the atomic local frame (ALF, defined just below) of Kandathil *et al.*^{23} In this work, an ALF is used, as it is general for both rigid and non-rigid systems alike; it captures relative geometrical changes rather than absolute ones, such that the ALF produces kriging models that are transferable; and it does not reduce the dimensionality of the geometric inputs and thus no configurational information is lost.

The ALF is defined as a spherical polar coordinate-frame centred on an atom of interest, with the *x*-axis of the system, and the *xy*-plane, defined by selected, surrounding atoms. An example of a spherical polar ALF is displayed in Figure 1. Here, the system is centred on the oxygen of a water molecule, the *x*-axis of the system is defined along one of the molecule’s OH bonds, and the *xy*-plane is defined to include the remaining hydrogen. According to previous work using an atom-centred, spherical polar coordinate-frame,^{13,15,24} the first three features of the kriging training set, *R*_{OH1}, *R*_{OH2}, θ_{HOH}, correspond to the two central OH bond lengths and the central water molecule’s HOH angle. External atoms were each described by a set of three features, *R _{N}*, θ

_{N},

*ϕ*for atom

_{N}*N*, where

*R*is the distance from the central oxygen to atom

_{N}*N*, θ

_{N}is the polar angle of atom

*N*(measured from the

*z*-axis), while

*ϕ*is the azimuthal angle of atom

_{N}*N*(as measured from the

*x*-axis). Such a scheme results in a total of 3

*N*− 6 features (e.g., 12 for the water dimer displayed in Figure 1).

### C. Kriging with a variable number of features (VNFs)

As the basis function in Equation (3) is only defined over the constant *N _{f}* dimensions, all inputs must be defined over

*N*dimensions accordingly. There are two straightforward options to construct kriging models for a set of training points where the number of dimensions required to describe each training point is not a constant across the whole set, which shall be referred to as a

_{f}*Variable Number of Features*(

*VNFs*). The first option reduces the number of dimensions of each training point to be equal to that of the training point with the

*lowest*dimensionality. The second option increases the number of dimensions of each training point to be equal to that of the training point with the

*highest*dimensionality.

The first option can be readily applied by considering the number of features required to describe the training point of lowest dimension and by *truncating* the number of features of all other training points to match. However, by truncating features, configurational information about the system is removed, leading to two problems:

Fundamental to the use of a system’s configurational information as the input to a machine learning model is the assumption that the output of interest is a function of the system’s configuration. By discarding some of this information, a machine learning model will no longer be modelling a property in its true configurational space, and inaccuracies may be introduced.

It will be possible for two identical sets of inputs, post-truncation, to possess different corresponding outputs. An example of this situation is displayed in Figure 2. Consider a mixed set of water trimers and tetramers, sampled from a molecular dynamics simulation, where a “central” water molecule acts as a hydrogen bond donor to two other water molecules in the trimer systems, and where the tetramer systems are sampled similarly except for the presence of a fourth water molecule (see Figure 2). If the first three water molecules of a trimer training point are configurationally identical to the first three molecules of a tetramer training point, then truncating the features corresponding to the fourth water molecule of the tetramer configuration will result in two identical sets of inputs. However, the existence of a fourth molecule in the tetramer configurations will have a quantitative effect on the atomic properties within the system. Thus, by simply truncating the features of higher dimensional training points, it is possible to obtain two identical sets of inputs with different output values. For an interpolating predictor such as simple kriging, this situation will lead to a singular correlation matrix

**R**, which will not be invertible, and thus leads to a situation where the log-likelihood (Equation (4)) is not defined.^{25}

The second option mentioned above is to extend the number of features possessed by training points of lower dimensionality, until they are equal to the training point of largest dimensionality. While the basis function in Equation (3) is only defined over *N _{f}* dimensions, notice that

*N*can be set to any arbitrary value that is equal to or higher than the actual number of features required to describe a system of interest, which we will term

_{f}*true-features*. In fact, for a system described by

*N*

_{fT}true-features, the addition of

*N*

_{fD}dummy features will not affect model predictions if the additional features contribute zero to the sum in Equation (3). Such a result can be achieved in two ways: (1) if

*θ*

_{NfT+1}=

*θ*

_{NfT+2}= ⋅ ⋅ ⋅ =

*θ*

_{NfT+NfD}= 0 or (2) if $ x N f T + j * \u2212 x N f T + j i =0$ for all

*i*and

*j*, where

*j*is the set of all integers between 1 and

*N*

_{fD}. Accordingly, a kriging model may be designed for a system that is described by a VNF by ensuring that the dimensionality of the model is not less than the number of true-features required to describe the training point (TP) of highest dimensionality, while adding

*dummy-features*to training points of lower dimensionality (see Figure 3).

As the values assigned to dummy-features will contribute to the sum in Equation (3), care must be given when choosing their values. In particular, there are three possible scenarios that must be considered:

Choosing values for dummy-features that are within the range of their equivalent true-features may inadvertently result in two training points having equivalent descriptions, despite different property values. In other words, two identical sets of inputs may correspond to different outputs. Again, such a scenario will result in multiple, identical rows of the correlation matrix

**R**in Equation (4), making**R**singular and preventing its inversion.Assigning to dummy-feature values that are outside the sampled domain of the equivalent true-features might excessively reduce the correlation between two training points of different dimensionality, adversely affecting prediction accuracy.

Setting the value of dummy-features to a constant across all dimensions, regardless of whether the constant is inside or outside the sampled domain of the true-features, will introduce an avoidable bias in the distance terms in the sum of Equation (3). This bias stems from the fact that the mean value of the true-features of one dimension may be closer to the constant that the dummy-features are set to than the mean value of the true-features of another dimension. In other words, when considering dummy-features in Equation (3) (i.e., where $ x h i = \Delta x h i $), if $ \Delta x h i =c$ for all

*h*, where*c*is some constant, the distance terms in the sum in Equation (3) will be biased unless $ x 1 * \u2212 c =\u22c5\u22c5\u22c5= x N f * \u2212 c $.

Furthermore, if the constant value is not carefully selected, the introduced bias may be non-physical. For example, consider a set of water trimers, sampled from a molecular dynamics simulation, where a “central” water molecule (Water O) acts as a hydrogen bond donor to the two other water molecules in the system (Water A and Water B), according to Figure 4(a). Here, the features of the system are described through an ALF centred on the oxygen atom of Water O, with an *x*-axis defined through the OH bond of Water O on the left of Figure 4(a), and an *xy*-plane defined to include the remaining hydrogen of Water O. By defining the hydrogen bond accepting molecule on the left of Figure 4(a) as Water A, and by defining the hydrogen bond accepting molecule on the right of Figure 4(a) as Water B, then the distribution of the *ϕ*_{On} features describing the oxygens in Water A and Water B will be as displayed in Figure 4(b). By defining *ϕ*_{On} dummy-features to be equal to the arbitrary constant Δ_{ϕOWaterA} = Δ_{ϕOWaterB} = 0, dummy-features representing *ϕ*_{OWaterA} will, on average, contribute significantly less to the sum in Equation (3) than the dummy-features representing *ϕ*_{OWaterB} (i.e., $ \varphi O WaterA * \u2212 0 \u22450\u226a \varphi O WaterB * \u2212 0 \u22453\pi /5$) (see green peak in Figure 4(b)). Accordingly, the correlation between Water A and Water O will be artificially increased, despite Water A and Water B being physically equivalent when viewed from the central oxygen, due to the water monomer’s C_{2v} symmetry.

### D. VNF method

The method presented here is designed as a general method of applying kriging to VNF systems through the use of dummy-features. Although the work here will focus on the application of the presented VNF method to chemical systems, using the configurational information of the system as the machine learning inputs, the method is general and equally valid for non-chemical applications (e.g., aeronautical and automotive). Note that other groups have developed their own ways to apply machine learning techniques to systems with a variable number of inputs, specific to the machine learning method they have chosen to use, and specific to how they have defined their machine learning inputs. For example, Rupp *et al.*^{4} apply zero-valued dummy-features to “pad out” their training sets as required, when using the ordered eigenvalues of their Coulomb matrix as machine learning inputs. While such a method is valid when using ordered eigenvalues as machine learning inputs, it is not general to other forms of inputs mentioned in Section II B.

The proposed method consists of two rules. For each set of features representing a single atom or molecule, the rules are as follows:

A single feature in the set is selected, and the value of all corresponding dummy-features is taken to be a value outside the sampled domain of the selected true-feature. The feature that is selected, and the assigned dummy value, should be motivated by the nature of the problem (as explained just below).

The values of all other dummy-features are taken to be the mean value of their corresponding true-features.

Rule 1 guarantees that each configuration is uniquely described, and that there is no risk of identical sets of inputs corresponding to different outputs. The feature that is selected, and the assigned dummy value, should be motivated by consideration of the features being used. For example, where a water molecule *N* is described by a set of nine features according to the ALF described in Section II B, the *R*_{ON} feature is an intuitive feature to select to satisfy Rule 1, as there is only one oxygen atom per water molecule, and for many properties, the distance between molecules is a convenient dimension (i.e., feature) to consider property changes in. For this case, which will be discussed in more detail in Section III B, the dummy value can be set to a value outside the sampled range of *R*_{ON}. Alternatively, when using the eigenvalues of a correlation matrix as features, such as in Ref. 4, all eigenvalues would be considered separately for Rule 1, as the eigenvalues cannot be conveniently grouped into clearly defined sets. As the eigenvalues of a well-conditioned correlation matrix cannot equal zero, zero is a convenient value to assign to the dummy features. In fact, this is how the dummy features used in Ref. 4 were actually valued. Beyond the features selected for Rule 1, Rule 2 minimizes the mean distance between all other dummy-features and their corresponding true-features, and removes the bias introduced by arbitrary feature valuing. An example of the application of this method is provided for 4 Å water clusters in Section III B.

## III. SYSTEM ANALYSED

### A. Clusters sampled

To demonstrate the proposed method for applying kriging to VNF systems, a cluster of water molecules, centred on an oxygen atom, and with an oxygen-oxygen cutoff radius of 4 Å was selected. In other words, all water molecules possessing an oxygen atom within 4 Å of the oxygen atom of the central molecule were selected. Water, despite being the subject of extensive research, still has peculiar characteristics not yet fully understood. Lack of replication of water’s structure and properties in simulation has led to the development of multiple water specific force fields over the last thirty years,^{26,27} making bulk water an ideal system to apply advances in quantum chemical machine learning. Accordingly, 5000 water clusters were sampled from a previous molecular dynamics (MD) simulation, using a high-rank multipolar electrostatic potential.^{28}

Figure 5 shows the size of the sampled clusters as a function of sampling frequency. The QTAIM^{9,10} multipole moments of the individual atoms of the central molecule, as well as their IQA^{11} obtained self-energies and interaction energies, were taken as the properties of interest, for a total of 54 different properties (i.e., 25 multipole moments +2 IQA energies for both oxygen and hydrogen, so 54 = 2 × (25 + 2)) to test the VNF technique on. These properties were calculated using the program AIMAll,^{29} using the default settings and integration error control. The wavefunction files required as input for AIMAll were obtained using the package GAUSSIAN09^{30} at the B3LYP/6-311++G(d,p) level of theory. Kriging models were created and their predictions made using the in-house program FEREBUS, with Particle Swarm Optimization used to optimise the model hyperparameters.^{25,31}

As inputs for kriging models, each cluster was described by its atomic coordinates, using a spherical polar ALF as described in Section II B. Kriging models used to predict the properties of the central oxygen atom were centred on the central oxygen atom, while kriging models used to predict the properties of the central hydrogen atoms were centred on the central hydrogen atoms, as recommended by Davie *et al.*^{16} To define atom *N* of a cluster, the spatial distribution of the water molecules across the set of clusters was considered.^{16} In other words, consider an atom in a cluster of water molecules, defined O_{4} by the ALF described in Section II B. Whereas the ALF defines how O_{4} is described by spherical polar coordinates, the spatial distribution of the water molecules across the set of sampled clusters is considered when determining which specific oxygen within the cluster is defined O_{4}. To do this, for each cluster configuration, four nodes were placed on the vertices of a tetrahedron centred on the oxygen of the central water molecule, at a distance from the centre that approximately corresponds to the first peak in the water radial distribution function (as obtained from the MD simulation). The coordinates of the nodes are displayed in Table I, and a visual depiction is shown in Figure 6. From there, the non-central oxygen closest to node 1 was designated *N* = 4 in the above scheme, and its corresponding hydrogens were designated *N* = 5 and *N* = 6, with the hydrogen closest to the centre of the cluster designated *N* = 5. This is because the first three atoms of the system (i.e., *N* = 1, *N* = 2, and *N* = 3) comprise the central molecule. The node-based, atom-defining process was repeated for the closest undesignated molecule to node 2, and so on. In clusters where more than four non-central water molecules exist, the atoms of the additional molecules were defined as a set, with the lower index going to the atoms of molecules with a lower O_{central}–O_{N} distance. Again, within each molecule, hydrogens were defined based on ordered distance to the central oxygen. Readers may refer to the work of Davie *et al.*^{16} for more details.

Node . | Oxygen . | r (Å) . | Theta (deg) . | Phi (deg) . |
---|---|---|---|---|

1 | O_{4} | 2.65 | 90 | 0 |

2 | O_{7} | 2.65 | 90 | 105 |

3 | O_{10} | 2.65 | 17 | −128 |

4 | O_{13} | 2.65 | 163 | −128 |

Node . | Oxygen . | r (Å) . | Theta (deg) . | Phi (deg) . |
---|---|---|---|---|

1 | O_{4} | 2.65 | 90 | 0 |

2 | O_{7} | 2.65 | 90 | 105 |

3 | O_{10} | 2.65 | 17 | −128 |

4 | O_{13} | 2.65 | 163 | −128 |

One concern regards using some form of atomic ordering protocol to define the features of the kriging model (such as the node-based scheme outlined above) is that discontinuities can be introduced into the surface of the output property whenever atoms in the list swap position. In other words, a small perturbation to the positions of the molecules of the system might result in the molecules being ordered differently, which will result in a jump in feature space to the configurational description that corresponds to the new ordering. Such discontinuities are not a property of SOAP,^{21} symmetry functions,^{22} or the ordered eigenvalues of a Coulomb matrix.^{4} On the other hand, the above list of configurational representations are either inherently under-complete descriptions of the chemical environment or are required to be larger than the number of degrees of freedom a system possesses in order to capture all of the system’s configurational information. Accordingly, the above list of configurational representations does not present a convenient means of converting features back into global coordinates, which can hamper some forms of iterative training and sampling algorithms, and limits the possibilities of calculating system properties directly from feature space.

Discontinuities in the potential energy surface of a system can cause problems when trying to accurately simulate various states of the system but the strength of influence of a given discontinuity is highly dependent on the size of the discontinuity, how smooth the potential energy surface is on either side of the discontinuity, and the details of the system of interest. For a machine-learned potential energy surface with zero error, there would be no discontinuity caused by jumping between two physically identical but feature-input-different configurational descriptions of a system, as the predicted potential at both points would be identical. Likewise, the forces calculated should also be identical. Thus, the influence of the discontinuity can be made arbitrarily small by increasing model accuracy (although practical improvement of model accuracy is still a field of ongoing research). Finally, it should be noted that not every system property of interest will necessarily be influenced by such a discontinuity, and sometimes a model might be better served by the potentially increased prediction accuracy an alternative coordinate frame may offer. For example, a robust search algorithm should be capable of finding energetically optimal geometries despite jumps between what should be property-equivalent points in phase space.

### B. Application of VNF to the system

Each non-central water molecule contributes a set of nine additional features to the system, for example, {*R*_{ON}, θ_{ON}, *ϕ*_{ON}, *R*_{H1N}, θ_{H1N}, *ϕ*_{H1N}, *R*_{H2N}, θ_{H2N}, *ϕ*_{H2N}} for Water *N*. According to Rule 1 of Section II D, one of these features must be selected such that dummy-features representing the selected feature can be assigned a value outside the sampled domain of the corresponding true-features. As the O_{central}–O_{N} distance is already used for defining atomic indices, it is intuitive to also choose this feature for Rule 1. Thus, for the molecule Water *N*, *R*_{ON} has been selected as the feature required to satisfy Rule 1. Figure 7(a) displays the *R*_{ON} distributions of the non-central water molecules, sampled across all 5000 clusters, while Figure 7(b) displays their radial ranges. To satisfy Rule 1, dummy *R*_{O1} features $ \Delta R O 1 $ must be set to a value below 2.45 Å or above 4 Å. For the energetic and electrostatic properties of interest here, the influence of an atom on another atom decreases as a function of the distance between the two. For this reason, the value assigned to all Δ_{RON} was set to a value slightly greater than the maximum value of their corresponding true-features. Such a choice of Δ_{RON} can also be considered to approximately represent the existence of dummy molecules slightly beyond the cutoff radius of 4 Å (had Δ_{RON} been set to a value below 2.45 Å, the choice could be considered to approximately represent the existence of dummy molecules closer than 2.45 Å to the centre). As all true *R*_{ON} features for the system here have a maximum of approximately 4.00 Å, all corresponding dummy-features were set to 4.05 Å. According to Rule 2, dummy-features representing the eight remaining features used to describe a given non-central water molecule were set as the mean value of their corresponding true-features.

## IV. RESULTS

The performance statistics of the IQA kriging models for the 4 Å clusters are displayed in Table II and are labelled “VNF.” For comparison, the performance statistics of two additional models are also displayed: (1) a set created by truncating the features of higher-dimension training points such that all training points are described by the number of features describing the lowest dimensionality training point (which corresponds to 5 molecules for the set of clusters considered here) and (2) a set of ten molecule clusters (decamers), where all configurations contain ten molecules, regardless of cluster radius, and as published by Davie *et al.*^{16} These additional models are labelled “5 mol” and “10 mol,” respectively. Each kriging model was trained on a data-set of 5000 training points using 5-fold cross validation, using a 1:4 partitioning of training set to test set. In other words, the 5000 training points were evenly split into five 1000-training-point models, with each model being tested on the set of points not used for that specific model. The kriging models are compared through both the mean absolute errors (MAEs) obtained by comparing the predicted property values to the corresponding true property values and through the q^{2} correlation coefficient,

where *P _{i}* is the predicted value of test point

*i*;

*T*is the true value of test point

_{i}*i*;

*M*is the mean of the entire test set; and

*N*is the number of points (i.e., 4000) the model is tested on. Thus, the

_{test}*q*

^{2}metric has the intuitive property of being equal to 1 when predictions are equal to true values (

*P*−

_{i}*T*= 0), and equal to 0 when predictions are no better than the predictions obtained by using the simplest, unbiased estimator, which is the mean (i.e., when

_{i}*P*−

_{i}*T*=

_{i}*M*−

*T*). Thus, five MAE (see the 5-fold cross validation above) and q

_{i}^{2}values were obtained per model, for which the average and standard deviation were calculated, and reported in Table II.

Property . | Model . | Mean value . | Range . | MAE . | q^{2}
. |
---|---|---|---|---|---|

Oxygen E _{Self} | VNF | −196 714.1 | 286.9 | 10.3 ± 0.1 | 0.90 ± 0.00 |

5 mol | 23.0 ± 0.4 | 0.52 ± 0.02 | |||

10 mol | −196 708.8 | 243.9 | 10.6 ± 0.2 | 0.90 ± 0.00 | |

Oxygen E _{Inter} | VNF | −1 531.6 | 397.9 | 16.4 ± 0.7 | 0.88 ± 0.01 |

5 mol | 39.6 ± 0.4 | 0.32 ± 0.01 | |||

10 mol | −1 540.6 | 407.7 | 15.3 ± 0.6 | 0.91 ± 0.01 | |

Hydrogen E _{Self} | VNF | −738.9 | 136.4 | 9.6 ± 0.3 | 0.69 ± 0.02 |

5 mol | 11.4 ± 0.2 | 0.58 ± 0.01 | |||

10 mol | −736.1 | 121.6 | 10.3 ± 0.1 | 0.66 ± 0.01 | |

Hydrogen E _{Inter} | VNF | −523.6 | 139.3 | 6.8 ± 0.3 | 0.84 ± 0.01 |

5 mol | 13.2 ± 0.1 | 0.43 ± 0.02 | |||

10 mol | −525.3 | 120.8 | 6.7 ± 0.0 | 0.85 ± 0.00 |

Property . | Model . | Mean value . | Range . | MAE . | q^{2}
. |
---|---|---|---|---|---|

Oxygen E _{Self} | VNF | −196 714.1 | 286.9 | 10.3 ± 0.1 | 0.90 ± 0.00 |

5 mol | 23.0 ± 0.4 | 0.52 ± 0.02 | |||

10 mol | −196 708.8 | 243.9 | 10.6 ± 0.2 | 0.90 ± 0.00 | |

Oxygen E _{Inter} | VNF | −1 531.6 | 397.9 | 16.4 ± 0.7 | 0.88 ± 0.01 |

5 mol | 39.6 ± 0.4 | 0.32 ± 0.01 | |||

10 mol | −1 540.6 | 407.7 | 15.3 ± 0.6 | 0.91 ± 0.01 | |

Hydrogen E _{Self} | VNF | −738.9 | 136.4 | 9.6 ± 0.3 | 0.69 ± 0.02 |

5 mol | 11.4 ± 0.2 | 0.58 ± 0.01 | |||

10 mol | −736.1 | 121.6 | 10.3 ± 0.1 | 0.66 ± 0.01 | |

Hydrogen E _{Inter} | VNF | −523.6 | 139.3 | 6.8 ± 0.3 | 0.84 ± 0.01 |

5 mol | 13.2 ± 0.1 | 0.43 ± 0.02 | |||

10 mol | −525.3 | 120.8 | 6.7 ± 0.0 | 0.85 ± 0.00 |

For the energetic properties considered, the VNF kriging models either outperformed or were within statistical error of the decamer models with respect to the MAE. Alternatively, the feature-truncated 5 mol models performed significantly worse. When compared to the 5 mol models, the VNF models returned oxygen *E _{Self}*, oxygen

*E*, hydrogen

_{Inter}*E*, and hydrogen

_{Self}*E*of MAE values 55%, 59%, 16%, and 48% lower, respectively. Although the returned MAE and q

_{Inter}^{2}values of the 10 mol models were similar to the MAE and q

^{2}values obtained from the VNF models, the range of the energetic properties of the 4 Å clusters was on average 11% higher than the range from the decamer clusters. Thus, when the MAE is viewed as a percentage of the property range (MAE%), most VNF models perform better than their corresponding 10 mol models (Figure 8). As both the VNF and 10 mol models are trained with complete descriptions of their respective cluster configurations, the difference in overall performance between the two sampling methods implies that IQA energies are modelled better as a function of the atomic positions of a cluster of fixed radius than as a function of the atomic positions of a cluster of fixed number of molecules. This is particularly true for the hydrogen

*E*energy, a property known to be difficult to predict.

_{Self}^{16}

As a guide for molecular comparison, Table III shows approximated prediction metrics for the complete central molecule of a cluster. True full-molecule energies were not obtained due to the expense of calculating the true properties of both hydrogen atoms, when only one hydrogen atom is required for the atomistic properties this paper has focused on. Instead, random combinations of one oxygen and two hydrogens were combined to approximate the energetics of a complete molecule. Thus, each “molecule” possessed one set of oxygen energies (i.e., the *E _{Self}* and

*E*of a single, random oxygen atom) and two sets of hydrogen energies (i.e., the

_{Inter}*E*and

_{Self}*E*of two respective, random hydrogen atoms). The relatively low MAE value is in part due to the decreased range of the total energy relative to certain atom-specific energies in Table III, and in part due to the compensatory nature of the errors for each individual atom. When the self-energy of an atom was overpredicted, the atom’s interaction energy was usually underpredicted. The root mean squared error (RMSE) has been included to aid in comparison to the literature. The RMSE value obtained here is significantly higher than that obtained by Morawietz and Behler,

_{Inter}^{32}who used Neural Networks to predict the energies of clusters up to ten molecules in size with a mean RMSE of 0.22 kJ/mol. In that work, however, clusters were sampled at much lower temperatures (linearly distributed between 50 and 250 K) representing a different phase (i.e., ice vs liquid), and a more realistic potential was used in the simulation used to provide the configurations (which has been shown to reduce errors). Furthermore, each cluster above 5 molecules in size was individually sampled and trained for with between 1700 and 4700 structures per cluster. In comparison, in the current work, only a single model per energy component has been used for all cluster sizes (from 5 to 15 molecules) and trained with only 1000 training points.

. | Mean value . | Range . | MAE . | RMSE . | q^{2}
. |
---|---|---|---|---|---|

VNF | 200 770.7 ± 0.5 | 242.9 ± 16.1 | 16.9 ± 0.7 | 21.3 ± 0.8 | 0.61 ± 0.04 |

. | Mean value . | Range . | MAE . | RMSE . | q^{2}
. |
---|---|---|---|---|---|

VNF | 200 770.7 ± 0.5 | 242.9 ± 16.1 | 16.9 ± 0.7 | 21.3 ± 0.8 | 0.61 ± 0.04 |

Figure 9 displays the full distribution of energy property errors in S-curve format. The S-curve is a method of displaying the error distribution of a set of predictions in a cumulative (i.e., summed or integrated) way, showing a percentile (%) in the ordinate. This format is different to the common error distribution histogram, which can be seen as the differential of the S-curve. Whereas a distribution histogram generally draws the eye to the mode (i.e., the value corresponding to the maximum of the distribution), the S-curve presents the data in a way that makes ascertaining the various percentile values (such as the 50-percentile, 90-percentile value, and 95-percentile value) convenient. This makes the S-curve an intuitive means of quickly determining the characteristics of a distribution of errors that are important from a computational chemistry perspective. Figures 9(a), 9(b), and 9(d) display very similar distributions of prediction errors between the VNF and 10 mol models, and noticeably worse prediction errors for the 5 mol models. A similar trend is also seen in Figure 9(c) but the 10 mol model performs slightly worse relative to the VNF model than for the previous properties. Again, this result implies that the energetic properties of clusters selected using cutoff radii are easier to model with kriging than clusters selected to have a fixed number of molecules.

In addition to the IQA energies, kriging models for the QTAIM obtained electrostatic multipole moments were also obtained, for the charge (Q[0,0]), components of the dipole moment (Q[1,0], Q[1,1,c], Q[1,1,s]), and components of the quadrupole moment (Q[2,0], Q[2,1,c], Q[2,1,s], Q[2,2,c], Q[2,2,s]). According to the previous energetic models, each kriging model was trained on the data-set of 5000 training points using 5-fold cross validation, using a 1:4 partitioning of training set to test set. Figure 10 displays the MAE, q^{2}, and MAE% of both the oxygen and hydrogen multipole moment kriging models. Across all moments, the VNF and 10 mol models once again perform similarly, and together they both significantly outperform the 5 mol models. Once again, the q^{2} and MAE% validation metrics indicate that the VNF models generally outperform their 10 mol equivalents. It is worth noting that when compared to the models used to predict IQA properties, the 5 mol models used to predict the multipole moments have performed much better relative to the corresponding VNF and 10 mol clusters. In fact, the average MAE% over all IQA energies for the 5 mol models was 94% higher than that for the equivalent VNF models, while the average MAE% over all multipole moments for the 5 mol models was only 36% higher than for the equivalent VNF models. As the 5 mol model approximates a water molecule and its first solvation shell, the relative MAE% results imply that the IQA energetic description of a water molecule at the centre of a cluster is more sensitive to the position of water molecules beyond the first solvation shell than the QTAIM multipolar electrostatic description of the molecule.

## V. CONCLUSION

As molecular interaction strength generally decreases as a function of distance, the properties of a selected atom or molecule in a molecular simulation have traditionally been modelled as a function of the surrounding atoms within a given cutoff radius. Alternatively, many machine learning methods, including kriging, have traditionally required a fixed number of inputs to model some given property. Fixed-number-of-input kriging techniques offer no flexibility to handle the movement of atoms or molecules into or out of a radius of physical influence. As a result, straight application of such techniques risks decreased efficiency if the selected number of inputs is too high, or decreased accuracy if the selected number of inputs is too low, when applied to molecular simulation.

In this work, a new method of applying kriging to systems with a *variable number of features* (VNFs) has been developed and applied to the prediction of 54 properties of a water molecule at the centre of a 4 Å radius cluster. The presented method is general, is theoretically justified by the fact that it cannot result in a singular correlation matrix, and does not discard any input information. Models created using the VNF method have been shown to perform as well as, or better than, the equivalent models for a ten molecule cluster with a fixed number of inputs. Similarly, the developed VNF protocol was shown to significantly outperform models created using an alternative, naïve feature-truncating approach. The VNF protocol presented here is general and, as it does not suffer difficulties stemming from identical sets of inputs corresponding to different outputs or input biasing, is particularly suitable to interpolating machine learning methods.

## Acknowledgments

We are grateful to the EPSRC for the award of an Established Career Fellowship (No. EP/K005472), which funds all three authors.