A new force field called FFLUX uses the machine learning technique kriging to capture the link between the properties (energies and multipole moments) of topological atoms (i.e., output) and the coordinates of the surrounding atoms (i.e., input). Here we present a novel, general method of applying kriging to chemical systems that do not possess a fixed number of (geometrical) inputs. Unlike traditional kriging methods, which require an input system to be of fixed dimensionality, the method presented here can be readily applied to molecular simulation, where an interaction cutoff radius is commonly used and the number of atoms or molecules within the cutoff radius is not constant. The method described here is general and can be applied to any machine learning technique that normally operates under a fixed number of inputs. In particular, the method described here is also useful for interpolating methods other than kriging, which may suffer from difficulties stemming from identical sets of inputs corresponding to different outputs or input biasing. As a demonstration, the new method is used to predict 54 energetic and electrostatic properties of the central water molecule of a set of 5000, 4 Å radius water clusters, with a variable number of water molecules. The results are validated against equivalent models from a set of clusters composed of a fixed number of water molecules (set to ten, i.e., decamers) and against models created by using a naïve method of treating the variable number of inputs problem presented. Results show that the 4 Å water cluster models, utilising the method presented here, return similar or better kriging models than the decamer clusters for all properties considered and perform much better than the truncated models.
I. INTRODUCTION
Machine learning algorithms are rapidly gaining popularity within computational chemistry, owing to their ability to accurately predict chemical properties of interest in significantly less time than what might be required to calculate or measure said properties using conventional means.1–7 As the computational expense of predicting atomic and molecular properties through use of machine learning models is substantially less than calculating such properties via quantum chemical calculations, use of machine learning models in computational chemistry opens the door to the development of quantum-accuracy molecular dynamics simulations in a fraction of the computational time. Work towards the development of a machine learning based force-field (known as FFLUX8 and previously known as the QCTFF) is currently underway in the authors’ group. FFLUX views the individual atoms of a system as malleable boxes of electron density, known as topological atoms, and defined by the Quantum Theory of Atoms in Molecules (QTAIM).9,10 These topological atoms can then interact at long range through multipolar electrostatics; at short range through the combination of inter-atomic Coulomb, exchange, and correlation energies; and also through the influence of any other energy terms relevant to the chosen energy partitioning scheme (such as the atomic self-energy term defined through the Interacting Quantum Atoms (IQA)11 energy partitioning scheme). As the properties of these topological atoms are a function of the system around them, they lend themselves well to prediction through machine learning. In order to drive the dynamics of a system, FFLUX uses this efficient prediction approach to obtain the necessary properties and in a fraction of the time required to calculate them from scratch.
At the core of FFLUX is the machine learning method “kriging” or “Gaussian Process Regression,”12 which has repeatedly demonstrated its versatility, accurately predicting atomic and molecular properties for amino acids,13 carbohydrates,14 oligopeptides,15 and water clusters.16 Developed from a geostatistical background,17 and applied to computer experiments by Sacks et al.,18 then Jones,19,20 kriging is an interpolating predictor that relates a set of outputs to a corresponding set of inputs in order to predict a given property of interest. For a chemical system where the value of a property of interest is a function of the system’s atomic configuration, the required set of kriging inputs can be as straightforward and intuitive as a full description of the system’s atomic coordinates. However, as a kriging model is only defined for a constant number of inputs, a single model cannot be readily applied to similar systems of different sizes (i.e., with a differently sized description of the system’s configuration).
In an attempt to reduce the complexity of high-dimensional kriging problems, recent work by Fletcher and Popelier15 has demonstrated that, for large systems, accurate kriging predictions can be obtained by reducing the number of inputs to only those describing the immediate surrounding of an atom of interest. As the strength of interaction between two atoms usually decreases as a function of increasing distance, their results15 are intuitive, and that method can be viewed as similar to the application of the cutoff radius commonly applied in molecular dynamics simulations, beyond which interactions are considered negligible. Thus, the reliance of machine learning models in general, and kriging in particular, on a fixed number of inputs (i.e., surrounding atoms) instead of a flexible number of inputs determined by interaction distance, must be seen as a limitation of their application to chemical systems. This limitation will be overcome in the current article.
This work aims to address the problems associated with applying machine learning models, and in particular kriging, to chemical systems that may possess a variable number of inputs, by introducing a novel, general method to account for such systems. Although methods of addressing these problems have previously been proposed for particular input schemes (e.g., the works of Rupp et al.,4 Hansen et al.,5 Bartok et al.,21 and Behler22), the method proposed here is the first for a set of inputs directly defined by a molecular system’s geometrically defined inputs. In addition, the method proposed here is general and is equivalent to the methods above when applied to such input definition schemes.
The rest of this paper is structured as follows: after a brief introduction to kriging, a description is presented outlining how the inputs of the kriging models used here are defined. After this, an overview of the difficulties faced when attempting to apply kriging to systems with a variable number of inputs is discussed, before a novel method addressing these problems is presented. The proposed method is then applied to the prediction of 54 different properties of liquid water clusters of radius 4 Å, sampled from a molecular dynamics simulation. Prediction results are obtained and compared to similar systems, via a number of validation metrics. Finally, major conclusions of the current work are summarized and presented in Sec. V.
II. METHOD
A. Kriging
Kriging is an interpolating machine learning technique, capable of mapping an output’s response to a given set of inputs (also known as features).18–20 In simple kriging, the value of a given output, , is expressed as the sum of the output’s mean, μ, and an error term : dependent upon a particular atomic configuration with fixed number of descriptors (x),
When kriging chemical systems, y(x) in Equation (1) represent some property of interest (e.g., the net charge on an atom of choice), while ε(x) represents some deviation from the mean value of the property due to the particular atomic configuration, the system is currently in. Intuitively, kriging assumes the smoothness of physical phenomena in space by considering that the value of a property in a given location is more likely to be similar in value to other nearby points, as opposed to ones further away. Thus, a kriging model must be trained using a set of data from which the correlation relation may be optimized, and from which correlations can be used to predict outputs at new points. A predicted output, , for a given set of inputs, x*, is given through the following equation:
where is a prediction of the global mean estimated from the training data consisting of n training points,19 a is a vector of constant weights, and is a basis function relating the input x* to the i-th training point xi. In this work, the correlation between points is described through the basis function,
where Nf is the number of features describing the system (i.e., the dimensionality of the problem), and θh and ph are hyper-parameters corresponding to feature h, which are obtained through maximizing the log-likelihood function,
where y is the column vector of the modelled property evaluated at each of the Nt training points; 1 is a column vector of ones; is the transpose of its argument, σ2 is the variance of the process estimated from the data;19 and R is the correlation matrix, where element , and its determinant.
As the point to be predicted (i.e., a test point) approaches a training point (i.e., x*⇒xi), it can be seen from Equation (3) that the correlation between the test point and training point will increase to one. This means that simple kriging is an interpolative predictor, being able to perfectly predict any test point that has the exact same input coordinates as one of the model’s training points. From Equation (3), it can also be seen where the requirement for a fixed number of inputs arises: as the correlation between two points is a function of the kriging distance between the two points, both points must exist within the Nf-dimensional feature space for which the basis function is defined.
B. Defining features for molecular systems
In a molecular system, many properties of interest are a function of the atomic configuration of the system. Thus, by using the configurational information of a molecular system as its input, a machine learning model is well suited to predicting such molecular properties. Multiple methods have been proposed to represent the configurational information of a system as features. A shortlist of these include the molecular local frame (MLF) of Handley et al.,1 an expansion of a local environment into spherical harmonics with atomic neighbourhood densities expressed using Gaussians (known as SOAP) by Bartok et al.,3,21 the eigenvalues of the Coulomb Matrix of Rupp et al.,4 the symmetry functions of Behler,22 and the atomic local frame (ALF, defined just below) of Kandathil et al.23 In this work, an ALF is used, as it is general for both rigid and non-rigid systems alike; it captures relative geometrical changes rather than absolute ones, such that the ALF produces kriging models that are transferable; and it does not reduce the dimensionality of the geometric inputs and thus no configurational information is lost.
The ALF is defined as a spherical polar coordinate-frame centred on an atom of interest, with the x-axis of the system, and the xy-plane, defined by selected, surrounding atoms. An example of a spherical polar ALF is displayed in Figure 1. Here, the system is centred on the oxygen of a water molecule, the x-axis of the system is defined along one of the molecule’s OH bonds, and the xy-plane is defined to include the remaining hydrogen. According to previous work using an atom-centred, spherical polar coordinate-frame,13,15,24 the first three features of the kriging training set, ROH1, ROH2, θHOH, correspond to the two central OH bond lengths and the central water molecule’s HOH angle. External atoms were each described by a set of three features, RN, θN, ϕN for atom N, where RN is the distance from the central oxygen to atom N, θN is the polar angle of atom N (measured from the z-axis), while ϕN is the azimuthal angle of atom N (as measured from the x-axis). Such a scheme results in a total of 3N − 6 features (e.g., 12 for the water dimer displayed in Figure 1).
Coordinate-frame and first six kriging features of an oxygen atom in a water dimer. The x-axis of the system is defined by the O1–H2 bond, and the xy-plane is defined by the H2–O1–H3 angle. The first three features of the system correspond to RO1H2, RO1H3, and θH2O1H3, respectively. The next atom by index (i.e., O4 in this example) is the next atom to be described by features and is described by the three features, RO4, θO4, and φO4. This convention is continued until every atom in the system is described.
Coordinate-frame and first six kriging features of an oxygen atom in a water dimer. The x-axis of the system is defined by the O1–H2 bond, and the xy-plane is defined by the H2–O1–H3 angle. The first three features of the system correspond to RO1H2, RO1H3, and θH2O1H3, respectively. The next atom by index (i.e., O4 in this example) is the next atom to be described by features and is described by the three features, RO4, θO4, and φO4. This convention is continued until every atom in the system is described.
C. Kriging with a variable number of features (VNFs)
As the basis function in Equation (3) is only defined over the constant Nf dimensions, all inputs must be defined over Nf dimensions accordingly. There are two straightforward options to construct kriging models for a set of training points where the number of dimensions required to describe each training point is not a constant across the whole set, which shall be referred to as a Variable Number of Features (VNFs). The first option reduces the number of dimensions of each training point to be equal to that of the training point with the lowest dimensionality. The second option increases the number of dimensions of each training point to be equal to that of the training point with the highest dimensionality.
The first option can be readily applied by considering the number of features required to describe the training point of lowest dimension and by truncating the number of features of all other training points to match. However, by truncating features, configurational information about the system is removed, leading to two problems:
Fundamental to the use of a system’s configurational information as the input to a machine learning model is the assumption that the output of interest is a function of the system’s configuration. By discarding some of this information, a machine learning model will no longer be modelling a property in its true configurational space, and inaccuracies may be introduced.
It will be possible for two identical sets of inputs, post-truncation, to possess different corresponding outputs. An example of this situation is displayed in Figure 2. Consider a mixed set of water trimers and tetramers, sampled from a molecular dynamics simulation, where a “central” water molecule acts as a hydrogen bond donor to two other water molecules in the trimer systems, and where the tetramer systems are sampled similarly except for the presence of a fourth water molecule (see Figure 2). If the first three water molecules of a trimer training point are configurationally identical to the first three molecules of a tetramer training point, then truncating the features corresponding to the fourth water molecule of the tetramer configuration will result in two identical sets of inputs. However, the existence of a fourth molecule in the tetramer configurations will have a quantitative effect on the atomic properties within the system. Thus, by simply truncating the features of higher dimensional training points, it is possible to obtain two identical sets of inputs with different output values. For an interpolating predictor such as simple kriging, this situation will lead to a singular correlation matrix R, which will not be invertible, and thus leads to a situation where the log-likelihood (Equation (4)) is not defined.25
The second option mentioned above is to extend the number of features possessed by training points of lower dimensionality, until they are equal to the training point of largest dimensionality. While the basis function in Equation (3) is only defined over Nf dimensions, notice that Nf can be set to any arbitrary value that is equal to or higher than the actual number of features required to describe a system of interest, which we will term true-features. In fact, for a system described by NfT true-features, the addition of NfD dummy features will not affect model predictions if the additional features contribute zero to the sum in Equation (3). Such a result can be achieved in two ways: (1) if θNfT+1 = θNfT+2 = ⋅ ⋅ ⋅ = θNfT+NfD = 0 or (2) if for all i and j, where j is the set of all integers between 1 and NfD. Accordingly, a kriging model may be designed for a system that is described by a VNF by ensuring that the dimensionality of the model is not less than the number of true-features required to describe the training point (TP) of highest dimensionality, while adding dummy-features to training points of lower dimensionality (see Figure 3).
Diagram of trimer and tetramer systems where the “first three” water molecules in each (i.e., those within the blue region) are configurationally identical. The addition of the fourth molecule in the tetramer systems will result in atomic properties that are different between the trimer and tetramer. However, by attempting to describe both systems with only the configurational information within the blue region, these differences cannot be uniquely described.
Diagram of trimer and tetramer systems where the “first three” water molecules in each (i.e., those within the blue region) are configurationally identical. The addition of the fourth molecule in the tetramer systems will result in atomic properties that are different between the trimer and tetramer. However, by attempting to describe both systems with only the configurational information within the blue region, these differences cannot be uniquely described.
The first training point of the system (TP1) is described by five true-features (f1 to f5). However, the second and third training points are described by only three and four true-features, respectively. Accordingly, dummy-features (Δn) have been added to the second and third training points as required, in order to increase their dimensionality to be equal to that of training point one (TP1).
The first training point of the system (TP1) is described by five true-features (f1 to f5). However, the second and third training points are described by only three and four true-features, respectively. Accordingly, dummy-features (Δn) have been added to the second and third training points as required, in order to increase their dimensionality to be equal to that of training point one (TP1).
As the values assigned to dummy-features will contribute to the sum in Equation (3), care must be given when choosing their values. In particular, there are three possible scenarios that must be considered:
Choosing values for dummy-features that are within the range of their equivalent true-features may inadvertently result in two training points having equivalent descriptions, despite different property values. In other words, two identical sets of inputs may correspond to different outputs. Again, such a scenario will result in multiple, identical rows of the correlation matrix R in Equation (4), making R singular and preventing its inversion.
Assigning to dummy-feature values that are outside the sampled domain of the equivalent true-features might excessively reduce the correlation between two training points of different dimensionality, adversely affecting prediction accuracy.
Setting the value of dummy-features to a constant across all dimensions, regardless of whether the constant is inside or outside the sampled domain of the true-features, will introduce an avoidable bias in the distance terms in the sum of Equation (3). This bias stems from the fact that the mean value of the true-features of one dimension may be closer to the constant that the dummy-features are set to than the mean value of the true-features of another dimension. In other words, when considering dummy-features in Equation (3) (i.e., where ), if for all h, where c is some constant, the distance terms in the sum in Equation (3) will be biased unless .
Furthermore, if the constant value is not carefully selected, the introduced bias may be non-physical. For example, consider a set of water trimers, sampled from a molecular dynamics simulation, where a “central” water molecule (Water O) acts as a hydrogen bond donor to the two other water molecules in the system (Water A and Water B), according to Figure 4(a). Here, the features of the system are described through an ALF centred on the oxygen atom of Water O, with an x-axis defined through the OH bond of Water O on the left of Figure 4(a), and an xy-plane defined to include the remaining hydrogen of Water O. By defining the hydrogen bond accepting molecule on the left of Figure 4(a) as Water A, and by defining the hydrogen bond accepting molecule on the right of Figure 4(a) as Water B, then the distribution of the ϕOn features describing the oxygens in Water A and Water B will be as displayed in Figure 4(b). By defining ϕOn dummy-features to be equal to the arbitrary constant ΔϕOWaterA = ΔϕOWaterB = 0, dummy-features representing ϕOWaterA will, on average, contribute significantly less to the sum in Equation (3) than the dummy-features representing ϕOWaterB (i.e., ) (see green peak in Figure 4(b)). Accordingly, the correlation between Water A and Water O will be artificially increased, despite Water A and Water B being physically equivalent when viewed from the central oxygen, due to the water monomer’s C2v symmetry.
(a) Hypothetical water trimer system where “Water A” is defined to be the molecule to the left of “Water O,” and “Water B” is defined to be the water to the right of “Water O.” The x-axis is marked as defined by the ALF. (b) Sampling distribution of ϕ values for oxygens of Water A and Water B, as defined by ALF centred on Water O.
(a) Hypothetical water trimer system where “Water A” is defined to be the molecule to the left of “Water O,” and “Water B” is defined to be the water to the right of “Water O.” The x-axis is marked as defined by the ALF. (b) Sampling distribution of ϕ values for oxygens of Water A and Water B, as defined by ALF centred on Water O.
D. VNF method
The method presented here is designed as a general method of applying kriging to VNF systems through the use of dummy-features. Although the work here will focus on the application of the presented VNF method to chemical systems, using the configurational information of the system as the machine learning inputs, the method is general and equally valid for non-chemical applications (e.g., aeronautical and automotive). Note that other groups have developed their own ways to apply machine learning techniques to systems with a variable number of inputs, specific to the machine learning method they have chosen to use, and specific to how they have defined their machine learning inputs. For example, Rupp et al.4 apply zero-valued dummy-features to “pad out” their training sets as required, when using the ordered eigenvalues of their Coulomb matrix as machine learning inputs. While such a method is valid when using ordered eigenvalues as machine learning inputs, it is not general to other forms of inputs mentioned in Section II B.
The proposed method consists of two rules. For each set of features representing a single atom or molecule, the rules are as follows:
A single feature in the set is selected, and the value of all corresponding dummy-features is taken to be a value outside the sampled domain of the selected true-feature. The feature that is selected, and the assigned dummy value, should be motivated by the nature of the problem (as explained just below).
The values of all other dummy-features are taken to be the mean value of their corresponding true-features.
Rule 1 guarantees that each configuration is uniquely described, and that there is no risk of identical sets of inputs corresponding to different outputs. The feature that is selected, and the assigned dummy value, should be motivated by consideration of the features being used. For example, where a water molecule N is described by a set of nine features according to the ALF described in Section II B, the RON feature is an intuitive feature to select to satisfy Rule 1, as there is only one oxygen atom per water molecule, and for many properties, the distance between molecules is a convenient dimension (i.e., feature) to consider property changes in. For this case, which will be discussed in more detail in Section III B, the dummy value can be set to a value outside the sampled range of RON. Alternatively, when using the eigenvalues of a correlation matrix as features, such as in Ref. 4, all eigenvalues would be considered separately for Rule 1, as the eigenvalues cannot be conveniently grouped into clearly defined sets. As the eigenvalues of a well-conditioned correlation matrix cannot equal zero, zero is a convenient value to assign to the dummy features. In fact, this is how the dummy features used in Ref. 4 were actually valued. Beyond the features selected for Rule 1, Rule 2 minimizes the mean distance between all other dummy-features and their corresponding true-features, and removes the bias introduced by arbitrary feature valuing. An example of the application of this method is provided for 4 Å water clusters in Section III B.
III. SYSTEM ANALYSED
A. Clusters sampled
To demonstrate the proposed method for applying kriging to VNF systems, a cluster of water molecules, centred on an oxygen atom, and with an oxygen-oxygen cutoff radius of 4 Å was selected. In other words, all water molecules possessing an oxygen atom within 4 Å of the oxygen atom of the central molecule were selected. Water, despite being the subject of extensive research, still has peculiar characteristics not yet fully understood. Lack of replication of water’s structure and properties in simulation has led to the development of multiple water specific force fields over the last thirty years,26,27 making bulk water an ideal system to apply advances in quantum chemical machine learning. Accordingly, 5000 water clusters were sampled from a previous molecular dynamics (MD) simulation, using a high-rank multipolar electrostatic potential.28
Figure 5 shows the size of the sampled clusters as a function of sampling frequency. The QTAIM9,10 multipole moments of the individual atoms of the central molecule, as well as their IQA11 obtained self-energies and interaction energies, were taken as the properties of interest, for a total of 54 different properties (i.e., 25 multipole moments +2 IQA energies for both oxygen and hydrogen, so 54 = 2 × (25 + 2)) to test the VNF technique on. These properties were calculated using the program AIMAll,29 using the default settings and integration error control. The wavefunction files required as input for AIMAll were obtained using the package GAUSSIAN0930 at the B3LYP/6-311++G(d,p) level of theory. Kriging models were created and their predictions made using the in-house program FEREBUS, with Particle Swarm Optimization used to optimise the model hyperparameters.25,31
Frequency distribution of cluster size, obtained from set of 5000 water clusters of 4 Å radius. Note that cluster sizes of 1, 2, 3, 4, and 16 are exactly 0.
Frequency distribution of cluster size, obtained from set of 5000 water clusters of 4 Å radius. Note that cluster sizes of 1, 2, 3, 4, and 16 are exactly 0.
As inputs for kriging models, each cluster was described by its atomic coordinates, using a spherical polar ALF as described in Section II B. Kriging models used to predict the properties of the central oxygen atom were centred on the central oxygen atom, while kriging models used to predict the properties of the central hydrogen atoms were centred on the central hydrogen atoms, as recommended by Davie et al.16 To define atom N of a cluster, the spatial distribution of the water molecules across the set of clusters was considered.16 In other words, consider an atom in a cluster of water molecules, defined O4 by the ALF described in Section II B. Whereas the ALF defines how O4 is described by spherical polar coordinates, the spatial distribution of the water molecules across the set of sampled clusters is considered when determining which specific oxygen within the cluster is defined O4. To do this, for each cluster configuration, four nodes were placed on the vertices of a tetrahedron centred on the oxygen of the central water molecule, at a distance from the centre that approximately corresponds to the first peak in the water radial distribution function (as obtained from the MD simulation). The coordinates of the nodes are displayed in Table I, and a visual depiction is shown in Figure 6. From there, the non-central oxygen closest to node 1 was designated N = 4 in the above scheme, and its corresponding hydrogens were designated N = 5 and N = 6, with the hydrogen closest to the centre of the cluster designated N = 5. This is because the first three atoms of the system (i.e., N = 1, N = 2, and N = 3) comprise the central molecule. The node-based, atom-defining process was repeated for the closest undesignated molecule to node 2, and so on. In clusters where more than four non-central water molecules exist, the atoms of the additional molecules were defined as a set, with the lower index going to the atoms of molecules with a lower Ocentral–ON distance. Again, within each molecule, hydrogens were defined based on ordered distance to the central oxygen. Readers may refer to the work of Davie et al.16 for more details.
Position of nodes used to define atom indexes. Nodes 1-4 were placed on the vertices of a tetrahedron centred on the oxygen of the central water molecule, at a distance from the centre that approximately corresponds to the first peak in the water radial distribution function (see Figure 6 for a spatial depiction of how nodes 1-4 are positioned).
Node . | Oxygen . | r (Å) . | Theta (deg) . | Phi (deg) . |
---|---|---|---|---|
1 | O4 | 2.65 | 90 | 0 |
2 | O7 | 2.65 | 90 | 105 |
3 | O10 | 2.65 | 17 | −128 |
4 | O13 | 2.65 | 163 | −128 |
Node . | Oxygen . | r (Å) . | Theta (deg) . | Phi (deg) . |
---|---|---|---|---|
1 | O4 | 2.65 | 90 | 0 |
2 | O7 | 2.65 | 90 | 105 |
3 | O10 | 2.65 | 17 | −128 |
4 | O13 | 2.65 | 163 | −128 |
Spatial distribution of atoms O4 (blue), O7 (green), O10 (purple), and O13 (orange) sampled from 5000 water clusters of 4 Å radius. Black spheres represent position of nodes used to define atom indices, and the orientation of the central water molecule is marked by a red-white stick diagram.
Spatial distribution of atoms O4 (blue), O7 (green), O10 (purple), and O13 (orange) sampled from 5000 water clusters of 4 Å radius. Black spheres represent position of nodes used to define atom indices, and the orientation of the central water molecule is marked by a red-white stick diagram.
One concern regards using some form of atomic ordering protocol to define the features of the kriging model (such as the node-based scheme outlined above) is that discontinuities can be introduced into the surface of the output property whenever atoms in the list swap position. In other words, a small perturbation to the positions of the molecules of the system might result in the molecules being ordered differently, which will result in a jump in feature space to the configurational description that corresponds to the new ordering. Such discontinuities are not a property of SOAP,21 symmetry functions,22 or the ordered eigenvalues of a Coulomb matrix.4 On the other hand, the above list of configurational representations are either inherently under-complete descriptions of the chemical environment or are required to be larger than the number of degrees of freedom a system possesses in order to capture all of the system’s configurational information. Accordingly, the above list of configurational representations does not present a convenient means of converting features back into global coordinates, which can hamper some forms of iterative training and sampling algorithms, and limits the possibilities of calculating system properties directly from feature space.
Discontinuities in the potential energy surface of a system can cause problems when trying to accurately simulate various states of the system but the strength of influence of a given discontinuity is highly dependent on the size of the discontinuity, how smooth the potential energy surface is on either side of the discontinuity, and the details of the system of interest. For a machine-learned potential energy surface with zero error, there would be no discontinuity caused by jumping between two physically identical but feature-input-different configurational descriptions of a system, as the predicted potential at both points would be identical. Likewise, the forces calculated should also be identical. Thus, the influence of the discontinuity can be made arbitrarily small by increasing model accuracy (although practical improvement of model accuracy is still a field of ongoing research). Finally, it should be noted that not every system property of interest will necessarily be influenced by such a discontinuity, and sometimes a model might be better served by the potentially increased prediction accuracy an alternative coordinate frame may offer. For example, a robust search algorithm should be capable of finding energetically optimal geometries despite jumps between what should be property-equivalent points in phase space.
B. Application of VNF to the system
Each non-central water molecule contributes a set of nine additional features to the system, for example, {RON, θON, ϕON, RH1N, θH1N, ϕH1N, RH2N, θH2N, ϕH2N} for Water N. According to Rule 1 of Section II D, one of these features must be selected such that dummy-features representing the selected feature can be assigned a value outside the sampled domain of the corresponding true-features. As the Ocentral–ON distance is already used for defining atomic indices, it is intuitive to also choose this feature for Rule 1. Thus, for the molecule Water N, RON has been selected as the feature required to satisfy Rule 1. Figure 7(a) displays the RON distributions of the non-central water molecules, sampled across all 5000 clusters, while Figure 7(b) displays their radial ranges. To satisfy Rule 1, dummy RO1 features must be set to a value below 2.45 Å or above 4 Å. For the energetic and electrostatic properties of interest here, the influence of an atom on another atom decreases as a function of the distance between the two. For this reason, the value assigned to all ΔRON was set to a value slightly greater than the maximum value of their corresponding true-features. Such a choice of ΔRON can also be considered to approximately represent the existence of dummy molecules slightly beyond the cutoff radius of 4 Å (had ΔRON been set to a value below 2.45 Å, the choice could be considered to approximately represent the existence of dummy molecules closer than 2.45 Å to the centre). As all true RON features for the system here have a maximum of approximately 4.00 Å, all corresponding dummy-features were set to 4.05 Å. According to Rule 2, dummy-features representing the eight remaining features used to describe a given non-central water molecule were set as the mean value of their corresponding true-features.
Distance properties of all non-central oxygens sampled from 5000 water clusters of 4 Å radius. The labels on legend denotes colour of Nth non-central water molecule, as defined in Section III A. Thus, the oxygen labelled 14th corresponds to the 14th non-central oxygen, which is the 15th oxygen of the system according to Figure 5. (a) Distance distributions. (b) Range of RON features.
Distance properties of all non-central oxygens sampled from 5000 water clusters of 4 Å radius. The labels on legend denotes colour of Nth non-central water molecule, as defined in Section III A. Thus, the oxygen labelled 14th corresponds to the 14th non-central oxygen, which is the 15th oxygen of the system according to Figure 5. (a) Distance distributions. (b) Range of RON features.
IV. RESULTS
The performance statistics of the IQA kriging models for the 4 Å clusters are displayed in Table II and are labelled “VNF.” For comparison, the performance statistics of two additional models are also displayed: (1) a set created by truncating the features of higher-dimension training points such that all training points are described by the number of features describing the lowest dimensionality training point (which corresponds to 5 molecules for the set of clusters considered here) and (2) a set of ten molecule clusters (decamers), where all configurations contain ten molecules, regardless of cluster radius, and as published by Davie et al.16 These additional models are labelled “5 mol” and “10 mol,” respectively. Each kriging model was trained on a data-set of 5000 training points using 5-fold cross validation, using a 1:4 partitioning of training set to test set. In other words, the 5000 training points were evenly split into five 1000-training-point models, with each model being tested on the set of points not used for that specific model. The kriging models are compared through both the mean absolute errors (MAEs) obtained by comparing the predicted property values to the corresponding true property values and through the q2 correlation coefficient,
where Pi is the predicted value of test point i; Ti is the true value of test point i; M is the mean of the entire test set; and Ntest is the number of points (i.e., 4000) the model is tested on. Thus, the q2 metric has the intuitive property of being equal to 1 when predictions are equal to true values (Pi − Ti = 0), and equal to 0 when predictions are no better than the predictions obtained by using the simplest, unbiased estimator, which is the mean (i.e., when Pi − Ti = M − Ti). Thus, five MAE (see the 5-fold cross validation above) and q2 values were obtained per model, for which the average and standard deviation were calculated, and reported in Table II.
Prediction statistics for the ESelf and EInter of the oxygen and hydrogen atoms of the central water molecule of a 4 Å water cluster. All energies are in kJ/mol, and uncertainties represent ±1 standard deviation. Averages obtained from five 1000-training-point models, tested on 4000 test points each.
Property . | Model . | Mean value . | Range . | MAE . | q2 . |
---|---|---|---|---|---|
Oxygen ESelf | VNF | −196 714.1 | 286.9 | 10.3 ± 0.1 | 0.90 ± 0.00 |
5 mol | 23.0 ± 0.4 | 0.52 ± 0.02 | |||
10 mol | −196 708.8 | 243.9 | 10.6 ± 0.2 | 0.90 ± 0.00 | |
Oxygen EInter | VNF | −1 531.6 | 397.9 | 16.4 ± 0.7 | 0.88 ± 0.01 |
5 mol | 39.6 ± 0.4 | 0.32 ± 0.01 | |||
10 mol | −1 540.6 | 407.7 | 15.3 ± 0.6 | 0.91 ± 0.01 | |
Hydrogen ESelf | VNF | −738.9 | 136.4 | 9.6 ± 0.3 | 0.69 ± 0.02 |
5 mol | 11.4 ± 0.2 | 0.58 ± 0.01 | |||
10 mol | −736.1 | 121.6 | 10.3 ± 0.1 | 0.66 ± 0.01 | |
Hydrogen EInter | VNF | −523.6 | 139.3 | 6.8 ± 0.3 | 0.84 ± 0.01 |
5 mol | 13.2 ± 0.1 | 0.43 ± 0.02 | |||
10 mol | −525.3 | 120.8 | 6.7 ± 0.0 | 0.85 ± 0.00 |
Property . | Model . | Mean value . | Range . | MAE . | q2 . |
---|---|---|---|---|---|
Oxygen ESelf | VNF | −196 714.1 | 286.9 | 10.3 ± 0.1 | 0.90 ± 0.00 |
5 mol | 23.0 ± 0.4 | 0.52 ± 0.02 | |||
10 mol | −196 708.8 | 243.9 | 10.6 ± 0.2 | 0.90 ± 0.00 | |
Oxygen EInter | VNF | −1 531.6 | 397.9 | 16.4 ± 0.7 | 0.88 ± 0.01 |
5 mol | 39.6 ± 0.4 | 0.32 ± 0.01 | |||
10 mol | −1 540.6 | 407.7 | 15.3 ± 0.6 | 0.91 ± 0.01 | |
Hydrogen ESelf | VNF | −738.9 | 136.4 | 9.6 ± 0.3 | 0.69 ± 0.02 |
5 mol | 11.4 ± 0.2 | 0.58 ± 0.01 | |||
10 mol | −736.1 | 121.6 | 10.3 ± 0.1 | 0.66 ± 0.01 | |
Hydrogen EInter | VNF | −523.6 | 139.3 | 6.8 ± 0.3 | 0.84 ± 0.01 |
5 mol | 13.2 ± 0.1 | 0.43 ± 0.02 | |||
10 mol | −525.3 | 120.8 | 6.7 ± 0.0 | 0.85 ± 0.00 |
For the energetic properties considered, the VNF kriging models either outperformed or were within statistical error of the decamer models with respect to the MAE. Alternatively, the feature-truncated 5 mol models performed significantly worse. When compared to the 5 mol models, the VNF models returned oxygen ESelf, oxygen EInter, hydrogen ESelf, and hydrogen EInter of MAE values 55%, 59%, 16%, and 48% lower, respectively. Although the returned MAE and q2 values of the 10 mol models were similar to the MAE and q2 values obtained from the VNF models, the range of the energetic properties of the 4 Å clusters was on average 11% higher than the range from the decamer clusters. Thus, when the MAE is viewed as a percentage of the property range (MAE%), most VNF models perform better than their corresponding 10 mol models (Figure 8). As both the VNF and 10 mol models are trained with complete descriptions of their respective cluster configurations, the difference in overall performance between the two sampling methods implies that IQA energies are modelled better as a function of the atomic positions of a cluster of fixed radius than as a function of the atomic positions of a cluster of fixed number of molecules. This is particularly true for the hydrogen ESelf energy, a property known to be difficult to predict.16
As a guide for molecular comparison, Table III shows approximated prediction metrics for the complete central molecule of a cluster. True full-molecule energies were not obtained due to the expense of calculating the true properties of both hydrogen atoms, when only one hydrogen atom is required for the atomistic properties this paper has focused on. Instead, random combinations of one oxygen and two hydrogens were combined to approximate the energetics of a complete molecule. Thus, each “molecule” possessed one set of oxygen energies (i.e., the ESelf and EInter of a single, random oxygen atom) and two sets of hydrogen energies (i.e., the ESelf and EInter of two respective, random hydrogen atoms). The relatively low MAE value is in part due to the decreased range of the total energy relative to certain atom-specific energies in Table III, and in part due to the compensatory nature of the errors for each individual atom. When the self-energy of an atom was overpredicted, the atom’s interaction energy was usually underpredicted. The root mean squared error (RMSE) has been included to aid in comparison to the literature. The RMSE value obtained here is significantly higher than that obtained by Morawietz and Behler,32 who used Neural Networks to predict the energies of clusters up to ten molecules in size with a mean RMSE of 0.22 kJ/mol. In that work, however, clusters were sampled at much lower temperatures (linearly distributed between 50 and 250 K) representing a different phase (i.e., ice vs liquid), and a more realistic potential was used in the simulation used to provide the configurations (which has been shown to reduce errors). Furthermore, each cluster above 5 molecules in size was individually sampled and trained for with between 1700 and 4700 structures per cluster. In comparison, in the current work, only a single model per energy component has been used for all cluster sizes (from 5 to 15 molecules) and trained with only 1000 training points.
Approximated prediction statistics for the total energy of the central water molecule of a 4 Å water cluster. All energies are in kJ/mol, and uncertainties represent ±1 standard deviation. Averages obtained from five 1000-training-point models, tested on 4000 test points each.
. | Mean value . | Range . | MAE . | RMSE . | q2 . |
---|---|---|---|---|---|
VNF | 200 770.7 ± 0.5 | 242.9 ± 16.1 | 16.9 ± 0.7 | 21.3 ± 0.8 | 0.61 ± 0.04 |
. | Mean value . | Range . | MAE . | RMSE . | q2 . |
---|---|---|---|---|---|
VNF | 200 770.7 ± 0.5 | 242.9 ± 16.1 | 16.9 ± 0.7 | 21.3 ± 0.8 | 0.61 ± 0.04 |
Figure 9 displays the full distribution of energy property errors in S-curve format. The S-curve is a method of displaying the error distribution of a set of predictions in a cumulative (i.e., summed or integrated) way, showing a percentile (%) in the ordinate. This format is different to the common error distribution histogram, which can be seen as the differential of the S-curve. Whereas a distribution histogram generally draws the eye to the mode (i.e., the value corresponding to the maximum of the distribution), the S-curve presents the data in a way that makes ascertaining the various percentile values (such as the 50-percentile, 90-percentile value, and 95-percentile value) convenient. This makes the S-curve an intuitive means of quickly determining the characteristics of a distribution of errors that are important from a computational chemistry perspective. Figures 9(a), 9(b), and 9(d) display very similar distributions of prediction errors between the VNF and 10 mol models, and noticeably worse prediction errors for the 5 mol models. A similar trend is also seen in Figure 9(c) but the 10 mol model performs slightly worse relative to the VNF model than for the previous properties. Again, this result implies that the energetic properties of clusters selected using cutoff radii are easier to model with kriging than clusters selected to have a fixed number of molecules.
Error distributions of IQA kriging models: (a) oxygen ESelf, (b) oxygen EInter, (c) hydrogen ESelf and (d) hydrogen EInter.
Error distributions of IQA kriging models: (a) oxygen ESelf, (b) oxygen EInter, (c) hydrogen ESelf and (d) hydrogen EInter.
In addition to the IQA energies, kriging models for the QTAIM obtained electrostatic multipole moments were also obtained, for the charge (Q[0,0]), components of the dipole moment (Q[1,0], Q[1,1,c], Q[1,1,s]), and components of the quadrupole moment (Q[2,0], Q[2,1,c], Q[2,1,s], Q[2,2,c], Q[2,2,s]). According to the previous energetic models, each kriging model was trained on the data-set of 5000 training points using 5-fold cross validation, using a 1:4 partitioning of training set to test set. Figure 10 displays the MAE, q2, and MAE% of both the oxygen and hydrogen multipole moment kriging models. Across all moments, the VNF and 10 mol models once again perform similarly, and together they both significantly outperform the 5 mol models. Once again, the q2 and MAE% validation metrics indicate that the VNF models generally outperform their 10 mol equivalents. It is worth noting that when compared to the models used to predict IQA properties, the 5 mol models used to predict the multipole moments have performed much better relative to the corresponding VNF and 10 mol clusters. In fact, the average MAE% over all IQA energies for the 5 mol models was 94% higher than that for the equivalent VNF models, while the average MAE% over all multipole moments for the 5 mol models was only 36% higher than for the equivalent VNF models. As the 5 mol model approximates a water molecule and its first solvation shell, the relative MAE% results imply that the IQA energetic description of a water molecule at the centre of a cluster is more sensitive to the position of water molecules beyond the first solvation shell than the QTAIM multipolar electrostatic description of the molecule.
Validation results of the oxygen and hydrogen multipole moment kriging models showing (a) oxygen MAE, (b) oxygen MAE%, (c) oxygen q2, (d) hydrogen MAE, (e) hydrogen MAE%, and (f) hydrogen q2.
Validation results of the oxygen and hydrogen multipole moment kriging models showing (a) oxygen MAE, (b) oxygen MAE%, (c) oxygen q2, (d) hydrogen MAE, (e) hydrogen MAE%, and (f) hydrogen q2.
V. CONCLUSION
As molecular interaction strength generally decreases as a function of distance, the properties of a selected atom or molecule in a molecular simulation have traditionally been modelled as a function of the surrounding atoms within a given cutoff radius. Alternatively, many machine learning methods, including kriging, have traditionally required a fixed number of inputs to model some given property. Fixed-number-of-input kriging techniques offer no flexibility to handle the movement of atoms or molecules into or out of a radius of physical influence. As a result, straight application of such techniques risks decreased efficiency if the selected number of inputs is too high, or decreased accuracy if the selected number of inputs is too low, when applied to molecular simulation.
In this work, a new method of applying kriging to systems with a variable number of features (VNFs) has been developed and applied to the prediction of 54 properties of a water molecule at the centre of a 4 Å radius cluster. The presented method is general, is theoretically justified by the fact that it cannot result in a singular correlation matrix, and does not discard any input information. Models created using the VNF method have been shown to perform as well as, or better than, the equivalent models for a ten molecule cluster with a fixed number of inputs. Similarly, the developed VNF protocol was shown to significantly outperform models created using an alternative, naïve feature-truncating approach. The VNF protocol presented here is general and, as it does not suffer difficulties stemming from identical sets of inputs corresponding to different outputs or input biasing, is particularly suitable to interpolating machine learning methods.
Acknowledgments
We are grateful to the EPSRC for the award of an Established Career Fellowship (No. EP/K005472), which funds all three authors.