We rely on a total of 23 (cluster size, 8 structural, and 14 connectivity) descriptors to investigate structural patterns and connectivity motifs associated with water cluster aggregation. In addition to the cluster size n (number of molecules), the 8 structural descriptors can be further categorized into (i) one-body (intramolecular): covalent OH bond length (rOH) and HOH bond angle (θHOH), (ii) two-body: OO distance (rOO), OHO angle (θOHO), and HOOX dihedral angle (ϕHOOX), where X lies on the bisector of the HOH angle, (iii) three-body: OOO angle (θOOO), and (iv) many-body: modified tetrahedral order parameter (q) to account for two-, three-, four-, five-coordinated molecules (qm, m = 2, 3, 4, 5) and radius of gyration (Rg). The 14 connectivity descriptors are all many-body in nature and consist of the AD, AAD, ADD, AADD, AAAD, AAADD adjacencies [number of hydrogen bonds accepted (A) and donated (D) by each water molecule], Wiener index, Average Shortest Path Length, hydrogen bond saturation (% HB), and number of non-short-circuited three-membered cycles, four-membered cycles, five-membered cycles, six-membered cycles, and seven-membered cycles. We mined a previously reported database of 4 948 959 water cluster minima for (H2O)n, n = 3–25 to analyze the evolution and correlation of these descriptors for the clusters within 5 kcal/mol of the putative minima. It was found that rOH and % HB correlated strongly with cluster size n, which was identified as the strongest predictor of energetic stability. Marked changes in the adjacencies and cycle count were observed, lending insight into changes in the hydrogen bond network upon aggregation. A Principal Component Analysis (PCA) was employed to identify descriptor dependencies and group clusters into specific structural patterns across different cluster sizes. The results of this study inform our understanding of how water clusters evolve in size and what appropriate descriptors of their structural and connectivity patterns are with respect to system size, stability, and similarity. The approach described in this study is general and can be easily extended to other hydrogen-bonded systems.

Water clusters are useful models for understanding hydrogen bonding and chemical transformations in aqueous environments via microsolvation approaches.1–5 These systems have been used, among others, to study proton-transfer mechanisms in protonated water clusters,6 to identify the origin of spectral signatures in IR spectra7–9 and to quantify strong cooperative effects in hydrogen bonding networks.10–13 While water clusters have definitely claimed their space and significance in the chemical physics and simulation communities, the question of how the information gained from small water clusters translates to the liquid and/or ice phases still remains an active field of research. For example, water clusters typically have far more two- and three-coordinated water molecules, whereas the condensed phase has mostly four-coordinated molecules. For the smaller clusters, the need to maximize hydrogen bonding results in all atoms lying on the surface of the cluster,14 whereas the smallest water cluster with a “fully solvated” water molecule appears at n ∼ 17.15,16 In addition to the different connectivity, small water clusters have a smaller extended hydrogen bonding network than liquid water and ice (in which the long-range structure is ordered). While it is known that some features of water clusters evolve with increasing cluster size,17 it is not fully understood at what point the clusters begin to exhibit substructures and/or structural parameters akin to those found in liquid water and ice. Improving our knowledge of this transition via appropriate descriptors could aid in our fundamental understanding of water aggregation while at the same time assisting in the development of model systems to effectively probe the physical phenomena in condensed phase aqueous environments.

Previous studies have reported and used descriptors of the local and extended hydrogen bond environment in bulk systems to distinguish ice-like from liquid-like structures18–20 and also to identify unique types of solvation environments during liquid water simulations.21–23 These descriptors of bulk environments typically take into account the relative orientations of numerous surrounding molecules, such as the orientational tetrahedrality order parameter (q),24,25 the translational tetrahedrality order parameter (sk),24 the Steinhardt order parameters,26–29 and the local structure index (LSI).30 Other studies have noted the importance of extended structural descriptors,31 such as incorporating a measure of translational order between the first and second solvation shells (ζ)32 and the node total communicability (NTC).22 The authors of the work of Manserrat et al.33 used the smooth overlap of atomic positions (SOAP) descriptor to demonstrate the presence of ice-like substructures in liquid simulations, based on similarities in their local environment. However, a subsequent study of Donkor et al.34 demonstrated the dissimilarity of the SOAP parameter and other commonly used local structure parameters in detecting similar molecular environments. However, there is still much to understand about the local and extended structure of bulk systems, especially the emergence of bulk-like qualities upon aggregation. Furthermore, while structural and connectivity descriptors have been applied widely in bulk (ice and liquid) systems, there has been little work to develop descriptors of aggregation in smaller cluster systems having much smaller hydrogen bond networks with properties that are further away from the ones for condensed environments.

The goal of this work is to develop structural and connectivity descriptors for the water cluster minima in the range n = 6–25 contained in a recent database of 4 948 959 structures that was created as a result of extensive Monte Carlo Temperature Basin-Paving (MC-TBP) simulations35 with the TTM2.1-F polarizable potential.36 We specifically focus on a subset of the full database containing 162 892 water cluster minima that are within 5 kcal/mol of the putative minima for water cluster sizes n = 6–25. The restriction of the existing dataset to low-energy structures is likely sufficient in describing minimum-like structures. However, caution should be taken when using the descriptors to describe high-energy or nonstationary point structures that lie outside of the ranges used in the training and analysis. First, we define the descriptors we use and analyze the evolution of these descriptors for increasing sizes of water clusters in an attempt to provide insight into the transition from small to larger water clusters. While the binding energy (stability) of a cluster naturally evolves as a function of the size of the cluster as an extensive quantity, other structural descriptors are found to correlate strongly with the binding energy, lending insight into emerging stable features upon aggregation. Additionally, the correlations between descriptors are examined to better understand how these evolve in tandem upon aggregation. Lastly, an unsupervised learning approach was implemented to group water clusters of different sizes in the database based on similarities in their structural and connectivity descriptors. This presents a unique way of classifying water clusters of different sizes with the goal of identifying similar patterns across the size regime n = 6–25. The current study aims at developing descriptors of water aggregation up to 30 molecules so they can be used to understand the formation of a critical nucleus for water aggregation. Several of these descriptors, including their trends, are transferable to liquid water and/or ice, albeit the networks found in these smaller clusters (formed by the need to maximize hydrogen bonding, resulting to most atoms lying on the surface of the water cluster up to n ∼ 17) are quite different than those present in condensed phase environments. The fact that many highly accurate computational techniques suffer from poor scaling with respect to the system size hinders their application to very large clusters in conjunction with sampling techniques such as basin hopping Monte Carlo or to the bulk via molecular dynamics simulations. The current work offers the possibility to use the developed descriptors to compare the similarity between clusters regardless of size. Altogether, this work identifies descriptors of water aggregation using data science to better understand how hydrogen bonding structures change during the transition from small gas-phase systems to larger aggregates of water molecules, and sets the stepping stone for their future application to condensed phase aqueous environments.

The descriptors used in this work are defined below. It is intuitive to rely on the many-body concept to categorize the various descriptors as one-body (intramolecular), two-body (intermolecular and pairwise), and many-body (intermolecular and cooperative). Utilizing the Cartesian geometry of each cluster, one can collect a large set of descriptors for localized geometries that will become the one-, two-, three-, and many-body structural descriptors. Graphs were constructed as previously described,35 where each node is a water molecule and each edge is a hydrogen bond satisfying the rψ definition of the hydrogen bond outlined in the work of Kumar et al.37 These graphs are undirected and contain adjacency information as labels of each node. More delocalized properties were derived from the graph representation of each cluster. These descriptors are termed connectivity descriptors because they contain information about the hydrogen bond network. Certain descriptors can have multiple values across a whole cluster. In cases where a single value per cluster is required, the average of the descriptor over the cluster will be used and will be denoted with angled brackets, viz. ⟨⋯⟩.

Two one-body descriptors were utilized to describe the intramolecular geometry. These are the OH bond lengths (rOH) and the HOH bond angle (θHOH) shown in Fig. 1(a). Three two-body descriptors were used to describe each hydrogen bond. From the six intermolecular degrees of freedom in a water dimer, the three most important coordinates for a hydrogen bond—the OO distance (rOO), the OHO angle (θOHO), and the HOOX dihedral angle (ϕHOOX), where X lies on the bisector of the HOH angle—were chosen. These are depicted in Fig. 1(b). The rOO has previously been shown to have a strong relationship with the vibrational frequencies as well as the hydrogen bond energy.38,39 The θOHO angle describes the linearity of each hydrogen bond; close-to-linear hydrogen bonds are known to be stronger.38,39 The ϕHOOX angle is used to quantify the cis-trans-dihedral angle that is important for the “strong–weak” nearest-neighbor hydrogen bond definition introduced in the work of Kirov et al.40 These authors argued that trans (ϕHOOX near 180°)-hydrogen bonds are more favorable than cis (ϕHOOX near 60°)-hydrogen bonds in polyhedral water clusters,40 since in the gas-phase water dimer the trans-conformation is about 1 kcal/mol more stable than the cis-conformation.41 This descriptor is defined as the dihedral angle between the hydrogen on the donating water that is not participating in the hydrogen bond, the two oxygen atoms, and the bisector of the two hydrogen atoms on the accepting water molecule [Fig. 1(b)]. Three- and many-body descriptors become increasingly complex, with 21 total degrees of freedom for a trimer. It, therefore, becomes intangible to analyze all three-body degrees of freedom. Instead, we will only focus on the OOO angle (θOOO) [Fig. 1(c)]. This descriptor is the angle between a central oxygen atom and two oxygen atoms participating in a hydrogen bond with the central molecule.

FIG. 1.

Structural descriptors. Water molecules are depicted as balls and sticks, red is used for the oxygen and white for the hydrogen atoms. X denotes the bisector of the HOH angle. (a) The two intramolecular (one-body) descriptors. (b) Three intermolecular (two-body) descriptors of a hydrogen bond between adjacent molecules: hydrogen bond distance (rOO), hydrogen bond angle (θOHO), and ϕHOOX. The gray vertical plane contains the free hydrogen of the donor and the two oxygen atoms, while the purple one contains the bisector of the hydrogen atoms on the hydrogen bond acceptor and the two oxygen atoms. The dihedral angle, ϕHOOX, is the angle between these two planes. (d) The O–O–O angle (three-body) descriptor (θOOO). (e) Examples of the most common adjacencies, ranging from AD to AAADD, for the highlighted central water molecule (A = acceptor, D = donor).

FIG. 1.

Structural descriptors. Water molecules are depicted as balls and sticks, red is used for the oxygen and white for the hydrogen atoms. X denotes the bisector of the HOH angle. (a) The two intramolecular (one-body) descriptors. (b) Three intermolecular (two-body) descriptors of a hydrogen bond between adjacent molecules: hydrogen bond distance (rOO), hydrogen bond angle (θOHO), and ϕHOOX. The gray vertical plane contains the free hydrogen of the donor and the two oxygen atoms, while the purple one contains the bisector of the hydrogen atoms on the hydrogen bond acceptor and the two oxygen atoms. The dihedral angle, ϕHOOX, is the angle between these two planes. (d) The O–O–O angle (three-body) descriptor (θOOO). (e) Examples of the most common adjacencies, ranging from AD to AAADD, for the highlighted central water molecule (A = acceptor, D = donor).

Close modal
The tetrahedral order parameter (q) is a combinatorial metric of the distribution of bonds around an atom or molecule.24 Our analysis uses the definition adopted by Errington and Debenedetti,25,
q4=138i3j>i4cos(θi,c,j)cos(109.5°)2,
(1)
where θi,c,j is the angle formed by two water molecules, i and j, around a central, c, water molecule. Eq. (1) sets a range between q = 0 (stochastic distribution observed in an ideal gas) and q = 1 (completely tetrahedral order). Original and subsequent work using the tetrahedral order parameter was based on the four closest water molecules to define q.24,25,42 Water clusters, however, may lack four neighbors in the first solvation shell, especially at the surface of the cluster and most definitely for the smaller cluster size networks (n < 17). Therefore, neighbors will be determined through hydrogen bonds. The definition for tetrahedrality has been expanded in this work to accommodate a water molecule with two, three, and five (in addition to four for the standard definition) hydrogen bonds. Following the procedure outlined by Errington and Debenedetti,25 the coefficient of the summation is modified such that the expectation value is still zero for an ideal gas. That is,
qn=1Anin1j>in0πcos(θi,c,j)cos(109.5°)2×sin(θi,c,j)dθi,c,j=0.
(2)
Solving for the coefficient yields An=94n2=92n22n, where n is the number of water molecules bound to the central water of interest. Note that A4 reverts back to 38. Therefore, the general tetrahedral order parameter (qn) that will be used in this work is
qn=192n(n1)in1j>incos(θi,c,j)cos(109.5°)2.
(3)
The newly defined qn value still quantifies how close the orientations of the hydrogen bonds are to 109.5° (tetrahedral) but can also account for a different number of hydrogen bonds to the central atom. For example, q3 can be thought of as how far a three-coordinated molecule is from trigonal pyramidal. In the following, q will be used to denote the modified tetrahedrality order parameter regardless of coordination (i.e., it incorporates qm, m = 2–5).
An additional many-body structural descriptor is the radius of gyration (Rg), which describes the radial distribution of particles away from the center of mass. It is defined as
Rg=inrircom2n,
(4)
where n is the number of water molecules, ri is the position of each water molecule, and rcom is the position of the center of mass.

Many-body connectivity descriptors are derived from the graph representation of the clusters and generally describe the connectivity of the hydrogen bonding network. Specifically, each cluster is represented by an undirected graph where each node is a water molecule and each edge is a hydrogen bond. Several of these descriptors have been the focus of previous studies,17 but they will be revisited and investigated more in depth. One such descriptor is the hydrogen bond saturation (% HB) or the percent of hydrogen atoms participating in a hydrogen bond. It is simply edges2nodes, where the number of edges and nodes come from the graph representation of each cluster, or nHB2nH2O, where nHB is the number of hydrogen bonds and nH2O is the number of water molecules.

Another connectivity descriptor is the adjacency—the number of hydrogen bonds accepted (A) and donated (D) by each water molecule. The concept of adjacency has been extensively used by infrared (IR) spectroscopists to correlate with shifts in the observed vibrational spectral bands.43 These adjacencies range from D and A, or single donors and single acceptors respectively, to AAADD, which refers to a penta-coordinated water molecule that accepts three and donates two hydrogen bonds [Fig. 1(d)]. Although adjacency can be easily computed by finding the directionality of each hydrogen bond, adjacencies were provided as node labels in the graph representation of each cluster. To minimize size extensive behavior, the percentage of water molecules in each adjacency class will be used.

Other connectivity descriptors are slightly more complex. The Wiener topological index is the sum of the shortest paths from all nodes to all other nodes.44 It describes how connected a graph is, with lower values describing graphs with stronger connections and shorter paths. The Wiener index of alkanes44 was found to correlate with their respective boiling points, but applications have expanded its use to predict other thermochemical properties and toxicity of bioactive molecules.45 Closely related is the Average Shortest Path Length (ASPL), which is the Wiener index divided by the number of paths [n2, where n is the number of water molecules], making the ASPL less size extensive than the Wiener index. An example graph and corresponding shortest paths are shown in Fig. 2(a) to illustrate the computation.

FIG. 2.

Descriptors that are visual in nature are depicted for clarity. (a) The shortest paths for a pentagonal graph of (H2O)5, which are enumerated to compute the values of the Wiener index (15) and the ASPL (1.5). (b) Enumeration of the non-short-circuited cycles in the pentagonal prism (H2O)10.

FIG. 2.

Descriptors that are visual in nature are depicted for clarity. (a) The shortest paths for a pentagonal graph of (H2O)5, which are enumerated to compute the values of the Wiener index (15) and the ASPL (1.5). (b) Enumeration of the non-short-circuited cycles in the pentagonal prism (H2O)10.

Close modal

Non-short-circuited cycles are the final descriptor in the connectivity category considered in this study. Non-short-circuited cycles, as described by Rahman and Stillinger, are hydrogen-bonded rings in water that are not cut short by additional hydrogen bonds.46 Mathematically, that means the shortest paths (or tied for shortest paths) between all sets of two nodes in the cycle are along that cycle. The non-short-circuited cycles for the pentagonal prism are depicted in Fig. 2(b). This produces a nontrivial problem where brute-force method for computation, i.e., finding all cycles and checking for short-circuited behavior, is too expensive. Instead, we used an optimized algorithm to find good candidates for non-short-circuited cycles via King’s shortest paths47 and then checked for short-circuited behavior. The pseudocode is described in algorithm S1, and the code to perform such searches can be found at https://github.com/gds001/Descriptors-for-Water-Aggregation.git. To minimize size extensive behavior, the percentage of the cycles of each size were used.

To explore and utilize the correlation between the various descriptors, various supervised and unsupervised machine learning (ML) techniques were employed. Regression, the simplest supervised machine learning model, was widely used to explore the interplay of various descriptors and properties of water clusters. Regression can not only create predictive models but can also provide insight into the correlations and relationships between descriptors. Regression models were created by minimizing the root mean squared error (RMSE) between a fitting function and the descriptors using the Python library SciPy.48 Linear regressions utilize a linear function,
F(x1,x2,,xN)=i=1Naixiμiσi,
(5)
where xi are the features of the model (descriptors or products of descriptors), μi is the average value of xi, σi is the standard deviation of xi, and ai is the fitting weight. The square of the correlation coefficient (R2) served as a good metric for the strength of the relationship. When used for prediction, multidimensional and polynomial regressions were created in a similar fashion, using initial weights determined through low-dimensional principal component (PC) models.

To simplify the configuration space and create useful clustering models, principal component analysis (PCA) was employed using scikit-learn.49 PCA has been widely used for a variety of chemical applications, ranging from property prediction to group clustering.50–54 In summary, PCA linearly transforms descriptors to pull as much variance into the principal components (PCs) and reduce the dimensions required for a successful model. Mean-centering and norm-scaling were employed on the descriptors prior to PCA. PCA can be utilized with all descriptors, which can be useful when attempting to analyze the interplay between the descriptors as a whole. However, certain descriptors are not good candidates for a PCA model, such as those that are uniform across the data set and those that do not contain information about desired properties. Including such descriptors will pull unnecessary information into higher PCs and require more dimensions for a successful model. Thus, the PCA model used in this work was constructed by selecting a subset of potential descriptors and inspecting for visual separation. Descriptors were added and removed one by one until no improvements were found.

By examining how the descriptors correlate to cluster size (n), we gain insight into how these descriptors (associated with stable clusters) evolve upon aggregation.

We will first focus on the changes in water intramolecular geometry upon aggregation via the one-body descriptors. Figure 3 depicts the distributions of rOH (left) and θHOH (middle) for water cluster sizes n = 6 (bottom of each panel) to n = 25 (top of each panel). For these finite clusters, we observe a relatively constant peak around 0.950–0.955 Å for the free rOH across all cluster sizes in the rOH distributions. However, upon increasing the cluster size, we see a splitting of the peak between 0.965 and 0.975 Å upon aggregation. This splitting suggests that those molecules exist in differing hydrogen bonding environments within the water clusters. In general, the average rOH value (dashed line) lengthens for larger cluster sizes (R2 = 0.59) in a manner that is akin to the increase in % HB. Furthermore, in the θHOH distributions, we see that the small peak between 102° and 104° disappears quickly around n = 12. The peak around 106° decreases as the cluster size increases, which is also associated with the introduction of a peak around 107°. At about n = 12, a small shoulder starts to grow in around 109°. This expansion of the θHOH relative to the gas-phase monomer value (104.5°) upon aggregation is well documented in the literature.55–61 However, the overall correlation between θHOH and cluster size is weak (R2 = 0.30) on average, as larger, bulk-like angles start to appear and the smaller gas-like angles disappear quickly upon increasing the cluster size.

FIG. 3.

Distributions of the rOH (Å, left), θHOH (middle), and q descriptors (right) for water cluster sizes n = 6 (bottom of each panel) to n = 25 (top of each panel). Colors represent different clusters and are used for clarity. Dashed vertical lines denote the arithmetic mean of each distribution.

FIG. 3.

Distributions of the rOH (Å, left), θHOH (middle), and q descriptors (right) for water cluster sizes n = 6 (bottom of each panel) to n = 25 (top of each panel). Colors represent different clusters and are used for clarity. Dashed vertical lines denote the arithmetic mean of each distribution.

Close modal

Figure 4 traces the evolution of (a) the hydrogen bond saturation, (b) the radius of gyration and ASPL, (c) the θOHO, (d) the θOOO, (e) the percentage of molecule type, and (f) the percentage of cycle counts as a function of water cluster size (n = 6–25). A few of the many-body descriptors are inherently size extensive. These include the Wiener index,17 the number of hydrogen bonds, and the total binding energy (R2 > 0.97, Fig. S1). In addition, there are many descriptors that intuitively change with increasing cluster size. For example, the % HB increases for larger cluster sizes and trends toward 100% [R2 = 0.62, Fig. 4(a)]. This is expected because the molecules in the interior of a cluster are more coordinated and, consequently, there is an on average smaller fraction of molecules with a free rOH. While we expect % HB ≃ 100 for the bulk phase, it does not exceed 90% for clusters up to n = 25. The radius of gyration and ASPL are also strongly correlated with the cluster size [R2 > 0.94, Fig. 4(b)]. This is not surprising given that both metrics quantify the size and shape of a cluster, despite being derived from different representations of the cluster (Cartesian coordinates vs a graph).

FIG. 4.

Evolution of the hydrogen bond saturation (a), radius of gyration and ASPL (b), θOHO (c), θOOO (d), percentage of molecule type (e), and percentage of cycle counts (f) as a function of water cluster size (n = 6–25). The shaded regions in panels (b), (e), and (f) represent the standard deviation across clusters of a given size. The connected markers in panels (b)–(f) represent the average across a given cluster size. The boxes in panels (c) and (d) represent each decile of the distribution. Scattered points in panels (a), (c), and (d) are statistical outliers in the distributions. The dotted line in panel (c) represents the perfectly linear hydrogen bond. The dotted line in panel (d) represents the perfectly tetrahedral θOOO.

FIG. 4.

Evolution of the hydrogen bond saturation (a), radius of gyration and ASPL (b), θOHO (c), θOOO (d), percentage of molecule type (e), and percentage of cycle counts (f) as a function of water cluster size (n = 6–25). The shaded regions in panels (b), (e), and (f) represent the standard deviation across clusters of a given size. The connected markers in panels (b)–(f) represent the average across a given cluster size. The boxes in panels (c) and (d) represent each decile of the distribution. Scattered points in panels (a), (c), and (d) are statistical outliers in the distributions. The dotted line in panel (c) represents the perfectly linear hydrogen bond. The dotted line in panel (d) represents the perfectly tetrahedral θOOO.

Close modal

In general, two-body descriptors (i.e., rOO, θOHO, and ϕHOOX) have weak correlations with cluster size. The dihedral angles and resulting ratios of cis-vs trans-hydrogen bonds stays relatively consistent at 30%–40% cis-hydrogen bonds (0°–90°) and 60%–70% trans-hydrogen bonds (90°–180°). The rOO distance, on average, is smaller for large cluster sizes (R2 = 0.38), while the θOHO [R2 = 0.31, Fig. 4(c)] gradually becomes more linear with increasing cluster size. In addition, there exists a correlation between θOOO and the size of the cluster [R2 = 0.79, Fig. 4(d)]. On average, the θOOO increases from an average of 87.6° for n = 6 to an average of 103.0° for n = 25. However, the correlation with tetrahedral order (q) is weaker (R2 = 0.31). Since distribution plots for that descriptor have been reported for liquid simulations,24,25,42,62–64 we provide the distributions of the tetrahedral order (q) for cluster sizes n = 6–25, for comparison (Fig. 3, right panel). As observed in liquid simulations,25,42,62–64 the distribution is bimodal. However, the peak existing at lower q-values is more prominent than the corresponding one in liquid water, indicating the large ratio of molecules exhibiting a weaker local tetrahedral ordering in the clusters compared to liquid water. However, there is a substantial amount of molecules approaching q = 1, indicating the emergence of fully solvated molecules having four nearest neighbors with a near tetrahedral local structure.

The evolution of the hydrogen bond network (connectivity descriptors) with cluster size is of particular interest. Figure 4(e) traces the changes in the fraction of molecule types (local bonding information) as a function of cluster size. For small water clusters (n = 6–7), the AD-type water molecules dominate with small contributions from ADD and AAD. The fraction of AD water molecules sharply decreases up to size n = 8 and continues to decrease for larger cluster sizes. It is at this size regime that the energetically stable n = 8 cube appears, which is entirely ADD/AAD. This structural motif persists through larger clusters and is likely the cause for the low % of AD molecules in the low-energy subset of the full cluster database used in this study. The fractions of ADD- and AAD-type water molecules grow at nearly the same rate, reaching a maximum at n = 8 and staying somewhat constant beyond that point. Eventually, at around n = 22 the AADD-type water molecules become the most prevalent (approaching 40%, on average) while the ADD- and AAD-type ones decrease slightly. This demonstrates the point at which the AADD-type molecules become the most prevalent type in the cluster. Regardless, even at n = 25 we see that roughly 60% of the water molecules are still three-coordinated, on average.

The average number of hydrogen bonds in liquid water is about 3.5,37 suggesting limiting values of AADD-type molecules in liquid water. Furthermore, a small percentage of water molecules are penta-coordinated (AAADD-type) for n > 10. The AAADD molecules become more prevalent for larger water clusters albeit never exceeding 17% for a single cluster. The emergence of the AAADD molecules is accompanied by a simultaneous loss of AAD molecules (Fig. S4), demonstrating the role of the AAADD molecules in balancing the number of hydrogen bond donors in the cluster.

Figure 4(e) traces the changes in the fraction of molecule types (local bonding information) and the fraction of cycle counts as a function of cluster size. We observe significant changes in the cycles that are produced as part of the extended hydrogen bond environment [Fig. 3(f)]. For small water clusters, the small cycles (3 and 4) naturally dominate. While these ring sizes are often strained, they help to maximize the number of hydrogen bonds for small clusters. However, the five-membered cycles abruptly become the dominant cycle size for n > 16 with the four-membered cycles steadily decreasing. The six-membered cycles very slowly increase in prevalence but never surpass roughly 20% of the cycles in a cluster for n = 25. Given that hexagonal ice (ice Ih) has only six-membered rings,65 we anticipate the prevalence of six cycles in stable water cluster minima to increase for larger cluster sizes (n > 25).

It is expected that several descriptors will correlate with one another. Some of these correlations are an artifact of the way each descriptor is defined, whereas others provide insight into the structural and connectivity patterns present in the water cluster minima. The coefficients are supplied in Table S1, and correlation plots of the various descriptors are depicted in Figs. S2–S4. The various correlations are summarized by the corresponding strengths (correlation coefficients) traced by colors in Fig. 5. Descriptors that have no correlations, such as ⟨ϕHOOX⟩, % AAADD, and % of six cycles, are excluded for clarity.

FIG. 5.

The strengths of the correlations between the various descriptors are depicted in the chord plot. Blue lines denote positive correlations, whereas red lines denote negative correlations. The darkness of the colors encodes the strength of the correlations, with dark blue and dark red having R2 values near 1, and lighter (pale) lines representing R2 values near 0.5. Correlations with R2 < 0.33 are excluded for clarity. The outer circle shows what the descriptors best describe—energy (purple), cycles (yellow), size (orange), and adjacencies (green).

FIG. 5.

The strengths of the correlations between the various descriptors are depicted in the chord plot. Blue lines denote positive correlations, whereas red lines denote negative correlations. The darkness of the colors encodes the strength of the correlations, with dark blue and dark red having R2 values near 1, and lighter (pale) lines representing R2 values near 0.5. Correlations with R2 < 0.33 are excluded for clarity. The outer circle shows what the descriptors best describe—energy (purple), cycles (yellow), size (orange), and adjacencies (green).

Close modal

The descriptors appear to fall into two groupings that are related to size and hydrogen bond strength. There is a strong correlation between ASPL and Rg (R2 = 0.98). There is no mathematical reasoning for the correlation in the formulas for these descriptors, suggesting there are two distinct ways in describing a similar property using either the Cartesian coordinates (Rg) or graphs (ASPL). The correlation is, however, logical, since both ASPL and Rg increase with cluster size and decrease with the spherical nature of the cluster. Closely correlated to ASPL and Rg is the Wiener index (R2 = 0.93), which is similar in computation to the ASPL. Another correlation exists with the aforementioned descriptors and ⟨θOOO⟩ (R2 ≈ 0.8). The origin of this latter correlation is not clear, but it serves as yet another link between structural and connectivity descriptors. A natural correlation exists between % HB and ⟨rOH⟩ (R2 = 0.9). This arises from the lengthening of rOH due to the formation of hydrogen bonds.7 In the presence of additional hydrogen bonds (higher % HB), rOH will be more elongated, therefore ⟨rOH⟩ will be greater. Altogether, these six descriptors form the core of one group of correlations. As discussed earlier, all these descriptors have strong correlations with size.

Another set of correlations exists between the rest of the structural descriptors. The strongest correlation in this group is between ⟨q⟩ and ⟨θOHO⟩ (R2 = 0.76). It translates to the realization that water molecules in a water cluster are on average more tetrahedral when the hydrogen bonds are more linear. This is reasonable since hydrogen bonds that are far from 109.5° (tetrahedral) generally bend their hydrogen bond to increase hydrogen-lone-pair interactions.66 It is interesting that the correlation between ⟨q⟩ and ⟨θOHO⟩ is stronger (R2 = 0.76) than the correlation between ⟨q⟩ and ⟨θOOO⟩ (R2 = 0.54), which are mathematically related to one another. Since the function between q and θOOO is not one-to-one, the correlation of their average values is weak. Another strong correlation is the one between ⟨θOHO⟩ and ⟨rOO⟩ (R2 = 0.63). This is quite reasonable, as both descriptors are well-known metrics of hydrogen bond strength.38,39 This implies that the correlations between ⟨rOO⟩, ⟨θOHO⟩, and ⟨q⟩ (all descriptors focusing on the local hydrogen bonding environment) stem from the relationship between such descriptors and the underlying hydrogen bond strengths. Oddly, ⟨θHOH⟩ shows correlations only with the descriptors mentioned in this paragraph (R2 ≈ 0.35). Therefore, it will be grouped accordingly despite the weakness of the correlations. It is important to note that rOH is also known to correlate with the hydrogen bond strength, with a longer rOH associated with stronger hydrogen bonds.67,68 This relation is not seen in this analysis, as it is skewed by the inclusion of the free rOH in the averaging of the value of the descriptor.

Weak cross correlations exist between the two groups. Notably, the strongest correlations between subsets of the two groups are between ⟨θOOO⟩ and descriptors of hydrogen bond strength (R2 ≈ 0.57) and weaker correlations are between ⟨rOO⟩ and descriptors of size (R2 ≈ 0.47). This relationship is not surprising because of the existence of a weak correlation between θOOO, q, and hydrogen bond strength. The latter, however, is surprising since the correlations between the two-body descriptors and the cluster size are quite weak. This suggests that additional information besides size and hydrogen bond energy is encoded in these descriptors.

Correlations between adjacency descriptors provide important insight between the stability of hydrogen bond networks and the balance of hydrogen bond acceptors (A) and donors (D). Strong correlations exist between the % of AD, AAD, ADD, and AADD, which are most prevalent in the size regime of the water clusters examined in this study. The correlation between those descriptors is also shown in Fig. 5. The anticorrelation between AD and AAD/ADD is weak (R2 ≈ 0.17) due to the rapid disappearance of AD molecules. The correlation is stronger between AAD/ADD and AADD (R2 ≈ 0.42), a fact that demonstrates the interplay between AAD/ADD and AADD in large, highly connected clusters. The strong correlation between AAD and ADD (R2 = 0.67) arises from the natural constraint that for each acceptor there is a donor. As seen previously, there is a disappearance of AD and the appearance of AADD water molecules upon increasing cluster size. This loose correlation with cluster size manifests as weak (0.3 < R2 < 0.6), while correlations between AD and AADD descriptors are found to strongly correlate with size—⟨rOH⟩, ASPL, Wiener index, % HB, and Rg.

Correlations in the % cycles exist because more of one cycle means less of another cycle. As most cycles consist of four and five molecules, there is a strong correlation bounded by the anticorrelation line. Correlations between three-membered cycles and six-membered cycles and other cycles are extremely weak because their populations are small throughout the dataset. Although correlations with other descriptors are weak, interesting trends emerge with descriptors that predict hydrogen bond strength. There exist weak correlations (R2 ≈ 0.3) between the three-membered cycles, four-membered cycles, and five-membered cycles and ⟨q⟩, ⟨rOO⟩, ⟨θOHO⟩, and ⟨θOOO⟩. Notably, values with increased hydrogen bond strength correspond to more five-membered cycles and less three-membered cycles and four-membered cycles. This is reasonable as three-membered cycles and four-membered cycles are more strained (⟨θOOO⟩ ≤ 90°) and therefore are associated with weaker hydrogen bonds.

The PCA model constructed in this study utilizes descriptors that can effectively translate to the bulk phase. For this reason, size, ASPL, Wiener index, and Rg are excluded from the model. The resulting best model utilized the descriptors ⟨rOO⟩, ⟨θOHO⟩, ⟨θOOO⟩, ⟨q⟩, % HB, % AD, % ADD, % AAD, % AADD, % AAADD, % four-membered cycles, and % six cycles. The results of the PCA analysis are shown in Fig. 6, suggesting that interesting structural patterns emerge warranting further analysis. It was found that the one-body descriptors (⟨rOH⟩ and ⟨θHOH⟩) muddled the model by smearing groups together and were thus not used. Additionally, the average dihedral angles of hydrogen bonds (⟨ϕHOOX⟩) and the % five-membered cycles combined seemingly meaningful separations; thus, ⟨ϕHOOX⟩ and % five-membered cycles were not used. The descriptors % HB, % four-membered cycles, and % six cycles had minor effects on the separation of principal components but served to increase the visible separation. Notably, % HB improved the clustering of isomorphs with fewer PCs. The latter two are included in the model for potential benefit when using multiple PCs for analysis but will not be discussed in detail here.

FIG. 6.

The variance (left), the distribution as a histogram (right, lower triangle and diagonal), and the components (right, upper triangle) of the PCA model used. The colors in the histograms represent the number of clusters in each bin with blue in the tens and red in the hundreds.

FIG. 6.

The variance (left), the distribution as a histogram (right, lower triangle and diagonal), and the components (right, upper triangle) of the PCA model used. The colors in the histograms represent the number of clusters in each bin with blue in the tens and red in the hundreds.

Close modal

An initial analysis examined the component matrices produced by the PCA to describe how the initial descriptors map onto this variance-separated space. The components of each descriptor in the first seven PCs are shown in the upper triangle of Fig. S5. PCs 4–6 are dominated by % AAADD, % six cycles, and % four-membered cycles respectively. It is noticeable, however, that a significant proportion of % AAADD contributes to PC3. PCs 1–3, shown in Fig. 6, have a good distribution of the contribution from each of the remaining descriptors. The strongest descriptors contributing to PC1 are ⟨rOO⟩, ⟨θOHO⟩, ⟨θOOO⟩, and ⟨q⟩. PC3 is similar to PC1 but includes % HB and % AD information. PC2 is unique in that it contains information about adjacencies (% ADD, % AAD and % AADD). These differences, combined with the structural patterns that appear and the variances in Fig. 6, make these PCs good candidates for analyzing water clusters.

It is found that PCA succeeds in separating clusters based on descriptors not included in the model. Size correlates strongly with PC1 (R2 = 0.76), with similar correlations between PC1 and ASPL, Wiener index, and Rg. This is likely because PC1 contains information about ⟨θOOO⟩, which is closely related to size. The combination of ⟨rOO⟩, ⟨θOHO⟩, ⟨θOOO⟩, and ⟨q⟩ in PC1 seems to pull the size extensive behavior of ⟨θOOO⟩ into PC1, while leaving the size-independent character of ⟨θOOO⟩ in lower PCs. The implicit separation of size is especially important as it allows aspects of the information related to size to be drawn up to PC1 and aspects of information unrelated to size to remain pure in the lower PCs. When targeting attributes that must remain size-independent, it will be beneficial to remove PC1 from the model.

PCA provides a useful tool to separate and group similar data points, and this technique was employed to classify and compare certain water clusters. Since PC2 and PC3 provided the best visual separation, they will be used heavily in this section. For quantitative metrics, PC 2–7 will be used. PC1 will not be used because it largely encodes cluster size, and this study seeks to classify patterns in water clusters independent of size. As discussed earlier in this section, PC2 separates mainly on adjacency and PC3 separates mainly on % HB. These descriptors have discrete values, arising from the division of small integers whose magnitudes are <60. Therefore, their distinct separation is natural but further analysis shows it is more subtle. Figure 7 depicts the separation of the aforementioned descriptors across PC2 and PC3 for water clusters of size n = 25. PC3 nicely separates hydrogen bond saturation into distinct strata, with negative PC3 representing highly connected structures and positive PC3 representing less connected structures. Clusters along PC2 contain mixtures of the various % adjacencies encoding the detailed information on the hydrogen bond networks. Much of the information across PC2 is separated by % ADD. Nonetheless, the separation provided by PC2 and PC3 in this model provides a great way to describe similarly connected water clusters using descriptors that have measurable values in bulk water.

FIG. 7.

The separation of descriptors important for PC2 and PC3 are colored for clusters with n = 25 water molecules across PC2 and PC3. The vertical separation is determined by hydrogen bond saturation, whereas the horizontal separation is determined by the combination of adjacencies.

FIG. 7.

The separation of descriptors important for PC2 and PC3 are colored for clusters with n = 25 water molecules across PC2 and PC3. The vertical separation is determined by hydrogen bond saturation, whereas the horizontal separation is determined by the combination of adjacencies.

Close modal

Another application of the separation is to group similar clusters, such as families of isomorphs. The water hexamer has been widely studied for the emergence of a complex, non-quasiplanar landscape of low-energy isomorphs.69–71 The PC model developed here appropriately separates the isomorph family (Fig. 8 left). Notably, the common families of the water hexamer, delineated by blue borders, appear in distinct groupings. Prism- and cage-like clusters appear in quadrant III, whereas ring structures appear in quadrant I. The other isomorphs, shown with no outlines, group around similarly structured clusters. For example, the traditional “cage” is grouped with a similar but distinct cage-like structure. More clearly, the “book” structure is grouped among other fused ring structures. Similar patterns emerge for larger water cluster sizes. Isomorphs remain separated across PC2–3 (Fig. S6). Figure 8 (right) shows the separation for the water 16-mer, where the lowest energy isomer in each grouping is shown. Although each grouping is expected to be diverse for the medium-sized clusters, the structures shown in Fig. 8 are good representatives of the structures that separate via PCA.

FIG. 8.

Representative structures for the water hexamer (left) and the water 16-mer (right) are shown as examples of how PC 2 and PC 3 can separate clusters based on structural differences. Common hexamer isomorphs are identified by blue borders in the left panel. An example of a ”defect” for n = 16 (see text) is indicated by a green border in the right panel. PC3 consistently separates clusters based on their degree of connectivity, but PC2 changes separations based on structures in the size group.

FIG. 8.

Representative structures for the water hexamer (left) and the water 16-mer (right) are shown as examples of how PC 2 and PC 3 can separate clusters based on structural differences. Common hexamer isomorphs are identified by blue borders in the left panel. An example of a ”defect” for n = 16 (see text) is indicated by a green border in the right panel. PC3 consistently separates clusters based on their degree of connectivity, but PC2 changes separations based on structures in the size group.

Close modal

The power of this technique rests in its ability to identify water clusters of different sizes with similar structural motifs. As discussed earlier, water-6’s “prism” (trigonal prism) structure appears in the far negative quadrant across PC2 and PC3. Similar structures, such as water-8’s cube and water-10’s pentagonal prism, appear in a similar area, although slightly higher in PC3 due to differences in ⟨q⟩ and ⟨θOHO⟩. Fused prisms appear similarly in the negative region of PC3 but are spread out across PC2, with positive PC2 having more shared faces. Defects to fused prisms appear in proximal groupings to their “parent” cluster. An example of a “defect” (indicated by a green border in Fig. 8) involves breaking the arrangement of one end and the shifting of that cluster “upward” in PC3 (because of fewer hydrogen bonds) and “left” in PC2 (because of fewer ADD/AAD molecules).

To exemplify the PCA classification, the clusters closest to the lowest energy 25-mer in the data set were identified and shown in Fig. 9. Clusters were found by proximity (norm 2) across PC 2–7. The three closest structures interestingly belong to cluster sizes 22, 19, and 16. Each structure shows similar structural motifs, a nucleating pentagonal prism-like structure (shown on the right side of each image), and the beginnings of an ordered cage surrounding a centered water molecule (shown on the left). As these are the most structurally similar clusters, they would make great candidates as starting structures when needing to study the 25-mer. This analysis can be done for any water cluster, including structures such as clathrates, which were not included in the subset of the full database that was used for the analysis (see the discussion in the Introduction).

FIG. 9.

The putative minima of the 25-mer water cluster is depicted in the top left. The three closest water clusters from the PCA are also shown, with their distances along the PCA indicated in the figure. Each serves as a potentially good model for studying the water 25-mer.

FIG. 9.

The putative minima of the 25-mer water cluster is depicted in the top left. The three closest water clusters from the PCA are also shown, with their distances along the PCA indicated in the figure. Each serves as a potentially good model for studying the water 25-mer.

Close modal

It is finally of interest to examine which descriptors are predictive of the energetic stability [binding energy per water molecule (BEn)] of the cluster. As discussed earlier, BEn naturally evolves with the cluster size. More specifically, there exists a strong correlation between the binding energy per water molecule and the cluster size (R2 = 0.99), as seen in the left panel of Fig. 10. Note that the limiting value of BEn for n for the fit shown in the right panel of Fig. 10 is −12.7 kcal/mol when the exponent of n is considered a variable (with a value of 0.82). Incidentally, this limit is near the lattice energy of ice (14.07 kcal/mol per molecule).72 The cluster sizes (up to n = 25) are too small to be used in conjunction with a fit of the volume element n−1/3, which produces a less accurate regression for this cluster regime than when the exponent is a variable.

FIG. 10.

Left: The strongest single predictor for the binding energy per molecule (BEn) is cluster size. It follows an inverse n function, where n is the number of water molecules, and an asymptotic value of −12.7 kcal/mol for n. Right: ⟨rOO⟩, ⟨θOHO⟩, and ⟨q⟩ are predictive of the binding energy per hydrogen bond (BEHB) in the cluster. Both panels are 2D histograms, where the color indicates the density of points.

FIG. 10.

Left: The strongest single predictor for the binding energy per molecule (BEn) is cluster size. It follows an inverse n function, where n is the number of water molecules, and an asymptotic value of −12.7 kcal/mol for n. Right: ⟨rOO⟩, ⟨θOHO⟩, and ⟨q⟩ are predictive of the binding energy per hydrogen bond (BEHB) in the cluster. Both panels are 2D histograms, where the color indicates the density of points.

Close modal

Descriptors that correlate with size (rOH, θOOO, ASPL, Wiener index, % HB, and Rg) also show correlations with the binding energy per water molecule (R2 ≈ 0.95 for θOOO, ASPL, Wiener index, and Rg, R2 ≈ 0.7 for rOH and % HB). However, none of these descriptors are predictive of relative stability between isomers within a particular water cluster size. The descriptor that most improved the accuracy of the prediction (albeit by only 5%) was % HB. Therefore, it is inferred that the correlation with the aforementioned descriptors comes mostly from their relationship with size and size’s relationship with the cluster binding energy.

Another way to analyze energetic stability is to scale the cluster binding energy by the number of hydrogen bonds. This metric of energy does not scale with cluster size but instead scales strongly with structural many-body descriptors, such as rOO, θOHO, and q, which are known to greatly influence the energy of the hydrogen bond.38,39,73 It is therefore expected that they are predictive of the average energy of a hydrogen bond in a cluster. Each has relatively strong correlations (R2 between 0.70 and 0.86) with the binding energy per hydrogen bond as can be seen from the right panel of Fig. 10. Strikingly, these three descriptors combine linearly for an improved fit (R2 = 0.95), suggesting each descriptor encodes distinct properties of hydrogen bond strengths.

Alternatively, more descriptors can also be combined to predict the cluster binding energy. There is strong predictive power for the energy per molecule and the energy per hydrogen bond from the combination of all descriptors with an RMSE of 0.036 and 0.022 kcal/mol per molecule, respectively (Table I and Fig. 11). The regression weights are displayed in Table S2; the largest weights are associated with size, ASPL, and Wiener index, suggesting that these descriptors together are most predictive of the binding energy per water molecule. Since these descriptors are not transferable to bulk water, the linear regression was refitted using only the descriptors that can be computed under periodic boundary conditions (Table I), namely ⟨rOH⟩, ⟨rOO⟩, ⟨θHOH⟩, ⟨θOHO⟩, ⟨θOOO⟩, ⟨q⟩, % AD, % AAD, % ADD, % AADD, % AAAD, % AAADD, % three-membered cycles, % four-membered cycles, % five-membered cycles, % six cycles. As can be seen from Table I, the fit is less predictive and begins to incorporate cycle counting as predictive features, as it can be seen from the increase in the fitting weights of % four-membered cycles and % five-membered cycles from <0.05 to >0.10 in Table S2.

TABLE I.

The errors (RMSE, kcal/mol) of various regression models to predict the binding energy per molecule (BEn) and per hydrogen bond (BEHB) for various sets of descriptors and regression styles. “All” indicates using all 22 descriptors, whereas “Bulk” indicates a fit using the 18 descriptors relevant to the bulk phase (see text). Also reported are the number of fitting parameters (k) for each of the regression models.

AllBulkPC 1–3PC 1–7
QuantityRegression modelRMSEkRMSEkRMSEkRMSEk
BEn Linear 0.036 22 0.050 18 0.075 0.062 
Quadratic (uncoupled) 0.032 44 0.047 36 0.073 0.061 14 
Quadratic (coupled) 0.028 275 0.041 189 0.072 0.056 35 
BEHB Linear 0.022 22 0.030 18 0.043 0.034 
Quadratic (uncoupled) 0.020 44 0.028 36 0.043 0.034 14 
Quadratic (coupled) 0.018 275 0.025 189 0.041 0.032 35 
AllBulkPC 1–3PC 1–7
QuantityRegression modelRMSEkRMSEkRMSEkRMSEk
BEn Linear 0.036 22 0.050 18 0.075 0.062 
Quadratic (uncoupled) 0.032 44 0.047 36 0.073 0.061 14 
Quadratic (coupled) 0.028 275 0.041 189 0.072 0.056 35 
BEHB Linear 0.022 22 0.030 18 0.043 0.034 
Quadratic (uncoupled) 0.020 44 0.028 36 0.043 0.034 14 
Quadratic (coupled) 0.018 275 0.025 189 0.041 0.032 35 
FIG. 11.

The errors of linear regression for each model as 2D histograms. Colors indicate the density of points: blue in the tens and red in the hundreds. Left panels trace the training error of the models for all structures n ≤ 25 and those 5 kcal/mol from the putative minima. Right panels display an extrapolation error for structures 26 ≤ n ≤ 30, still lying within 5 kcal/mol from the putative minima.

FIG. 11.

The errors of linear regression for each model as 2D histograms. Colors indicate the density of points: blue in the tens and red in the hundreds. Left panels trace the training error of the models for all structures n ≤ 25 and those 5 kcal/mol from the putative minima. Right panels display an extrapolation error for structures 26 ≤ n ≤ 30, still lying within 5 kcal/mol from the putative minima.

Close modal

The developed regression models were applied to structures lying outside the dataset used for training to examine predictive behavior. A database of water clusters larger than n = 25 was constructed using the same methods as outlined in the work of Rakshit et al.35 Although the configuration space of these larger water clusters was not sampled at the same level of detail as the existing database for n = 3–25, it nonetheless can provide useful information about any divergent behavior in the models. It was found that models based solely on bulk-like descriptors were successful in predicting the energy to a comparable accuracy for clusters inside and outside the training set. The RMSE values decrease, viz., from 0.050 to 0.046 kcal/mol per molecule and from 0.030 to 0.028 kcal/mol per hydrogen bond. On the other hand, models that incorporate the size, ASPL, Wiener index, and Rg were found to overestimate the energies for larger clusters, likely due to the nonlinear correlation between energy and size (Fig. 10 left). Complete extrapolation should never be assumed, as structures with descriptors not seen by the model may diverge.

The PCs of the PCA can also serve as indicators to predict stability. When using PCA, fewer parameters are required as the data have been reduced in dimensionality. The RMSEs for predicting the binding energy per molecule and per hydrogen bond are listed in Table I. Incorporating the first seven PCs produces relatively good fits. Although the fits are worse than the regression with bulk descriptors by 13%–24%, the number of parameters required is 61% less. This current model is an excellent starting point in using reduced dimensions to predict energetic stability, but incorporating additional descriptors and higher energy local minima into the training set could improve the ability of the model to predict the binding energy.

The strong predictive power of the simplistic linear regressions is promising in that these cluster-averaged descriptors can be used to predict the energy of water cluster minima. Water cluster properties that strongly affect stability, such as proton ordering in the polyhedral cage water clusters (H2O)n, n = 20, 24, 28,40,74,75 are not explicitly incorporated in the model. Proton ordering might be implicitly described by other descriptors such as rOO, resulting in predictive fits. With the incorporation of descriptors that explicitly account for proton ordering, the model is likely to improve. This assumption would be consistent with better-predicting models that use geometries explicitly as descriptors.76,77 Furthermore, more predictive models can be developed by using nonlinear machine learning techniques. Polynomial regressions improve the fits by between 16% and 18% (Table I). Naturally, additional improvement is anticipated with more sophisticated models, such as neural networks, that are not the focus of this work.17,78,79

The database of water cluster minima generated in the work of Rakshit et al.35 has provided a useful resource to analyze structural and connectivity descriptors in water cluster minima (n = 6–25). It was found that some descriptors, such as rOH and % HB, correlated strongly with cluster size, while others, such as rOO, θOHO, and q, correlate with one another due to their relationships with hydrogen bond strength. Cluster size was found to be the strongest predictor of stability (binding energy per molecule), while when descriptors are grouped together, it results in providing sufficient information to predict cluster stability. Energy prediction appears better when scaled for each hydrogen bond, as most descriptors directly affect the strength of individual hydrogen bonds.

The principal component analysis demonstrated the use of descriptor decomposition in classifying and predicting properties of water clusters. The PCA predicts the energy of each water cluster with ∼60% fewer free parameters than linear regressions with a 13% and 24% loss in accuracy for the binding energy per molecule and the binding energy per hydrogen bond, respectively. The PCA also shows promise in identifying similarly structured clusters of different sizes, allowing for the use of smaller clusters to model computationally expensive large clusters. The PCA model developed in this study is publicly available and can be used to further study the structural and connectivity patterns in either larger clusters or different aqueous systems. Improvements to the model, such as additional descriptors, additional isomers, and explicit/robust tests on bulk systems, will be the focus of future studies. Although the models described in this work were not trained on large clusters or liquid water/ice configurations, they should be transferable to structures with similar descriptors (containing four-membered cycles, five-membered cycles, six-membered cycles, and mostly ADD, AAD, AADD adjacencies). Clusters and bulk structures that have descriptors well outside the ranges considered in this work will require additional training information to create a reliable model. Although some of the conclusions are expected, especially those regarding changes in cluster size and correlations with hydrogen bond strength (albeit not previously reported since the water cluster database has been made available only recently), many others are novel. These include the drastic transitions in the % of five-membered cycles and four-membered cycles at n = 17 and the correlation between ASPL and Rg. An additional novelty in this work is the synthesis of these structural and connectivity descriptors in describing these clusters, the structure classification of the PCA, and the energy regression from cluster-averaged descriptors. These findings set the stage for future investigations, as well as provides evidence that these traits persist throughout a wide range of water clusters.

Since the described approach is general, it can be easily extended to other training sets and other hydrogen-bonded systems to further assist in their structure prediction and structural motif analysis and classification, a field that has seen great improvements with the use of simplistic and descriptive machine learning techniques.80–83 We plan to use the descriptors and developed protocol to analyze the structural motifs for liquid water at different temperatures in forthcoming studies.

See the supplementary material for algorithms for computing cycles efficiently, figures correlating descriptors and describing the PCA, and a table describing the fitting parameters for linear recession models.

This work was supported by U.S. Department of Energy (DOE), Office of Science, Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences, and Biosciences, Condensed Phase and Interfacial Molecular Science (CPIMS) program, Molecular Theory and Modeling Grant No. FWP 16249 at Pacific Northwest National Laboratory (PNNL), a multi-program national laboratory operated for DOE by Battelle. This research also used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

The authors have no conflicts to disclose.

Garrett D. Santis: Data curation (lead); Formal analysis (lead); Investigation (equal); Methodology (equal); Software (lead); Validation (lead); Writing – original draft (lead); Writing – review & editing (equal). Kristina M. Herman: Data curation (supporting); Formal analysis (supporting); Software (supporting); Validation (supporting); Visualization (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). Joseph P. Heindel: Formal analysis (supporting); Investigation (supporting); Writing – original draft (supporting). Sotiris S. Xantheas: Conceptualization (lead); Funding acquisition (lead); Methodology (supporting); Project administration (lead); Resources (lead); Supervision (lead); Writing – original draft (supporting); Writing – review & editing (supporting).

The data that support the findings of this study are openly available in https://github.com/gds001/Descriptors-for-Water-Aggregation.git.

1.
C.
Steinbach
,
P.
Andersson
,
M.
Melzer
,
J. K.
Kazimirski
,
U.
Buck
, and
V.
Buch
, “
Detection of the book isomer from the OH-stretch spectroscopy of size selected water hexamers
,”
Phys. Chem. Chem. Phys.
6
,
3320
3324
(
2004
).
2.
J. E.
Combariza
,
N. R.
Kestner
, and
J.
Jortner
, “
Energy-structure relationships for microscopic solvation of anions in water clusters
,”
J. Chem. Phys.
100
,
2851
2864
(
1994
).
3.
S. S.
Xantheas
, “
Quantitative description of hydrogen bonding in Chloride−Water clusters
,”
J. Phys. Chem.
100
,
9703
9713
(
1996
).
4.
E.
Wasserman
,
J. R.
Rustad
, and
S. S.
Xantheas
, “
Interaction potential of Al3+ in water from first principles calculations
,”
J. Chem. Phys.
106
,
9769
9780
(
1997
).
5.
J. P.
Heindel
,
H.
Hao
,
R. A.
LaCour
, and
T.
Head-Gordon
, “
Spontaneous formation of hydrogen peroxide in water microdroplets
,”
J. Phys. Chem. Lett.
13
,
10035
10041
(
2022
).
6.
C. T.
Wolke
,
J. A.
Fournier
,
L. C.
Dzugan
,
M. R.
Fagiani
,
T. T.
Odbadrakh
,
H.
Knorke
,
K. D.
Jordan
,
A. B.
McCoy
,
K. R.
Asmis
, and
M. A.
Johnson
, “
Spectroscopic snapshots of the proton-transfer mechanism in water
,”
Science
354
,
1131
1135
(
2016
).
7.
M. A.
Boyer
,
O.
Marsalek
,
J. P.
Heindel
,
T. E.
Markland
,
A. B.
McCoy
, and
S. S.
Xantheas
, “
Beyond Badger’s rule: The origins and generality of the structure–spectra relationship of aqueous hydrogen bonds
,”
J. Phys. Chem. Lett.
10
,
918
924
(
2019
).
8.
J.
Liu
,
J.
Yang
,
X. C.
Zeng
,
S. S.
Xantheas
,
K.
Yagi
, and
X.
He
, “
Towards complete assignment of the infrared spectrum of the protonated water cluster H+(H2O)21
,”
Nat. Commun.
12
,
6141
(
2021
).
9.
J. M.
Finney
,
T. H.
Choi
,
R. M.
Huchmala
,
J. P.
Heindel
,
S. S.
Xantheas
,
K. D.
Jordan
, and
A. B.
McCoy
, “
Isotope effects in the Zundel-Eigen isomerization of H+(H2O)6
,”
J. Phys. Chem. Lett.
14
,
4666
4672
(
2023
).
10.
S. S.
Xantheas
, “
Ab initio studies of cyclic water clusters (H2O)n, n = 1–6. II. Analysis of many-body interactions
,”
J. Chem. Phys.
100
,
7523
7534
(
1994
).
11.
M. P.
Hodges
,
A. J.
Stone
, and
S. S.
Xantheas
, “
Contribution of many-body terms to the energy for small water clusters: A comparison of ab initio calculations and accurate model potentials
,”
J. Phys. Chem. A
101
,
9163
9168
(
1997
).
12.
A.
Milet
,
R.
Moszynski
,
P. E. S.
Wormer
, and
A.
van der Avoird
, “
Hydrogen bonding in water clusters: Pair and many-body interactions from symmetry-adapted perturbation theory
,”
J. Phys. Chem. A
103
,
6811
6819
(
1999
).
13.
K. M.
Herman
and
S. S.
Xantheas
, “
An extensive assessment of the performance of pairwise and many-body interaction potentials in reproducing ab initio benchmark binding energies for water clusters n = 2–25
,”
Phys. Chem. Chem. Phys.
25
,
7120
7143
(
2023
).
14.
S.
Yoo
and
S. S.
Xantheas
, “
Structures, energetics, and spectroscopic fingerprints of water clusters n = 2–24
,” in
Handbook of Computational Chemistry
, edited by
J.
Leszczynski
,
A.
Kaczmarek-Kedziera
,
T.
Puzyn
,
M. G.
Papadopoulos
,
H.
Reis
and
M. K.
Shukla
(
Springer International Publishing
,
Cham
,
2017
), pp.
1139
1173
.
15.
A.
Lagutschenkov
,
G. S.
Fanourgakis
,
G.
Niedner-Schatteburg
, and
S. S.
Xantheas
, “
The spectroscopic signature of the “all-surface” to “internally solvated” structural transition in water clusters in the n = 17–21 size regime
,”
J. Chem. Phys.
122
,
194310
(
2005
).
16.
C. C.
Pradzynski
,
C. W.
Dierking
,
F.
Zurheide
,
R. M.
Forck
,
U.
Buck
,
T.
Zeuch
, and
S. S.
Xantheas
, “
Infrared detection of (H2O)20 isomers of exceptional stability: A drop-like and a face-sharing pentagonal prism cluster
,”
Phys. Chem. Chem. Phys.
16
,
26691
26696
(
2014
).
17.
J. A.
Bilbrey
,
J. P.
Heindel
,
M.
Schram
,
P.
Bandyopadhyay
,
S. S.
Xantheas
, and
S.
Choudhury
, “
A look inside the black box: Using graph-theoretical descriptors to interpret a Continuous-Filter Convolutional Neural Network (CF-CNN) trained on the global and local minimum energy structures of neutral water clusters
,”
J. Chem. Phys.
153
,
024302
(
2020
).
18.
H.
Doi
,
K. Z.
Takahashi
, and
T.
Aoyagi
, “
Mining of effective local order parameters to classify ice polymorphs
,”
J. Phys. Chem. A
125
,
9518
9526
(
2021
).
19.
P.
Geiger
and
C.
Dellago
, “
Neural networks for local structure detection in polymorphic systems
,”
J. Chem. Phys.
139
,
164105
(
2013
).
20.
M.
Fulford
,
M.
Salvalaglio
, and
C.
Molteni
, “
DeepIce: A deep neural network approach to identify ice and water molecules
,”
J. Chem. Inf. Model.
59
,
2141
2149
(
2019
).
21.
R.
Shi
and
H.
Tanaka
, “
Microscopic structural descriptor of liquid water
,”
J. Chem. Phys.
148
,
124503
(
2018
).
22.
C.
Faccio
,
M.
Benzi
,
L.
Zanetti-Polzi
, and
I.
Daidone
, “
Low- and high-density forms of liquid water revealed by a new medium-range order descriptor
,”
J. Mol. Liq.
355
,
118922
(
2022
).
23.
S.
Kazachenko
and
A. J.
Thakkar
, “
Are there any magic numbers for water nanodroplets, (H2O)n, in the range 36 ≤ n ≤ 50?
,”
Mol. Phys.
108
,
2187
2193
(
2010
).
24.
P.
Chau
and
A. J.
Hardwick
, “
A new order parameter for tetrahedral configurations
,”
Mol. Phys.
93
,
511
518
(
1998
).
25.
J.
Erringon
and
P.
Debenedetti
, “
Relationship between structural order and the anomalies of liquid water
,”
Nature
406
,
318
321
(
2001
).
26.
P.
Steinhardt
,
D.
Nelson
, and
M.
Ronchetti
, “
Bond-orientational order in liquids and glasses
,”
Phys. Rev. B
28
,
784
805
(
1983
).
27.
P.-R.
ten Wolde
,
M. J.
Ruiz-Montero
, and
D.
Frenkel
, “
Simulation of homogeneous crystal nucleation close to coexistence
,”
Faraday Discuss.
104
,
93
110
(
1996
).
28.
S.
Auer
and
D.
Frenkel
, “
Numerical simulation of crystal nucleation in colloids
,”
Adv. Polym. Sci.
173
,
149
(
2005
).
29.
W.
Lechner
and
C.
Dellago
, “
Accurate determination of crystal structures based on averaged local bond order parameters
,”
J. Chem. Phys.
129
,
114707
(
2008
).
30.
E.
Shiratani
and
M.
Sasai
, “
Molecular scale precursor of the liquid–liquid phase transition of water
,”
J. Chem. Phys.
108
,
3264
3276
(
1998
).
31.
Z.
Yan
,
S. V.
Buldyrev
,
P.
Kumar
,
N.
Giovambattista
,
P. G.
Debenedetti
, and
H. E.
Stanley
, “
Structure of the first- and second-neighbor shells of simulated water: Quantitative relation to translational and orientational order
,”
Phys. Rev. E
76
,
051201
(
2007
).
32.
J.
Russo
and
H.
Tanaka
, “
Understanding water’s anomalies with locally favoured structures
,”
Nat. Commun.
5
,
3556
(
2014
).
33.
B.
Monserrat
,
J. G.
Brandenburg
,
E. A.
Engel
, and
B.
Cheng
, “
Liquid water contains the building blocks of diverse ice phases
,”
Nat. Commun.
11
,
5757
(
2020
).
34.
E. D.
Donkor
,
A.
Laio
, and
A.
Hassanali
, “
Do machine-learning atomic descriptors and order parameters tell the same story? The case of liquid water
,”
J. Chem. Theory Comput.
19
,
4596
4605
(
2023
).
35.
A.
Rakshit
,
P.
Bandyopadhyay
,
J. P.
Heindel
, and
S. S.
Xantheas
, “
Atlas of putative minima and low-lying energy networks of water clusters n = 3–25
,”
J. Chem. Phys.
151
,
214307
(
2019
).
36.
G. S.
Fanourgakis
and
S. S.
Xantheas
, “
The flexible, polarizable, Thole-type interaction potential for water (TTM2-F) revisited
,”
J. Phys. Chem. A
110
,
4100
4106
(
2006
).
37.
R.
Kumar
,
J. R.
Schmidt
, and
J. L.
Skinner
, “
Hydrogen bonding definitions and dynamics in liquid water
,”
J. Chem. Phys.
126
,
204107
(
2007
).
38.
K.
Wendler
,
J.
Thar
,
S.
Zahn
, and
B.
Kirchner
, “
Estimating the hydrogen bond energy
,”
J. Phys. Chem. A
114
,
9529
9536
(
2010
).
39.
T.
Steiner
, “
The hydrogen bond in the solid state
,”
Angew. Chem., Int. Ed.
41
,
48
76
(
2002
).
40.
M. V.
Kirov
,
G. S.
Fanourgakis
, and
S. S.
Xantheas
, “
Identifying the most stable networks in polyhedral water clusters
,”
Chem. Phys. Lett.
461
,
180
188
(
2008
).
41.
I.
Morrison
,
J.-C.
Li
,
S.
Jenkins
,
S. S.
Xantheas
, and
M. C.
Payne
, “
Ab-initio total energy studies of the static and dynamical properties of ice Ih
,”
J. Phys. Chem. B
101
,
6146
6150
(
1997
).
42.
E.
Duboué-Dijon
and
D.
Laage
, “
Characterization of the local structure in liquid water by various order parameters
,”
J. Phys. Chem. B
119
,
8406
8418
(
2015
).
43.
N.
Yang
,
C. H.
Duong
,
P. J.
Kelleher
,
A. B.
McCoy
, and
M. A.
Johnson
, “
Deconstructing water’s diffuse OH stretching vibrational spectrum with cold clusters
,”
Science
364
,
275
278
(
2019
).
44.
H.
Wiener
, “
Structural determination of paraffin boiling points
,”
J. Am. Chem. Soc.
69
,
17
20
(
1947
).
45.
D. H.
Rouvray
,
Topology in Chemistry
,
Chap. The Rich Legacy of Half a Century of the Wiener Index
(
Woodhead Publishing
,
Sawston, UK
,
2002
), pp.
16
37
.
46.
A.
Rahman
and
F. H.
Stillinger
, “
Hydrogen-bond patterns in liquid water
,”
J. Am. Chem. Soc.
95
,
7943
7948
(
1973
).
47.
S. V.
King
, “
Ring configurations in a random network model of vitreous silica
,”
Nature
213
,
1112
1113
(
1967
).
48.
P.
Virtanen
,
R.
Gommers
,
T. E.
Oliphant
,
M.
Haberland
,
T.
Reddy
,
D.
Cournapeau
,
E.
Burovski
,
P.
Peterson
,
W.
Weckesser
,
J.
Bright
,
S. J.
van der Walt
,
M.
Brett
,
J.
Wilson
,
K. J.
Millman
,
N.
Mayorov
,
A. R. J.
Nelson
,
E.
Jones
,
R.
Kern
,
E.
Larson
,
C. J.
Carey
,
İ.
Polat
,
Y.
Feng
,
E. W.
Moore
,
J.
VanderPlas
,
D.
Laxalde
,
J.
Perktold
,
R.
Cimrman
,
I.
Henriksen
,
E. A.
Quintero
,
C. R.
Harris
,
A. M.
Archibald
,
A. H.
Ribeiro
,
F.
Pedregosa
,
P.
van Mulbregt
, and
SciPy 1.0 Contributors
, “
SciPy 1.0: Fundamental algorithms for scientific computing in Python
,”
Nat. Methods
17
,
261
272
(
2020
).
49.
F.
Pedregosa
,
G.
Varoquaux
,
A.
Gramfort
,
V.
Michel
,
B.
Thirion
,
O.
Grisel
,
M.
Blondel
,
P.
Prettenhofer
,
R.
Weiss
,
V.
Dubourg
,
J.
Vanderplas
,
A.
Passos
,
D.
Cournapeau
,
M.
Brucher
,
M.
Perrot
, and
E.
Duchesnay
, “
Scikit-learn: Machine learning in Python
,”
J. Mach. Learn. Res.
12
,
2825
2830
(
2011
); see https://www.jmlr.org/papers/v12/pedregosa11a.html.
50.
A. R.
Katritzky
,
M.
Kuanar
,
S.
Slavov
,
C. D.
Hall
,
M.
Karelson
,
I.
Kahn
, and
D. A.
Dobchev
, “
Quantitative correlation of physical and chemical properties with chemical structure: Utility for prediction
,”
Chem. Rev.
110
,
5714
5789
(
2010
).
51.
M.
Karthikeyan
,
R. C.
Glen
, and
A.
Bender
, “
General melting point prediction based on a diverse compound data set and artificial neural networks
,”
J. Chem. Inf. Model.
45
(
3
),
581
590
(
2005
).
52.
N.
Gilbert
,
R. E.
Mewis
, and
O. B.
Sutcliffe
, “
Classification of fentanyl analogues through principal component analysis (PCA) and hierarchical clustering of GC–MS data
,”
Forensic Chem.
21
,
100287
(
2020
).
53.
S.
Chanana
,
C. S.
Thomas
,
F.
Zhang
,
S. R.
Rajski
, and
T. S.
Bugni
, “
hcapca: Automated hierarchical clustering and principal component analysis of large metabolomic datasets in R
,”
Metabolites
10
,
297
(
2020
).
54.
K.
Blaziak
,
D.
Tzeli
,
S. S.
Xantheas
, and
E.
Uggerud
, “
The activation of carbon dioxide by first row transition metals (Sc–Zn)
,”
Phys. Chem. Chem. Phys.
20
,
25495
25505
(
2018
).
55.
W. S.
Benedict
,
N.
Gailar
, and
E. K.
Plyler
, “
Rotation-vibration spectra of deuterated water vapor
,”
J. Chem. Phys.
24
,
1139
(
1956
).
56.
I. C.
Baianu
,
N.
Boden
,
D.
Lightowlers
, and
M.
Mortimer
, “
A new approach to the structure of concentrated aqueous electrolyte solutions using pulsed NMR methods
,”
Chem. Phys. Lett.
54
,
169
(
1978
).
57.
M. A.
Floriano
,
D. D.
Klug
,
E.
Whalley
,
E. C.
Svensson
,
V. F.
Sears
, and
E. D.
Hallman
, “
Direct determination of the intramolecular O–D distance in ice Ih and Ic by neutron diffraction
,”
Nature
329
,
821
(
1987
).
58.
Y. A.
Mantz
,
F. M.
Geiger
,
L. T.
Molina
,
M. J.
Molina
, and
B. L.
Trout
, “
First-principles molecular-dynamics study of surface disordering of the (0001) face of hexagonal ice
,”
J. Chem. Phys.
113
,
10733
(
2000
).
59.
J.
Jeon
,
A. E.
Lefohn
, and
G. A.
Voth
, “
An improved Polarflex water model
,”
J. Chem. Phys.
118
,
7504
(
2003
).
60.
G. S.
Fanourgakis
and
S. S.
Xantheas
, “
The bend angle of water in ice Ih and liquid water: The significance of implementing the nonlinear monomer dipole moment surface in classical interaction potentials
,”
J. Chem. Phys.
124
,
174504
(
2006
).
61.
S.
Imoto
,
S. S.
Xantheas
, and
S.
Saito
, “
Molecular origin of the difference in the hoh bend of the IR spectra between liquid water and ice
,”
J. Chem. Phys.
138
,
054506
(
2013
).
62.
P.
Kumar
,
S. V.
Buldyrev
, and
H. E.
Stanley
, “
A tetrahedral entropy for water
,”
Proc. Natl. Acad. Sci. U. S. A.
106
,
22130
22134
(
2009
).
63.
D.
Bandyopadhyay
,
S.
Mohan
,
S. K.
Ghosh
, and
N.
Choudhury
, “
Correlation of structural order, anomalous density, and hydrogen bonding network of liquid water
,”
J. Phys. Chem. B
117
,
8831
8843
(
2013
).
64.
T. E.
Gartner
III
,
K. M.
Hunter
,
E.
Lambros
,
A.
Caruso
,
M.
Riera
,
G. R.
Medders
,
A. Z.
Panagiotopoulos
,
P. G.
Debenedetti
, and
F.
Paesani
, “
Anomalies and local structure of liquid water from boiling to the supercooled regime as predicted by the many-body MB-pol model
,”
J. Phys. Chem. Lett.
13
,
3652
3658
(
2022
).
65.
C. G.
Salzmann
,
P. G.
Radaelli
,
B.
Slater
, and
J. L.
Finney
, “
The polymorphism of ice: Five unresolved questions
,”
Phys. Chem. Chem. Phys.
13
,
18468
18480
(
2011
).
66.
F. N.
Keutsch
,
J. D.
Cruzan
, and
R. J.
Saykally
, “
The water trimer
,”
Chem. Rev.
103
,
2533
2578
(
2003
).
67.
R. M.
Badger
, “
A relation between internuclear distances and bond force constants
,”
J. Chem. Phys.
2
,
128
131
(
2004
).
68.
R. M.
Badger
and
S. H.
Bauer
, “
Spectroscopic studies of the hydrogen bond. II. The shift of the O–H vibrational frequency in the formation of the hydrogen bond
,”
J. Chem. Phys.
5
,
839
851
(
2004
).
69.
Y.
Wang
and
J. M.
Bowman
, “
Ab initio potential and dipole moment surfaces for water. II. Local-monomer calculations of the infrared spectra of water clusters
,”
J. Chem. Phys.
134
,
154510
(
2011
).
70.
K.
Liu
,
M. G.
Brown
,
C.
Carter
,
R. J.
Saykally
,
J. K.
Gregory
, and
D. C.
Clary
, “
Characterization of a cage form of the water hexamer
,”
Nature
381
,
501
503
(
1996
).
71.
Y.
Wang
,
V.
Babin
,
J. M.
Bowman
, and
F.
Paesani
, “
The water hexamer: Cage, prism, or both. Full dimensional quantum simulations say both
,”
J. Am. Chem. Soc.
134
,
11116
11119
(
2012
).
72.
E.
Whalley
, “
Energies of the phases of ice at zero temperature and pressure
,”
J. Chem. Phys.
81
,
4087
4092
(
1984
).
73.
M.
Huš
and
T.
Urbic
, “
Strength of hydrogen bonds of water depends on local environment
,”
J. Chem. Phys.
136
,
144305
(
2012
).
74.
S.
Yoo
,
M. V.
Kirov
, and
S. S.
Xantheas
, “
Low-energy networks of the T-cage (H2O)24 cluster and their use in constructing periodic unit cells of the structure I (sI) hydrate lattice
,”
J. Am. Chem. Soc.
131
,
7564
7566
(
2009
).
75.
J. P.
Heindel
,
M. V.
Kirov
, and
S. S.
Xantheas
, “
Hydrogen bond arrangements in (H2O)20,24,28 clathrate hydrate cages: Optimization and many-body analysis
,”
J. Chem. Phys.
157
,
094301
(
2022
).
76.
V.
Babin
,
G. R.
Medders
, and
F.
Paesani
, “
Development of a “first principles” water potential with flexible monomers. II: Trimer potential energy surface, third virial coefficient, and small clusters
,”
J. Chem. Theory Comput.
10
,
1599
1607
(
2014
).
77.
Q.
Yu
,
C.
Qu
,
P. L.
Houston
,
R.
Conte
,
A.
Nandi
, and
J. M.
Bowman
, “
q-AQUA: A many-body CCSD(T) water potential, including four-body interactions, demonstrates the quantum nature of water from clusters to the liquid phase
,”
J. Phys. Chem. Lett.
13
,
5068
5074
(
2022
).
78.
S.
Chmiela
,
A.
Tkatchenko
,
H. E.
Sauceda
,
I.
Poltavsky
,
K. T.
Schütt
, and
K.-R.
Müller
, “
Machine learning of accurate energy-conserving molecular force fields
,”
Sci. Adv.
3
,
e1603015
(
2017
).
79.
K. T.
Schütt
,
H. E.
Sauceda
,
P.-J.
Kindermans
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
SchNet – A deep learning architecture for molecules and materials
,”
J. Chem. Phys.
148
,
241722
(
2018
).
80.
C.
Schran
,
F. L.
Thiemann
,
P.
Rowe
,
E. A.
Müller
,
O.
Marsalek
, and
A.
Michaelides
, “
Machine learning potentials for complex aqueous systems made simple
,”
Proc. Natl. Acad. Sci. U. S. A.
118
,
e2110077118
(
2021
).
81.
S.
Yao
,
R.
Van
,
X.
Pan
,
J. H.
Park
,
Y.
Mao
,
J.
Pu
,
Y.
Mei
, and
Y.
Shao
, “
Machine learning based implicit solvent model for aqueous-solution alanine dipeptide molecular dynamics simulations
,”
RSC Adv.
13
,
4565
4577
(
2023
).
82.
G.
Panapitiya
,
M.
Girard
,
A.
Hollas
,
J.
Sepulveda
,
V.
Murugesan
,
W.
Wang
, and
E.
Saldanha
, “
Evaluation of deep learning architectures for aqueous solubility prediction
,”
ACS Omega
7
,
15695
15710
(
2022
).
83.
T.
Schroeter
,
A.
Schwaighofer
,
S.
Mika
,
A.
Ter Laak
,
D.
Suelzle
,
U.
Ganzer
,
N.
Heinrich
, and
K.-R.
Müller
, “
Estimating the domain of applicability for machine learning QSAR models: A study on aqueous solubility of drug discovery molecules
,”
J. Comput.-Aided Mol. Des.
21
,
651
664
(
2007
).
Published open access through an agreement with Pacific Northwest National Laboratory

Supplementary Material