We rely on a total of 23 (cluster size, 8 structural, and 14 connectivity) descriptors to investigate structural patterns and connectivity motifs associated with water cluster aggregation. In addition to the cluster size *n* (number of molecules), the 8 structural descriptors can be further categorized into (i) one-body (intramolecular): covalent OH bond length (*r*_{OH}) and HOH bond angle (*θ*_{HOH}), (ii) two-body: OO distance (*r*_{OO}), OHO angle (*θ*_{OHO}), and HOOX dihedral angle (*ϕ*_{HOOX}), where X lies on the bisector of the HOH angle, (iii) three-body: OOO angle (*θ*_{OOO}), and (iv) many-body: modified tetrahedral order parameter (*q*) to account for two-, three-, four-, five-coordinated molecules (*q*_{m}, *m* = 2, 3, 4, 5) and radius of gyration (*R*_{g}). The 14 connectivity descriptors are all many-body in nature and consist of the AD, AAD, ADD, AADD, AAAD, AAADD adjacencies [number of hydrogen bonds accepted (A) and donated (D) by each water molecule], Wiener index, Average Shortest Path Length, hydrogen bond saturation (% HB), and number of non-short-circuited three-membered cycles, four-membered cycles, five-membered cycles, six-membered cycles, and seven-membered cycles. We mined a previously reported database of 4 948 959 water cluster minima for (H_{2}O)_{n}, *n* = 3–25 to analyze the evolution and correlation of these descriptors for the clusters within 5 kcal/mol of the putative minima. It was found that *r*_{OH} and % HB correlated strongly with cluster size *n*, which was identified as the strongest predictor of energetic stability. Marked changes in the adjacencies and cycle count were observed, lending insight into changes in the hydrogen bond network upon aggregation. A Principal Component Analysis (PCA) was employed to identify descriptor dependencies and group clusters into specific structural patterns across different cluster sizes. The results of this study inform our understanding of how water clusters evolve in size and what appropriate descriptors of their structural and connectivity patterns are with respect to system size, stability, and similarity. The approach described in this study is general and can be easily extended to other hydrogen-bonded systems.

## I. INTRODUCTION

Water clusters are useful models for understanding hydrogen bonding and chemical transformations in aqueous environments via microsolvation approaches.^{1–5} These systems have been used, among others, to study proton-transfer mechanisms in protonated water clusters,^{6} to identify the origin of spectral signatures in IR spectra^{7–9} and to quantify strong cooperative effects in hydrogen bonding networks.^{10–13} While water clusters have definitely claimed their space and significance in the chemical physics and simulation communities, the question of how the information gained from small water clusters translates to the liquid and/or ice phases still remains an active field of research. For example, water clusters typically have far more two- and three-coordinated water molecules, whereas the condensed phase has mostly four-coordinated molecules. For the smaller clusters, the need to maximize hydrogen bonding results in all atoms lying on the surface of the cluster,^{14} whereas the smallest water cluster with a “fully solvated” water molecule appears at *n* ∼ 17.^{15,16} In addition to the different connectivity, small water clusters have a smaller extended hydrogen bonding network than liquid water and ice (in which the long-range structure is ordered). While it is known that some features of water clusters evolve with increasing cluster size,^{17} it is not fully understood at what point the clusters begin to exhibit substructures and/or structural parameters akin to those found in liquid water and ice. Improving our knowledge of this transition via appropriate descriptors could aid in our fundamental understanding of water aggregation while at the same time assisting in the development of model systems to effectively probe the physical phenomena in condensed phase aqueous environments.

Previous studies have reported and used descriptors of the local and extended hydrogen bond environment in bulk systems to distinguish ice-like from liquid-like structures^{18–20} and also to identify unique types of solvation environments during liquid water simulations.^{21–23} These descriptors of bulk environments typically take into account the relative orientations of numerous surrounding molecules, such as the orientational tetrahedrality order parameter (*q*),^{24,25} the translational tetrahedrality order parameter (*s*_{k}),^{24} the Steinhardt order parameters,^{26–29} and the local structure index (LSI).^{30} Other studies have noted the importance of extended structural descriptors,^{31} such as incorporating a measure of translational order between the first and second solvation shells (*ζ*)^{32} and the node total communicability (NTC).^{22} The authors of the work of Manserrat *et al.*^{33} used the smooth overlap of atomic positions (SOAP) descriptor to demonstrate the presence of ice-like substructures in liquid simulations, based on similarities in their local environment. However, a subsequent study of Donkor *et al.*^{34} demonstrated the dissimilarity of the SOAP parameter and other commonly used local structure parameters in detecting similar molecular environments. However, there is still much to understand about the local and extended structure of bulk systems, especially the emergence of bulk-like qualities upon aggregation. Furthermore, while structural and connectivity descriptors have been applied widely in bulk (ice and liquid) systems, there has been little work to develop descriptors of aggregation in smaller cluster systems having much smaller hydrogen bond networks with properties that are further away from the ones for condensed environments.

The goal of this work is to develop structural and connectivity descriptors for the water cluster minima in the range *n* = 6–25 contained in a recent database of 4 948 959 structures that was created as a result of extensive Monte Carlo Temperature Basin-Paving (MC-TBP) simulations^{35} with the TTM2.1-F polarizable potential.^{36} We specifically focus on a subset of the full database containing 162 892 water cluster minima that are within 5 kcal/mol of the putative minima for water cluster sizes *n* = 6–25. The restriction of the existing dataset to low-energy structures is likely sufficient in describing minimum-like structures. However, caution should be taken when using the descriptors to describe high-energy or nonstationary point structures that lie outside of the ranges used in the training and analysis. First, we define the descriptors we use and analyze the evolution of these descriptors for increasing sizes of water clusters in an attempt to provide insight into the transition from small to larger water clusters. While the binding energy (stability) of a cluster naturally evolves as a function of the size of the cluster as an extensive quantity, other structural descriptors are found to correlate strongly with the binding energy, lending insight into emerging stable features upon aggregation. Additionally, the correlations between descriptors are examined to better understand how these evolve in tandem upon aggregation. Lastly, an unsupervised learning approach was implemented to group water clusters of different sizes in the database based on similarities in their structural and connectivity descriptors. This presents a unique way of classifying water clusters of different sizes with the goal of identifying similar patterns across the size regime *n* = 6–25. The current study aims at developing descriptors of water aggregation up to 30 molecules so they can be used to understand the formation of a critical nucleus for water aggregation. Several of these descriptors, including their trends, are transferable to liquid water and/or ice, albeit the networks found in these smaller clusters (formed by the need to maximize hydrogen bonding, resulting to most atoms lying on the surface of the water cluster up to *n* ∼ 17) are quite different than those present in condensed phase environments. The fact that many highly accurate computational techniques suffer from poor scaling with respect to the system size hinders their application to very large clusters in conjunction with sampling techniques such as basin hopping Monte Carlo or to the bulk via molecular dynamics simulations. The current work offers the possibility to use the developed descriptors to compare the similarity between clusters regardless of size. Altogether, this work identifies descriptors of water aggregation using data science to better understand how hydrogen bonding structures change during the transition from small gas-phase systems to larger aggregates of water molecules, and sets the stepping stone for their future application to condensed phase aqueous environments.

## II. DEFINITION OF DESCRIPTORS

The descriptors used in this work are defined below. It is intuitive to rely on the many-body concept to categorize the various descriptors as one-body (intramolecular), two-body (intermolecular and pairwise), and many-body (intermolecular and cooperative). Utilizing the Cartesian geometry of each cluster, one can collect a large set of descriptors for localized geometries that will become the one-, two-, three-, and many-body structural descriptors. Graphs were constructed as previously described,^{35} where each node is a water molecule and each edge is a hydrogen bond satisfying the *r* − *ψ* definition of the hydrogen bond outlined in the work of Kumar *et al.*^{37} These graphs are undirected and contain adjacency information as labels of each node. More delocalized properties were derived from the graph representation of each cluster. These descriptors are termed connectivity descriptors because they contain information about the hydrogen bond network. Certain descriptors can have multiple values across a whole cluster. In cases where a single value per cluster is required, the average of the descriptor over the cluster will be used and will be denoted with angled brackets, viz. ⟨⋯⟩.

### A. Structural descriptors

Two one-body descriptors were utilized to describe the intramolecular geometry. These are the OH bond lengths (*r*_{OH}) and the HOH bond angle (*θ*_{HOH}) shown in Fig. 1(a). Three two-body descriptors were used to describe each hydrogen bond. From the six intermolecular degrees of freedom in a water dimer, the three most important coordinates for a hydrogen bond—the OO distance (*r*_{OO}), the OHO angle (*θ*_{OHO}), and the HOOX dihedral angle (*ϕ*_{HOOX}), where X lies on the bisector of the HOH angle—were chosen. These are depicted in Fig. 1(b). The *r*_{OO} has previously been shown to have a strong relationship with the vibrational frequencies as well as the hydrogen bond energy.^{38,39} The *θ*_{OHO} angle describes the linearity of each hydrogen bond; close-to-linear hydrogen bonds are known to be stronger.^{38,39} The *ϕ*_{HOOX} angle is used to quantify the *cis*-*trans*-dihedral angle that is important for the “strong–weak” nearest-neighbor hydrogen bond definition introduced in the work of Kirov *et al.*^{40} These authors argued that *trans* (*ϕ*_{HOOX} near 180°)-hydrogen bonds are more favorable than *cis* (*ϕ*_{HOOX} near 60°)-hydrogen bonds in polyhedral water clusters,^{40} since in the gas-phase water dimer the *trans*-conformation is about 1 kcal/mol more stable than the *cis*-conformation.^{41} This descriptor is defined as the dihedral angle between the hydrogen on the donating water that is not participating in the hydrogen bond, the two oxygen atoms, and the bisector of the two hydrogen atoms on the accepting water molecule [Fig. 1(b)]. Three- and many-body descriptors become increasingly complex, with 21 total degrees of freedom for a trimer. It, therefore, becomes intangible to analyze all three-body degrees of freedom. Instead, we will only focus on the OOO angle (*θ*_{OOO}) [Fig. 1(c)]. This descriptor is the angle between a central oxygen atom and two oxygen atoms participating in a hydrogen bond with the central molecule.

*q*) is a combinatorial metric of the distribution of bonds around an atom or molecule.

^{24}Our analysis uses the definition adopted by Errington and Debenedetti,

^{25}

^{,}

*θ*

_{i,c,j}is the angle formed by two water molecules,

*i*and

*j*, around a central,

*c*, water molecule. Eq. (1) sets a range between

*q*= 0 (stochastic distribution observed in an ideal gas) and

*q*= 1 (completely tetrahedral order). Original and subsequent work using the tetrahedral order parameter was based on the four closest water molecules to define

*q*.

^{24,25,42}Water clusters, however, may lack four neighbors in the first solvation shell, especially at the surface of the cluster and most definitely for the smaller cluster size networks (

*n*< 17). Therefore, neighbors will be determined through hydrogen bonds. The definition for tetrahedrality has been expanded in this work to accommodate a water molecule with two, three, and five (in addition to four for the standard definition) hydrogen bonds. Following the procedure outlined by Errington and Debenedetti,

^{25}the coefficient of the summation is modified such that the expectation value is still zero for an ideal gas. That is,

*n*is the number of water molecules bound to the central water of interest. Note that

*A*

_{4}reverts back to $38$. Therefore, the general tetrahedral order parameter (

*q*

_{n}) that will be used in this work is

*q*

_{n}value still quantifies how close the orientations of the hydrogen bonds are to 109.5° (tetrahedral) but can also account for a different number of hydrogen bonds to the central atom. For example,

*q*

_{3}can be thought of as how far a three-coordinated molecule is from trigonal pyramidal. In the following,

*q*will be used to denote the modified tetrahedrality order parameter regardless of coordination (i.e., it incorporates

*q*

_{m},

*m*= 2–5).

*R*

_{g}), which describes the radial distribution of particles away from the center of mass. It is defined as

*n*is the number of water molecules, $r\u20d7i$ is the position of each water molecule, and $r\u20d7com$ is the position of the center of mass.

### B. Connectivity descriptors

Many-body connectivity descriptors are derived from the graph representation of the clusters and generally describe the connectivity of the hydrogen bonding network. Specifically, each cluster is represented by an undirected graph where each node is a water molecule and each edge is a hydrogen bond. Several of these descriptors have been the focus of previous studies,^{17} but they will be revisited and investigated more in depth. One such descriptor is the hydrogen bond saturation (% HB) or the percent of hydrogen atoms participating in a hydrogen bond. It is simply $edges2\u22c5nodes$, where the number of edges and nodes come from the graph representation of each cluster, or $nHB2\u22c5nH2O$, where *n*_{HB} is the number of hydrogen bonds and $nH2O$ is the number of water molecules.

Another connectivity descriptor is the adjacency—the number of hydrogen bonds accepted (A) and donated (D) by each water molecule. The concept of adjacency has been extensively used by infrared (IR) spectroscopists to correlate with shifts in the observed vibrational spectral bands.^{43} These adjacencies range from D and A, or single donors and single acceptors respectively, to AAADD, which refers to a penta-coordinated water molecule that accepts three and donates two hydrogen bonds [Fig. 1(d)]. Although adjacency can be easily computed by finding the directionality of each hydrogen bond, adjacencies were provided as node labels in the graph representation of each cluster. To minimize size extensive behavior, the percentage of water molecules in each adjacency class will be used.

Other connectivity descriptors are slightly more complex. The Wiener topological index is the sum of the shortest paths from all nodes to all other nodes.^{44} It describes how connected a graph is, with lower values describing graphs with stronger connections and shorter paths. The Wiener index of alkanes^{44} was found to correlate with their respective boiling points, but applications have expanded its use to predict other thermochemical properties and toxicity of bioactive molecules.^{45} Closely related is the Average Shortest Path Length (ASPL), which is the Wiener index divided by the number of paths [$n2$, where n is the number of water molecules], making the ASPL less size extensive than the Wiener index. An example graph and corresponding shortest paths are shown in Fig. 2(a) to illustrate the computation.

Non-short-circuited cycles are the final descriptor in the connectivity category considered in this study. Non-short-circuited cycles, as described by Rahman and Stillinger, are hydrogen-bonded rings in water that are not cut short by additional hydrogen bonds.^{46} Mathematically, that means the shortest paths (or tied for shortest paths) between all sets of two nodes in the cycle are along that cycle. The non-short-circuited cycles for the pentagonal prism are depicted in Fig. 2(b). This produces a nontrivial problem where brute-force method for computation, i.e., finding all cycles and checking for short-circuited behavior, is too expensive. Instead, we used an optimized algorithm to find good candidates for non-short-circuited cycles via King’s shortest paths^{47} and then checked for short-circuited behavior. The pseudocode is described in algorithm S1, and the code to perform such searches can be found at https://github.com/gds001/Descriptors-for-Water-Aggregation.git. To minimize size extensive behavior, the percentage of the cycles of each size were used.

## III. DEVELOPMENT OF MACHINE LEARNING MODELS

^{48}Linear regressions utilize a linear function,

*x*

_{i}are the features of the model (descriptors or products of descriptors),

*μ*

_{i}is the average value of

*x*

_{i},

*σ*

_{i}is the standard deviation of

*x*

_{i}, and

*a*

_{i}is the fitting weight. The square of the correlation coefficient (

*R*

^{2}) served as a good metric for the strength of the relationship. When used for prediction, multidimensional and polynomial regressions were created in a similar fashion, using initial weights determined through low-dimensional principal component (PC) models.

To simplify the configuration space and create useful clustering models, principal component analysis (PCA) was employed using scikit-learn.^{49} PCA has been widely used for a variety of chemical applications, ranging from property prediction to group clustering.^{50–54} In summary, PCA linearly transforms descriptors to pull as much variance into the principal components (PCs) and reduce the dimensions required for a successful model. Mean-centering and norm-scaling were employed on the descriptors prior to PCA. PCA can be utilized with all descriptors, which can be useful when attempting to analyze the interplay between the descriptors as a whole. However, certain descriptors are not good candidates for a PCA model, such as those that are uniform across the data set and those that do not contain information about desired properties. Including such descriptors will pull unnecessary information into higher PCs and require more dimensions for a successful model. Thus, the PCA model used in this work was constructed by selecting a subset of potential descriptors and inspecting for visual separation. Descriptors were added and removed one by one until no improvements were found.

## IV. RESULTS AND DISCUSSION

### A. Evolution of descriptors with cluster size

By examining how the descriptors correlate to cluster size (*n*), we gain insight into how these descriptors (associated with stable clusters) evolve upon aggregation.

We will first focus on the changes in water intramolecular geometry upon aggregation via the one-body descriptors. Figure 3 depicts the distributions of *r*_{OH} (left) and *θ*_{HOH} (middle) for water cluster sizes *n* = 6 (bottom of each panel) to *n* = 25 (top of each panel). For these finite clusters, we observe a relatively constant peak around 0.950–0.955 Å for the free *r*_{OH} across all cluster sizes in the *r*_{OH} distributions. However, upon increasing the cluster size, we see a splitting of the peak between 0.965 and 0.975 Å upon aggregation. This splitting suggests that those molecules exist in differing hydrogen bonding environments within the water clusters. In general, the average *r*_{OH} value (dashed line) lengthens for larger cluster sizes (*R*^{2} = 0.59) in a manner that is akin to the increase in % HB. Furthermore, in the *θ*_{HOH} distributions, we see that the small peak between 102° and 104° disappears quickly around *n* = 12. The peak around 106° decreases as the cluster size increases, which is also associated with the introduction of a peak around 107°. At about *n* = 12, a small shoulder starts to grow in around 109°. This expansion of the *θ*_{HOH} relative to the gas-phase monomer value (104.5°) upon aggregation is well documented in the literature.^{55–61} However, the overall correlation between *θ*_{HOH} and cluster size is weak (*R*^{2} = 0.30) on average, as larger, bulk-like angles start to appear and the smaller gas-like angles disappear quickly upon increasing the cluster size.

Figure 4 traces the evolution of (a) the hydrogen bond saturation, (b) the radius of gyration and ASPL, (c) the *θ*_{OHO}, (d) the *θ*_{OOO}, (e) the percentage of molecule type, and (f) the percentage of cycle counts as a function of water cluster size (*n* = 6–25). A few of the many-body descriptors are inherently size extensive. These include the Wiener index,^{17} the number of hydrogen bonds, and the total binding energy (*R*^{2} > 0.97, Fig. S1). In addition, there are many descriptors that intuitively change with increasing cluster size. For example, the % HB increases for larger cluster sizes and trends toward 100% [*R*^{2} = 0.62, Fig. 4(a)]. This is expected because the molecules in the interior of a cluster are more coordinated and, consequently, there is an on average smaller fraction of molecules with a free *r*_{OH}. While we expect % HB ≃ 100 for the bulk phase, it does not exceed 90% for clusters up to *n* = 25. The radius of gyration and ASPL are also strongly correlated with the cluster size [*R*^{2} > 0.94, Fig. 4(b)]. This is not surprising given that both metrics quantify the size and shape of a cluster, despite being derived from different representations of the cluster (Cartesian coordinates vs a graph).

In general, two-body descriptors (i.e., *r*_{OO}, *θ*_{OHO}, and *ϕ*_{HOOX}) have weak correlations with cluster size. The dihedral angles and resulting ratios of *cis*-vs *trans*-hydrogen bonds stays relatively consistent at 30%–40% *cis*-hydrogen bonds (0°–90°) and 60%–70% *trans*-hydrogen bonds (90°–180°). The *r*_{OO} distance, on average, is smaller for large cluster sizes (*R*^{2} = 0.38), while the *θ*_{OHO} [*R*^{2} = 0.31, Fig. 4(c)] gradually becomes more linear with increasing cluster size. In addition, there exists a correlation between *θ*_{OOO} and the size of the cluster [*R*^{2} = 0.79, Fig. 4(d)]. On average, the *θ*_{OOO} increases from an average of 87.6° for *n* = 6 to an average of 103.0° for *n* = 25. However, the correlation with tetrahedral order (*q*) is weaker (*R*^{2} = 0.31). Since distribution plots for that descriptor have been reported for liquid simulations,^{24,25,42,62–64} we provide the distributions of the tetrahedral order (*q*) for cluster sizes *n* = 6–25, for comparison (Fig. 3, right panel). As observed in liquid simulations,^{25,42,62–64} the distribution is bimodal. However, the peak existing at lower *q*-values is more prominent than the corresponding one in liquid water, indicating the large ratio of molecules exhibiting a weaker local tetrahedral ordering in the clusters compared to liquid water. However, there is a substantial amount of molecules approaching *q* = 1, indicating the emergence of fully solvated molecules having four nearest neighbors with a near tetrahedral local structure.

The evolution of the hydrogen bond network (connectivity descriptors) with cluster size is of particular interest. Figure 4(e) traces the changes in the fraction of molecule types (local bonding information) as a function of cluster size. For small water clusters (*n* = 6–7), the AD-type water molecules dominate with small contributions from ADD and AAD. The fraction of AD water molecules sharply decreases up to size *n* = 8 and continues to decrease for larger cluster sizes. It is at this size regime that the energetically stable *n* = 8 cube appears, which is entirely ADD/AAD. This structural motif persists through larger clusters and is likely the cause for the low % of AD molecules in the low-energy subset of the full cluster database used in this study. The fractions of ADD- and AAD-type water molecules grow at nearly the same rate, reaching a maximum at *n* = 8 and staying somewhat constant beyond that point. Eventually, at around *n* = 22 the AADD-type water molecules become the most prevalent (approaching 40%, on average) while the ADD- and AAD-type ones decrease slightly. This demonstrates the point at which the AADD-type molecules become the most prevalent type in the cluster. Regardless, even at *n* = 25 we see that roughly 60% of the water molecules are still three-coordinated, on average.

The average number of hydrogen bonds in liquid water is about 3.5,^{37} suggesting limiting values of AADD-type molecules in liquid water. Furthermore, a small percentage of water molecules are penta-coordinated (AAADD-type) for *n* > 10. The AAADD molecules become more prevalent for larger water clusters albeit never exceeding 17% for a single cluster. The emergence of the AAADD molecules is accompanied by a simultaneous loss of AAD molecules (Fig. S4), demonstrating the role of the AAADD molecules in balancing the number of hydrogen bond donors in the cluster.

Figure 4(e) traces the changes in the fraction of molecule types (local bonding information) and the fraction of cycle counts as a function of cluster size. We observe significant changes in the cycles that are produced as part of the extended hydrogen bond environment [Fig. 3(f)]. For small water clusters, the small cycles (3 and 4) naturally dominate. While these ring sizes are often strained, they help to maximize the number of hydrogen bonds for small clusters. However, the five-membered cycles abruptly become the dominant cycle size for *n* > 16 with the four-membered cycles steadily decreasing. The six-membered cycles very slowly increase in prevalence but never surpass roughly 20% of the cycles in a cluster for *n* = 25. Given that hexagonal ice (ice Ih) has only six-membered rings,^{65} we anticipate the prevalence of six cycles in stable water cluster minima to increase for larger cluster sizes (*n* > 25).

### B. Correlation between descriptors

It is expected that several descriptors will correlate with one another. Some of these correlations are an artifact of the way each descriptor is defined, whereas others provide insight into the structural and connectivity patterns present in the water cluster minima. The coefficients are supplied in Table S1, and correlation plots of the various descriptors are depicted in Figs. S2–S4. The various correlations are summarized by the corresponding strengths (correlation coefficients) traced by colors in Fig. 5. Descriptors that have no correlations, such as ⟨*ϕ*_{HOOX}⟩, % AAADD, and % of six cycles, are excluded for clarity.

The descriptors appear to fall into two groupings that are related to size and hydrogen bond strength. There is a strong correlation between ASPL and *R*_{g} (*R*^{2} = 0.98). There is no mathematical reasoning for the correlation in the formulas for these descriptors, suggesting there are two distinct ways in describing a similar property using either the Cartesian coordinates (*R*_{g}) or graphs (ASPL). The correlation is, however, logical, since both ASPL and *R*_{g} increase with cluster size and decrease with the spherical nature of the cluster. Closely correlated to ASPL and *R*_{g} is the Wiener index (*R*^{2} = 0.93), which is similar in computation to the ASPL. Another correlation exists with the aforementioned descriptors and ⟨*θ*_{OOO}⟩ (*R*^{2} ≈ 0.8). The origin of this latter correlation is not clear, but it serves as yet another link between structural and connectivity descriptors. A natural correlation exists between % HB and ⟨*r*_{OH}⟩ (*R*^{2} = 0.9). This arises from the lengthening of *r*_{OH} due to the formation of hydrogen bonds.^{7} In the presence of additional hydrogen bonds (higher % HB), *r*_{OH} will be more elongated, therefore ⟨*r*_{OH}⟩ will be greater. Altogether, these six descriptors form the core of one group of correlations. As discussed earlier, all these descriptors have strong correlations with size.

Another set of correlations exists between the rest of the structural descriptors. The strongest correlation in this group is between ⟨*q*⟩ and ⟨*θ*_{OHO}⟩ (*R*^{2} = 0.76). It translates to the realization that water molecules in a water cluster are on average more tetrahedral when the hydrogen bonds are more linear. This is reasonable since hydrogen bonds that are far from 109.5° (tetrahedral) generally bend their hydrogen bond to increase hydrogen-lone-pair interactions.^{66} It is interesting that the correlation between ⟨*q*⟩ and ⟨*θ*_{OHO}⟩ is stronger (*R*^{2} = 0.76) than the correlation between ⟨*q*⟩ and ⟨*θ*_{OOO}⟩ (*R*^{2} = 0.54), which are mathematically related to one another. Since the function between *q* and *θ*_{OOO} is not one-to-one, the correlation of their average values is weak. Another strong correlation is the one between ⟨*θ*_{OHO}⟩ and ⟨*r*_{OO}⟩ (*R*^{2} = 0.63). This is quite reasonable, as both descriptors are well-known metrics of hydrogen bond strength.^{38,39} This implies that the correlations between ⟨*r*_{OO}⟩, ⟨*θ*_{OHO}⟩, and ⟨*q*⟩ (all descriptors focusing on the local hydrogen bonding environment) stem from the relationship between such descriptors and the underlying hydrogen bond strengths. Oddly, ⟨*θ*_{HOH}⟩ shows correlations only with the descriptors mentioned in this paragraph (*R*^{2} ≈ 0.35). Therefore, it will be grouped accordingly despite the weakness of the correlations. It is important to note that *r*_{OH} is also known to correlate with the hydrogen bond strength, with a longer *r*_{OH} associated with stronger hydrogen bonds.^{67,68} This relation is not seen in this analysis, as it is skewed by the inclusion of the free *r*_{OH} in the averaging of the value of the descriptor.

Weak cross correlations exist between the two groups. Notably, the strongest correlations between subsets of the two groups are between ⟨*θ*_{OOO}⟩ and descriptors of hydrogen bond strength (*R*^{2} ≈ 0.57) and weaker correlations are between ⟨*r*_{OO}⟩ and descriptors of size (R^{2} ≈ 0.47). This relationship is not surprising because of the existence of a weak correlation between *θ*_{OOO}, *q*, and hydrogen bond strength. The latter, however, is surprising since the correlations between the two-body descriptors and the cluster size are quite weak. This suggests that additional information besides size and hydrogen bond energy is encoded in these descriptors.

Correlations between adjacency descriptors provide important insight between the stability of hydrogen bond networks and the balance of hydrogen bond acceptors (A) and donors (D). Strong correlations exist between the % of AD, AAD, ADD, and AADD, which are most prevalent in the size regime of the water clusters examined in this study. The correlation between those descriptors is also shown in Fig. 5. The anticorrelation between AD and AAD/ADD is weak (*R*^{2} ≈ 0.17) due to the rapid disappearance of AD molecules. The correlation is stronger between AAD/ADD and AADD (*R*^{2} ≈ 0.42), a fact that demonstrates the interplay between AAD/ADD and AADD in large, highly connected clusters. The strong correlation between AAD and ADD (*R*^{2} = 0.67) arises from the natural constraint that for each acceptor there is a donor. As seen previously, there is a disappearance of AD and the appearance of AADD water molecules upon increasing cluster size. This loose correlation with cluster size manifests as weak (0.3 < *R*^{2} < 0.6), while correlations between AD and AADD descriptors are found to strongly correlate with size—⟨*r*_{OH}⟩, ASPL, Wiener index, % HB, and *R*_{g}.

Correlations in the % cycles exist because more of one cycle means less of another cycle. As most cycles consist of four and five molecules, there is a strong correlation bounded by the anticorrelation line. Correlations between three-membered cycles and six-membered cycles and other cycles are extremely weak because their populations are small throughout the dataset. Although correlations with other descriptors are weak, interesting trends emerge with descriptors that predict hydrogen bond strength. There exist weak correlations (*R*^{2} ≈ 0.3) between the three-membered cycles, four-membered cycles, and five-membered cycles and ⟨*q*⟩, ⟨*r*_{OO}⟩, ⟨*θ*_{OHO}⟩, and ⟨*θ*_{OOO}⟩. Notably, values with increased hydrogen bond strength correspond to more five-membered cycles and less three-membered cycles and four-membered cycles. This is reasonable as three-membered cycles and four-membered cycles are more strained (⟨*θ*_{OOO}⟩ ≤ 90°) and therefore are associated with weaker hydrogen bonds.

### C. Partition of descriptors via principal component analysis

The PCA model constructed in this study utilizes descriptors that can effectively translate to the bulk phase. For this reason, size, ASPL, Wiener index, and *R*_{g} are excluded from the model. The resulting best model utilized the descriptors ⟨*r*_{OO}⟩, ⟨*θ*_{OHO}⟩, ⟨*θ*_{OOO}⟩, ⟨*q*⟩, % HB, % AD, % ADD, % AAD, % AADD, % AAADD, % four-membered cycles, and % six cycles. The results of the PCA analysis are shown in Fig. 6, suggesting that interesting structural patterns emerge warranting further analysis. It was found that the one-body descriptors (⟨*r*_{OH}⟩ and ⟨*θ*_{HOH}⟩) muddled the model by smearing groups together and were thus not used. Additionally, the average dihedral angles of hydrogen bonds (⟨*ϕ*_{HOOX}⟩) and the % five-membered cycles combined seemingly meaningful separations; thus, ⟨*ϕ*_{HOOX}⟩ and % five-membered cycles were not used. The descriptors % HB, % four-membered cycles, and % six cycles had minor effects on the separation of principal components but served to increase the visible separation. Notably, % HB improved the clustering of isomorphs with fewer PCs. The latter two are included in the model for potential benefit when using multiple PCs for analysis but will not be discussed in detail here.

An initial analysis examined the component matrices produced by the PCA to describe how the initial descriptors map onto this variance-separated space. The components of each descriptor in the first seven PCs are shown in the upper triangle of Fig. S5. PCs 4–6 are dominated by % AAADD, % six cycles, and % four-membered cycles respectively. It is noticeable, however, that a significant proportion of % AAADD contributes to PC3. PCs 1–3, shown in Fig. 6, have a good distribution of the contribution from each of the remaining descriptors. The strongest descriptors contributing to PC1 are ⟨*r*_{OO}⟩, ⟨*θ*_{OHO}⟩, ⟨*θ*_{OOO}⟩, and ⟨*q*⟩. PC3 is similar to PC1 but includes % HB and % AD information. PC2 is unique in that it contains information about adjacencies (% ADD, % AAD and % AADD). These differences, combined with the structural patterns that appear and the variances in Fig. 6, make these PCs good candidates for analyzing water clusters.

It is found that PCA succeeds in separating clusters based on descriptors not included in the model. Size correlates strongly with PC1 (*R*^{2} = 0.76), with similar correlations between PC1 and ASPL, Wiener index, and *R*_{g}. This is likely because PC1 contains information about ⟨*θ*_{OOO}⟩, which is closely related to size. The combination of ⟨*r*_{OO}⟩, ⟨*θ*_{OHO}⟩, ⟨*θ*_{OOO}⟩, and ⟨*q*⟩ in PC1 seems to pull the size extensive behavior of ⟨*θ*_{OOO}⟩ into PC1, while leaving the size-independent character of ⟨*θ*_{OOO}⟩ in lower PCs. The implicit separation of size is especially important as it allows aspects of the information related to size to be drawn up to PC1 and aspects of information unrelated to size to remain pure in the lower PCs. When targeting attributes that must remain size-independent, it will be beneficial to remove PC1 from the model.

PCA provides a useful tool to separate and group similar data points, and this technique was employed to classify and compare certain water clusters. Since PC2 and PC3 provided the best visual separation, they will be used heavily in this section. For quantitative metrics, PC 2–7 will be used. PC1 will not be used because it largely encodes cluster size, and this study seeks to classify patterns in water clusters independent of size. As discussed earlier in this section, PC2 separates mainly on adjacency and PC3 separates mainly on % HB. These descriptors have discrete values, arising from the division of small integers whose magnitudes are <60. Therefore, their distinct separation is natural but further analysis shows it is more subtle. Figure 7 depicts the separation of the aforementioned descriptors across PC2 and PC3 for water clusters of size *n* = 25. PC3 nicely separates hydrogen bond saturation into distinct strata, with negative PC3 representing highly connected structures and positive PC3 representing less connected structures. Clusters along PC2 contain mixtures of the various % adjacencies encoding the detailed information on the hydrogen bond networks. Much of the information across PC2 is separated by % ADD. Nonetheless, the separation provided by PC2 and PC3 in this model provides a great way to describe similarly connected water clusters using descriptors that have measurable values in bulk water.

Another application of the separation is to group similar clusters, such as families of isomorphs. The water hexamer has been widely studied for the emergence of a complex, non-quasiplanar landscape of low-energy isomorphs.^{69–71} The PC model developed here appropriately separates the isomorph family (Fig. 8 left). Notably, the common families of the water hexamer, delineated by blue borders, appear in distinct groupings. Prism- and cage-like clusters appear in quadrant III, whereas ring structures appear in quadrant I. The other isomorphs, shown with no outlines, group around similarly structured clusters. For example, the traditional “cage” is grouped with a similar but distinct cage-like structure. More clearly, the “book” structure is grouped among other fused ring structures. Similar patterns emerge for larger water cluster sizes. Isomorphs remain separated across PC2–3 (Fig. S6). Figure 8 (right) shows the separation for the water 16-mer, where the lowest energy isomer in each grouping is shown. Although each grouping is expected to be diverse for the medium-sized clusters, the structures shown in Fig. 8 are good representatives of the structures that separate via PCA.

The power of this technique rests in its ability to identify water clusters of different sizes with similar structural motifs. As discussed earlier, water-6’s “prism” (trigonal prism) structure appears in the far negative quadrant across PC2 and PC3. Similar structures, such as water-8’s cube and water-10’s pentagonal prism, appear in a similar area, although slightly higher in PC3 due to differences in ⟨*q*⟩ and ⟨*θ*_{OHO}⟩. Fused prisms appear similarly in the negative region of PC3 but are spread out across PC2, with positive PC2 having more shared faces. Defects to fused prisms appear in proximal groupings to their “parent” cluster. An example of a “defect” (indicated by a green border in Fig. 8) involves breaking the arrangement of one end and the shifting of that cluster “upward” in PC3 (because of fewer hydrogen bonds) and “left” in PC2 (because of fewer ADD/AAD molecules).

To exemplify the PCA classification, the clusters closest to the lowest energy 25-mer in the data set were identified and shown in Fig. 9. Clusters were found by proximity (norm 2) across PC 2–7. The three closest structures interestingly belong to cluster sizes 22, 19, and 16. Each structure shows similar structural motifs, a nucleating pentagonal prism-like structure (shown on the right side of each image), and the beginnings of an ordered cage surrounding a centered water molecule (shown on the left). As these are the most structurally similar clusters, they would make great candidates as starting structures when needing to study the 25-mer. This analysis can be done for any water cluster, including structures such as clathrates, which were not included in the subset of the full database that was used for the analysis (see the discussion in the Introduction).

### D. Prediction of energetic stability

It is finally of interest to examine which descriptors are predictive of the energetic stability [binding energy per water molecule (BE_{n})] of the cluster. As discussed earlier, BE_{n} naturally evolves with the cluster size. More specifically, there exists a strong correlation between the binding energy per water molecule and the cluster size (*R*^{2} = 0.99), as seen in the left panel of Fig. 10. Note that the limiting value of BE_{n} for *n* → *∞* for the fit shown in the right panel of Fig. 10 is −12.7 kcal/mol when the exponent of *n* is considered a variable (with a value of 0.82). Incidentally, this limit is near the lattice energy of ice (14.07 kcal/mol per molecule).^{72} The cluster sizes (up to *n* = 25) are too small to be used in conjunction with a fit of the volume element *n*^{−1/3}, which produces a less accurate regression for this cluster regime than when the exponent is a variable.

Descriptors that correlate with size (*r*_{OH}, *θ*_{OOO}, ASPL, Wiener index, % HB, and *R*_{g}) also show correlations with the binding energy per water molecule (*R*^{2} ≈ 0.95 for *θ*_{OOO}, ASPL, Wiener index, and *R*_{g}, *R*^{2} ≈ 0.7 for *r*_{OH} and % HB). However, none of these descriptors are predictive of relative stability between isomers within a particular water cluster size. The descriptor that most improved the accuracy of the prediction (albeit by only 5%) was % HB. Therefore, it is inferred that the correlation with the aforementioned descriptors comes mostly from their relationship with size and size’s relationship with the cluster binding energy.

Another way to analyze energetic stability is to scale the cluster binding energy by the number of hydrogen bonds. This metric of energy does not scale with cluster size but instead scales strongly with structural many-body descriptors, such as *r*_{OO}, *θ*_{OHO}, and *q*, which are known to greatly influence the energy of the hydrogen bond.^{38,39,73} It is therefore expected that they are predictive of the average energy of a hydrogen bond in a cluster. Each has relatively strong correlations (*R*^{2} between 0.70 and 0.86) with the binding energy per hydrogen bond as can be seen from the right panel of Fig. 10. Strikingly, these three descriptors combine linearly for an improved fit (*R*^{2} = 0.95), suggesting each descriptor encodes distinct properties of hydrogen bond strengths.

Alternatively, more descriptors can also be combined to predict the cluster binding energy. There is strong predictive power for the energy per molecule and the energy per hydrogen bond from the combination of all descriptors with an RMSE of 0.036 and 0.022 kcal/mol per molecule, respectively (Table I and Fig. 11). The regression weights are displayed in Table S2; the largest weights are associated with size, ASPL, and Wiener index, suggesting that these descriptors together are most predictive of the binding energy per water molecule. Since these descriptors are not transferable to bulk water, the linear regression was refitted using only the descriptors that can be computed under periodic boundary conditions (Table I), namely ⟨*r*_{OH}⟩, ⟨*r*_{OO}⟩, ⟨*θ*_{HOH}⟩, ⟨*θ*_{OHO}⟩, ⟨*θ*_{OOO}⟩, ⟨*q*⟩, % AD, % AAD, % ADD, % AADD, % AAAD, % AAADD, % three-membered cycles, % four-membered cycles, % five-membered cycles, % six cycles. As can be seen from Table I, the fit is less predictive and begins to incorporate cycle counting as predictive features, as it can be seen from the increase in the fitting weights of % four-membered cycles and % five-membered cycles from <0.05 to >0.10 in Table S2.

. | . | All . | Bulk . | PC 1–3 . | PC 1–7 . | ||||
---|---|---|---|---|---|---|---|---|---|

Quantity . | Regression model . | RMSE . | k
. | RMSE . | k
. | RMSE . | k
. | RMSE . | k
. |

BE_{n} | Linear | 0.036 | 22 | 0.050 | 18 | 0.075 | 3 | 0.062 | 7 |

Quadratic (uncoupled) | 0.032 | 44 | 0.047 | 36 | 0.073 | 6 | 0.061 | 14 | |

Quadratic (coupled) | 0.028 | 275 | 0.041 | 189 | 0.072 | 9 | 0.056 | 35 | |

BE_{HB} | Linear | 0.022 | 22 | 0.030 | 18 | 0.043 | 3 | 0.034 | 7 |

Quadratic (uncoupled) | 0.020 | 44 | 0.028 | 36 | 0.043 | 6 | 0.034 | 14 | |

Quadratic (coupled) | 0.018 | 275 | 0.025 | 189 | 0.041 | 9 | 0.032 | 35 |

. | . | All . | Bulk . | PC 1–3 . | PC 1–7 . | ||||
---|---|---|---|---|---|---|---|---|---|

Quantity . | Regression model . | RMSE . | k
. | RMSE . | k
. | RMSE . | k
. | RMSE . | k
. |

BE_{n} | Linear | 0.036 | 22 | 0.050 | 18 | 0.075 | 3 | 0.062 | 7 |

Quadratic (uncoupled) | 0.032 | 44 | 0.047 | 36 | 0.073 | 6 | 0.061 | 14 | |

Quadratic (coupled) | 0.028 | 275 | 0.041 | 189 | 0.072 | 9 | 0.056 | 35 | |

BE_{HB} | Linear | 0.022 | 22 | 0.030 | 18 | 0.043 | 3 | 0.034 | 7 |

Quadratic (uncoupled) | 0.020 | 44 | 0.028 | 36 | 0.043 | 6 | 0.034 | 14 | |

Quadratic (coupled) | 0.018 | 275 | 0.025 | 189 | 0.041 | 9 | 0.032 | 35 |

The developed regression models were applied to structures lying outside the dataset used for training to examine predictive behavior. A database of water clusters larger than *n* = 25 was constructed using the same methods as outlined in the work of Rakshit *et al.*^{35} Although the configuration space of these larger water clusters was not sampled at the same level of detail as the existing database for *n* = 3–25, it nonetheless can provide useful information about any divergent behavior in the models. It was found that models based solely on bulk-like descriptors were successful in predicting the energy to a comparable accuracy for clusters inside and outside the training set. The RMSE values decrease, viz., from 0.050 to 0.046 kcal/mol per molecule and from 0.030 to 0.028 kcal/mol per hydrogen bond. On the other hand, models that incorporate the size, ASPL, Wiener index, and *R*_{g} were found to overestimate the energies for larger clusters, likely due to the nonlinear correlation between energy and size (Fig. 10 left). Complete extrapolation should never be assumed, as structures with descriptors not seen by the model may diverge.

The PCs of the PCA can also serve as indicators to predict stability. When using PCA, fewer parameters are required as the data have been reduced in dimensionality. The RMSEs for predicting the binding energy per molecule and per hydrogen bond are listed in Table I. Incorporating the first seven PCs produces relatively good fits. Although the fits are worse than the regression with bulk descriptors by 13%–24%, the number of parameters required is 61% less. This current model is an excellent starting point in using reduced dimensions to predict energetic stability, but incorporating additional descriptors and higher energy local minima into the training set could improve the ability of the model to predict the binding energy.

The strong predictive power of the simplistic linear regressions is promising in that these cluster-averaged descriptors can be used to predict the energy of water cluster minima. Water cluster properties that strongly affect stability, such as proton ordering in the polyhedral cage water clusters (H_{2}O)_{n}, *n* = 20, 24, 28,^{40,74,75} are not explicitly incorporated in the model. Proton ordering might be implicitly described by other descriptors such as *r*_{OO}, resulting in predictive fits. With the incorporation of descriptors that explicitly account for proton ordering, the model is likely to improve. This assumption would be consistent with better-predicting models that use geometries explicitly as descriptors.^{76,77} Furthermore, more predictive models can be developed by using nonlinear machine learning techniques. Polynomial regressions improve the fits by between 16% and 18% (Table I). Naturally, additional improvement is anticipated with more sophisticated models, such as neural networks, that are not the focus of this work.^{17,78,79}

## V. CONCLUSIONS

The database of water cluster minima generated in the work of Rakshit *et al.*^{35} has provided a useful resource to analyze structural and connectivity descriptors in water cluster minima (*n* = 6–25). It was found that some descriptors, such as *r*_{OH} and % HB, correlated strongly with cluster size, while others, such as *r*_{OO}, *θ*_{OHO}, and *q*, correlate with one another due to their relationships with hydrogen bond strength. Cluster size was found to be the strongest predictor of stability (binding energy per molecule), while when descriptors are grouped together, it results in providing sufficient information to predict cluster stability. Energy prediction appears better when scaled for each hydrogen bond, as most descriptors directly affect the strength of individual hydrogen bonds.

The principal component analysis demonstrated the use of descriptor decomposition in classifying and predicting properties of water clusters. The PCA predicts the energy of each water cluster with ∼60% fewer free parameters than linear regressions with a 13% and 24% loss in accuracy for the binding energy per molecule and the binding energy per hydrogen bond, respectively. The PCA also shows promise in identifying similarly structured clusters of different sizes, allowing for the use of smaller clusters to model computationally expensive large clusters. The PCA model developed in this study is publicly available and can be used to further study the structural and connectivity patterns in either larger clusters or different aqueous systems. Improvements to the model, such as additional descriptors, additional isomers, and explicit/robust tests on bulk systems, will be the focus of future studies. Although the models described in this work were not trained on large clusters or liquid water/ice configurations, they should be transferable to structures with similar descriptors (containing four-membered cycles, five-membered cycles, six-membered cycles, and mostly ADD, AAD, AADD adjacencies). Clusters and bulk structures that have descriptors well outside the ranges considered in this work will require additional training information to create a reliable model. Although some of the conclusions are expected, especially those regarding changes in cluster size and correlations with hydrogen bond strength (albeit not previously reported since the water cluster database has been made available only recently), many others are novel. These include the drastic transitions in the % of five-membered cycles and four-membered cycles at *n* = 17 and the correlation between ASPL and *R*_{g}. An additional novelty in this work is the synthesis of these structural and connectivity descriptors in describing these clusters, the structure classification of the PCA, and the energy regression from cluster-averaged descriptors. These findings set the stage for future investigations, as well as provides evidence that these traits persist throughout a wide range of water clusters.

Since the described approach is general, it can be easily extended to other training sets and other hydrogen-bonded systems to further assist in their structure prediction and structural motif analysis and classification, a field that has seen great improvements with the use of simplistic and descriptive machine learning techniques.^{80–83} We plan to use the descriptors and developed protocol to analyze the structural motifs for liquid water at different temperatures in forthcoming studies.

## SUPPLEMENTARY MATERIAL

See the supplementary material for algorithms for computing cycles efficiently, figures correlating descriptors and describing the PCA, and a table describing the fitting parameters for linear recession models.

## ACKNOWLEDGMENTS

This work was supported by U.S. Department of Energy (DOE), Office of Science, Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences, and Biosciences, Condensed Phase and Interfacial Molecular Science (CPIMS) program, Molecular Theory and Modeling Grant No. FWP 16249 at Pacific Northwest National Laboratory (PNNL), a multi-program national laboratory operated for DOE by Battelle. This research also used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Garrett D. Santis**: Data curation (lead); Formal analysis (lead); Investigation (equal); Methodology (equal); Software (lead); Validation (lead); Writing – original draft (lead); Writing – review & editing (equal). **Kristina M. Herman**: Data curation (supporting); Formal analysis (supporting); Software (supporting); Validation (supporting); Visualization (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). **Joseph P. Heindel**: Formal analysis (supporting); Investigation (supporting); Writing – original draft (supporting). **Sotiris S. Xantheas**: Conceptualization (lead); Funding acquisition (lead); Methodology (supporting); Project administration (lead); Resources (lead); Supervision (lead); Writing – original draft (supporting); Writing – review & editing (supporting).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in https://github.com/gds001/Descriptors-for-Water-Aggregation.git.

## REFERENCES

^{3+}in water from first principles calculations

^{+}(H

_{2}O)

_{21}

^{+}(H

_{2}O)

_{6}

*Ab initio*studies of cyclic water clusters (H

_{2}O)

_{n}

*, n*= 1–6. II. Analysis of many-body interactions

*ab initio*benchmark binding energies for water clusters

*n*= 2–25

*n*= 2–24

*Handbook of Computational Chemistry*

*n*= 17–21 size regime

_{2}O)

_{20}isomers of exceptional stability: A drop-like and a face-sharing pentagonal prism cluster

_{2}O)

_{n}, in the range 36 ≤

*n*≤ 50?

*Ab-initio*total energy studies of the static and dynamical properties of ice Ih

*Topology in Chemistry*

*Chap. The Rich Legacy of Half a Century of the Wiener Index*

*hcapca*: Automated hierarchical clustering and principal component analysis of large metabolomic datasets in R

_{2}O)

_{24}cluster and their use in constructing periodic unit cells of the structure I (sI) hydrate lattice

_{2}O)

_{20,24,28}clathrate hydrate cages: Optimization and many-body analysis