This work reexamines seeded simulation results for NaCl nucleation from a supersaturated aqueous solution at 298.15 K and 1 bar pressure. We present a linear regression approach for analyzing seeded simulation data that provides both nucleation rates and uncertainty estimates. Our results show that rates obtained from seeded simulations rely critically on a precise driving force for the model system. The driving force vs. solute concentration curve need not exactly reproduce that of the real system, but it should accurately describe the thermodynamic properties *of the model system*. We also show that rate estimates depend strongly on the nucleus size metric. We show that the rate estimates systematically increase as more stringent local order parameters are used to count members of a cluster and provide tentative suggestions for appropriate clustering criteria.

## I. INTRODUCTION

Nucleation is important in many contexts, yet both theory and simulation have largely failed to quantitatively predict experimental nucleation rates.^{1–5} Recently, some promising results^{6,7} have been obtained using a new seeded simulation technique.^{8,9} The seeding approach was developed to circumvent certain problems with other rare event methods^{1} in studies of solute precipitate nucleation. It was not originally expected to provide highly accurate rates,^{8} and as outlined above there are many cases where it *should not* provide accurate rates. The seeding approach infers interfacial free energies,^{10} attachment frequencies, and ultimately nucleation rates from the rate at which seeded nuclei of different initial sizes grow or shrink. These ideas are also present in the classical nucleation theory (CNT), and yet seeding avoids several of the most problematic assumptions within CNT.^{11} By using a cluster size coordinate, the seeding approach avoids defining an explicit surface around the nucleus and the associated capillarity approximations.^{12,13} It also avoids commonly used assumptions about diffusion limited attachment by extracting the attachment frequency from short trajectory data. Seeding results have previously been combined with a spherical nucleus assumption to obtain an effective interfacial free energy, but in principle the seeding equations only require the interfacial free energy and the shape factor as a single lumped parameter.

The seeded simulation approach relies on accurate estimates of the critical nucleus size and the driving force Δμ. The nucleus size metric is integral to the data analyses and to the interpretation of results, e.g., different nucleus sizes ascribed to the same nucleus lead to different inferred interfacial free energies. The driving force is a potential source of error for two reasons. First, small nuclei may have very different compositions^{14} and/or structures,^{15–17} etc. from their bulk counterpart phases. The seeding approach cannot accurately predict rates for nucleation processes that involve such non-classical nuclei.^{18} Second, even when nuclei do have the structure and composition of the stable phase, nucleation rates are highly sensitive to the driving force.^{19} Precise estimates of driving forces for solute precipitation pose major challenges in simulations.^{20}

Despite these limitations, Sanz *et al.* demonstrated that seeding results are surprisingly accurate for crystal nucleation in supercooled liquids of hard-spheres, water, NaCl, and a Lennard-Jones fluid.^{6,7} The findings by Sanz *et al.* motivated Zimmermann *et al.* to examine the quantitative predictions of seeded simulation rates for solute precipitate nucleation. Zimmermann *et al.* computed NaCl nucleation rates from supersaturated brine.^{11} Salt nucleation from electrolyte solutions is important in geochemistry,^{21} and it presents unique fundamental questions about electroneutrality of nuclei, non-ideal activities, and desolvation barriers during ion attachment. Aqueous NaCl was specifically chosen because it is a simple 1-1 electrolyte, because there are extensively characterized force fields,^{20} and because experimental homogeneous nucleation rates have been reported.^{22–24} For simulations, homogeneous nucleation is an advantageous starting point for several reasons. Homogeneous nucleation is a well-defined rate process, whereas heterogeneous nucleation can occur at many types of sites with unknown characteristics and populations. Additionally, properties like attachment frequencies, interfacial free energies, and driving forces that govern homogeneous nucleation rates also influence heterogeneous nucleation rates.

To compare predicted and measured solute precipitate nucleation rates, Zimmerman *et al.* compared theoretical and experimental rates *at the same driving forces*, i.e., comparisons of *J*_{expt}(Δ*μ*) vs. *J*_{sim}(Δ*μ*), where (Δ*μ*)_{expt} = (Δ*μ*)_{sim} = Δ*μ*. These comparisons require precise knowledge of the driving force vs. composition for the experimental system *and for the model system*.^{8} The classical theory suggests that comparisons performed in this way will largely remove the effects of force field imperfections. Based on the classical theory, all residual differences in the *J*_{expt}(Δ*μ*) vs. *J*_{sim}(Δ*μ*) comparison stem from errors in the interfacial free energy, from errors in the attachment/detachment kinetics (prefactors), and from non-classical mechanisms. For rates computed with seeding, we show in this work that additional errors may also arise from the chosen nucleus size metric.

Zimmermann *et al.* chose NaCl concentrations in their seeded simulations to approximately match the driving forces in the available experiments. Unfortunately, the nucleation rates predicted by Zimmermann *et al.* were 15 to 30 orders of magnitude larger than the experimental rates.^{11} Because the computed homogeneous rates were *faster* than the experiments, the discrepancy could not be ascribed to experiments that inadvertently probed heterogeneous nucleation. Zimmermann *et al.* speculated that the Joung-Cheatham force field with SPC/E water (JC/SPC/E) was not able to describe interfacial properties of the nuclei and even questioned the assumptions used to extract rates from the experimental measurements.^{11}

Imperfect force fields and experimental interpretations are potentially important, but more importantly the electrolyte chemical potentials for the JC/SPC/E force field differ from those which Zimmermann *et al.* used to define the driving force. Joung and Cheatham estimated the solubility limit for their model as 7.2*m*,^{25} where *m* refers to molality in (mol NaCl/kg of water). Aragones *et al.* estimated the solubility as 4.8*m* by directly computing the chemical potentials of the solid and the electrolyte solution.^{26} They recommended the similar and independently obtained value of 5.1*m* based on long time scale coexistence simulations. Recent studies starting with Moucka *et al.*^{27–30} and followed by Mester and Panagiotopoulos,^{31,32} Kolafa,^{33} Benavides *et al.*,^{34} and Espinosa *et al.*^{35} have all arrived at estimates near 3.7*m*.

This work reviews the seeded simulation approach and revises our predicted NaCl nucleation rates to account for the revised driving force estimates. We illustrate an analysis of variance (ANOVA) procedure^{36} to quantify how errors in the driving force and in the seeded simulation data influence the predicted rates. We also examine the effects of different nucleus size metrics in the seeding analysis.

## II. BRIEF REVIEW OF SEEDING APPROACHES

The seeded simulation approach relies on some, but not all, aspects of CNT:

The seeding approach assumes that the reaction coordinate is nucleus size, an assumption that is justified by several recent analyses.

^{37–39}The seeding approach assumes that the free energy (in units of

*k*_{B}*T*) to form a nucleus is

where (*βϕγ*)_{eff} is an effective dimensionless parameter that lumps together the shape factor *ϕ*, the interfacial free energy *γ*, and *β =* 1*/*(*k*_{B}*T*). The lumped parameter is not required to match the value for nuclei with spherical shapes and/or macroscopic interfacial properties.^{40}

The commonly used CNT rate expression is

where *ρ* is the density of monomers, *Z* is the Zeldovich factor,^{41} where *D*_{+} is the attachment frequency, Γ is the Girshick-Chiu correction^{42} {i.e., Γ = exp[*F*(1)*/*(*k*_{B}*T*)]}, and the *B* parameter is^{13}

Equations (2) and (3) emerge from the Frenkel-Zeldovich equation^{13} or equivalently from an overdamped Langevin model in which the diffusivity along the (continuous) nucleus size coordinate is the attachment frequency.^{43} By analogy, the seeding approach assumes that short trajectory swarms are realizations of an overdamped Langevin equation. After integrating away the noise, one obtains an equation that describes how the average nucleus size in a swarm of trajectories should evolve from some initial size *n*_{0},^{8}

The quantity in brackets is proportional to *∂F*/*∂n* as predicted by CNT. The expression should only be used at short times. If nuclei evolve for too long, the forces that influence their growth/dissolution will change.^{44}

A seeding calculation begins with the creation of carefully solvated and annealed seeds ranging from very small pre-critical sizes to very large post-critical sizes. See the work of Zimmermann *et al.* for details on seed preparation.^{11} For each group of initial seed sizes, many independent configurations are initiated and allowed to run for a short time. The seeded trajectories are used to estimate *d*⟨*n*⟩/*dt* for each initial nucleus size. Given *β*Δ*μ*, the seeded simulation data enable an estimate of all remaining properties that are needed to compute the rate. For example, Zimmermann *et al.* obtained *D*_{+} and (*βϕγ*)_{eff} by fitting their data to Eq. (4) and then computed the nucleation rate according to Eq. (2).^{11} Zimmermann *et al.* used five seeds per initial size and a seeded trajectory duration of 4 ns.^{11} 4 ns is sufficiently short to avoid any appreciable depletion of the overall saturation level in the simulation box. The data presented by Zimmermann *et al.* suggest that 4 ns is also sufficiently long to identify an initial slope as *d*⟨*n*⟩*/dt*. Seeding enables a straightforward uncertainty analysis (see below), but the most appropriate seed duration time for extracting an attachment frequency is not clear.

Consider, for example, swarm evolution from a post-critical seed. Because of the annealing procedure, the initial concentration around the seed should begin at the bulk supersaturation. During the first moments of swarm evolution, i.e., immediately after release of constraints, the surface concentration should drop rapidly to that given by the Ostwald-Freundlich equation.^{45} As time progresses, the post-critical nucleus and the depleted boundary layer thickness around the nucleus should both grow. Estimates of the attachment frequency at both short and long times are complicated by the moving phase boundary and the changing depletion layer thickness.

Seeding approaches enable a prediction of trends, while methods like forward flux sampling^{46} and direct simulation estimates of mean first passage times^{47} provide the rate only at the simulated supersaturation. In particular, by predicting the barrier parameter *B*, seeded simulation results from one *β*Δ*μ*-value can be used to predict nucleation rates at other supersaturations.^{48} But before a seeding calculation can be completed, we also require an accurate model for Δ*μ*. Section III reviews the challenges and recent evolution of chemical potential driving force estimates for nucleation in a supersaturated NaCl solution.

## III. DRIVING FORCES

The seeded simulations of Zimmermann *et al.* examined NaCl concentrations from 8*m* to 12*m*, but directly computed electrolyte chemical potentials were only available up to 6*m*. An accurate model of the driving force in the metastable zone is critical for most versions of the seeded simulation approach (see the work of Lifanov *et al.* for the exception).^{18} As noted in the Introduction, the driving force and solubility involve difficult calculations, and a consensus for the JC/SPC/E model has only recently emerged. Moucka *et al.*^{27,29,30} obtained the first accurate solubility estimates at 298.15 K and 1 bar pressure using their osmotic ensemble Monte Carlo method, reporting a value of 3.64 ± 0.20 *m* for the JC/SPC/E force field. Mester and Panagiotopoulos subsequently used MD particle insertion and thermodynamic integration techniques to estimate the solubility as 3.59 ± 0.04 *m* or 3.71 ± 0.04 *m*^{31,32} (the second estimate incorporates an improved treatment of the infinite dilution limit). Benavides *et al.* then obtained a value of 3.71 ± 0.20 *m* using a “global” MD thermodynamic integration calculation.^{34,49} Finally, Kolafa^{33} and Espinosa *et al.*^{35} separately implemented a correct version of the direct MD coexistence method to estimate the solubility as 3.6 ± 0.4 *m* and 3.7 ± 0.4 *m*, respectively.

Most of the above solubility estimates required the electrolyte chemical potential *μ*_{NaCl}(*m*) at a discrete series of molalities up to a supersaturated value of 6*m*. The computed values of *μ*_{NaCl}(*m*) are used to parameterize analytic models for the chemical potential. According to convention, the reference chemical potential *μ*^{†}_{NaCl} is obtained by extrapolating Henry’s law to a hypothetically ideal 1.0*m* NaCl solution.^{50} Deviations from the ideal Henry’s law behavior are captured in the activity coefficient *γ*_{±}:

where *γ*_{±} is the overall NaCl electrolyte activity coefficient and *μ*^{†}_{NaCl} for the JC/SPC/E model is approximately −391.6 kJ/mol.^{31} Different investigators have used different models for the ln *γ*_{±} part of *μ*_{NaCl}(*m*). Several of the proposed models accurately fit the computed *μ*_{NaCl}(*m*) values, but in our study *μ*_{NaCl}(*m*) must be extrapolated beyond 6*m* for comparisons to experimental nucleation rates. The functional form of ln *γ*_{±} that is used to extrapolate *μ*_{NaCl}(*m*) into the metastable supersaturation zone influences the estimated driving force in this concentration range. A satisfactory model of ln *γ*_{±} for extrapolation deep into the metastable zone must account for the finite size of ions and the concentration dependence of the dielectric constant.^{51} We have used an extension of the Debye-Hückel limiting law^{52} similar to the expression of Hamer and Wu^{53} used to fit experimental data, but restricted to a linear term appended to the Debye-Hückel term,

We used the theoretical expression for *a* depending on the H_{2}O dielectric constant and density,^{52} yielding *a* = 0.568 and fitted *b* and *c* to the combined simulation data of Moucka *et al.*^{27} and of Mester and Panagiotopoulus^{31} along with new high-*m* data points from this work using the same methodology as in Ref. 34. The fit yields values *b* = 1.177 69 *m*^{−1/2} and *c* = 0.177 157 *m*^{−1}. Figure 1 shows the computed solid NaCl chemical potential as well as the computed and extrapolated electrolyte chemical potentials.

The revised calculations of *μ*_{NaCl}(*m*) leave almost no uncertainty in the solubility, but there remains considerable uncertainty in the driving force for molalities deep in the metastable zone. Specifically, we determined that the uncertainties increased with increasing molality: 0.111, 0.125, 0.168, 0.226, and 0.363 kJ/mol for *m* = 3.7, 8, 10, 12, and 16, respectively. These uncertainties are calculated (i) by constructing a new data set by perturbing each data point around its initial position according to a normal distribution with width 0.2 kJ/mol (the reported simulation uncertainty^{31}) and (ii) then refitting the new data set to Eq. (6); this procedure is repeated 1000 times to yield average chemical potentials and standard deviations at varying molalities.

The driving force basis should be consistent with the definition of nucleus size. In our case, nucleus size is counted in ions, not in formula units of NaCl. The driving force for nucleation *on a per-ion basis* is

i.e., the difference between the electrolyte chemical potential in solution and the solid NaCl chemical potential, *μ*_{solid} = *μ*_{NaCl}(*m*_{sat}), at the same temperature and pressure.

## IV. RATES AND UNCERTAINTIES FROM SEEDING

Equation (4) suggests that one can directly regress the seeded simulation data (*n*^{−1/3} and *d*⟨*n*⟩/*dt*) to infer *D*_{+} and (*βϕγ*)_{eff} from the slope and intercept of the best fit line. The resulting estimates of *D*_{+} and (*βϕγ*)_{eff} can be used to estimate the rate via Eqs. (2) and (3). *D*_{+} and (*βϕγ*)_{eff} are both random variables (estimates from linear regression), and their uncertainties propagate into the rate calculation. The slope and intercept estimates in linear regression [and therefore *D*_{+} and (*βϕγ*)_{eff}] are correlated to one another.^{36} The complicated dependence of the rate on *D*_{+} and (*βϕγ*)_{eff}, as well as their interdependence, complicates efforts to quantify uncertainty in the rate estimate.

To facilitate the rate calculation and error analysis, note that the free energy barrier within the exponential part of rate expression (2) can be written as^{4}

Equation (8) is an identity that emerges from classical nucleation theory.^{12,13} To exploit Eq. (8), we can estimate *n*_{‡} and *D*_{+} directly from the seeded simulation data, with *β*Δ*μ* being separately computed using methods listed in Sec. III. To estimate *n*_{‡}, rearrange Eq. (4) to give *n*^{−1/3} as a function of *dn/dt* rather than *dn/dt* as a function of *n*^{−1/3}. The rearrangement still allows identification of *D*_{+} and (*βϕγ*)_{eff} from the slope and intercept so that the prefactor and *B*-parameter can be determined. However, the regression line also provides an estimate of *n*_{‡} (where *dn*/*dt* = 0) and its uncertainty for the rate calculation [see Fig. 2(b)]. An alternative seeding procedure^{9} uses an algorithm somewhat like bisection to identify the size at which *dn*/*dt* = 0.

Figure 2(a) shows seeded simulation data from the work of Zimmermann *et al.*^{11} on rocksalt NaCl nucleation from an aqueous solution of sodium chloride at a molality of 8*m*. Figure 2(b) plots these data as *n*^{−1/3} vs. *d*⟨*n*⟩/*dt*. This procedure assumes an *n*-independent attachment frequency, but the attachment frequency should actually depend on nucleus size.^{1} Zimmermann *et al.*^{11} showed that ion-attachment to NaCl nuclei is desolvation limited, so *D*_{+} should be proportional to *n*^{2/3}. More elaborate regression procedures could account for *n*-dependence in *D*_{+}, but the effects are expected to be small. As the molality changes from m = 8, 10, to 12, the critical nucleus size changes from 31, 14, to 6 ions. The corresponding attachment frequencies 1.03/ns, 1.47/ns, and 1.47/ns, obtained from ⟨(δ*n*)^{2}⟩ vs *t*,^{54} are nearly constant.

The separate estimate of *β*Δ*μ* is needed to interpret the linear regression results, to predict the nucleation rate, *and* to place predictions and experimental measurements on the ln *J* vs. 1/(*β*Δ*μ*)^{2} plot as shown in Fig. 3. The original rate predictions of Zimmermann *et al.*^{11} using *μ*_{NaCl}(*m*) and *m*_{sat} from the work of Aragones *et al.* were between 10^{15} and 10^{30} orders of magnitude faster than the experimental rates. Figure 3 shows that most of the error is reconciled by calculations based on the revised chemical potential driving forces.

Unfortunately, the revised driving forces used in the simulations no longer match those used in the experiments. It would be useful to study nucleation at lower NaCl molalities in future work. According to CNT, the critical nucleus size *n*_{‡} scales as *n*_{‡} ∼ (*β*Δ*μ*)^{−3}. Based on directly computed critical sizes in 8*m*, 10*m*, and 12*m* solutions, the critical nuclei at 6*m* would have sizes near 200 ions (see the supplementary material). These exceed our largest simulations, but it should be possible to reach such sizes in future seeded simulations.

Two of the experimental rates shown in Fig. 3, from the studies of Na *et al.*^{24} and Gao *et al.*,^{23} were measured in droplets that were levitated and passed through an exposure chamber, respectively. The levitation experiments were designed to favor homogeneous nucleation by eliminating all potential heterogeneous nucleation sites except for an air-water interface. By contrast, Desarnaud *et al.* examined NaCl nucleation within a high surface area mesoporous silica.^{22} Because of the large disordered silica-water interface, heterogeneous nucleation may explain the results of Desarnaud *et al.* which are between 10 and 20 orders of magnitude faster than the trend suggested by Gao *et al.*,^{23} Na *et al.*,^{24} and the revised calculations.

The revised computational results suggest that seeded simulations combined with accurate driving force calculations can be a powerful combination for predicting accurate nucleation rates. However, the striking difference between the previous rates of Zimmermann *et al.*^{11} and the revised predictions of this study also illustrates the tremendous importance of a precise Δ*μ* calculation for the molecular model. Apparently, much of the original 10^{15}-10^{30} error resulted from errors in *m*_{sat} and *μ*_{NaCl}(*m*) that correspond to just fractional *k*_{B}*T* errors in the free energy calculations.^{20,34}

Why do such small errors lead to giant errors in the nucleation rate? Qualitatively, we can think of nucleation as a high order association reaction, and it is well known that reactions of high order are extremely sensitive to concentration. Quantitatively, we can estimate the uncertainty in the predicted rates by examining two key parts of the exponential term, *n*_{‡} and *β*Δ*μ.* Starting from the differential relationship

and then using typical error propagation rules to relate variances in estimates of *n*_{‡} and *n*_{‡}^{−1/3},

Inverting Eq. (10) gives $\sigma n\u20212$ = 9*n*_{‡}^{8/3}$\sigma n\u2021\u22121/32.$ The variance in *n*_{‡}^{−1/3} can be obtained from analysis of variance rules for the linear regression, as shown in Fig. 2(b).^{36}

To assess uncertainty in *β*Δ*μ*, we begin from the per-ion basis driving force

The uncertainty in the driving force in Eq. (7) is given by standard uncertainty propagation rules as

We use $\sigma \beta \mu NaCl(msat)\u2009=\u2009(0.2\u2009kJ/mol)/RT$ from the work of Nezbeda *et al.*^{20} and values of $\sigma \mu NaCl(m)$ as given in Sec. III. We assume that the uncertainty in ln *J* is dominated by the term *n*_{‡}*β*Δ*μ*/2 in the exponential. Accordingly, the uncertainty in ln[*J/ρZD*_{+}Γ] can be estimated from *σ*_{βΔμ} and *σ*_{n‡} as

Table I shows the estimates and uncertainties of the driving force, the critical nucleus size, and ln[*J/ρZD*_{+}Γ].

. | 8 m
. | 10 m
. | 12 m
. |
---|---|---|---|

βΔμ | 1.85 ± 0.05 | 2.59 ± 0.05 | 3.30 ± 0.06 |

$n\u2021$ | 27 ± 3 | 15 ± 3 | 9 ± 2 |

B | 87 ± 12 | 129 ± 25 | 158 ± 38 |

ln[J/ρZD_{+}Γ] | −25.3 ± 3.1 | −19.3 ± 3.5 | −14.5 ± 3.4 |

Due to $\sigma n\u2021$ | ± 3.0 | ± 3.5 | ± 3.4 |

Due to σ_{$\beta \Delta \mu $} | ± 0.7 | ± 0.4 | ± 0.3 |

. | 8 m
. | 10 m
. | 12 m
. |
---|---|---|---|

βΔμ | 1.85 ± 0.05 | 2.59 ± 0.05 | 3.30 ± 0.06 |

$n\u2021$ | 27 ± 3 | 15 ± 3 | 9 ± 2 |

B | 87 ± 12 | 129 ± 25 | 158 ± 38 |

ln[J/ρZD_{+}Γ] | −25.3 ± 3.1 | −19.3 ± 3.5 | −14.5 ± 3.4 |

Due to $\sigma n\u2021$ | ± 3.0 | ± 3.5 | ± 3.4 |

Due to σ_{$\beta \Delta \mu $} | ± 0.7 | ± 0.4 | ± 0.3 |

Based on the results in Table I, the error bars on the revised computational estimates of ln *J* are approximately four natural log units. In other words, the rate estimates (*within assumptions of the seeding approach*) are numerically accurate to within a factor of ca. 40. Because the error contributions from *β*Δ*μ* have probably been overestimated, the dominant source of numerical error in the revised calculations is most likely from the estimate of *n*_{‡}. Of course, tremendous computational effort has been invested in precise driving force estimates for the JC/SPC/E model of NaCl and water. For most other solutes and solvents, the dominant source of error will likely be uncertainty in the driving forces. Moreover, Sec. V shows that there are additional sources of uncertainty related to assumptions in the seeding approach.

## V. INFLUENCE OF THE NUCLEUS SIZE METRIC

For single component nucleation processes, the free energy (and equilibrium population) as a function of nucleus size can be obtained from traditional rare event sampling methods.^{56,57} Suppose that the free energy calculation is performed with two different nucleus size metrics, one that overestimates nucleus size and one that underestimates nucleus size. The two calculations will give two different free energy vs. nucleus size relationships with one peaked at a large size and one peaked at a small size. On the other hand, the two free energy *barriers* will be similar, as long as one metric consistently overestimates nucleus size while the other consistently underestimates nucleus size. The reason is that the barrier reflects the (small) population of critical clusters at equilibrium, regardless of the size ascribed to them in the clustering algorithm.

Typical nucleus size metrics use a local order parameter to identify atoms in a solute-rich and/or ordered local environment. Then a clustering algorithm groups these atoms into clusters and reports the cluster size. For the same nucleus, using different local order parameters can yield different cluster sizes. For example, consider a 3 × 3 × 3 nucleus of “solutes” on a cubic lattice and suppose that all surrounding sites are occupied by “solvents” as shown in Fig. 4. One could use coordination numbers (CNs) to determine whether individual solute atoms are part of the nucleus. Let the coordination number (CN) for a given solute be the number of solute nearest neighbors. If our local order parameter requires CN ≥ 3, then the clustering algorithm will identify a contiguous cluster of size *n* = 27 solutes. If the local order parameter requires CN ≥ 4, then corner sites on the nucleus are not counted and clustering will give *n* = 19. If a stringent threshold of CN ≥ 6 is used for the local order parameter, then only *n =* 1 solute (at the center of the 3 × 3 × 3 cube) will be counted.

This section shows that the seeding approach, in contrast to traditional rare events methods, gives results which are highly sensitive to the nucleus size metric. This sensitivity of the results to the nucleus size metric increases as the size of inserted seed decreases, given that the ratio interface/volume increases. In Fig. 5, the data from Zimmermann *et al.* have been reanalyzed to obtain rate estimates with several different nucleus size metrics. Zimmermann *et al.* used a stringent local order parameter that requires an octahedral arrangement of counterions around each ion (open blue circles). The rates obtained by Zimmermann *et al.* are similar to rates obtained from analysis with a stringent CN ≥ 6 requirement (blue diamonds). By systematically varying the CN-threshold, we can see how the computed nucleation rates (from the seeding procedure) change. Different CN-thresholds from CN ≥ 1 to CN ≥ 6 yield rate estimates that span 30 orders of magnitude. The rate estimate increases as the CN threshold becomes more stringent.

Figure 5 also shows results from the simulations of Lanaro and Patey. These authors directly observe nucleation in brute force simulations of a 15*m* NaCl solution. Lanaro and Patey^{58} used the same JC/SPC/E force field to study nucleation in direct large scale simulations at 15.6*m*, 22.2*m*, and beyond. They did not report nucleation rates, but at 15.6*m* and 22.2*m* they reported the duration of their simulations and the number of nuclei generated. We have used their data with the approximation

Equation (14) gives an approximate lower bound on *J* because multiple nucleation events occurred in the simulations by Lanaro and Patey. Growth of the earliest nuclei consumes the supersaturation and tends to suppress further nucleation. As shown in Fig. 3, Eq. (14) applied to the data of Lanaro and Patey yields *J* ≥ 6.5 × 10^{24}/cm^{3} s at 15.7m and *J* ≥ 1.2 × 10^{25}/cm^{3} s at 22.2m.

We can understand the effects of different nucleus size metrics from Eqs. (2) and (8). Given a value of the driving force, the estimated free energy barrier for nucleation is directly proportional to the estimated critical nucleus size,

Lax order parameters cause clustering algorithms to overestimate the nucleus size, to overestimate the free energy barrier, and therefore to underestimate the rate. Stringent order parameters cause clustering to underestimate the nucleus size, to underestimate the barrier, and to overestimate the rate. The dramatic effect of different nucleus size metrics is disconcerting because there is no straightforward way to choose the most suitable cluster-size definition.

Seeding analyses use a computational estimate of the critical nucleus size within the CNT rate expression; therefore, the ideal nucleus size metric should be consistent with the assumptions of CNT. Identifying an appropriate nucleus size metric for seeding analysis is non-trivial^{7} because a true critical cluster is discrete, granular, and fluxional, while the CNT model has sharp interfaces that separate continuous stable and metastable phases. As yet, there are no generally established principles to select an optimal CNT-consistent metric, but we offer the following list of considerations:

A stringent clustering threshold on the local order parameter ensures that a cluster of size

*n*contains at least*n*atoms with bulk crystal-like structure, consistent with the interpretation of the*n*Δ*μ*term in the CNT free energy expression. However, stringent local order parameter thresholds omit surface atoms from the nucleus. Because so many atoms in a small nucleus are part of the surface, seeding calculations must decide whether surface atoms are “on” or “in” the nucleus—a question that is avoided by the sharp interface assumption in CNT.A lax local order parameter threshold, e.g., CN ≥ 1 or CN ≥ 2, will include adsorbates that are clearly “on” (not “within”) the surfaces of nuclei. These adsorbates may be dynamically committed to the nucleus and still exterior to a CNT-consistent dividing surface. Extremely lax order parameter thresholds (CN ≥ 1) with highly concentrated solutes also risk crossing the percolation limit in clustering, which may lead to extremely large nuclei.

CNT can be formulated in numerous ways, but for droplet nucleation an equimolar dividing surface leads to a convenient formulation with no surface excess atoms.

^{13}An equimolar dividing surface is difficult to define for solid-fluid interfaces because the density profile is a series of sharp peaks rather than a smooth sigmoid curve.^{59}Nevertheless, the analogy to droplet nucleation theories suggests that the nucleus should include atoms whose CN is approximately halfway between the large CN for atoms in the bulk solid and the small CN for atoms in the metastable solution.

Theories of crystal growth at low supersaturation focus on the half-lattice positions, i.e., “kink sites,” on surfaces. Adding one atom to a kink site adds one atom to the bulk crystal lattice by translational symmetry arguments. For a cubic “Kossel” crystal, the kink sites have CNs that are half of the bulk coordination number.^{60,61} For a large crystal, these considerations suggest including kink sites and more strongly coordinated atoms in the nucleus size *n* and excluding atoms with lower CNs. In the case of a cubic lattice, the half-lattice CN prescription suggests that CN ≥ 3 is an optimal clustering criterion. Interestingly, the forward flux sampling results^{55} (which should be independent of the clustering criterion) agree most closely with the CN ≥ 3 and CN ≥ 4 results.

We emphasize that our work can provide only tentative recommendations, as many potentially important factors have not been considered. Complex crystal structures may present many types of kink sites and many different building units. Moreover, the facets of nuclei are not infinitely large, so any rules inspired by large facets may require amendments to account for corners and edges. Indeed, the use of an equimolar dividing surface for droplet nucleation leads to a size-dependent surface tension because of the Tolman correction.^{62,63} Beyond questions about dividing surfaces and nucleus size metrics, the procedure for solvating and annealing seeds at non-equilibrium sizes may influence their stability and dynamics. The many sources of uncertainty in our results, some not easily resolved, point to the need for more robust rare event methods that can circumvent the special difficulties in simulating solute precipitate nucleation.

## VI. CONCLUSIONS

This work examines the sources of error in seeded simulation results for NaCl nucleation from aqueous solution. Our results demonstrate that the seeding approach relies critically on a precise driving force for the model system. For accurate rate estimates, the driving force vs. solute concentration curve, i.e., Δμ(*m*), need not exactly reproduce that of the real system, but it should accurately describe the thermodynamic properties *of the model system*. We have demonstrated these calculations and presented uncertainty analyses for the rates of NaCl nucleation from a supersaturated NaCl electrolyte solution.

We have further demonstrated that the nucleus size metric also has a strong influence on the predicted nucleation rates. Because the seeding approach uses the critical nucleus size in a CNT-like rate expression, the ideal nucleus size metric should be consistent with the equimolar dividing surface definition used in classical nucleation theory. Size metrics that overestimate nucleus sizes lead to underestimated nucleation rates, and size metrics that underestimate nucleus sizes lead to overestimated rates. Further work is needed to determine the optimal local order parameters and clustering rules for computing nucleus size, but our results point to some tentative suggestions. A comparison between our results and independent forward flux sampling results^{55} suggests that the optimal nucleus size metric should include all atoms in the bulk, surface, edge, and perhaps corner sites. Solute atoms with lower coordination to other solute atoms should be viewed as adsorbates and excluded from the nucleus size estimate. The numerous and potentially gigantic sources of error in rates obtained by the seeding approach should motivate the development of more robust rare event methods to overcome the unique challenges encountered in studies of solute precipitate nucleation.

## SUPPLEMENTARY MATERIAL

See supplementary material for how rates were estimated from brute force simulations. It also shows how a single outlier data point in the reprocessed 8*m* trajectories leads to an extremely small rate estimate. Data associated with this manuscript is available for download at http://wrap.warwick.ac.uk/101981.

## ACKNOWLEDGMENTS

We thank Athanassios Panagiotopoulos for helpful discussions. W.R.S. thanks the Natural Sciences and Engineering Council of Canada for funding under STPGP No. 479466-15. D.Q. acknowledges support of an EPSRC Grant No. EP/H00341X/1. B.P. acknowledges support from a Camille Dreyfus Teacher Scholar Award and from Eli Lilly and Company. J.R.E., E.S., and C.V. thank No. FIS2016-78117-P of the MINECO for funding. N.E.R.Z. was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Materials Sciences and Engineering Division under Contract No. DEAC02-05-CH11231: Materials Project program KC23MP.