Electrolytes play an important role in a plethora of applications ranging from energy storage to biomaterials. Notwithstanding this, the structure of concentrated electrolytes remains enigmatic. Many theoretical approaches attempt to model the concentrated electrolyte by introducing the idea of ion pairs, with ions either being tightly “paired” with a counter-ion or “free” to screen charge. In this study, we reframe the problem into the language of computational statistics and test the null hypothesis that all ions share the same local environment. Applying the framework to molecular dynamics simulations, we find that this null hypothesis is not supported by data. Our statistical technique suggests the presence of two distinct local ionic environments at intermediate concentrations, whose differences surprisingly originate in like charge correlations rather than unlike charge attraction. Through considering the effect of these “aggregated” and “non-aggregated” states on bulk properties including effective ion concentration and dielectric constant, we identify a scaling relation between the effective screening length and theoretical Debye length, which applies across different dielectric constants and ion concentrations.

## I. INTRODUCTION

Understanding the correlations between charged objects in electrolytes is important to applications such as colloid science^{1,2} and energy storage in supercapacitors.^{3,4} The structure of dilute electrolytes is well-understood: the mean field Debye–Hückel theory, derived almost a century ago, accurately predicts a decrease in electrostatic screening length as ion concentration increases.^{5} However, the physical picture is less clear for concentrated electrolytes. Theoretical studies have systematically analyzed the structure of electrolytes beyond mean field theory.^{6–8} These pioneering works predict an increase in screening length after reaching a minimum when the screening length is of the order of the ion diameter in concentrated electrolytes.

Recently, a series of experimental studies using the Surface Force Balance (SFB) apparatus revealed that the screening length is longer than theoretically predicted, reaching up to 10 nm (an order of magnitude greater than the ion diameter) in solvent-free ionic liquids.^{9–12} The screening length follows a simple scaling behavior that is linear in ion concentration and the Bjerrum length and is universal across organic and inorganic electrolytes.^{13}

The physical mechanism that triggers this surprisingly long screening length is still elusive. Atomistic and coarse grained simulations have observed a screening length that seems to agree with the short-ranged “structural force” observed in experiments but not the long decay length.^{14} Many theoretical approaches attempt to explain the data by introducing the idea of ion pairs^{9,10} or effective dielectric constants.^{15} The argument is that most cations and anions are bundled up as neutral pairs in concentrated electrolytes, thus the electrolyte comprises a low concentration of “free” ions solvated in a liquid of effectively neutral “paired” ions. However, ion pairs are usually defined based on a cutoff distance selected to fit the data. It is unclear that ions closer than a certain distance are physically distinct from the rest and how this affects the dielectric constant.

In this paper, we revisit the physical picture of ion pairing using Bayesian inference and unsupervised machine learning. Rather than attempting to predict and explain the anomalously long screening length, our aim is to understand the concept of ion pairing. Our conjecture is that if there were ion pairs, one would expect the existence of two distinct local ionic environments, one corresponding to ions in a “paired” state and another corresponding to ions in a “free” state. The question then becomes whether the hypothesis that all ions share one environment is more probable than the hypothesis that there are multiple ionic environments. As such, we reframe the ion-pair hypothesis as Bayesian hypothesis testing,^{16} sidestepping any phenomenological distance-based cutoff.

We will first discuss the representation of local ionic environments using the local pair distribution function (PDF) and introduce the Bayesian Gaussian mixture model as a tool to determine the most probable number of statistically distinct local environments. We will then analyze molecular dynamics simulations of the Solvent Primitive Model (SPM) used as a surrogate model of the bulk electrolyte. We compare the hypotheses of these existing two environments as against more than two or just a single environment. At intermediate concentrations, we find that the evidence most strongly supports the existence of two distinct local environments. Close inspection of these distinct “aggregated” and “non-aggregated” environments reveals that surprisingly differences originate in the form of like charge correlations rather than unlike charge pairing. Finally, we discuss the effect of aggregation on bulk properties such as free ion concentration and effective dielectric constant; from this, we derive a mapping from the inferred fraction of particles in non-aggregated environments to an effective screening length. Notably, we observe a subsequent collapse of our data onto one single curve. We conclude by discussing the implications of our findings for future work.

## II. REPRESENTATION AND CLUSTERING OF LOCAL ENVIRONMENTS

### A. Description of local ionic environment

We describe the local environment of some central ion *a* of type *A* by its correlations with surrounding ions of type *B*,

where *V* is the system volume, *N*_{B} is the number of type *B* ions, and each neighboring ion *b* is represented by a normal distribution $N$, with mean and standard deviation equal to the inter-ion distance |**r**_{ab}| and the ion radius *σ*. Our descriptor, $g\u0303aB(r)$, is related to the pair distribution function (PDF)

where ⟨·⟩ denotes the ensemble average,^{17} except we smooth the delta function to a Gaussian and remove the ensemble average as we are considering the local environment at a particular snapshot around a particular ion.

The representation is vectorized by discretizing *r* into *D* bins; the resultant descriptor for each central ion **x** is formed from the concatenation of representations of both like and unlike charge correlations and is thus of dimensionality *d* = 2*D*. From *F* frames of molecular dynamics simulations, each with some number of type *A* ions *N*_{A} in the simulation box, we collect a total of *N* = *N*_{A}*F* local ionic environments, ${xi}i=1N$. The simulation methodology and numerical details are provided in Appendix A.

### B. Pattern identification with Bayesian unsupervised learning

Having collected the local ionic environments to form the dataset $X={xn}n=1N$, we next seek to infer the number of statistically significant groups into which the *N* data points fall. If statistically speaking, ions do not all inhabit the same local environment, we can “group” the inhabited environments such that differences within each group are much smaller than differences between groups.

This grouping problem can be solved using Bayesian inference.^{18,19} We first posit that each data point **x**_{n} is generated by the sum of *K* independent probability distributions, each describing a local environment. We then infer both the most probable number of distributions required to adequately describe the data, *K*^{*}, and the most probable parameters of those distributions. *K*^{*} is physically interpreted as the number of statistically distinct environments in the system, a key parameter that could confirm or reject the ion-pair picture.

For ease of interpretation, we consider a weighted sum of *K* Gaussian distributions,

where $\Theta K={\pi k,\mu k,\Lambda k}k=1K$ are the parameters of the *K*-environment model, *π*_{k} defines the fraction of ions inhabiting the *k*th local environment, and *μ*_{k} and **Λ**_{k} define the mean of the *k*th local environment (physically, the local ionic environment of the typical ion in that group) and the precision of environments within that group, respectively. The parameters **Θ**_{K} are considered random variables; an approximation will be found to the posterior distribution *p*(**Θ**_{K}|**X**, *K*), which is the probability of the parameters, given the observed data. The form of this Bayesian Gaussian mixture model is given in Appendix B.

The optimal number of distributions, *K*^{*}, is that which maximizes the probability *p*(**X**|*K*) of the observed data, given a *K*-environment model. *p*(**X**|*K*) is the *marginal likelihood*; it quantifies how well a model can explain the observed data and is of fundamental importance in Bayesian model selection.^{20}

The Bayesian paradigm is that the quality of a *K*-environment model is not measured by how well the “best” parameters within a model class fit the data, but whether the model, averaged over all possible model parameters, can effectively fit the data. This is seen in the explicit marginalization over all model parameters, **Θ**_{K},

Crucially, the marginal likelihood incorporates Occam’s razor effect,^{21} implicitly penalizing overly complex models. It will be maximized by the simplest model that can adequately explain the data.

We shall use the Bayes factor,

whose calculation is central in Bayesian hypothesis testing,^{16,20} to obtain a quantitative measure of the evidence in favor of a model with *K*′ environments as against another with *K* environments. Rather than computing the integral (4), which is numerically intractable, in this paper, we employ variational inference^{22–24} to compute an approximation to the marginal likelihood and the corresponding Bayes factor, inspired by the statistical physics of mean field theory.^{18} A summary of this approach is provided in Appendix C. The code used can be found at https://github.com/PenelopeJones/electrolytes.

## III. RESULTS AND DISCUSSION

### A. Validation on toy systems

Before turning to analyze electrolytic solutions, we first illustrate how our methodology can be applied to systems where the answer is well known. Through this set of toy problems, we validate the choice of hyperparameters in the Bayesian prior which we will then apply to understand the structure of electrolytes. Technical details are provided in Appendix B.

We consider two simple systems. The first is a system of 100% neutral hard spheres and the second comprises unbonded hard spheres and bonded hard spheres in a 1:1 ratio; both systems have the same total concentration of hard spheres, although in the latter case half of the hard spheres are connected by a harmonic spring. The question is that if we assume we do not know which particles are bonded and which particles are free, can the algorithm decipher that there are two distinct environments and correctly deduce the proportion of particles in each environment? Figure 1 shows that in agreement with intuition, the most probable number of statistically distinct local environments, *K*^{*}, is inferred to be one for the system comprising only neutral hard spheres but two for the system comprising neutral hard spheres and dumbbells in a 1:1 ratio.

We then study the differences in these statistically distinct environments. Rather than studying the posterior over the model parameters themselves, it is more intuitive to look at the widely studied unsmoothed PDF, following (2), averaged over all particles classified as inhabiting each environment. These are shown for the two systems in Fig. 1. For the first system, the recovered mean (*g*_{1}) corresponds exactly to the “true” mean environment for the system. For the second system, the recovered environments are seen to correspond to the “bonded” (*g*_{1}) and “unbonded” (*g*_{2}) environments, with the correct proportion of ions classified as each. It should be noted that the recovered mean PDFs do not exactly correspond to those of the dumbbells and hard spheres, but the majority of dumbbells are classified as “bonded” and the majority of hard spheres are classified as “unbonded.”

### B. Monovalent electrolytic solutions

Having corroborated the ability of the model to infer the correct number of environments in simple physical systems, we now apply the same technique to probe the structure of bulk monovalent electrolytes, modeled using the Solvent Primitive Model (SPM) as outlined in Appendix A. We consider the relative plausibility of models with one, two, three, or four distinct environments.

Figure 2 summarizes our key findings: At intermediate concentrations, a clear maximum in the marginal likelihood is observed for a two-environment model, indicating that the null hypothesis that all ions share the same environment should be rejected in favor of the hypothesis that ions inhabit one of two statistically distinct states. At higher concentrations, approaching saturation or neat ionic liquids, these two statistically distinct environments become indistinguishable, and the evidence supports a single environment model. We note that experiments also suggest a change in the form of correlations at high concentrations, as evidenced by the deviation of the screening length at high concentrations from the cubic scaling relationship observed for intermediate concentrations.^{11,25} In all of the aforementioned cases, the marginal likelihood of the three- and four-environment models is significantly less than that of either the one- or two-environment models; thus, we only present analysis of these models.

We next study the form of the two distinct environments inferred at intermediate concentrations. Surprisingly, across all systems, the preeminent difference between environments arises in the like charge correlations, as shown in Fig. 2(b). This contrasts with the differences in unlike charge correlations implied by the ion-pair hypothesis. The two environments are physically distinguished as being “aggregated” and “non-aggregated.” In the first case, several charges are bundled together in close proximity; in the second case, there will be *at most* another unlike charge in the local vicinity.

The concepts of ion pairing^{26–29} and clustering^{30,31} have been employed in the literature to explain the physics of concentrated electrolytes. However, the definition of an ion-pair (e.g., a distance threshold between oppositely charged ions) is mostly empirical.^{32} Our work provides a complementary way to define ion pairs, by inferring the number of statistically significant ionic environments. This approach leads to several insights: First, it suggests that the statistically distinct ionic environments arise from like charge correlation rather than the Bjerrum pairing picture of unlike charge correlations. Second, a corollary of our approach is that the existence, or lack thereof, of multiple types of ion clusters can be inferred naturally. Our results suggest that concentrated electrolytes are best described by a two-environment model of “aggregated” and “non-aggregated” ions.

Interestingly, for all concentrations *c* and dielectric constants *ϵ* for which a two-environment model is found to be most probable, the same qualitative differences in environment are identified. However, the proportion of ions that are classified as aggregated increases with both *c* and *ϵ*, as shown in Fig. 3. This relationship is rationalized by the reduced electrostatic repulsion between like charges at larger *ϵ* and by the reduced ability to maintain large distances between like charges at larger *c*.

We next consider the effect of aggregation on the effective screening length. In the classical ion-pair model, ion pairs are considered neutral species, whose existence reduces the free ion concentration in the solution. This leads to an increase in the screening length according to Debye–Hückel theory. If we were to consider aggregated ions playing the role of ion pairs, in that, on average, they are a neutral species, one might posit that the effective screening length takes the form, analogous to the Debye length,

where *ϵ*_{eff} is the effective dielectric constant of the mixture of non-aggregated ions and ion aggregates and *c*_{eff} is the concentration of non-aggregated ions. We model the solution as a dielectric continuum in which ion aggregates, with dielectric constant *ϵ*_{A} and molecular fraction *ϕ*, exist in a dielectric background with dielectric constant *ϵ*. Note that for each system, *ϕ* is determined directly from our Bayesian Gaussian mixture model (cf. Fig. 3). The effective concentration of non-aggregated ions is then

We compute *ϵ*_{eff} from the Bruggeman equation for dielectric mixing,^{33}

where *b* = (3*ϕ* − 1)*ϵ*_{A} + (2 − 3*ϕ*)*ϵ*. Following the scaling analysis of the experimental data,^{13} we analyze the data by plotting *λ*_{S}/*λ*_{D} against 2*σ*/*λ*_{D}, where *λ*_{D} is the Debye length computed using *ϵ* and *c* and *σ* is the ion radius. The only unknown constant is *ϵ*_{A}, the dielectric constant of the ion aggregates. We fit *ϵ*_{A} to maximize the extent to which the data collapse. Note that this is a single parameter fit, collapsing data from simulations at four different dielectric constants, with multiple ion concentrations at each dielectric constant. Figure 4 shows that the data collapse satisfactorily for *ϵ*_{A} ∼ 21.1, in qualitative agreement with experimental data.^{13} This is perhaps unsurprising as it is in the range expected for room temperature ionic liquids,^{34} which are pure ionic melts. However, the scaling exponent is 0.376 ± 0.011, which is different from the cubic scaling observed in experiments.^{13} This discrepancy between scalings observed in theory, simulations, and experiments has been widely documented. Even across the theory and simulations literature, different scalings have been identified, including 1,^{14,35} 3/2,^{36} and 2,^{35} with no simulations work recovering the experimental cubic scaling. We further note the emphasis of our work on short-ranged interactions; the scaling trend observed by experiments investigating short-ranged structural forces is much less clear.^{25} Thus, it is perhaps unsurprising that we do not recover the aforementioned exponents.

## IV. CONCLUSION

In summary, we used Bayesian unsupervised learning to infer the number of statistically distinct local ionic environments in bulk monovalent electrolytes. Surprisingly, our results indicate the presence of two statistically distinct “aggregated” and “non-aggregated” environments at intermediate concentrations, whose differences originate in like charge correlations (as opposed to unlike charge correlations, which would be implied by the ion-pair hypothesis). By considering the effect of aggregation on dielectric constant and free ion concentration, we compute an effective screening length *λ*_{S} and observe a universal scaling relationship between *λ*_{S} and the theoretical Debye length *λ*_{D} that holds across all dielectric constants over a range of concentrations.

The presence of a scaling law suggests that there might be underlying general physical insights, which could be inferred from concepts revealed by the unsupervised learning approach, such as aggregated ions and effective ion concentrations. Nonetheless, we note that, similar to other work founded on simulations and theory, the exponent of the scaling relationship obtained from our analysis is different from what is observed experimentally.

Perhaps most importantly, our results suggest that meaningful physical insights can be elucidated in soft matter systems by studying the statistical *differences* between local environments in the same macroscopic system, as opposed to more widely used statistical *averages*. We suggest this could be a novel way to recharacterize previously hidden order in soft matter and ionic systems.

## ACKNOWLEDGMENTS

P.J. and A.A.L. acknowledge the support of the Winton Programme for the Physics of Sustainability. P.J. acknowledges the support of the Ernest Oppenheimer Fund. A.A.L. acknowledges support from the Royal Society. A.H. and F.C. acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project Nos. 406121234 and 404913146, respectively. The software package Ovito was used for the visualization of environments in real space.^{37} This work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service, provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (Capital Grant No. EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in zenodo at https://zenodo.org/deposit/4015136.

### APPENDIX A: DATA PRODUCTION AND FEATURE EXTRACTION

Using the ESPResSO package,^{38} molecular dynamics simulations were carried out for both the “toy” systems and for the Solvent Primitive Model (SPM).^{39} For both toy systems, molecules were modeled as neutral hard spheres of radius *σ* = 0.15 nm. Hard particle interactions were modeled by a Weeks–Chandler–Andersen (WCA) potential^{40,41} and a purely repulsive truncated and shifted Lennard-Jones potential with prefactor 10^{4} *k*_{B}*T* and cutoff distance at 2*σ*, where *k*_{B} is the Boltzmann constant and *T* is the temperature. In the second toy system, “dumbbells” were formed by connecting 50% of the spheres to another using additional harmonic bonds of equilibrium separation 0.4 nm and prefactor 15 *k*_{B}*T*/nm^{2}. Smoothed PDFs were calculated for distances [0.15 nm, 1.2 nm] using a bin size of 0.15 nm and normalized with linear scaling, such that the resulting data had zero mean and a range of [−1, 1].

To model the electrolytic system, we use the SPM, which is one of the simplest molecular solvent models,^{36,45} and it has been used extensively to study interactions within electrolytic solutions at surfaces^{42–44} where packing effects matter. More recently, it has been applied to investigate the underscreening effect in the bulk of ionic liquids and concentrated electrolytes^{36,45} and to explain an experimentally observed switch in the decay behavior of correlations.^{45}

In the SPM, cations, anions, and solvent molecules were modeled as hard spheres of radius *σ* = 0.2 nm with electrical charges +*e*, −*e*, and 0, respectively. Simulations were performed at a constant temperature of *T* = 300 K and a range of concentrations *c* and relative permittivities *ϵ*, i.e., (*c*, *ϵ*) ∈ [0.5M, 3.0M] × {20, 40, 60, 80}. Hard particle interactions were again modeled by a WCA potential, and electrostatic interactions were treated by the P3M method.^{38,46} The smoothed PDFs *g*_{+−} and *g*_{−−} were calculated for distances [0.2 nm, 1.0 nm] using a bin size of 0.2 nm and normalized as described above. Only the anionic environments are studied due to charge reversal symmetry.

### APPENDIX B: THE BAYESIAN GAUSSIAN MIXTURE MODEL

In the Gaussian mixture model, it is assumed that the observed dataset **X** = {**x**_{1}, …, **x**_{N}} is drawn from an underlying distribution comprised of a linear superposition of *K* independent Gaussian distributions, each of which has an associated mean *μ*_{k}, precision **Λ**_{k} (note that the precision is the inverse of the covariance **Σ**_{k}), and mixing coefficient *π*_{k}, where *π*_{k} represents the probability of an observed data point being drawn from the *k*th distribution. Each observation **x**_{n} has an associated latent variable **z**_{n} of dimensionality *K*, where *z*_{nk} is the label describing the mapping of data point **x**_{n} to cluster *k*. Thus, the latent variables can be denoted by **Z** = {**z**_{1}, …, **z**_{N}}. The distribution over **Z**, given $\pi ={\pi k}k=1K$, is

and the distribution over **X**, given the latent variables and model parameters, is

In the Bayesian Gaussian mixture model, it is further assumed that the parameters $\Theta K={\pi k,\mu k,\Lambda k}k=1K$ are themselves random, with distributions parameterized by hyperparameters to be determined. The aim is to find the posterior distribution *p*(**Θ**_{K}|**X**, *K*) of these model parameters, given **X**.

To compute the posterior, it is necessary to specify a prior distribution over **Θ**_{K}. We use the prior

Using this model, the only manual selection is that of the hyperparameters {*α*_{0}, *m*_{0}, *β*_{0}, **W**_{0}, *ν*_{0}}. We select uninformative hyperparameters (those that lead to a broad distribution over model parameters): *α*_{0} = 1.0, *β*_{0} = 1.0 × 10^{−11}, $W0=I$, *m*_{0} = **0**, and *ν*_{0} = *d*, where *d* is the dimensionality of **x**. We check for coherence of the resulting prior distribution via validation on well-understood physical systems as described in the main text.

### APPENDIX C: VARIATIONAL INFERENCE

Calculation of the marginal likelihood requires integration over all parameters of the model, which is intractable for most models of interest. This problem can be circumvented using variational inference,^{24} whereby the variational distribution *q*_{ϕ}(**Θ**_{K}) is introduced and used to obtain a lower bound to the marginal likelihood, termed the evidence lower bound (ELBO),

Maximization of the ELBO with respect to variational parameters *ϕ* jointly obtains an approximation to the marginal likelihood and the posterior of the model parameters *p*(**Θ**_{k}|**X**, *M*_{k}). Indeed, when *q*_{ϕ}(**Θ**_{k}) = *p*(**Θ**_{k}|**X**, *M*_{k}), the lower bound is exact.