Machine learning is an important tool in the study of the phase behavior from molecular simulations. In this work, we use un-supervised machine learning methods to study the phase behavior of two off-lattice models, a binary Lennard-Jones (LJ) mixture and the Widom–Rowlinson (WR) non-additive hard-sphere mixture. The majority of previous work has focused on lattice models, such as the 2D Ising model, where the values of the spins are used as the feature vector that is input into the machine learning algorithm, with considerable success. For these two off-lattice models, we find that the choice of the feature vector is crucial to the ability of the algorithm to predict a phase transition, and this depends on the particular model system being studied. We consider two feature vectors, one where the elements are distances of the particles of a given species from a probe (distance-based feature) and one where the elements are +1 if there is an excess of particles of the same species within a cut-off distance and −1 otherwise (affinity-based feature). We use principal component analysis and t-distributed stochastic neighbor embedding to investigate the phase behavior at a critical composition. We find that the choice of the feature vector is the key to the success of the unsupervised machine learning algorithm in predicting the phase behavior, and the sophistication of the machine learning algorithm is of secondary importance. In the case of the LJ mixture, both feature vectors are adequate to accurately predict the critical point, but in the case of the WR mixture, the affinity-based feature vector provides accurate estimates of the critical point, but the distance-based feature vector does not provide a clear signature of the phase transition. The study suggests that physical insight into the choice of input features is an important aspect for implementing machine learning methods.

## I. INTRODUCTION

Machine learning (ML) has become an important tool in the study of phase transitions in molecular simulations.^{1} Supervised and un-supervised methods have been successfully applied to a variety of systems.^{2–12} While the majority of studies have focused on lattice models such as the 2D Ising model,^{8,13–18} there have also been supervised machine learning studies of complex fluids, such as polymer blends^{19} and polymers in ionic liquids.^{7} In this work, we use un-supervised machine learning methods to study the phase behavior of two continuous-space binary mixtures.

Machine learning methods can be broadly classified into two categories: supervised and un-supervised. In supervised methods, such as neural networks and support vector machines, the algorithm is trained on known data and used to predict the behavior of an unknown evaluation set. In un-supervised methods, a method is used to decrease the dimensionality of the input data, and correlations (or clustering) in the reduced dimensions are used to provide insight into the behavior. The objective is to discover patterns in the data without any prior training.^{2,13} The advantage of un-supervised methods for phase behavior is that prior knowledge of the existence of a phase transition is not required. Furthermore, they can be useful for complex fluids where molecular simulation of the phase behavior using conventional methods (required for a training set in supervised methods) is computationally intensive. In this work, we use un-supervised machine learning methods with two dimensionality reduction techniques, namely, principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), to study the phase behavior of off-lattice models.

There have been some studies of continuous-space methods using un-supervised ML methods. Jadrich *et al.* used PCA to study the phase behavior of 2D model (continuous-space) particle systems, such as the RandOrg model, hard ellipses, and non-additive, hard sphere Widom–Rowlinson (WR) mixture, as well as 3D hard sphere and patchy colloid mixtures.^{3,4} They emphasized the importance of “feature” construction in the application of PCA to particle-based systems. The feature vector is the data that is input to the ML method. In spin models, such as the 2D Ising model, the raw configuration, i.e., the values of all the spins, is usually an excellent choice as the input feature vector. For particle-based systems, the positions of the particles prove to be uninformative. Inter-particle distances are more useful, since these are invariant to rotation of the simulation cell.^{3,4,20}

In this work we study the phase behavior of two 3D off-lattice systems showing liquid–liquid phase behavior, a binary Lennard-Jones (LJ) mixture and the Widom–Rowlinson (WR) mixture, using PCA and t-SNE techniques. We find that the construction of the feature vector is crucial to the ability of the ML method to predict the phase behavior. A distance-based feature provides good estimates for the critical point of the LJ mixture, but not for the WR mixture. We propose an affinity-based feature, where the elements are +1 if a particle has predominantly neighbors of the same species and −1 otherwise. We find that this feature provides an excellent description of the phase behavior for both models.

## II. MODELS AND SIMULATION DETAILS

The two models we investigate are an LJ mixture and the WR mixture.^{21} In the symmetric LJ mixture, the particles interact via a truncated and force shifted 12-6 LJ potential, i.e., the potential of interaction between a particle of species *A* and a particle of species *B*, Φ_{AB}, is given by

where

*σ*_{AA} = *σ*_{BB} = *σ*_{AB} = *σ*, *r*_{c} = 2.5*σ*, and *ϵ*_{AA} = *ϵ*_{BB} = 2*ϵ*_{AB} = *ϵ*. Since the attraction between like species (A–A and B–B) is stronger than that of unlike (A–B) species, this system phase separates upon cooling. We define a reduced temperature by *T** = *k*_{B}*T*/*ϵ,* where *k*_{B} is Boltzmann’s constant and *T* is the temperature, and a reduced density *ρ* = *Nσ*^{3}/*V*, where *N* is the total number of particles, *N*_{A} + *N*_{B} = *N*, where *N*_{A} and *N*_{B} denote the number of A type and B type particles, respectively. V is the volume of the simulation box.

In the WR mixture, there is no interaction between particles of the same species, and particles of unlike species (A-B) interact via a hard sphere interaction, with diameter *σ*_{HS}. This system phase separates when the density is increased. We define a reduced density $\rho =N\sigma HS3/V$. All simulations are performed for a mole fraction *x* = 0.5 which, by symmetry, is the critical composition of both mixtures.

Configurations of the WR mixture are obtained from Monte-Carlo simulations in the NVT (number of particles, volume, and temperature fixed) ensemble. Initial configurations are obtained by placing *N* particles in a cubic cell with linear dimension *L*. The system is evolved via single particle moves, where a randomly chosen particle is moved at a random angle with displacements uniform on (0, *σ*_{HS}). The system is equilibrated for 10^{6} steps per particle. The production run consists of 10^{7} attempted moves per particle from which 1000 configurations are extracted. A total of 51 number densities are explored, ranging from $\rho \sigma HS3$ = 0.5–1. We study six system sizes with N = 1024, 2048, 3072, 4096, 8192, and 16 384.

Configurations of the binary LJ mixture are obtained via molecular dynamics (MD) simulations in the canonical (NVT) ensemble using the LAMMPS package.^{22} Initial configurations are obtained by placing *N* particles in a cubic cell with linear dimension *L* = *V*^{1/3}, such that *ρ* = 1. The system is propagated using a velocity verlet integrator, with a time step of Δ*τ** = 0.004, where *τ** is the reduced time ($\tau *=t\sigma 2\u03f5/m$, where m is the mass of a particle). A Nose–Hoover thermostat is used to maintain the desired simulation temperature.^{23,24} The initial configuration is equilibrated for 2 × 10^{6} time steps (8000 *τ**) and properties are averaged over 10^{7} time steps (40 000 *τ**) from which 1000 configurations are extracted and used. Simulations are performed for temperatures ranging from 1.0 ≤ *T** ≤ 1.8 for distance feature matrix calculations and from 0.9 ≤ *T** ≤ 2.375 for affinity matrix calculations. For affinity matrix calculations, we study varying system sizes with *L*/*σ* = 10, 12, 14, 16, 18, 20, 24, and 30.

## III. FEATURE CONSTRUCTION

The feature vector is a one-dimensional vector that comprises the data from each configuration that is input into the ML algorithm. In spin systems, this is usually a list of the values of all the spins in a configuration. In continuous-space systems, the choice is not obvious. We test two different choices for the feature vector in this work. The first is based on the work of Jadrich *et al.*,^{4} which we refer to as the distance-based feature. The second is a choice that we propose, inspired by spin systems, and we refer to this as the affinity-based feature.

The distance based feature vector is constructed as follows [see Fig. 1(a)]: For each configuration, we choose a probe particle of one of the species and then calculate the distances from this particles to the *N*_{c} nearest neighbors of the same species. We select a particle of species A as a probe and use the distances from *N*_{c} nearest neighbors for analysis as described below. If *v*_{i} is the distance from the probe to the *i*th nearest neighbor, the average distance is

We define a new distance $xi=vi\u2212v\u0304$ and manually re-order the elements such that $x1\u2264x2\u2264\u2026\u2264xNc$. A single component of the feature vector is given by

For the WR mixtures, the particles’ co-ordinates are first normalized by the box length.

The feature matrix **X** is constructed by stacking all the feature vectors

where *N*_{T} is the number of configurations. Since there are *N*_{c} neighbors, **X** is a matrix of dimension *N*_{T} × *N*_{c}. We average the results over ten probes in the case of the LJ mixtures and N/2 probes in the case of the WR mixture. We perform PCA for each feature matrix constructed from each probe and then derive the mean and standard deviation profile. Then, we average them to get the final results.

The affinity-based feature vector is constructed as follows [see Fig. 1(b)]: For each particle, we calculate the number of particles of each species within a cut-off distance *r*_{cut}, which we set equal to half the box length. If the number of particles of the same species around particle *i* is greater than the number of particles of the other species, then the quantity *g*_{i} = 1, otherwise *g*_{i} = −1.

For each configuration the feature vector is given by

where N is the total number of particles. The feature vector matrix is an *N*_{T} × *N* matrix,

We optimize the value of *r*_{cut} in the affinity-based feature vector by selecting a value that can best distinguish between different structures. Figure 2 depicts the probability distribution function, $Pg\u0304$, of the average feature vector, $g\u0304=1N\u2211i=1Ngi$, for two values of *r*_{cut}. There is a greater distinction between the different values of $g\u0304$ for *r*_{cut} = 0.5*L*. We choose a value of *r*_{cut} where the variance $\sigma g\u03042$ is the largest. For both the WR and LJ mixtures, this occurs for *r*_{cut} = 0.5*L*.

## IV. UNSUPERVISED MACHINE LEARNING

We use Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE)^{25,26} to obtain a low-dimensional representation of our data. PCA uses an orthogonal linear mapping of the original high dimensional data into a lower dimensional space. The method we use is standard, but we describe it briefly here for completeness.

The first step is standardization of the feature matrix to give a matrix **X**_{S,} where the mean of each column is zero. The covariance matrix

is a positive, semi-definite, symmetric matrix (of dimension *N*_{c} × *N*_{c} or *N* × *N*) with non-negative eigenvalues and orthogonal eigenvectors. The eigenvalues *b*_{n} and eigenvectors **w**_{n} are obtained by solving

The transformation matrix has the eigenvectors as columns, i.e., $W=w1T,w2T,\u2026,wNT$, where the eigenvectors are ordered so that they correspond to a decreasing order of eigenvalues, *b*_{1} > *b*_{2} > *b*_{3}, etc. The principal component matrix is given by

The principal component (PC) axes are the columns of **P**. Since the eigenvalues are ordered in decreasing magnitude, the first PC will have the largest variance. A dimensionality reduction is achieved by choosing the first few principal components. If *b*_{n} is the variance of PC *n*, called the explained variance, the relative explained variance *λ*_{n} is defined as

where M is the dimension of **W**.

Figure 3 depicts the relative explained variances plotted against the PC index for the LJ mixture and WR models. In both models, *λ*_{n} decreases slowly with *n* with the distance-based feature, but the first index is dominant with the affinity-based feature. The affinity-based feature, therefore, provides a better reduction in dimensionality.

We quantify the phase transition using PCA-derived order parameters (OPs). These OPs are calculated by using only the first two PC scores: in other words, only the first two columns of the final matrix **P**. The first two columns contain the first two principal components *P*_{1} and *P*_{2} for each state point (temperature or density). One OP is the ensemble average of *P*_{1}, denoted $P1$. We normalize $P1$ so that the range is between 0 and 1. The second OP is related to the variance of the principal components. If $\sigma 12$ and $\sigma 22$ are the ensemble averages of the variances of *P*_{1} and *P*_{2}, respectively, the OP is defined as $\sigma 12\u2261\sigma 12+\sigma 22/2$.

The theoretical framework of t-SNE is based on effective minimization of the divergence between the two conditional probabilities (*P* and *Q*). Here, for the distance feature matrix, we have a higher dimensional input dataset **X**_{S} = $[x1,x2,\u2026xNT]$. For the input, we use the PCA transformed data. The first step is measuring the conditional probability of pairwise distances between the input data points based on a Gaussian distribution,

where

and

Here, *σ*_{i} is the variance of the Gaussian centered on each high dimensional data point **x**_{i}. This parameter is automatically determined by the algorithm for a given value of user-provided perplexity, which is basically the effective number of neighbors around **x**_{i}, mathematically given as

where H(*P*_{i}) is Shannon entropy. In order to keep the symmetry between *P*_{i∣j} and *P*_{j∣i}, t-SNE uses a probability function *P*_{ij}, which is the mean of two probabilities

where *N* is the size of the input dataset.

The second conditional probability (Q) measures the similarity between the high-dimensional data points in lower dimensional space. The low dimensional mapping is represented as **Y** = $[y1,y2,\u2026yNT]$ and the embedding similarity *Q*_{ij} between two low dimensional data points **y**_{i} and **y**_{j} is calculated using a Student-t kernel given as

where

and

The t-SNE approach minimizes the Kullback–Leibler divergence that is defined using a cost function (C), given as

In order to minimize the difference between two probability distributions *P*_{ij} and *Q*_{ij} and updating *Q*_{ij} iteratively, the cost function is minimized by descending along the gradient

Solving this gradient iteratively allows t-SNE to find a low dimension projection **y**_{i} of the corresponding input **x**_{i}. Since, here we are projecting our data to a two-dimensional space, the final low-dimensional matrix is of size *N*_{T} times two, and the columns give us the first and second components of the projections. These are labeled *S*_{1} and *S*_{2} in the scatter plots. In the case of t-SNE, the order parameters are the same as described in connection with the PCA, although they are denoted as $S1$ and *σ*_{12}.

All the PCA and tSNE calculations were performed by using the libraries available in scikit-learn.^{27}

## V. RESULTS AND DISCUSSION

A scatter plot of the PCA and t-SNE as a function of state point shows that the affinity based feature vector more clearly distinguishes between the phase separated and mixed states, when compared to the distance-based feature, for both the LJ mixture and WR model. Interestingly, there is no significant difference in the ability of the PCA or t-SNE to distinguish the phases. These conclusions are evident in Figs. 4 and 5, which depict the scatter plots for the two models with the distance based (Fig. 4) and affinity-based (Fig. 5) feature vectors, respectively. In the figures, the colors represent the state point going from mixed (red, high temperature in the LJ mixture and low density in the WR model) to phase separated (blue, low temperature in the LJ mixture and high density in the WR model).

A comparison of the PCA with the t-SNE (compare left to right panels in all figures) shows that there is no significant distinction between the ability of the PCA and t-SNE to distinguish mixed and phase separated states, i.e., the degree of clustering of data points is similar. With the distance-based feature, both methods show some clustering in the case of the LJ mixture, but are unable to display a meaningful separation in the case of the WR model. With the affinity-based feature, both methods show a clear clustering, with the mixed phase (red) distinct from the phase separated phase (blue). This suggests that the affinity-based feature vector is a more promising route to extracting the phase behavior of these models.

Often higher order components in the PCA are required to distinguish different structures. This is particularly relevant here because, with the distance-based feature, the explained variances show more than one dominant component (see Fig. 3). The higher components do not provide significant additional information in this case, however. Figure 6, which is a three-dimensional plot of the first three components of the PCA (*P*_{1}, *P*_{2} and *P*_{3}), does not show a clear distinction between the mixed and separated states.

The order parameters obtained with the affinity-based feature vectors provide an accurate estimate of the critical point of both models. The order parameters with the PCA are depicted in Fig. 7 and those with the t-SNE are in the supplementary material (Fig. S1). The order parameter $P1$ is zero in the mixed phase and increases rapidly in the two-phase region, and *σ*_{12} shows a peak in the neighborhood of what could be the critical point. Interestingly, the behavior of *σ*_{12} is quite similar to the constant volume specific heat (Fig. S2).

For the smaller systems, the PCA is not able to distinguish between the mixed and phase separated states, for both models, but a clear transition is seen for larger systems. The order parameter $P1$ often shows a gradual increase rather than a rapid increase to non-zero values, and we, therefore, identify the peak in *σ*_{12} with the phase transition. For the systems shown in Figs. 7(b) and 7(d), which are the largest system sizes for each case, we estimate a critical temperature of 1.425 for the LJ mixture, which compares well with the literature value 1.423 ± 0.0005.^{28} Similarly, the transition density for the WR model is 0.75, which compares well with the literature value of 0.762.^{29} Results for other system sizes are shown in Figs. S3–S8 of the supplementary material.

We estimate the infinite system transition point using finite size scaling. We find the transition temperatures and densities from the position where *σ*_{12} is the highest. Uncertainties in the temperature or density are measurement uncertainties, i.e., the simulation intervals of temperature (LJ) or density (WR). The machine learning transition temperatures do not show the usual scaling expected from finite size simulations in the Ising universality class.^{30} Usually, *T*_{c}(*L*) ∼ *T*_{c,∞} + *AL*^{−1/ν}, where *T*_{c,∞} is the critical temperature of the infinite system, *ν* = 0.629 is the Ising finite size scaling exponent, and *A* is a positive constant. This implies that *T*_{c}(*L*) > *T*_{c,∞} for finite values of *L*. This is because, in a finite system, the critical point is reached when the length-scale of fluctuations exceeds the box size, and this happens at a higher temperature for smaller systems.

We find that for the WR model, *T*_{c} is essentially independent of *L* (within uncertainties), and for the LJ mixture, *T*_{c} increases with increasing *L*. The variation of *T*_{c} with *L*^{−1/ν} for the LJ mixture and WR model are shown in Figs. 8(a) and 8(b), respectively. From a fit to *T*_{c}(*L*) = *T*_{c,∞} + *AL*^{−1/ν,} we extract *T*_{c,∞} = 1.43 ± 0.01 for the LJ mixture, and *ρ*_{c,∞} = 0.76 ± 0.01 for the WR mixture. These estimates are in good agreement with the critical point obtained using conventional simulations.

## VI. CONCLUDING REMARKS

We use unsupervised machine learning methods to obtain the critical point of three dimensional off-lattice systems. The most important result is that the construction of the input feature vector plays a crucial role in the ability of the ML method to predict the critical point. We use two feature vectors, one based on the distances between particles (“distance-based”) and one based on the *local* concentration (“affinity-based”), and study two systems, a symmetric Lennard–Jones binary (LJ) mixture and a non-additive, hard sphere Widom–Rowlinson (WR) mixture. We find that the distance-based feature vector is not able to distinguish clearly between mixed and two-phase states for the WR mixture, although it does show a distinction for the LJ mixture. The affinity-based feature vector, however, shows a distinct separation of one-phase and two-phase configurations for both models. We use two ML methods, PCA and t-SNE, and find that the choice of the ML algorithm is not important—the key choice is the construction of the feature vector.

We estimate the critical point from the variation of the peak in variance of the first two principal components with temperature and density, and extrapolate to an infinite system using the critical scaling of the Ising universality class. The critical point obtained in this fashion is in good agreement with previous estimates from conventional simulations, and requires standard simulations with no particle insertions or deletions.

Supervised and un-supervised ML methods have been widely used for simple lattice systems and some off-lattice systems. In lattice systems, the choice of the feature vector is clear, namely, the values of the spins. We find that in off-lattice systems, a judicious choice of the input features can play a crucial role in the predictions, and it remains to be seen if the choice of feature is system dependent. We hope this paves the way to study other complex fluids using un-supervised methods.

An interesting direction is to connect the machine learning methods to statistical mechanics and liquid state theory. For the LJ mixture, we find empirically that the order parameter *σ*_{12} shows behavior quite similar to the constant volume heat capacity. Although this suggests that there might be a link, the former relies only on the locally averaged concentrations, whereas the latter is a measure of energy fluctuations. Note that ML relies on how configurations at a given state are different from those at other states, and it is, therefore, important to have a range of temperatures (in the LJ case) that span the critical point. In statistical mechanics, on the other hand, one only requires a single state to determine all the properties. In this regard, it is important to note that identifying the peak in *σ*_{12} with the critical point is merely an ansatz, and one that must eventually be verified to represent the critical point from a more fundamental standpoint.

## SUPPLEMENTARY MATERIAL

See the supplementary material for analysis of the phase behavior using the distance-based feature vector and order parameters with the t-SNE method.

## ACKNOWLEDGMENTS

This work was supported by the National Science Foundation through Grant No. CHE-1856595. All simulations presented here were performed using computational resources provided by UW-Madison Department of Chemistry HPC Cluster under NSF Grant No. CHE-0840494.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Inhyuk Jang**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Writing – original draft (equal). **Supreet Kaur**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Writing – original draft (equal). **Arun Yethiraj**: Conceptualization (equal); Funding acquisition (lead); Investigation (equal); Project administration (lead); Resources (lead); Supervision (lead); Writing – review & editing (lead).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.