Several enhanced sampling methods, such as umbrella sampling or metadynamics, rely on the identification of an appropriate set of collective variables. Recently two methods have been proposed to alleviate the task of determining efficient collective variables. One is based on linear discriminant analysis; the other is based on a variational approach to conformational dynamics and uses time-lagged independent component analysis. In this paper, we compare the performance of these two approaches in the study of the homogeneous crystallization of two simple metals. We focus on Na and Al and search for the most efficient collective variables that can be expressed as a linear combination of X-ray diffraction peak intensities. We find that the performances of the two methods are very similar. Wherever the different metastable states are well-separated, the method based on linear discriminant analysis, based on its harmonic version, is to be preferred because simpler to implement and less computationally demanding. The variational approach, however, has the potential to discover the existence of different metastable states.

## I. INTRODUCTION

Many chemical and physical phenomena are characterized by the occurrence of long lived metastable states. Under these circumstances, accurate sampling becomes computationally expensive or even prohibitive. In order to overcome this problem, many methods have been suggested.^{1} A large fraction of these methods depends on the definition of collective variables (CVs). Typical examples are umbrella sampling,^{2–5} metadynamics,^{6–9} and variationally enhanced sampling.^{10} The efficiency of these simulations depends very much on the quality of the CV, and hence, the finding and improving of CVs is the object of intense investigations.^{11}

In our group, two methods have recently been developed, harmonic linear discriminant analysis (HLDA)^{12} and variational approach to conformational dynamics (VAC).^{13} In HLDA, one constructs low dimensional CVs from the local fluctuations in the different metastable states. In contrast, in the VAC approach, one starts with a biased simulation with non-optimal CVs and attempts to improve this initial guess using a variational principle approach that is based on the time-lagged independent component analysis (TICA).

In this work, we compare the performance of HLDA-generated CVs with those obtained using VAC. We focus on a typical application area of enhanced sampling methods, namely, homogeneous crystallization. In particular, we shall present results on two systems, Na and Al, already studied elsewhere.^{4,5,9,14}

It has recently been shown^{15} that the X-ray diffraction (XRD) intensities can be useful CVs. However, sometimes a single peak is not sufficient. Thus we shall search with HLDA and VAC for the best linear combinations of the diffraction peak intensities. We are not implying here that XRD intensities are necessarily the best possible CVs to study nucleation. However, they have been successfully used and provided a good example on how to improve from reasonable CVs, which is the situation commonly encountered in the practice.

## II. METHODS

### A. Well-tempered metadynamics

Here, we use as an enhanced sampling method well-tempered metadynamics (WTMetaD).^{16,17} We recall that WTMetaD is a procedure in which the dynamical evolution of the system is altered by the addition of an external bias potential *V*(** s**) periodically updated as

where *γ* is the bias factor and *β* = 1/*k*_{B}*T* is the inverse temperature. The modification of the potential in Eq. (1) results from the multiplication of a Gaussian kernel *G*(** s**,

*s*_{n}) centered at the current CV value

*s*_{n}and scaled by $exp\u22121\gamma \u22121\beta Vn\u22121(sn)$. This scaling factor decreases as 1/n; thus, the change of the external bias potential becomes smaller as the metadynamics simulation progresses. The bias potential

*V*(

**(**

*s***)) asymptotically takes the form**

*R*### B. Harmonic linear discriminant analysis

Very recently, a simple way of obtaining efficient CVs, called harmonic linear discriminant analysis (HLDA), has been proposed.^{12} In its simplest version, one assumes that there are two separated states that can be identified by *N*_{d} descriptors *d*_{i}(** R**) that are functions of the atomic coordinates

**and are arranged to form a vector**

*R***(**

*d***). We assume that the average values in the two basins,**

*R*

*μ*_{A}and

*μ*_{B}, are different. The fluctuation of

**(**

*d***) in the two states is given by the covariance matrices**

*R***Σ**

_{A}and

**Σ**

_{B}.

The idea of HLDA is to find the direction along which the projected data of the two states are well-separated. This is obtained by maximizing the ratio of the between-class variation *S*_{B} = (*μ*_{A} − *μ*_{B})(*μ*_{A} − *μ*_{B})^{T} and the within-class variance calculated by a harmonic average *S*_{W} = **Σ**_{A}**Σ**_{B}/(**Σ**_{A} + **Σ**_{B}), which leads to the object function

where *J*(*W*) is maximized by

Thus the HLDA based CV takes the form

This procedure has been inspired from linear discriminant analysis (LDA). However, it differs from the standard version of LDA in that the within-class variance is estimated from the harmonic average of the covariance matrices and not by the arithmetic average as done in LDA. The reasons for this choice have been illustrated in Ref. 12 and also in Ref. 19.

An extension of HLDA to the case in which several metastable states need to be considered is possible and following the nomenclature of the artificial intelligence community we refer to this as a multi-class situation and thereby we shall name the method as multi-class HLDA (MC-HLDA).^{19} We assume that there are *M* classes, the between-class variance matrix in this case is expressed as $SB=\u2211i=1M\mu i\u2212\mu \mu i\u2212\mu T$, where $\mu =1M\u2211i=1M\mu i$, and the within-class variance matrix is expressed as *S*_{W} = **1**/((**1/Σ**_{1}) + (**1**/**Σ**_{2}) + ⋯ + (**1**/**Σ**_{M}**))**. The projection matrix $K=W1W2\cdots |WM$ that maximizes the ratio $JK=KTSBKKTSWK$ is the one whose columns are the eigenvectors corresponding to the generalized eigenvalue problem

The optimized CVs are thus given by

Notice that, because *S*_{B} is of rank (*M* − 1) or less, there will be at most *M* − 1 eigenvectors with non-zero eigenvalues *λ*_{i}.

### C. Variational approach to conformational dynamics

Another way one can use to optimize CVs is based on a variational approach to conformational dynamics (VAC).^{13} It has been shown that time-lagged independent component analysis (TICA) can provide an optimal solution of VAC. TICA is a well-known method for blind source separation problems in signal processing and recently has been applied in finding slow modes in molecular dynamics simulations.^{20–26}

The first version of this method assumes that a sufficient number of trajectories in which the system underwent transitions from one well to another were already available.^{21,22} However, recently, this method has been extended and it is now possible to perform a VAC analysis starting from biased simulations by WTMetaD with non-optimal CVs.^{13} These CVs are then optimized. In the application of this principle, one expands the slow modes into a basis set of *N*_{d} descriptors *d*_{i}(*R*_{t}) (*i* = 1, 2, …, *N*_{d}). The descriptors are normalized to one, and the average value is subtracted to get a quantity of zero mean. The best linear combination in these basis sets is obtained by maximizing the ratio *f*(*W*_{1}) with respect to a projection vector *W*_{1},

where $Cij\tau =Edit\u0303djt\u0303+\tau $ is the time lagged correlation between *d*_{i} and *d*_{j}, $C\xaf\tau =C\tau +CT\tau /2$ is the symmetrized correlation matrix, and $C\xaf0$ is the correlation at lag time 0. In Refs. 13 and 24, it has been shown that Eq. (8) can be applied also to biased simulations provided that the time scale is modified as follows:

where *V*(** s**,

*t*) is the instantaneous value of bias in WTMetaD.

is an energy offset, and *F*(** s**) is the free energy of the system. The function

*c*(

*t*) asymptotically tends to the reversible work done on the system by the external bias.

Maximizing the ratio in Eq. (8) leads to the generalized eigenvalue problem

The eigenvectors of Eq. (11) are ordered in descending λ(*τ*) values. The slowest decreasing modes are then chosen as CVs

### D. X-ray diffraction intensities

In a recent paper,^{15} it has been suggested that the intensities of the X-ray diffraction (XRD) peaks could act as good CVs since they are physically meaningful and can distinguish between different states.

The XRD intensities are computed by the Debye scattering function^{27}

where *N* is the total number of atoms, *Q* is the modulus of the scattering vector, *f*_{i}(*Q*) and *f*_{j}(*Q*) are the atomic scattering form factors, *R*_{ij} is the distance between atoms *i* and *j*, and *W*(*R*_{ij}) = sin(*πR*_{ij}/*R*_{c})/(*πR*_{ij}/*R*_{c}) is a window function that handles the problem of the finite simulation box and *R*_{c} is the upper limit of *R*_{ij}.

## III. COMPUTATIONAL DETAILS

### A. Details of the molecular dynamics simulations

All simulations were performed in the isothermal-isobaric (NPT) ensemble. We simulated Na and Al using embedded atom models.^{28,29} Biased metadynamics simulations were performed using the LAMMPS molecular dynamics simulation code^{30} patched with PLUMED 2 plugin.^{31} The integration of the equations of motion was carried out with a time step of 2 fs. We employed the stochastic velocity rescaling thermostat^{32} with a relaxation time of 0.1 ps. The target pressure of the isotropic barostat^{33} was set to the atmospheric value, and a relaxation time of 10 ps was used.

We simulated the two systems at temperatures close to their melting points (375 K for Na and 850 K for Al). The total numbers of atoms were 250 in Na and 256 in Al.

### B. Details of the WTMetad simulations

In order to determine the free energy surface *F*(** s**) as a function of the CVs, we used WTMetaD.

^{16}The temporal length of the WTMetad runs ranges from 350 to 600 ns. The calculation details are summarized in Table I.

Number of . | . | . | . | . | . | . | . | Simulation time . | Time interval . |
---|---|---|---|---|---|---|---|---|---|

trajectories . | System . | T (K) . | T/T_{m}
. | CVs . | γ . | ω (kJ/mol) . | σ (a.u.) . | (ns) . | (ps) . |

1 | Na | 375 | 1.025 | I^{{011}} | 30 | 3 | 2 | 350 | 1 |

2 | Na | 375 | 1.025 | HLDA | 30 | 3 | 0.012 | 600 | 1 |

3 | Na | 375 | 1.025 | VAC | 30 | 3 | 0.012 | 600 | 1 |

4 | Al | 850 | 0.918 | I^{{111}} HLDA | 70 | 7.5 | 2 | 600 | 1 |

5 | Al | 850 | 0.918 | Eig1 and Eig2 VAC | 70 | 7.5 | 0.014 | 600 | 1 |

6 | Al | 850 | 0.918 | Eig1 and Eig2 | 70 | 7.5 | 0.014 | 600 | 1 |

Number of . | . | . | . | . | . | . | . | Simulation time . | Time interval . |
---|---|---|---|---|---|---|---|---|---|

trajectories . | System . | T (K) . | T/T_{m}
. | CVs . | γ . | ω (kJ/mol) . | σ (a.u.) . | (ns) . | (ps) . |

1 | Na | 375 | 1.025 | I^{{011}} | 30 | 3 | 2 | 350 | 1 |

2 | Na | 375 | 1.025 | HLDA | 30 | 3 | 0.012 | 600 | 1 |

3 | Na | 375 | 1.025 | VAC | 30 | 3 | 0.012 | 600 | 1 |

4 | Al | 850 | 0.918 | I^{{111}} HLDA | 70 | 7.5 | 2 | 600 | 1 |

5 | Al | 850 | 0.918 | Eig1 and Eig2 VAC | 70 | 7.5 | 0.014 | 600 | 1 |

6 | Al | 850 | 0.918 | Eig1 and Eig2 | 70 | 7.5 | 0.014 | 600 | 1 |

## IV. RESULTS AND DISCUSSION

### A. Choosing the descriptors

Both HLDA and VAC require defining an adequate set of descriptors. As mentioned in the Introduction, we shall use some of the XRD peak intensities. The cutoff *R*_{c} for Na and Al is 11 Å and 9 Å, respectively. Due to the small system sizes, the peaks resulting from Eq. (13) are broad, as shown in Fig. 1, but they can still distinguish well between solid and liquid.

In order to perform our analysis, we found that the intensities of the first four Bragg peaks were sufficient for our purpose. The four descriptors are labeled by the peak Miller indices {hkl}. They are *d*_{1} = *I*^{{011}}, *d*_{2} = *I*^{{002}}, *d*_{3} = *I*^{{112}}, and *d*_{4} = *I*^{{022}} in Na, and *d*_{1} = *I*^{{111}}, *d*_{2} = *I*^{{002}}, *d*_{3} = *I*^{{022}}, and *d*_{4} = *I*^{{113}} in Al.

In order to make a meaningful comparison of the HLDA and VAC coefficients from the original runs in which only *I*^{{011}} for Na and *I*^{{111}} for Al is biased, we rescaled the value *I*^{{hkl}} to $\u0128hkl$, which covers the range [0, 1], where 0 is the minimum value sampled and 1 is the maximum.

In the following, we shall measure the quality of the CVs not only by their ability to discriminate between states but also by the re-crossing frequency. The re-crossing frequency is the number of transitions between solid and liquid per unit time during the WTMetaD. They could be measured from the CV trajectory (see Figs. 4 and 10). The rational for this is that the re-crossing frequency between different basins is an important figure of merit. In fact, the slowest mode not included in the CV determines the re-crossing frequency.^{34}

### B. The Na case

From the existing literature, it is known that the *F*(*s*) of Na close to 375 K is dominated by 2 minima corresponding to the disordered liquid state and the ordered body centered cubic (BCC) state.^{9}

We first applied HLDA to find the direction that well discriminate the two states. In each of them, we estimate the average values and the covariance matrix in 2-ns long unbiased runs, a time that is sufficient for these estimations to converge, and the points sampled are shown in Fig. 2. We then applied Eq. (5) to obtain the CV, and the result is shown in Table II. The descriptor *I*^{{011}}, which corresponds to the highest peak in the XRD pattern, dominates in the HLDA coefficients.

. | I^{{011}}
. | I^{{002}}
. | I^{{112}}
. | I^{{022}}
. |
---|---|---|---|---|

HLDA | 0.857 | 0.201 | 0.464 | 0.096 |

VAC | 0.923 | 0.011 | −0.331 | −0.198 |

. | I^{{011}}
. | I^{{002}}
. | I^{{112}}
. | I^{{022}}
. |
---|---|---|---|---|

HLDA | 0.857 | 0.201 | 0.464 | 0.096 |

VAC | 0.923 | 0.011 | −0.331 | −0.198 |

The VAC based calculation was started with a biased run in which only *I*^{{011}} is used as CV. Following the procedure described in Ref. 13, we extract the slowest decaying modes in the basis of the selected descriptors. The eigenvalue of the VAC equation is displayed in Fig. 3(b) as a function of lag-time. Obviously one eigenvalue dominates the long time decay. The coefficients for the descriptors are shown in Table II. They are very similar to those of HLDA and also dominated by *I*^{{011}}.

The probability distributions related to these two new CVs were computed and shown in Fig. 3(c). Both of them can discriminate the two states, while the transition region is better separated by the direction of the HLDA CV *s*^{H} than that of the VAC CV *s*^{V}.

We now compare the performance of *I*^{{011}}, *s*^{H}, and *s*^{V} as a metadynamics CVs. Thus two additional runs based on *s*^{H} and *s*^{V} were performed. In Fig. 4, it is obvious that both *s*^{H} and *s*^{V} improve substantially the performance of the original simulation. As to be expected from the theory of metadynamics, after an initial steep rise, the offset energy *c*(*t*) reaches a slowly increasing behavior, indicating that the asymptotic regime in which WTMetad is valid has been reached (Fig. 5). In both systems, this regime is reached after about 200 ns. If we count the number of recrossing in this regime, we find an average of 3.8 (*I*^{{011}}), 9.0 (HLDA), and 7.3 (VAC) re-crossings per 100 ns.

### C. The Al case

The other case studied was the crystallization of Al. We first explored the free energy surface biasing *I*^{{111}} at 850 K, close to the melting temperature. A visual inspection of the solid-like minimum reveals that the structure in the solid minimum is a combination of perfect face centered cubic (FCC) and stacking disorder close packed (CP) structures. Fortunately, we found that if we project the free energy surface (FES) onto *I*^{{111}} and *I*^{{113}} by reweighting,^{35} the solid minima are separated [see Fig. 6(a)]. The CP structure is stable even with a larger simulation cell with 2048 atoms, and the stacking faults remain throughout the 2-ns long unbiased simulation (Fig. 7). Such a phase of Al has indeed been stabilized in the experiments performed in cold rolled Al crystal^{36} and nanocrystalline Al.^{37,38}

Having established somewhat empirically, that we are in the presence of a multi-class problem and can apply the multi-class generalization of HLDA.^{19} To prepare for the MC-HLDA simulation, we performed three independent 2-ns-long runs start from the three states, and on this time scale no transition between these states was observed (Fig. 8). From this using Eq. (7), we obtain two CVs, $sEig\u20091H$ and $sEig\u20092H$, and performed a biased run based on these CVs. The result shows not only that sampling is accelerated, but that the three classes are better separated [Fig. 6(c)].

We then constructed a VAC calculation starting from the biased trajectory that uses *I*^{{111}} only as CV. Two eigenvalues dominate the long time decay [Fig. 6(b)], leading also to a 2-dimensional CV space, and the coefficients of the corresponding eigenvectors are listed in Table III. This is an important result since it shows that VAC can find out automatically that the Al case is in a multi-class scenario. This is to be contrasted with HLDA requiring that the classes have to be known beforehand.

. | . | I^{{111}}
. | I^{{002}}
. | I^{{022}}
. | I^{{113}}
. |
---|---|---|---|---|---|

HLDA | Eig1 | 0.946 | 0.028 | 0.077 | 0.315 |

Eig2 | 0.871 | −0.044 | −0.237 | −0.429 | |

VAC | Eig1 | 0.695 | −0.023 | 0.217 | 0.685 |

Eig2 | 0.932 | −0.087 | −0.220 | −0.273 |

. | . | I^{{111}}
. | I^{{002}}
. | I^{{022}}
. | I^{{113}}
. |
---|---|---|---|---|---|

HLDA | Eig1 | 0.946 | 0.028 | 0.077 | 0.315 |

Eig2 | 0.871 | −0.044 | −0.237 | −0.429 | |

VAC | Eig1 | 0.695 | −0.023 | 0.217 | 0.685 |

Eig2 | 0.932 | −0.087 | −0.220 | −0.273 |

Both the HLDA CV *s*^{H} and the VAC CV *s*^{V} spaces (Fig. 9) can discriminate better between the three basins than the XRD peak intensities (Fig. 8). While the coefficients of the descriptors appear to be different, the two FESs are very similar with the three minima well-separated, apart from a slight relative rotation [Figs. 6(c) and 6(d)].

To compare the relative performance of the *s*^{{111}}, *s*^{H}, and *s*^{V}, we performed two extra simulations using *s*^{H} and *s*^{V}, and the trajectories are shown in Fig. 10. Similarly to what done for Na, we measure how many transitions take place on average every 100 ns after the metadynamics has reached its asymptotic equilibrium (Fig. 11). If we count the number of re-crossings in this regime, we find an average of 0.8 (*I*^{{111}}), 3.0 (HLDA), and 3.3 (VAC) re-crossings per 100 ns.

## V. CONCLUSION

In this paper, we tested the performance of HLDA and VAC in the case of homogeneous nucleation. We applied them to look for the most efficient collective variables that can be expressed as a linear combination of X-ray diffraction peak intensities. Both methods gave a good separation among the different states and substantially enhanced sampling. When the number of relevant classes is known in advance, HLDA is to be preferred to VAC because of its simplicity and the fact that it does not require a preliminary biased run. On the other hand, as we have seen in the case of Al, VAC can automatically find out the different classes that is not a minor advantage.

## ACKNOWLEDGMENTS

The authors thank Faidon Brotzakis, Yi Issac Yang, Valerio Rizzi, and Pablo M. Piaggi for useful discussions. Some of the calculations reported here were performed on the EULER cluster at ETH Zurich, and the others were on the Mönch cluster at the Swiss National Supercomputing Center (CSCS). This research was supported by the VARMET European Union (Grant No. ERC-2014-ADG-670227).