The dream of machine learning in materials science is for a model to learn the underlying physics of an atomic system, allowing it to move beyond the interpolation of the training set to the prediction of properties that were not present in the original training data. In addition to advances in machine learning architectures and training techniques, achieving this ambitious goal requires a method to convert a 3D atomic system into a feature representation that preserves rotational and translational symmetries, smoothness under small perturbations, and invariance under re-ordering. The atomic orbital wavelet scattering transform preserves these symmetries by construction and has achieved great success as a featurization method for machine learning energy prediction. Both in small molecules and in the bulk amorphous Li_{α}Si system, machine learning models using wavelet scattering coefficients as features have demonstrated a comparable accuracy to density functional theory at a small fraction of the computational cost. In this work, we test the generalizability of our Li_{α}Si energy predictor to properties that were not included in the training set, such as elastic constants and migration barriers. We demonstrate that statistical feature selection methods can reduce over-fitting and lead to remarkable accuracy in these extrapolation tasks.

## I. INTRODUCTION

Machine learning (ML) is a powerful tool in chemical physics.^{1,2} Both kernel-based^{3–14} and neural-network-based^{15–24} learning algorithms have found success in predicting physical properties such as energies, forces, and potential energy surfaces starting from atomic coordinates. ML models have been used for molecular dynamics (MD),^{13,25,26} prediction of free energy surfaces,^{27–30} and generation of thermodynamic ensembles^{25,31} on systems for which they have been trained. Much as ML models have revolutionized fields such as computer vision,^{32} automated content generation,^{33} and natural language processing,^{34} an ML model could, in principle, predict the physical properties of broad classes of atomic systems with an accuracy competitive with the best current methods at a small fraction of the computational cost. However, such an ML model has not yet been developed, and many obstacles still remain before general atomistic ML models can compete with the existing quantum chemistry methods. The most fundamental obstacle is perhaps the generalizability problem (also referred to as transferability). Quantum chemical methods such as density functional theory (DFT) are predictive because they work from first principles—physical properties emerge from solutions to equations that describe underlying physics—allowing these techniques to work on systems that have never been studied before. In contrast, machine learning is at its heart a fitting technique. The model knows nothing about the physical equations, and its predictions are statistical inferences based on the training data. This raises the question: are quantum chemistry models based on machine learning inherently limited to the interpolation of the training data? Or can a machine learning model “learn the underlying physics” and provide new insights beyond the training data? If we want to answer a question with an ML model, will it always be necessary to compute the answer to the same question beforehand in thousands or tens of thousands of similar cases? Or could ML truly teach us something new? The hope that it can is not entirely without foundation. Machine learning models have shown the ability to capture and generalize abstract patterns within training data. For example, a neural net trained to translate languages recently achieved “zero-shot translation”: translation between a pair of languages for which it had no dictionary without going through an intermediate language.^{35} This suggests that learning of underlying meaning from diverse examples may be possible.

Of course, the full answer to this question is beyond the scope of the present work. We look at a specific example of the generalizability problem: energy prediction for amorphous Li_{α}Si. Due to its potential for developing high energy density lithium-ion batteries, this system has recently been studied by a variety of machine learning methods. Onat *et al.* generated an implanted neural network potential for this system.^{27} Artrith *et al.* developed a machine-learning potential that enabled ensemble generation and molecular dynamics for Li_{α}Si.^{25} Brumwell *et al.* created a three-dimensional ML model, similar to a convolutional neural network, with a physically motivated filter based on the wavelet scattering transform and were able to achieve chemical accuracy in energy prediction for this class of structures.^{37} The wavelet formulation has a number of advantages. The inclusion of wavelet dilations and second-order scattering coefficients makes the wavelet scattering approach inherently multiscale, including length scales on the order of the support of the smallest wavelet up to the length scale of the largest unit cell in the Li_{α}Si database. This allows the wavelet scattering transform to capture a broad class of physical interactions, making it more general (see Sec. II D for details on the multiscale nature). The fact that the wavelets themselves are based on atomic orbitals may allow them to more naturally capture interactions arising from electronic bonds (though without sacrificing generality since the wavelet frame is overcomplete) and may help with the generalizability of the model. Other methods that sum up atomic energies based on their local environments (e.g., Refs. 7, 25, 27, 38, and 39) can similarly separate length scales by setting the cutoff to the length scale of the unit cell and selecting a variety of radial functions with support along the intermediate length scales down to the distance between nearest-neighbor atoms. These methods provide feature descriptions that separate length scales but must be used in an appropriate training or energy fitting framework in order to nonlinearly couple these scales. In contrast, the wavelet scattering coefficients used here include the nonlinear coupling in the feature set.

In this work, we build on and improve the wavelet-based model,^{37,40–42} achieving similar accuracy to Onat *et al.*^{27} and Artrith, Urban, and Ceder,^{25} but using a simpler architecture with fewer parameters than the neural network approaches and with fewer features than any of the previous models including Brumwell *et al.*^{37} More importantly, we test the model on extrapolation tasks, predicting physical properties that were not present in the training set. We perform detailed analysis on how to balance under-fitting and over-fitting in order to achieve high generalizability. The three tasks are predicting migration barriers based on transition state theory, energies of amorphous systems significantly larger than the training set systems, and elastic properties based on deformations of amorphous Li_{α}Si. In each of these extrapolation tasks, we find that the model is able to achieve reasonable accuracy, and in some cases does quite well, thus providing evidence of the model’s ability to generalize to new types of systems and tasks.

The remainder of this paper is organized as follows: In Sec. II, we present the methods used in this work, including the data generation process and descriptions of the machine learning algorithms. Numerical results are presented and discussed in Sec. III, and Sec. IV contains a short conclusion. Appendixes A–C and supplementary material provide additional details on certain aspects of this paper.

## II. METHODS

Methods consist of data generation for training (Sec. II A) as well as data generation for testing on extrapolation tasks (Sec. II B). Algorithms used for training the machine learned models are discussed in Sec. II C, whereas the wavelet scattering representation of an atomic state is described in Sec. II D. Appendix B explains how to compute such features efficiently.

### A. Training data generation

The training, validation, and interpolation testing data for the machine-learned model consist of amorphous Li_{α}Si structures labeled by formation energies calculated using Density Functional Theory (DFT). These structures are in cubic boxes under periodic boundary conditions containing from 55 to 100 atoms, with lithium-to-silicon ratio *α* ranging from 0.1 to 3.75. The initial disordered structures are generated by evolving random structures under ReaxFF^{43} molecular dynamics (MD) at 2500 K for 10 ps, and ten different disordered structures are randomly picked from the MD trajectory for each of the 37 chosen concentrations. The accuracy of the force field used to obtain the initial amorphous structure is not important due to the following DFT calculations. In particular, each structure is fully relaxed at constant volume using DFT. The structures and formation energies along the relaxation paths make up the amorphous dataset used in this work, which contains a total of 90 252 structures. A histogram of the quantity of these structures by energy and concentration is shown in Fig. 1. We note that the structures are heavily concentrated near the endpoint of the relaxation, so we expect the resulting model to do better on near-equilibrium amorphous structures. This is desirable because the low-energy structures are more likely to arise in realistic simulations. We also calculate voltages vs Li/Li^{+}^{44–46} and radial distribution functions for Si–Si, Li–Si, and Li–Li pairs and find good agreement with similar datasets from the literature^{25,47} (see the supplementary material). This confirms that our amorphous structures are physically realistic.

Formation energies and relaxations were performed in the Vienna *Ab initio* Simulation Package (VASP)^{48} using the Projector-Augmented Wave method^{49} and the PBE exchange–correlation functional^{50} with a plane-wave energy cutoff of 500 eV. The Brillouin zone was sampled using the Gamma point only during relaxation. After relaxation, the energies along each relaxation path were corrected for *k*-point sampling errors by calculating the energy of each fully relaxed structure using a 3 × 3 × 3 Gamma-centered grid and applying the resulting constant shift to the rest of the structures in the relaxation path. The mean absolute *k*-point sampling correction was 27 meV/atom. The total formation energy of a structure with *N*_{Li} lithium atoms and *N*_{Si} silicon atoms is defined based on DFT total energies,

where *E*_{tot}(Li_{NLi}Si_{NSi}) is the total energy of the system and *E*(Li) and *E*(Si) are the DFT total energy per atom of elemental lithium and silicon, respectively. The structure Li_{NLi}Si_{NSi} has reduced formula Li_{α}Si with *α* = *N*_{Li}/*N*_{Si} and per-atom formation energy,

The per-atom formation energy is the quantity of interest for machine learning. Notice, though, it includes the terms *N*_{Li}*E*(Li) and *N*_{Si}*E*(Si), which require no additional quantum mechanical calculations beyond the one-time cost of computing *E*(Li) and *E*(Si). The difficulty is in computing $Etot(LiNLiSiNSi)$, which requires a costly DFT calculation for each new state. When fitting our machine learned models, we regress the per-atom total energy, defined as

or

Even though it is simple to convert total energies into per-atom total energies, we regress the latter since per atom energies remove the effect of varying unit cell sizes and the number of atoms per unit cell on the total energy. Since we use the squared loss as our measure of error when training, regressing total energies would bias the models toward systems containing larger numbers of atoms since the total energy scales with the number of atoms.

### B. Data generation for extrapolation tests

In order to test the machine learning model’s generalizability to extrapolation tasks, additional DFT data are required to compare with the results of the machine learning model. We test three different extrapolation tasks: prediction of migration barriers, energy prediction for systems with larger unit cells, and prediction of elastic properties.

Diffusion barriers cannot be defined uniquely in amorphous structures due to the lack of order. Rather, paths that move an atom from one favorable coordination environment to another through a relatively unfavorable environment are abundant. An endpoint for such a pathway was found by locating void spaces in the amorphous structure through Voronoi analysis and inserting a test lithium atom at each void to find the most energetically favorable position. Nearby lithium atoms to this void were subsequently identified, and the minimum-energy path (MEP) for each lithium to travel to the void was calculated using the nudged elastic band (NEB) method.^{51} The NEB images along 6 calculated MEPs (for a total of 50 image structures) were used as testing data for this extrapolation task. The primary quantity of interest is the migration barrier, i.e., the difference between the lowest-energy and highest-energy points along the MEP.

Large structure testing data were generated by two methods: independently relaxing larger AIMD-generated structures (the “from-scratch” method) or tiling structures from the dataset, randomly perturbing all atomic positions by 0.1 Å, and performing a single-point calculation (the “tiled” approach). The testing data consist of 37 from-scratch structures, 40 2 × 2 × 2 tiled structures, and 108 2 × 1 × 1 tiled structures.

Finally, elastic property testing data were generated by applying hydrostatic strain from −9% to +9% on fully relaxed structures at each concentration. The bulk modulus *K* is calculated by fitting data near the minimum to the following equation:^{52}

where *V* is the volume, *V*_{0} is the equilibrium volume, and *E* is the energy. In total, a bulk modulus value is calculated at each of the 37 concentrations, based on a total of 333 structures under hydrostatic strain.

### C. Linear regression and model fitting

The purpose of using machine learning is to reduce the computational burden relative to quantum mechanical calculations while maintaining accurate predictions. To accomplish this task, we need to derive features from our data and recombine them in some meaningful way. We use a linear model over a predefined set of features, which consist of nonlinear, multiscale maps of the original atomic state. Using a linear model over a universal feature set allows us to leverage several well-studied techniques in regression, regularization, and statistical learning theory that increases the accuracy, stability, and generalizability of the model.

Let $x={(Zk,Rk)\u2208N\xd7R3}k=1Nx$ denote the list of atoms in the unit cell of a Li_{α}Si system. The value $Zk\u2208N$ denotes the protonic charge of the atom, i.e., *Z*_{k} = 3 for lithium and *Z*_{k} = 14 for silicon, and $Rk\u2208R3$ denotes the position of the atom in the simulation cell. The quantity *N*_{x} = *N*_{Li} + *N*_{Si} is the total number of atoms in the unit cell. Let $\Phi (x)\u2208Rd$ be a *d*-dimensional representation of the state *x*, which is described in detail in Sec. II D. A linear regression with weights $w=(w\gamma )\gamma =0d\u2208Rd+1$ of the per-atom total energy $Etot*(x)$ over the representation $\Phi (x)=(\varphi \gamma (x))\gamma =1d$ computes

where $w0$ is a bias term and the coordinates *ϕ*_{γ}(*x*) of Φ(*x*) are weighted with the scalars $w$_{γ}. We regularize the regression by selecting a parsimonious model via a sparsity constraint on the weights $w$,

for some hyper-parameter *M* that determines the number of nonzero weights.

The weights $(w\gamma )\gamma =0d$ and the hyper-parameter *M* are solved for using the DFT generated training data. Let $Xt={(xi,E(xi))}i=1nt$ denote a training set consisting of Li_{α}Si states *x*_{i} and their DFT generated per-atom total energies $E(xi)=Etot*(xi)$; denote by $Xv={xi\u2032,E(xi\u2032)}i=1nv$ another such set, also consisting of Li_{α}Si states and their associated per-atom total energies, non-overlapping with the training set, which we use as the validation set. For each *M* up to some maximum value, 1 ≤ *M* ≤ *M*_{max}, we compute weights $wM=(w\gamma M)\gamma =0d$ by solving the following:

where $L(w,Xt)$ is the mean squared loss function with respect to the training set $Xt$,

As *M* increases, the model $E\u0303(x;wM)$ becomes more complex as it has more non-zero weights. This increasing complexity is reflected by the fact that the loss function $M\u21a6L(wM,Xt)$ is a decreasing function of *M*. That is, the training error decreases as *M* increases (see also the red curves in Fig. 5).

However, it is well known in machine learning and statistical learning theory that more complex models do not necessarily generalize better. The optimal regularization, here controlled by the sparsity parameter *M*, is determined via cross-validation using the validation set $Xv$. That is, for each 1 ≤ *M* ≤ *M*_{max}, we compute the loss of the model $E\u0303(x;wM)$ on the validation set and select the *M*^{⋆} that minimizes the validation error,

The model that is used on the test data is $E\u0303(x;wM\u22c6)$. In general, *M*^{⋆} ≠ *M*_{max}, since unlike the training error $M\u21a6L(wM,Xt)$, the validation error $M\u21a6L(wM,Xv)$ is not monotonically decreasing but rather generically decreases up to *M*^{⋆} and then increases after *M*^{⋆} (see the green dashed curves in Fig. 5). We remark that the value *M*^{⋆} is the best estimate of the optimal model for testing on states similar to those in the validation set; in other words, it balances the model between under-fitting and over-fitting. However, states for the extrapolation tasks (Sec. II B) are not necessarily similar to the states in the validation set. Models that extrapolate must be formulated in such a way that when trained, the cross-validation procedure selects a complexity *M*^{⋆} that captures underlying physical phenomena while ignoring non-physical patterns in the training and validation data. We achieve such a result by leveraging the universal wavelet scattering features (see Sec. II D) and by careful partitioning of the training and validation sets, which are described in more detail in Sec. III A.

Computationally, solving (3) is NP-Hard. We thus solve a relaxed problem that obtains the weights in a greedy fashion using the same orthogonal least squares (OLS) approach described by Hirn, Mallat, and Poilvert.^{40} While the resulting weights do not in general solve (3) due to the relaxation, the OLS approach is optimal among greedy approaches since it reduces the mean square error by the maximum amount with each greedy step. Otherwise, using a greedy approach has two benefits. First, it is significantly more efficient; solving (3) requires $O(dM)$ floating point operations, whereas the greedy approach requires *O*(*dM*) floating point operations. Second, it is an iterative process that can be solved using a QR factorization, which, in this case, means that after *M* nonzero weights are selected, the computation for *M* + 1 nonzero weights requires solving only for the one additional weight. This lets us efficiently construct an array of models for 1 ≤ *M* ≤ *M*_{max}, which, in turn, enables an efficient solution to the cross-validation problem given in (4).

Finally, we augment the learning process by leveraging empirical bootstrapping and feature bagging. Given an initial database of *n* states and their energies (that does not include the withheld testing set), the empirical bootstrap algorithm samples the database with replacement to obtain the training set. Those states not selected for the training set are placed in the validation set. This approach allows us to construct many different models from one database, which are then averaged. The resulting averaged model, which is still a linear model over the representation Φ(*x*), is superior to any one individually fitted model since the averaging reduces random fluctuations in the fitting process that result from spurious patterns in a single training set. In order for this averaging process to have a maximum effect, the weights of the individual models must be as uncorrelated as possible. Feature bagging, which is a prominent component of random forests, decorrelates the models by restricting the greedy selection at each greedy step. In particular, at each greedy step in the OLS algorithm, approximately $d+1$ features are sampled without replacement from among the full set of *d* features in Φ(*x*) plus the bias term, minus the features that have already been selected up to that point. The OLS algorithm at each step must then select from among the sampled features, which due to the randomness in the feature sampling, results in models that are significantly less correlated. Indeed, in our own numerical experiments, the most significant features selected with empirical bootstrapping, but without feature bagging, are very often identical. While restricting the number of possible features at each greedy step means that each model has larger error on the training set, the aggregated average model improves on the test set.^{53}

### D. Atomic orbital wavelet scattering

We now describe how we construct the feature vector Φ(*x*). Since our regression $E\u0303(x;w)=w0+\u27e8w,\Phi (x)\u27e9$ is a linear model over Φ(*x*), the representation Φ(*x*) should have the same properties as *E*(*x*). In particular, as has been noted by several machine learning papers for many-particle physics,^{7} *E*(*x*) is invariant to translations, rotations, and reflections (i.e., isometries) of the atomic coordinates ${Rk}k=1Nx$, and therefore, Φ(*x*) should also be invariant to isometries. Additionally, *E*(*x*) is independent of the atom indexation in the list *x*, and thus, Φ(*x*) must be invariant to index permutations. Like the formation energy, Φ(*x*) should be a continuous function of the atomic coordinates ${Rk}k=1Nx$, which is particularly important since our data consists of structural relaxation paths. Furthermore, since we are fitting total energy per atom, the feature values should be independent of the number of atoms in the unit cell. Finally, the amorphous Li_{α}Si systems are periodic, and thus, the features Φ(*x*) must be invariant to equivalent representations of *x* with respect to the periodicity of the state.

In addition to those basic physical properties, electronic interactions encoded by the molecular Hamiltonian are inherently multiscale in nature and thus range over a number of length scales. The resulting total energy of the system is a nonlinear function of these length scales. We thus seek a representation Φ(*x*) that on the one hand can separate the length scales of the system, while on the other hand can recombine them in a nonlinear fashion. Our approach is to use the atomic orbital wavelet scattering transform of Brumwell *et al.*,^{37} which itself is an adaptation of the three-dimensional solid harmonic wavelet scattering transform proposed by Eickenberg *et al.*^{41,42} for molecules. We review its construction in this section, emphasizing certain nuances specific to atomic states with periodic unit cell and more specifically to Li_{α}Si systems.

The wavelet scattering transform is based upon the wavelet transform of three-dimensional signals. We identify the state *x* with such a signal, which will encode permutation invariance into all representations derived from this signal. Let $Qx\u2282R3$ be the unit cell of the state *x*, which in the case of all systems we consider is a cube. We encode the state *x* as a superposition of Dirac delta functions,

We use the notation $\rho xf$ because one can think of it as a type of nuclear density for the state *x*, but in which we allow some additional flexibility. In particular, the function $f:N\u2192R$ encodes a weight that responds based on the type of atom. We use five different functions *f*, which can be thought of as channels of the state *x*, similar to how a color image has red, green, and blue channels. The five channels we use are lithium, silicon, valence, ionic, and kinetic. The lithium and silicon channels partition the state *x* along atom species, whereas the valence and core channels separate the state *x* according to electron type. The kinetic channel, inspired by the Thomas–Fermi–Dirac–von Weizsacker model in quantum chemistry, encodes a different scaling in the number of electrons than the other four channels. The precise definitions of these channels are given in Appendix A.

A simple translation and rotation invariant representation of *x* is obtained by summing over $fx=(f(Zk))k=1Nx$ for each channel *f*,

We compute four different types of summations by taking *q*th powers of *f*(*Z*_{k}) for

These powers are also inspired by the Thomas–Fermi–Dirac–von Weizsacker model and will be discussed more later in this section. By dividing by *N*_{x}, the features *ϕ*_{γ}(*x*) are also invariant to system size. Finally, since they are constant with respect to the atomic coordinates, they are trivially continuous functions of them.

The zero order features (5) satisfy all the required invariance properties, but they remove all geometric information from the state *x* and are constant for a given lithium–silicon ratio *α*. We compute complimentary features that separate the length scales of *x* and encode multiscale geometric invariants. These features will be derived from a three-dimensional wavelet transform of $\rho xf$, which gives a multiscale representation of the signal. Following Ref. 37, we define a family of atomic orbital wavelets $\psi n,\u2113m:R3\u2192C$,

where $Y\u2113m$ is the usual spherical harmonic function and *Q*_{n,ℓ} is a radial function defined as

Here, *C*_{n,ℓ} is a normalizing constant and the functions $Lk\nu $ are the associated Laguerre polynomials. We refer to the family of wavelets $\psi n,\u2113m$ as atomic orbital wavelets since they mimic the shape of the hydrogen atomic orbitals. Indeed, (*n*, *ℓ*) = (1, 0), (2, 0), and (2, 1) corresponds to the 1*s*, 2*s*, and 2*p* orbitals, respectively, with similar correspondences for larger values of *n* (see Fig. 2). While the hydrogen atomic orbitals have exponential scaling, here, we use a Gaussian function, which mimics the well-known Gaussian type orbitals from the quantum chemistry literature.

We use these wavefunctions as wavelets, though, in which the wavelet transform dilates the wavelet at different dyadic scales 2^{j},

which increases the size of the wavelet. Let *s*_{x} be the side length of the cubic unit cell $Qx$. Unlike the molecular systems studied by Eickenberg *et al.*,^{41,42} here, *x* is a periodic system and so we compute a periodic wavelet transform of the density $\rho xf$ using a circular convolution ⊛ defined as

The operation * is the usual convolution over $R3$, which for the nuclear density-type function $\rho xf$ yields

Examining (8) and (9), we have the following interpretations. The standard convolution $\rho xf*\psi j,n,\u2113m$ emits the wavelet $\psi j,n,\u2113m$ from the location of each atom in the unit cell of *x*, with a strength given by *f*(*Z*_{k}). The interference patterns encoded by these emissions encode geometric information of the state of the system at different scales 2^{j}, which we shall aggregate to form multiscale, invariant features. The circular convolution $\rho xf\u229b\psi j,n,\u2113m$ wraps the wavelets periodically in the unit cell $Qx$, thus giving us a periodic function that respects the periodicity of the system. The parameter *β* in (7) in the definition of *Q*_{n,ℓ}, which encodes the smallest wavelet scale, is selected so that $\psi n,\u2113m(u\u2212Rk)$ and $\psi n,\u2113m(u\u2212Rl)$ interfere only if |*R*_{k} − *R*_{l}| is small, i.e., if the atoms located at *R*_{k} and *R*_{l} are neighboring atoms. The maximum scale 2^{J−1} is selected so that the size of the wavelet $\psi J\u22121,n,\u2113m$ is on the order of the maximum side length *s*_{x} of the unit cells $Qx$ across all training states, thus enabling the corresponding wavelet coefficients $\rho xf\u229b\psi J\u22121,n,\u2113m$ to encode macroscopic patterns in the arrangement of the atoms in *x*. These choices allow us to capture short-range interactions in features derived from wavelet filters with small *j*, while wavelet filters with large *j* capture interactions across a larger span of the system.

Convolution operators are, in general, translation equivariant, but not rotation equivariant. However, the atomic orbital wavelet filters are designed to admit a rotationally equivariant representation by combining information across the magnetic quantum number *m* through the following nonlinear transform *σ*:

The collection of maps $\sigma (\rho xf\u229b\psi j,n,\u2113)$ constitutes a multiscale, isometry equivariant representation of the state *x* (see Fig. 3 for plots of these maps). We note that the wavelet filters with *ℓ* = 0 are radially symmetric, and therefore, their convolutions are equivariant to rotations with or without the nonlinear transform *σ*.

Equivariant representations yield invariant representations via integral operators that integrate over the space variable *u*. We compute $Lq(Qx)$ norms, for the same four *q* values in (6), of the maps $\sigma (\rho xf\u229b\psi j,n,\u2113)$,

The selection of powers *q* is motivated by the Thomas–Fermi–Dirac–von Weizsacker model in quantum chemistry in which the 4/3 scaling is used to approximate the exchange energy, the 5/3 scaling is used to approximate the kinetic energy, and the power of 2 encodes an additional part of the kinetic energy and pairwise Coulombic interactions (see also Ref. 40). The power *q* = 1 is also used since these integrals scale linearly with ∑_{k}*f*(*Z*_{k}).

We normalize the norms (10) to be invariant to system size, which defines first order wavelet scattering features,

In numerical experiments reported on in Sec. III, there are five channels *f*, five scales *j* (i.e., *J* = 5), *n* = 3 (that is, a single *n* is used), 0 ≤ *ℓ* < *n* = 3, and there are the four *q* values specified in (6), which yields 300 first order features. The smallest wavelet scale (at *j* = 0) is set with the parameter *β* in (7) to be large enough to ensure the nontrivial overlap between nearest neighbor atoms in the database, which corresponds to a length scale of 0.9 Å. The largest length scale (at *j* = *J* – 1 = 4) is chosen so that the wavelet supports envelop the largest unit cell in the database, which has side length 11.8 Å. These first order wavelet scattering features encode isometry and size invariant descriptions of the state *x* across multiple length scales 2^{j} for 0 ≤ *j* < *J*. Furthermore, since the atomic orbital wavelets $\psi n,\u2113m$ are continuous functions, the resulting maps $\sigma (\rho xf\u229b\psi j,n,\u2113)$ are continuous functions of the atomic coordinates, which means that their integrals are as well. The use of circular convolution ensures that the maps are invariant to the representation of *x* with respect to its periodicity.

First order wavelet scattering features are complemented by second order wavelet scattering features that incorporate multiple length scales of *x* into a single feature. They are computed by iterating the nonlinear wavelet transform, which couples the scales $2j1$ and $2j2$,

The second order maps (11), which resemble the architecture of a convolutional neural network as well as aspects of tensor field networks,^{24} are equivariant with respect to translations and rotations, and extract coupled geometric information at the scales $2j1$ and $2j2$ from the state *x*. Figure 4 plots the examples of these maps, which are noticeably different than their first order counterparts.

Second order invariant wavelet scattering features are computed analogously to the first order features, by taking normalized $Lq(Qx)$ norms of the equivariant maps,

Using the same parameters as the first order features, plus setting *n*_{2} = 3 and 0 ≤ *ℓ*_{2} < *n*_{2} = 3 and max(0, *j*_{1} − 1) ≤ *j*_{2} < *J* = 5, we see that there are 3420 second order wavelet scattering features. They thus greatly expand the representation of the state *x* and satisfy all the required invariance properties.

We collect the zero, first, and second order invariant wavelet scattering features into a single feature representation $\Phi (x)\u2208R3740$. Let *λ* = (*j*, *n*, *ℓ*) denote the triplet of wavelet parameters. With this feature representation, our energy model (2) can be written as

## III. NUMERICAL RESULTS

We report empirical errors for formation energy prediction and related tasks. Section III A presents interpolation type test errors on predicting relaxation path formation energies for amorphous Li_{α}Si of a similar nature to those in the training set. Sections III B–III D report extrapolation task errors, focusing on diffusion barrier estimation, formation energies of amorphous states with larger unit cells, and bulk modulus prediction, respectively. Appendix C compares these results with results obtained by varying the training procedures and model formulation presented in Sec. II. In particular, it considers a non-randomized training procedure that does not utilize bootstrapping and feature bagging (originally described in Sec. II C), as well as an energy model that utilizes only zero and first order wavelet scattering features (defined in Sec. II D). In each case, we see the advantage of the full algorithm.

### A. Training and testing on amorphous dataset

Recall from Sec. II A and Fig. 1, we have a training database of 90 252 amorphous Li_{α}Si structures with DFT computed energies spread across 37 concentrations *α*, ranging from 0.1 to 3.75. These 90 252 structures correspond to 370 relaxation paths of an initial set of 370 high energy states, with 10 relaxation paths per concentration. Unlike many standard machine learning approaches that obtain a training set by uniformly sampling data points from the initial database, here we uniformly randomly sample relaxation paths. Using fivefold cross-validation, we randomly partition the relaxation paths into five sets of 74 relaxation paths with two paths per concentration in each of the sets of 74. We place four of these sets, 296 relaxation paths total, in the training/validation set, and one set of 74 paths in the test set. We rotate through using each set separately as a test set, meaning that we carry out all numerical experiments five times, each time with a different training/validation and test set split.

We select the training/validation/testing sets according to relaxation data paths, and not structures, because it leads to more physically realistic training and testing scenarios. In particular, it is reasonable to assume that whole relaxation paths, computed via DFT, would be used to train a model, which is then used to compute relaxation paths of new high energy states. Here, the validation and testing sets are simpler in that we require the model to predict all formation energies along a new relaxation path in which the structures along the path are given. Nevertheless, empirical results indicate that this training paradigm significantly restricts the degree to which the machine learned model can fit non-physical spurious patterns in the data. We leave for future work developing a model that can predict the entire relaxation path starting with the only the highest energy state.

Using the 296 training relaxation paths, we carry out the model fitting algorithm described in Sec. II C. For the training set, we randomly select according to a uniform distribution, with replacement, 296 relaxation paths from the training set. Those paths selected more than once are repeated in the training set with the number of copies equaling the number of times the path was selected. Those paths that are not selected are placed in the validation set. The sparse linear model is trained using the greedy OLS algorithm with randomized feature bagging, with the number of features *M* ranging from *M* = 1 to *M* = *M*_{max} = 1000. The optimal number of features *M* = *M*^{⋆} is selected by minimizing the loss on the validation set. This procedure is repeated 100 times, resulting in 100 sparse linear models of the form (12), which are averaged together to yield the final model.

This final model is evaluated on the withheld test set. Figure 5 depicts the training, validation, and testing errors as a function of the number of model features *M*. It indicates that best models have, generally, between 64 and 256 features, with an average of 121 features per model, a small number given that there are ∼70 000 training structures. Furthermore, the validation curve closely follows the test curve, indicating that our cross-validation procedure is nearly optimal for these test data. The average root mean squared error (RMSE) and the average mean absolute error (MAE) over the five test folds, along with the standard deviation, are reported in the first row (relaxation paths) of Table I. Despite the small number of features, the RMSE is 7.44 meV/atom and the MAE is 5.52 meV/atom, which is comparable to the results reported by Onat *et al.*^{27} and Artrith, Urban, and Ceder,^{25} both of which used neural networks, and is small enough to be of use in materials science applications. However, the model developed here is significantly simpler than neural network models, being a linear model over multiscale, invariant features that utilize a universal set of filters. As such, the model is adept at generalization, as reported in Subsections III B and III C.

. | RMSE (meV/atom) . | MAE (meV/atom) . |
---|---|---|

Relaxation paths | 7.44 ± 0.49 | 5.52 ± 0.34 |

Diffusion | 12.3 ± 0.50 | 11.7 ± 0.51 |

Large states | 9.54 ± 0.25 | 6.81 ± 0.23 |

Bulk modulus | 12.8 ± 1.36 | 8.92 ± 0.68 |

. | RMSE (meV/atom) . | MAE (meV/atom) . |
---|---|---|

Relaxation paths | 7.44 ± 0.49 | 5.52 ± 0.34 |

Diffusion | 12.3 ± 0.50 | 11.7 ± 0.51 |

Large states | 9.54 ± 0.25 | 6.81 ± 0.23 |

Bulk modulus | 12.8 ± 1.36 | 8.92 ± 0.68 |

### B. Extrapolate: Diffusion in amorphous system

One important application of atomistic simulation is the study of atomic migration from site to site. The energetic barrier to migration determines diffusion constants and ionic conductivity.^{54} The diffusion process may be simulated by directly tracking the mean square displacement using molecular dynamics or by calculating the migration barrier and using the Nernst–Einstein relationship. The first step in the explicit calculation is to find the minimum-energy path (MEP) for an atom to travel between two stable sites. This is typically done using optimization techniques such as the Nudged Elastic Band (NEB) method.^{51} The barrier is defined as the energy difference between the stable position and the highest-energy position (saddle point) along the MEP.

There are a number of reasons why calculation of diffusion barriers may present a challenge for our ML model. Our present models do not predict forces, so they cannot be used with the NEB for prediction of the path itself. We therefore simply predict energies along the DFT-calculated MEP. A more fundamental challenge is the fact that the transition state structure, with one atom in a high-energy state and the rest in relatively low-energy states, is qualitatively different from the training items in the amorphous Li_{α}Si dataset. Calculation of diffusion barrier is thus an extrapolation task. Furthermore, there is only one diffusing atom in the simulation box during calculation of the diffusion barrier. This means that energy per atom is no longer the most relevant measure of error. Instead, total energy differences between similar structures along the MEP are the relevant quantity. Cancellation of systematic errors in DFT allows the calculation of energy differences along diffusion paths with much higher accuracy than would be suggested based on the accuracy of the method in total energy per atom.^{54} It remains to be seen if similar cancellation of errors can improve the accuracy of diffusion barriers predicted by an ML model.

To test the extrapolation of our model to diffusion barriers, void spaces were identified in Li_{0.2}Si and Li_{0.5}Si by Voronoi analysis. Candidate endpoint structures were created by moving nearby lithium atoms into the voids and relaxing the resulting structure while keeping the target lithium atom fixed. Six endpoints were identified in which the void space was a local optimum for the lithium atom and in which the relaxation for the rest of the structure was minimal. These endpoints were then used together with the original Li_{α}Si structures as the basis for NEB calculations. The structures along the resulting NEB path were then passed to the ML model for comparison with the DFT results.

The learning curves are shown in Fig. 6. The RMSE for the diffusion path structures is less smooth than the RMSE for test folds consisting of the relaxation paths in the amorphous Li_{α}Si data. Table I, second row (diffusion), shows the RMSE and MAE of the per atom energy across all diffusion barrier structures. The RMSE for these structures is about 12.3 meV/atom, which is worse than on the relaxation path test but by less than a factor of two. Nevertheless, reduced accuracy is expected given the extrapolative nature of the task.

However, these errors are not the diffusion barrier errors, which is the quantity of interest. The energies along the diffusion paths are shown in Fig. 7. The first row plots the absolute energies for both the DFT calculation and the model prediction. The second row shifts the DFT and predicted energy curves to both start at zero, to more easily compare and read off the barriers, which are given in Table II. The third row of Fig. 7 plots the predicted energy curves as a function of the number of model features *M*, showing the learning rate of the model with respect to this task. The plots indicate that even with a small number of features, for example, *M* = 21 or *M* = 41, the energy curve and resulting barrier is qualitatively correct, with additional features serving to refine the curves and better align the total energies.

Path . | Barrier (ML model) . | Barrier (DFT) . |
---|---|---|

1 | 0.228 | 0.226 |

2 | 0.819 | 0.341 |

3 | 2.256 | 2.139 |

4 | 0.230 | 0.402 |

5 | 2.613 | 2.224 |

6 | 0.326 | 0.354 |

Path . | Barrier (ML model) . | Barrier (DFT) . |
---|---|---|

1 | 0.228 | 0.226 |

2 | 0.819 | 0.341 |

3 | 2.256 | 2.139 |

4 | 0.230 | 0.402 |

5 | 2.613 | 2.224 |

6 | 0.326 | 0.354 |

Visual inspection of the energy along the diffusion paths shows that much of the error is systematic. The Li_{0.2}Si structures contain 60 atoms, so 12.3 meV/atom corresponds to 0.74 eV in total energy. If these errors were random, we would expect at least 0.74 eV error in prediction of the diffusion barrier. However, the curves show that the ML model can successfully distinguish between small-barrier paths and large-barrier paths, and the MAE in barrier prediction is 0.20 eV. While there is certainly room for improvement, we believe that these data show evidence that the ML model is able to partially capture the physics involved in the diffusion process.

### C. Extrapolate: Larger amorphous systems

It is desirable for an energy-predictor to generalize to structures in simulation cells with a different size than the training set so that it can be applied to simulation cells large enough to contain the geometries of experimental interest. As system size increases, the computation becomes challenging to carry out with DFT, but the wavelet scattering transform and linear regression scales efficiently with system size (for more details, see Appendix B), and we are thus much less inhibited by large systems.

As discussed in Sec. II B, the data for this task were generated by two different methods: “from scratch” and “tiled.” The learning curves for each are shown in the right panel of Fig. 8. Since our model predicts global energies per atom, it gives the exact same result for a system that is simply periodically duplicated. This suggests that the predictions made when extrapolating to tiled systems that have been perturbed should maintain reasonable accuracy. This figure agrees with this conjecture since the corresponding error lines follow a similar trajectory to the line for the small system test data. This figure also shows that simpler models are favored for the independently relaxed AIMD-generated systems (the “from scratch” systems). These systems are less likely than the tiled systems to be similar to examples from the training set. The rapid increase in error on large systems for higher model complexity illustrates the sensitivity of the task to over-fitting. Nevertheless, as depicted in the left panel of Fig. 8, the optimal number of features for interpolation on amorphous Li_{α}Si data is approximately the same as the optimal number of features for energy predictions on the collection of states with larger unit cells. From Table I (third row, “large states”), we see that while the prediction errors are higher for the larger systems, it is not an unreasonable increase from errors on the smaller systems.

### D. Extrapolate: Bulk modulus

Elastic properties are another important output of atomistic simulations. These are typically calculated by applying small strains to the system in question and fitting elastic constants to the energy-vs-strain.^{54} This too is an extrapolation task for our model because uniformly expanded or compressed structures do not appear in the training set. Testing data for this task were generated based on the lowest-energy structure at each concentration by applying hydrostatic strain and varying the side-length of the simulation box from −9% to 9%.

The energy vs volume of the strained structures (Fig. 9) shows remarkable agreement between DFT and the ML model. The RMSE curves shown in Fig. 10 and the average errors in the last row of Table I (bulk modulus) are also quite low. The predicted bulk modulus of the structures is shown to decrease as a function of lithium content in Fig. 10. The ML method accurately captures lithiation-induced softening of the silicon. Energy-vs-strain curves along different deformation paths may also be used for the estimation of additional thermodynamic parameters, including Young’s modulus, shear modulus free energies, and heat capacities through the Debye–Grüneisen model.^{55,56}

## IV. CONCLUSION

We have demonstrated a machine-learning model based on atomic orbital wavelet scattering that can achieve an accuracy of 5.52 meV/atom (mean absolute error) in energy on the prediction of amorphous Li_{α}Si structures. We have tested the generalizability of this energy predictor on three extrapolation tasks: diffusion barriers, large systems, and bulk moduli. As expected based on the nature of regression-based ML, if care is not taken to avoid over-fitting, the model performs poorly on these extrapolation tasks. However, we have shown that a statistically based feature randomization procedure, using the universal wavelet scattering features, can significantly enhance performance on the extrapolation tasks without significant reduction in performance on the interpolative test set.

Although the present work is limited to amorphous Li_{α}Si, it provides general lessons for those wishing to apply ML models to new problems in chemical physics. This is often a daunting task because ML is generally an interpolative technique. Before a model can be used, it must be trained on large amounts of data similar to the task at hand. If the problem is new or challenging to solve by conventional means, the generation of these data can be quite difficult. Extrapolation from the well-known systems may be possible, but off-the-shelf ML models do not extrapolate well. However, extrapolation performance can be greatly improved by taking a different approach to training the ML model.

Simpler models generalize better. In our model, “simplicity” corresponds to the number of features (wavelet scattering coefficients) used and the fact that these features provide unsupervised descriptions of atomic states, but the concept is general. Validation sets are often used in machine learning to choose a model complex enough to describe the training data but simple enough to avoid over-fitting. By utilizing randomized feature selection and the aggregation of an ensemble of models (bootstrapping), we obtain a robust and accurate model when applied to the aforementioned extrapolation tasks. From this perspective, typical ML metrics such as testing and validation error are not the only criteria for a “good” model.

In order to apply these principles to harder extrapolation tasks and to incorporate *a priori* uncertainty quantification, it will be necessary to leverage statistical methods that allow one to predict which properties will be difficult for the model, suggesting possibilities for efficient training set expansion to further improve generalizability. Training set expansion could be automated using “active learning,” allowing a model to improve itself based on problems presented to it. The linear regression model over unsupervised nonlinear wavelet scattering features is well positioned for such future work, as it is relatively simple (compared to fully supervised neural networks) to incorporate new data on the fly.

In future work, we will extend the model to include force predictions. As described in (12), our energy predictions are given as linear combinations of features that are each dependent on the atomic positions. The features are differentiable and we can carry out this differentiation analytically, which opens up fast force computations and fits the weights of our model to force data. In this case, all the methods of Sec. II C would still be applicable. This has a computational advantage over features with learned filters, which would likely use automatic differentiation. Including forces in the weight learning process could affect the weights that are learned for the energy predictions as each system would have 3*N*_{x} more points of training data. We expect that this will boost the generalization ability of the model. In this regard, training on forces will act as a regularizer for energy predictions.

## SUPPLEMENTARY MATERIAL

See the supplementary material for voltages vs Li/Li^{+} and radial distribution functions for Si–Si, Li–Si, and Li–Li pairs.

## AUTHORS’ CONTRIBUTIONS

P.S. and M.W.S. contributed equally to this work.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## ACKNOWLEDGMENTS

This work was supported, in part, by the Defense Advanced Research Projects Agency (Young Faculty Award No. D16AP00117 to M.H., supporting P.S., X.B., J.L., and K.J.K.), the Alfred P. Sloan Foundation (Sloan Fellowship Grant No. FG-2016-6607 to M.H.), the National Science Foundation (Grant No. 1832808 to Y.Q., Grant No. 1620216 to M.H., and CAREER Award No. 1845856 to M.H.), and the Michigan State University Foundation (Strategic Partnership Grant to Y.Q.). This work used computational resources provided by the Institute for Cyber-Enabled Research at Michigan State University. X.B. also acknowledges support from the Institute for Pure and Applied Mathematics at UCLA, where he was in residence during fall 2019.

### APPENDIX A: CHANNEL DEFINITIONS

In this appendix, we give precise definitions of the five input channels:

Lithium channel:

*f*(*Z*_{k}) =*Z*_{k}if*Z*_{k}= 3 and*f*(*Z*_{k}) = 0 otherwise.Silicon channel:

*f*(*Z*_{k}) =*Z*_{k}if*Z*_{k}= 14 and*f*(*Z*_{k}) = 0 otherwise.Valence channel:

*f*(*Z*_{k}) = No. of valence electrons.Ionic channel:

*f*(*Z*_{k}) = No. of core electrons.Kinetic channel: $f(Zk)=Zk$.

### APPENDIX B: FAST WAVELET SCATTERING COMPUTATIONS

In this appendix, we describe how to efficiently compute the wavelet scattering features described in Sec. II D.

In practice, all computations are carried out over a discrete sampling of the unit cell $Qx$. Such a sampling is a three-dimensional grid $Gx\u2282Qx$ with *L*_{x} grid points along each dimension. Due to the multiscale sizes of the wavelet filters $\psi j,n,\u2113m$, a direct computation of the circular convolution (8) over the grid $Gx$ will require $O(Lx6)$ floating point operations. This computational cost can be significantly reduced by carrying out these computations in frequency space.

Recalling (9), the Fourier transform of $\rho xf*\psi j,n,\u2113m$ is

where $F[h](\omega )=h^(\omega )$ is the Fourier transform of the function $h\u2208L1(R3)$. The Fourier transform of $\psi n,\u2113m$ can be computed analytically,

Therefore, (B1) can be evaluated directly for any $\omega \u2208R3$. We do so in a box $\u2212\omega x,\omega x3$, where

The maximum frequency *ω*_{x} is chosen so that the essential support of $\psi ^n,\u2113m$ is contained within $\u2212\omega x,\omega x3$, which, in turn, determines the number of grid points *L*_{x} along each side length of the unit cell $Qx$. This maximum frequency depends on the wavelet width *β* and (weakly) on the (*n*, *ℓ*) parameters. Evaluations within the box $\u2212\omega x,\omega x3$ are restricted to a grid $\Omega x\u2282\u2212\omega x,\omega x3$ with grid spacing 2*π*/*s*_{x}, which yields *L*_{x} frequency grid points. In particular, we compute, via direct numerical evaluation, a tensor $\Psi j,n,\u2113m\u2208CLx\xd7CLx\xd7CLx$ defined as

Due to the discretization in (B2), taking the inverse fast Fourier transform (iFFT) of $\Psi j,n,\u2113m$ recovers the circular convolution $\rho xf\u229b\psi j,n,\u2113m$ evaluated on the spatial grid $Gx$,

The direct computation of $\Psi j,n,\u2113m$ requires $CNxLx3$ floating point operations, whereas the iFFT calculation requires $CLx3\u2061logLx$ floating point operations. Therefore, the total cost is reduced to $O((Nx+logLx)Lx3)$.

First order wavelet scattering features are estimated by applying the pointwise nonlinear operator *σ* to (B3) and estimating the $Lq(Qx)$ integrals with a Riemann sum approximation. Second order wavelet scattering features are computed by taking the fast Fourier transform (FFT) of $\sigma (\rho xf\u229b\psi j,n,\u2113)Gx$ and computing the second circular wavelet convolution via frequency multiplication with a direct evaluation of $\psi ^n2,\u21132m2(2j2\omega )$ on the grid *ω* ∈ Ω_{x}, followed by another iFFT, application of *σ*, and Riemann sum. The cost of each second order feature, given that (B3) must already be computed for the first order features, is $O(Lx3\u2061logLx)$.

### APPENDIX C: ALTERNATIVE MODELS COMPARISON

In this appendix, we describe two models similar to the one used in the main body of this text and compare the results.

The model used in the main body (hereafter referred to as the full model) has numerical results on the test set summarized in Table I, and the training method is described in Sec. II C. The results of two alternative models on the various tasks of this work are listed in Tables III and IV. The test folds of relaxation paths and Li_{α}Si states of the diffusion, large states, and bulk modulus states are identical in all three model comparisons.

. | RMSE (meV/atom) . | MAE (meV/atom) . |
---|---|---|

Relaxation paths | 8.04 ± 0.59 | 5.99 ± 0.39 |

Diffusion | 11.8 ± 0.48 | 9.51 ± 0.95 |

Large states | 14.0 ± 0.68 | 10.2 ± 0.42 |

Bulk modulus | 39.5 ± 4.82 | 25.1 ± 2.92 |

. | RMSE (meV/atom) . | MAE (meV/atom) . |
---|---|---|

Relaxation paths | 8.04 ± 0.59 | 5.99 ± 0.39 |

Diffusion | 11.8 ± 0.48 | 9.51 ± 0.95 |

Large states | 14.0 ± 0.68 | 10.2 ± 0.42 |

Bulk modulus | 39.5 ± 4.82 | 25.1 ± 2.92 |

. | RMSE (meV/atom) . | MAE (meV/atom) . |
---|---|---|

Relaxation paths | 7.50 ± 0.39 | 5.64 ± 0.28 |

Diffusion | 11.6 ± 1.01 | 11.0 ± 1.03 |

Large states | 9.78 ± 1.98 | 6.60 ± 0.81 |

Bulk modulus | 16.6 ± 4.91 | 11.5 ± 3.55 |

. | RMSE (meV/atom) . | MAE (meV/atom) . |
---|---|---|

Relaxation paths | 7.50 ± 0.39 | 5.64 ± 0.28 |

Diffusion | 11.6 ± 1.01 | 11.0 ± 1.03 |

Large states | 9.78 ± 1.98 | 6.60 ± 0.81 |

Bulk modulus | 16.6 ± 4.91 | 11.5 ± 3.55 |

The first alternative model (hereafter the 0–1 model) is trained identically to the full model with five test folds (the test folds are identical for both models) and 100 randomly selected sets of relaxation strings (with replacement) for training, but with only zero and first order wavelet scattering features available for selection in training. This results in a total of 321 features (with bias) to select from compared to 3741 in the full model. Note that at each step of the greedy OLS training, the best feature is chosen from $321\u224817$ features that are randomly selected from the remaining unselected features compared to $3741\u224861$ in the full model. The model size *M*^{⋆} averaged over all 500 permutations of the training data is 121 in the full model and 108 for the 0–1 model, with standard deviations of 64 and 38, respectively. The numerical results for the 0–1 model are listed in Table III. The performance is comparable on the relaxation paths. On the diffusion states, the 0–1 model has slightly better performance in RMSE and MAE, but the standard deviation in MAE across the fivefolds is nearly double the full model. Furthermore, inspection of the barriers computed by the 0–1 model reveals that they are, in fact, slightly worse than the full model. The performance of the 0–1 model is significantly worse than the full model on the large and bulk states, again with a large spread in errors. This indicates that we get a statistically significant benefit by including second order features in the models.

The second alternative model (hereafter the non-randomized model) has the same features available as the full model and the same five test folds as the prior two models. The training set is randomly partitioned into four equally sized sets (selection by relaxation strings) with a model trained for each selection of a set as validation and the remaining three for training (i.e., nested fivefold cross-validation, as in Ref. 57). This results in four trainings for each test fold for a total of 20 models trained compared to the 500 trainings (five test sets with 100 training/validation splits) of the full model. This non-randomized procedure ensures uniform representation of the strings in the training, validation, and testing folds. During training of the non-randomized model, the OLS algorithm seeks the next best feature at each step from all remaining features rather than randomly selecting a subset of features to choose from as in the prior two models. The average value of *M*^{⋆} is 153 for the non-randomized model with a standard deviation of 82 across the 20 trainings. The performance of this model is similar to the full model on relaxation paths but with a significantly larger spread of the errors between models on diffusion, large states, and bulk states. Thus, the chance of a catastrophic error is higher. Furthermore, the RMSE and MAE are significantly larger on the bulk states. This indicates that the model over-fit the training data and did not generalize as well to the extrapolation tasks. Randomized training in the full model appears to mitigate the possibility of over-fitting.

## REFERENCES

_{x}Si using machine-learning-assisted sampling with an evolutionary algorithm

_{1−x}Sn

_{x}films