In this article, we present a systematic study on developing machine learning force fields (MLFFs) for crystalline silicon. While the main-stream approach of fitting a MLFF is to use a small and localized training set from molecular dynamics simulations, it is unlikely to cover the global features of the potential energy surface. To remedy this issue, we used randomly generated symmetrical crystal structures to train a more general Si-MLFF. Furthermore, we performed substantial benchmarks among different choices of material descriptors and regression techniques on two different sets of silicon data. Our results show that neural network potential fitting with bispectrum coefficients as descriptors is a feasible method for obtaining accurate and transferable MLFFs.
I. INTRODUCTION
Atomistic modeling methods such as molecular dynamics (MD) or Monte Carlo (MC) play important roles in investigating time-dependent physical and chemical processes. In these methods, energy and forces need to be recalculated iteratively as the atomic configuration evolves. Consequently, atomistic simulations crucially depend on the accuracy of the underlying potential energy surface (PES). Modern quantum mechanical modeling based on density functional theory (DFT) can consistently generate accurate energetic descriptions for many solid systems.1 However, MD simulations based on DFT suffer from the highly demanding computational cost. The simulations are only suitable to model a system with up to a few thousands of atoms at tens of picoseconds. On the other hand, the classical force field (FF) method is widely employed to simulate materials with millions of atoms at hundreds of nanoseconds. This method has enabled many explorations that lead to revealing interesting physical and chemical phenomena.2–4 However, the construction of a reliable PES by the classical FF method remains problematic. In developing classical FF, a set of parameters are fitted to a few DFT and/or experimental data to compute the potential energy of a system, given an analytic functional form. Due to the constraints on the functional form and the limitation of the training dataset, the accuracy of classical FF is not dependable.
Meanwhile, in silico materials discovery requires an accurate yet efficient energy model to screen materials’ properties in a high-throughput manner. In the past decade, discoveries of new materials have been highly driven by advanced structure prediction methods such as crystal structure prediction (CSP)5 and data mining.6 In both cases, the DFT method is used to perform geometry relaxation and energy evaluation. Despite the power of the current supercomputer, the computational cost for DFT simulation remains a bottleneck to many important and fascinating puzzles in materials science. Ideally, an approach that preserves DFT accuracy without sacrificing the computational cost is desirable.
To resolve the limitations described above, many efforts have been devoted toward establishing the machine learning force field (MLFF) method. Compared to the DFT method, the MLFF approach demands far lower computational cost (2–4 orders of magnitude lower) while retaining accuracy at the DFT level. The power of the MLFF method is illustrated by many applications to a range of materials.7–10 A large amount of DFT data (structures, energy, forces, and stresses) are required to develop an accurate MLFF. The structures must be represented by appropriate descriptors (high-dimensional real valued array) in order to identify the similarities and/or dissimilarities in the atomic environments. In MLFF fitting, a variety of regression techniques are used to correlate between the descriptor and energy/forces. Several machine learning techniques for developing MLFF had been successfully implemented: linear/polynomial regression,11–14 Gaussian process regression,15,16 and high-dimensional neural network potential (NNP).17,18 A benchmark study of these machine learning methods had been carried out for performance and cost inspections to many elemental systems.19 Nevertheless, many of the published MLFFs lack transferability/versatility, which is crucial in crystal structure prediction.
In the past few years, many researchers have attempted to improve transferability for many different systems.10,20–25 Two approaches, including advanced sampling and structure prediction, have recently become popular. One is to force ordinary MD simulations to escape from the already explored equilibrium states,26,27 while the other attempts to identify the low energy configurations by sampling many different basins mostly based on geometry optimizations. A heterogeneous training dataset—diversity in structural types—enhances transferability across different types of structures, curing the extrapolation problem.24,25 Zeni et al.24 achieved a good trade-off between transferability and overall accuracy by applying Gaussian process regression with a diverse dataset (including high temperature structures). Similarly, many physical properties were reproduced within 10% relative error to the DFT.25 Hajinazar et al. employed a structure prediction technique to generate more diverse datasets than the common, less diverse, dataset generated with the MD-based approach.20 In addition, it was proposed that the generation of MLFFs could be performed in conjunction with structure prediction processes. The active learning approach in constructing MLFFs on-the-fly was employed automatically to deal with extrapolation outside the training domain. Then, the MLFFs replaced the DFT gradually for structural relaxations and energy evaluations with much lower computational cost. The active learning technique had been successfully applied to predict PES reconstructions of several challenging elemental systems21,22 and multi-component systems.23 For instance, Deringer et al.21 used Gaussian process regression combined with random structure searching (RSS) algorithms to systematically construct an interatomic potential for boron; Podryabinkin et al.22 employed the evolutionary algorithm USPEX to build the machine-learning interatomic potentials for several elemental allotropes; and similar ideas were also applied to investigate the surface reconstructions23 and nanoparticles.28
In this report, we will discuss about our attempts on developing accurate and transferable MLFFs for elemental silicon as the prototypical system. Many silicon MLFFs had been developed using the training datasets obtained by running MD simulations and selecting known structural prototypes manually.10,15,19,25,29–32 These configurations from MD trajectories tend to possess strong correlations with the initial geometry. Hence, the resulting MLFFs can only describe a few energy basins of the entire PES. We believe that there are two main factors that can influence the transferability of the MLFF. First, the training dataset generated with the high-throughput structure prediction method can enhance the transferability. Here, we generate a diverse silicon dataset by using our in-house code, PyXtal33—a Python package for random crystal structure generation. The DFT-quality dataset spans a large space in the PES covering many energy basins, and the DFT setting is provided in Sec. II A. Second, we enable a machine learning infrastructure that allows Behler–Parrinello descriptors and bispectrum coefficient descriptors to be trained with generalized linear regression and neural networks. The details of the descriptors and the regression techniques are available in Secs. II B and II C, respectively. Finally, we will systematically construct the NNP with bispectrum coefficients as the descriptors in Sec. III.
II. COMPUTATIONAL METHODOLOGIES
A. Ab initio calculation
Ab initio calculations are necessary to provide the training dataset for MLFF development. In this study, we employed PyXtal33 software to generate several thousands of structural configurations. For each configuration, the total energy and forces were calculated at the DFT level through the ASE package.34 ASE provides interface to the Vienna Ab Initio Simulation Package (VASP) code35 within projector augmented wave methodology36 to perform geometry relaxations. In our calculation, we used the Perdew-Burke-Ernzerhof generalized gradient approximation (PBE-GGA)37 as the exchange-correlation functional with an energy cutoff of 600 eV and a Γ-centered KSPACING of 0.15.
B. Descriptors
Descriptors, as the unique numerical representations of atomic structures, play an essential part in constructing MLFFs. It is crucial for a descriptor to be able to distinguish the local environments of atomic structures. The most common choice of representation by atomic coordinates is convenient, but it poorly describes the structural environments. The Cartesian coordinates of a crystal structure can change through translational or rotational operation, while the energy remains invariant. Thus, physically meaningful descriptors must be unaffected by these alterations to the structural environment, and any permutation of atoms should not change the descriptors. Additionally, the descriptors must be continuously differentiable within the domain of the local atomic environment. In the last decade, the atom-centered descriptors, which probe the atomic environment by their neighboring vectors, became popular because they fit the criteria. The descriptors usually operate within a cutoff function to ensure that the descriptors smoothly vanish to zero at a given cutoff radius, Rc. A popular cutoff function choice is the so-called cosine cutoff function. The function is expressed in the following:
where Rij is the distance between the center atom i and the neighbor atom j.
1. Behler–Parrinello descriptors
Behler–Parrinello descriptors are used regularly to represent the local atomic environments of crystal structures in NNP development. Commonly used Behler–Parrinello descriptors are two-body (G2) and three-body (G4) symmetry functions,
G2 is mainly designed to capture the radial environment, while G4 is used for describing the angular part by including the three-body ijk terms. Rs shifts the center of the Gaussian functions to a certain radius, resulting in a spherical shell with the Gaussian width of η. ζ controls the angular resolution, and λ usually takes the value of +1 and −1 for inverting the cosine function. The cutoff function (fc) is consistent with Eq. (1). There is a set of and descriptors specifying the center atom i in relation to the neighboring atoms j in terms of radial and angular parts. For a real material system, this set of parameters need to be optimized by a more extensive search.38–41
2. Bispectrum coefficients
Similar to Behler–Parrinello descriptors, the SO(4) bispectrum can be used to represent the local atomic environments. It was first introduced by Bartók et al. for the training of machine learning FF (MLFF) on the elemental systems of Group IVA.15 A detailed study of the SO(4) bispectrum as a descriptor along with several alternative implementations [SO(3) bispectrum, angular Fourier series, and SOAP kernel] is available in Ref. 29. Later, Thompson et al. proposed the spectral neighbor analysis (SNAP) method and demonstrated that the SO(4) bispectrum could achieve satisfactory accuracy based on the simple linear11 and quadratic regressions.12 Following the original work, the expression of the SO(4) bispectrum is formed by the expansion coefficients of 4D hyperspherical harmonics,
where is analog to the Clebsch–Gordan coefficients on a 3-sphere. In application, it is the product of two ordinary Clebsch–Gordan coefficients on a 2-sphere. are the expansion coefficients from the hyperspherical harmonic () functions that are projected from the atomic neighborhood density within a cutoff radius onto the surface of a four-dimensional sphere,
where the expansion coefficients are defined as
In this work, our implementation of the SO(4) bispectrum or bispectrum descriptor is very similar to the SNAP method11 that is implemented in the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) code.42 However, we introduce another method to calculate the hyperspherical harmonics and their gradients.43 The benefit of this method is that it allows for the removal of singularities at the north and south poles of the 3-sphere that exist in the traditional implementation. Furthermore, we also include an option to normalize the expansion coefficients from the hyperspherical harmonics, where the normalization factor is . The impacts of normalization on the MLFF training will be discussed later in Sec. III C.
C. Machine learning force field fitting
The construction of the total energy (Etotal) of a structure can be obtained by the summation of atomic energy (Ei) evaluated from atom-centered descriptors, Xi,
The atomic energy contributions depend on the local structural environment within a cutoff radius with respect to the center atom i. Furthermore, an accurate representation of PES is also dependent on the contributions of forces. The force that acted on atom j can be expressed by the negative gradient of the energy with respect to its atomic positions (rj),
1. Generalized linear regression
Linear regression is the most fundamental approach in curve fitting. In this context, each atomic energy is assumed to be linearly correlated with the descriptors. Thus, the total energy can be expressed as follows:
where γ0 and γ are the weights presented in scalar and vector forms, and N is the total number of atoms in a structure.
In general, the total energy can be described as a generalized linear regression with extended polynomial terms. The following equation is a version to the second-order (quadratic) expansion in the Taylor series:
where A is the symmetric weight matrix (i.e., A12 = A21) describing the quadratic terms. From linear to quadratic regression, the size of weight coefficients increases from N + 1 to (N + 1)(N + 2)/2. Indeed, the energy can be further expanded to higher order. However, we restrict it to the second-order expansion due to the drastic increase in the size of weight coefficients.
Correspondingly, the force of an atom j can be expressed in this form by expanding the terms in Eq. (8) with Eq. (10),
Both energy and force terms have a linear correlation with the expanded descriptors through a set of weight coefficients {γ0, γ1, …, γN, Γ11, Γ12, …, ΓNN}. For convenience, we call the set of coefficients as w from now on. To obtain the best w, we solve the objective cost function following the least squares formula for both energy and force,
where s is the total number of structures, i loops over all structures, and j loops over all atoms for each structure i in all three directions. is the total number of atoms in the ith structure. β is the force coefficient. It balances the energy and force contributions due to the number of force components being much larger. The cost function compares the predicted values obtained from the regression (Ei and Fi,j) to the true values of ERef and .
To prevent overfitting, it is useful to add a penalty term to account for the complexity of the entire weights (m) to Eq. (12),
where α is a dimensionless number that controls the degree of penalty. Adding such a penalty function in the context of machine learning is called regularization. Then, the optimum solution can be solved by finding the w leading to the zero partial derivative of Δ with respect to each element in w. Accordingly, we use the numpy.linalg.lstsq44 solver for generalized linear regression problems.
2. Neural network regression
In this section, the high-dimensional NN (Fig. 1) is introduced. The regression based on NN can be considered as an extension of the linear regression model. For a crystal structure that consists of N atoms, there are N positions (RN) for the atoms to arrange themselves. N atom-centered descriptors (Xi) for the structure can be mapped based on this atomic configuration. Each of the atom-centered descriptors is, then, fed into a NN architecture [Fig. 1(b)]. The NN architecture consists of input, hidden, and output neurons. These neurons are organized in layers as shown. The neurons in the first layer (input layer) are occupied by the atom-centered descriptors. The neuron at the output layer defines the atomic energy, Ei. Hidden layers lie between the input and output layers. In the case of Fig. 1(b), there are two hidden layers. In particular, we will call this NN architecture 2-3-3. 2 represents two neurons in the input layers. 3-3 represents two hidden layers with 3 neurons each. It is redundant to repeatedly mention the output layer as the node is always 1. The neurons in hidden layers represent no physical meaning. They act as a functional form to predict the atomic energy. There is no limit to the number of hidden layers. However, the flexibility of NNP will depend on the number of neurons present in the NN architecture. The connectivity in between the neurons are the weight parameters (fitting parameters). Mathematically, one can calculate the value of a neuron in this form,
The value of a neuron () at layer l can be determined by the relationships between the weights (), the bias (), and all neurons from the previous layer (). specifies the connectivity of neuron nj at layer l − 1 to the neuron ni at layer l. represents the bias of the previous layer that belongs to the neuron ni. These connectivities are summed based on the total number of neurons (N) at layer l − 1. Finally, an activation function () is applied to the summation to induce non-linearity to the neuron (). at the output layer is equivalent to an atomic energy, and it represents an atom-centered descriptor at the input layer. Since the atomic energy has no reference value to the DFT energy, each atomic energy is collected as in Eq. (7) to obtain the total energy of a crystal structure. The accuracy of NNP will rely on the accuracy of the NN architecture to predict the energy.
(a) A schematic diagram of the high-dimensional neural networks. The red diagrams are parts of (b) the neural network architecture. Each atom in a structure is first mapped into atom-centered descriptors according to the atomic environment of the structure. The atom-centered descriptors serve as inputs in the neural network architecture that outputs the atomic energy. Finally, the collection of the atomic energies is the total energy of the structure.
(a) A schematic diagram of the high-dimensional neural networks. The red diagrams are parts of (b) the neural network architecture. Each atom in a structure is first mapped into atom-centered descriptors according to the atomic environment of the structure. The atom-centered descriptors serve as inputs in the neural network architecture that outputs the atomic energy. Finally, the collection of the atomic energies is the total energy of the structure.
To train the NNP, we can consistently use the cost function in Eqs. (12) and (13). The minimization problem is then solved by our in-house stochastic gradient descent and Adaptive Moment Estimation (ADAM)45 optimizer. Alternatively, we interfaced our in-house code with the SciPy package,46 so it is possible to use the Limited-memory Broyden-Fletcher -Goldfarb-Shanno (L-BFGS) method47 for this study.
III. RESULTS
In this section, we discuss about the development of accurate and transferable MLFFs. First, we introduce two types of datasets—a localized dataset and a diverse dataset. Second, we will validate our machine learning framework with the localized dataset as the baseline. Third, we explore the interplay between bispectrum coefficients and the two machine learning regressions (generalized linear regression and NN) on the localized dataset. This subsection is dedicated to further validate the localized dataset with a new NNP fitting strategy. Finally, we will develop a transferable silicon MLFF based on the new strategy.
A. Data sets
Here, we present two silicon datasets. Set #1 is the localized dataset, obtained from Ref. 19. Set #1 contains 244 structures in total (219 for training and 25 for test), which includes the ground state of crystalline structure, strained structures, slabs, vacancy, and liquid configurations from MD simulations. To generate the diverse dataset, we utilized our in-house PyXtal code33 to produce thousands of silicon structures with various numbers of atoms in the unit cell from 1, 2, 4, 6, 8 to 16. Random space group (1–230) assignment was applied to these silicon structures. For each random structure, we performed four consecutive geometry optimization steps at the level of DFT with a steady increase in precision. The maximum numbers for each ionic step were 10, 25, 50, and 50. The relaxed images were then selected to our training pool to represent the shape of PES toward the energy minima. With this scheme, we ensured that not only the minima but also the configurations around the minima would be captured during the energy fitting. Afterward, we performed single-point DFT calculations for all configurations in the training pool using the parameters described in Sec. II A. Finally, 5352 silicon structures (Set #2) were selected by removing structures with energies that are higher than −4.000 eV/atom (i.e., 1.400 eV/atom higher than the ground states). In total, Set #1 has 15 078 atoms, and Set #2 has 31 004 atoms. We note that the energy cutoff (600 eV) used in our DFT calculation is slightly higher than the one (520 eV) used in Ref. 19. However, this resulted in negligible differences according to our test for the same structures. Therefore, we will use these two datasets for direct comparison in Secs. III D– III F.
As shown in Fig. 2, Set #2 covers more diverse atomic environments in terms of energy, force, and density. Set #1 includes 244 structures that span from −4.560 eV/atom to −5.425 eV/atom in energy and 17.56 Å3/atom to 40.89 Å3/atom in density. The energy of Set #2 ranges from −4.000 eV/atom to −5.425 eV/atom, and the density ranges from 8.295 Å3/atom to 52.81 Å3/atom. The force distribution in Set #2 is wider than that in Set #1. In order to probe the similarity between the two datasets, we further assessed them with the principal component analysis (PCA) technique. While the projection of two most dominating principal components is shown in Fig. 2, the principal components were fitted with the bispectrum coefficients mapped from the Set #2 structures. The inset shows that the data points of Set #1 cover mostly the empty space in the concentrated area. In other words, it appears to be that Set #1 and Set #2 rarely overlap. This indicates that two datasets encompass different atomic environments, which is expected since two different strategies were employed in generating the atomic configurations. Therefore, the two datasets are complementary and can be used to cross-validate each other in the MLFF development.
(a) The energy vs volume plot for training Set #1 and Set #2. The histograms of energy and forces are presented in (b) and (c), respectively. (d) The projection of two most dominating principal components of the atomic bispectrum coefficients. The inset illustrates a zoomed-in view of the concentrated area. In the area, Set #1 is highly concentrated, whereas Set #2 is more widely spread.
(a) The energy vs volume plot for training Set #1 and Set #2. The histograms of energy and forces are presented in (b) and (c), respectively. (d) The projection of two most dominating principal components of the atomic bispectrum coefficients. The inset illustrates a zoomed-in view of the concentrated area. In the area, Set #1 is highly concentrated, whereas Set #2 is more widely spread.
It is important to note that these two sets of data were obtained through entirely different approaches. Set #1 was not designed to generate an accurate force field for Si but rather to compare different MLFFs on a small, standardized dataset applicable to several elemental systems (for example, only 60 snapshots from ab initio MD were included into it). In a typical MLFF development, a few thousand or more configurations will be needed for both the Gaussian process21,48 and NN regressions.20,27 Therefore, the training results from Set #1 are expected to gain some improvement by employing a larger version of Set #1 with the same strategy (e.g., adding more MD snapshots). However, many other features in the PES will remain missing. Compared to Set #1, Set #2 covers more energy basins in the PES since it was obtained from an unbiased and more uniform sampling. For instance, we found that Set #2 contains the high pressure β-Sn phase of silicon and many other phases with five- and six-coordinated silicon atoms. While such atomic environments can also exist in silicon grain boundaries and other types of defects generated from high temperate MD simulations, the MLFF training may not describe these atomic energetics accurately when it attempts to fit the total energy of the system. Therefore, we believe that a MLFF with better coverage of the PES landmarks by small structures is more effective for an accurate modeling of rare events under various conditions (e.g., phase transitions, pronounced deformations, and chemical reactions). As we will discuss in Sec. III E, fitting on Set #2 is considerably more challenging than Set #1. While many relatively simple models can yield satisfactory errors for Set #1, the overall accuracy for Set #2 is notably lower, regardless of the machine learning methods. Therefore, our goal of this work is to fit a Si-MLFF, which can describe Set #2 reasonably well while retaining a similar level of accuracy for Set #1. Prior to this, we will verify our MLFF implementation with Set #1 in Sec. III B.
B. Verification with the localized data set
In Ref. 19, the authors presented an extensive benchmark for silicon (as well as several other elemental systems) with different MLFF approaches. This provided us a foundation to verify our MLFF implementations by using their data for training and testing. With Set #1, we attempted to reproduce the results based on the NNP, SNAP, and quadratic SNAP (qSNAP) methods. To compute the descriptors, we employed the same parameter setting as reported in Ref. 19, which is summarized in Table I. In the original literature, there were 9 G2 and 18 G4 descriptors. We made a deeper inspection on the histogram of the computed symmetry functions of the entire Set #1. We identified that descriptors with large η values span in a very narrow range. Narrow-range descriptors were less likely to discriminate different local atomic environments, and they could introduce numerical noise. Therefore, we reduced the parameter set, which included only 6 G2 and 8 G4 descriptors for this study. The reduced parameter sets are marked with asterisk symbols. For convenience, we are naming the full Behler–Parrinello descriptors as G27 and the reduced Behler–Parrinello descriptors as G14. For bispectrum coefficients, the expansion is limited to several finite orders since the higher indices of l can only be beneficial in detecting subtle signals on the neighbor density map. In this study, we only considered the band limit (lmax) of up to 8, with focus on 3, 4, and 5 (30, 55, and 91 bispectrum coefficients). They are denoted as B30, B55, and B91. Furthermore, we investigated the case of B with normalization, and they are denoted as 30, 55, and 91. Correspondingly, the labelings with the regression techniques are NNP+G27 for the NN regression with G27 descriptors, LR+B55 for linear regression with B55 descriptors, and QR+B55 for quadratic regression with B55 descriptors.
The setting used to compute the atom-centered descriptors in this study. The Behler–Parrinello descriptors are consistent with Ref. 19, except that Rc was set to 4.8 Å for the quadratic regression in the previous literature. Moreover, we considered bispectrum coefficients with the band limit lmax up to 8. The asterisk symbol denotes the reduced parameter set for Behler–Parrinello descriptors.
| Descriptors . | Parameters . | Values . |
|---|---|---|
| G2 | Rc (Å) | 5.2 |
| Rs (Å) | 0 | |
| η (Å−2) | 0.036*, 0.071*, 0.179*, 0.357*, 0.714*, | |
| 1.786*, 3.571, 7.142, 17.855 | ||
| G4 | Rc (Å) | 5.2 |
| λ (Å) | −1, 1 | |
| ζ | 1 | |
| η (Å−2) | 0.036*, 0.071*, 0.179*, 0.357*, 0.714, | |
| 1.786, 3.571, 7.142, 17.855 | ||
| B | Rc (Å) | 4.9 |
| lmax | 2, 3, 4, 5, 6, 7, 8 | |
| Normalization | True and false |
| Descriptors . | Parameters . | Values . |
|---|---|---|
| G2 | Rc (Å) | 5.2 |
| Rs (Å) | 0 | |
| η (Å−2) | 0.036*, 0.071*, 0.179*, 0.357*, 0.714*, | |
| 1.786*, 3.571, 7.142, 17.855 | ||
| G4 | Rc (Å) | 5.2 |
| λ (Å) | −1, 1 | |
| ζ | 1 | |
| η (Å−2) | 0.036*, 0.071*, 0.179*, 0.357*, 0.714, | |
| 1.786, 3.571, 7.142, 17.855 | ||
| B | Rc (Å) | 4.9 |
| lmax | 2, 3, 4, 5, 6, 7, 8 | |
| Normalization | True and false |
For the cases of linear and quadratic regressions, the results are deterministic as long as the force coefficient in Eq. (12) is given. Figure 3 displays the gradual changes in mean absolute error (MAE) values for energy and forces by varying the force coefficient (β) from 1 × 10−6 to 1 × 10 for both LR+B55 and QR+B55. For each regression, these points seem to form a Pareto front. Namely, there is no single point that can beat the other points in both energy and force MAE values. Here, we choose a range from the Pareto front that leads to an approximately even change on other sides. This point corresponds to the force coefficient closest to 1 × 10−4. When β = 1 × 10−4, B55+LR yields the MAE values of 6.94 (6.28) meV/atom for energy and 0.11 (0.12) eV/Å for force in the training (test) dataset. For B55+QR, the results gain significant improvement. The final energy MAE value is 2.50 (2.21) meV/atom, and the force MAE value is 0.06 (0.08) eV/Å. The results are expected since the quadratic form allows the coupling of bispectrum coefficients.12 However, the number of weight parameters also increases notably from 56 to 1596, which increases the computational cost for both FF training and prediction.
The comparison of fitting between linear and quadratic regressions based on the B55 descriptors (lmax = 4) applied to Set #1. For each regression, the energy MAE and force MAE values were collected by gradually varying the force coefficients from 1 × 10−6 to 1. The numbers of weight parameters are given in parentheses. The marked black asterisks correspond to the results when the force coefficient is at 1 × 10−4.
The comparison of fitting between linear and quadratic regressions based on the B55 descriptors (lmax = 4) applied to Set #1. For each regression, the energy MAE and force MAE values were collected by gradually varying the force coefficients from 1 × 10−6 to 1. The numbers of weight parameters are given in parentheses. The marked black asterisks correspond to the results when the force coefficient is at 1 × 10−4.
For NNP+G27, we tested the NNP fitting with the NN architecture of 27-24-24. The predicted MAE values are 5.65 meV/atom in the training dataset and 5.60 meV/atom in the test dataset. The metrics are close to the previously reported values: 5.88 meV/atom and 5.60 meV/atom in Ref. 19. Our force MAE values are 0.095 eV/Å and 0.106 eV/Å, agreeing with the previous report as well. Furthermore, we employed reduced Behler–Parrinello descriptors to the NNP fitting (NNP+G14). It is found that the training with NNP+G14 also yielded comparable metrics. This indicated that the removed Behler–Parrinello descriptors were indeed redundant, and they can cause numerical noise during the NNP training. Correspondingly, we adjusted our NNP training strategy toward G14 to investigate the impacts of hyperparameters on NNP training. In contrast to linear regression, the NNP training is much less vulnerable to the choice of force coefficient since the NNP can compromise for more flexible functional forms. It is rather reliant to the hidden layer size. Figure 4 shows the energy MAE values scanning across the hidden layer sizes for NNP+G14 with β fixed at 0.03. The overall picture suggests that NNP performances tend to improve as the NNP model becomes more flexible. However, the NNP accuracy will saturate at some point. Beyond the saturation point, increasing the hidden layer size will only raise the computational cost and lower the chance of finding optimal weight parameters. We also mention that the results from QR+B14 yield better performance than NNP with the same number of parameters. In principle, NNP should be able to self-learn a model similar to QR with the same number of weight parameters. However, different NNP trainings from different initial random guesses may yield somewhat less optimal solutions. This practice suggests that quadratic regression can be an alternative approach when the descriptor size is relatively small.
The performance of NN regression on G14 and B30 as a function of weight parameters. For comparison, the results from linear and quadratic regressions are also included.
The performance of NN regression on G14 and B30 as a function of weight parameters. For comparison, the results from linear and quadratic regressions are also included.
The results of verification with different training strategies are summarized in Table II. Compared to Ref. 19, our results are close or maybe slightly better, especially in the force performances for generalized linear regression. Therefore, we proceed to make further investigations on Set #1 by using different strategies.
The comparison of mean absolute error (MAE) values between this work and Ref. 19 for the same 244 Si dataset (Set #1). The results from Ref. 19 are shown in parentheses. For LR+B55 and QR+B55, the results are shown when the force coefficient is at 1 × 10−4. For the NNP fitting, we used NN architectures of 27-24-24 and 14-12-12.
| Fitting . | Train energy . | Test energy . | Train force . | Test force . |
|---|---|---|---|---|
| method . | (meV/atom) . | (meV/atom) . | (eV/Å) . | (eV/Å) . |
| LR+B55 | 6.94 (6.38) | 6.28 (6.89) | 0.11 (0.21) | 0.12 (0.22) |
| QR+B55 | 2.50 (3.98) | 2.21 (3.81) | 0.06 (0.18) | 0.08 (0.17) |
| NNP+G27 | 5.65 (5.88) | 5.60 (5.60) | 0.09 (0.12) | 0.11 (0.11) |
| NNP+G14 | 5.95 | 6.33 | 0.10 | 0.11 |
| Fitting . | Train energy . | Test energy . | Train force . | Test force . |
|---|---|---|---|---|
| method . | (meV/atom) . | (meV/atom) . | (eV/Å) . | (eV/Å) . |
| LR+B55 | 6.94 (6.38) | 6.28 (6.89) | 0.11 (0.21) | 0.12 (0.22) |
| QR+B55 | 2.50 (3.98) | 2.21 (3.81) | 0.06 (0.18) | 0.08 (0.17) |
| NNP+G27 | 5.65 (5.88) | 5.60 (5.60) | 0.09 (0.12) | 0.11 (0.11) |
| NNP+G14 | 5.95 | 6.33 | 0.10 | 0.11 |
C. Bispectrum coefficients and algorithms interplay
In this section, MLFF fitting with bispectrum coefficients will be discussed in detail by using both generalized linear and NN regressions on Set #1. First, the performances of generalized linear regression can be improved based on the normalization factor of bispectrum coefficients prior to the MLFF fitting. In the original implementation of SNAP,11 the bispectrum coefficients are not normalized prior to the MLFF fitting. However, Fig. 5 shows the benefits of normalization prior to the MLFF fitting. Linear regression achieves better performances for both energy and forces as lmax increases. At lmax > 5, there are no significant gains in the MAE values as the computational cost increases. The insignificance of normalization can be due to the limitation of linear regression ability to express the complexity.
The performance of linear regression based on the bispectrum coefficients without (a) and with normalization (b). In each plot, lmax values from 2 to 8 were considered. The number of descriptors is given in parenthesis.
The performance of linear regression based on the bispectrum coefficients without (a) and with normalization (b). In each plot, lmax values from 2 to 8 were considered. The number of descriptors is given in parenthesis.
Second, Fig. 4 shows the overall NNP fitting with bispectrum coefficients as the inputs to the neural network architecture. The results of NNP+B30 are trained with different hidden layer sizes. The best accuracy is achieved with the hidden layer size of [24, 24]. The 30-24-24 architecture consists of 1369 parameters in total. The training MAE values are 3.18 meV/atom and 0.07 eV/Å, and the test MAE values are 3.54 meV/atom and 0.08 eV/Å. These metrics reach comparable values to that from QR+B55 (see Table II) with few bispectrum coefficients. For reference, linear regression and quadratic regression results with the corresponding number of bispectrum coefficients are also marked in Fig. 4. NNP with bispectrum coefficients can gain notable improvements in comparison to linear regression and quadratic regression. The improvements are expected since NN allows more flexible functional forms to describe the deviation from linearity. Meanwhile, quadratic regression achieves significant improvement in accuracy compared to linear regression due to the extended polynomial forms. However, similar accuracy can be attained with NNP fitting with a smaller number of weight parameters.
D. Transferability of the MLFF from a localized data set
Our in-house code has the ability to apply various descriptors and regression techniques to train MLFF with satisfactory accuracy (energy MAE of <10 meV/atom and force MAE of <0.15 eV/Å) on Set #1. From computational perspective, bispectrum coefficients can cover more orthogonal sets and are easier to be expanded. Therefore, we focus on the use of bispectrum coefficients as the main descriptors from now on. Using the MLFF trained on Set #1, we tried to validate the prediction power on Set #2 (the more diverse dataset). The models include NNP with the 30-10-10 architecture (431 parameters, with β at 0.03), linear regression (31 parameters), and quadratic regression (528 parameters). The three scenarios use normalized bispectrum coefficients with lmax of 3 as normalized bispectrum coefficients suggest slight accuracy in improvement. Table III summarizes the results. In general, the prediction power of the MLFF on Set #2, especially in energy, is still poor, though the force errors are acceptable. It is not surprising as the machine learning ability in extrapolation is known to be poor. The performance of the MLFF yields great accuracy based on the given training dataset. The characteristic of atomic environments of Set #2 is too broad, and most of the data points lay outside of Set #1. Therefore, the predicted energy and force are no longer reliable.
The MAE values of the predicted energy and forces of Set #2 by training on Set #1. The NN architecture of 30-10-10 is used for providing comparable weight parameters as the quadratic regression. The numbers inside parentheses are the test MAEs.
| . | NN . | LR . | QR . |
|---|---|---|---|
| Energy (meV/atom) | 4.7 (70) | 7.5 (110) | 4.0 (265) |
| Force (eV/Å) | 0.08 (0.13) | 0.12 (0.15) | 0.08 (0.21) |
| Number of parameters | 431 | 31 | 496 |
| . | NN . | LR . | QR . |
|---|---|---|---|
| Energy (meV/atom) | 4.7 (70) | 7.5 (110) | 4.0 (265) |
| Force (eV/Å) | 0.08 (0.13) | 0.12 (0.15) | 0.08 (0.21) |
| Number of parameters | 431 | 31 | 496 |
Despite the unsatisfactory accuracy, some insights can be gained from this numerical experiment. NN regression can achieve better transferability in comparison to linear and quadratic regressions. Although the quadratic regression yields the best accuracy in training (3.99 meV/atom energy and 0.08 eV/Å force), it also produces the largest error on the test set. On the contrary, NN regression achieves a similar level of accuracy in training (4.70 meV/atom energy and 0.08 eV/Å force). However, the errors on the test set (69.8 meV/atom energy MAE and 0.13 eV/Å force MAE) are much smaller. This can be partially explained by the fact that NN adopts more flexible functional forms during fitting.
E. Training with data from random structure generator
For the sake of data diversity, it is more natural to train the MLFF based on Set #2 and test its performance on Set #1. To train reliable MLFF on Set #2, we decided to use more bispectrum coefficients and a larger NN architecture and test on Set #1. In addition, polynomial fittings were included again for the purpose of comparison. For polynomial regression, lmax at 5 with a cutoff radius of 4.9 Å was applied. According to Fig. 5, normalizing the bispectrum coefficients had a negligible effect on the results. Hence, normalization was ignored. The β value was fixed at 1 × 10−4 for quadratic regression and 1 × 10−3 for linear regression. The NN architecture of 91-34-34 was used to give comparable weight parameters as the quadratic regression.
Figure 6 summarizes the results of Set #2 training. In terms of energy, quadratic regression performs the best accuracy (5.90 meV/atom), whereas NNP can predict less accurate energy (9.81 meV/atom) but better forces (0.08 eV/Å). It should be emphasized that Set #2 contains a smaller unit cell (1–16 atoms in a unit cell) than Set #1 (up to 64 atoms in a unit cell). This transition from smaller to larger cells can introduce long-range effects that were not accounted for in the training.49 Therefore, the MAE values on the test set are consistently larger than the training set. Furthermore, Set #1 may contain some manually selected atomic configurations. These configurations may not be fully covered by our random generated structures. While linear regression predicts well on the forces, it guides the energy predictions to unsatisfactory results. This may be due to the limitation of the regression technique as the smaller number of parameters fails to describe the true PES. Therefore, our recommendation is to use either quadratic regression (similar to the recently proposed qSNAP method12) or NN for a better fitting of a diverse dataset. Compared to the quadratic regression, NN is our preferred choice due to its flexibility.
Table IV shows our results in comparison to studies.29,49 The authors juxtaposed among several atom-centered descriptors, including bispectrum coefficients, which were coupled with Gaussian process regression. In particular, the root mean square errors (RMSEs) for energy and forces with lmax at 5 are 20.2 meV/atom and 0.25 eV/Å. Meanwhile, the training RMSEs of our quadratic regression yield 9.7 meV/atom and 0.22 eV/Å and the training RMSEs of the 91-34-34 architecture are 14.8 meV/atom and 0.16 eV/Å. In another study, Kuritz et al. focused on training atomic forces using a deep learning model with the environmental distances as the descriptors. The force predictions are performed at a scaling from 16 atoms to 128 atoms, yielding a MAE of 0.12 eV/Å,49 given that the NN nodes are in the order of 103/layer. This phenomenon proves that the choice of descriptor can reduce the complexity of MLFF.
F. Physical properties
One of the critical requirements for MLFFs is to predict basic material properties, including but not limited to lattice parameters, elastic constants, and bulk moduli of diamond cubic Si. To obtain the elastic constants, we computed the stress–strain relation and fitted the relation to a set of linear equations built from the symmetry. For each applied deformation, the geometry of the structure was optimized to gain a net force of zero. The summary of the properties is tabulated in Table V.
The experimental elastic constants50 of cubic-diamond silicon are shown at zero-Kelvin values, while the DFT data are obtained from Ref. 19. In comparison to the Gaussian approximation potential (GAP) of Si,25 the GAP results are shown below. The numbers of weight parameters are displayed in parentheses. EF and EFS stand for energy-force and energy-force-stress training. LR, QR, and NN are linear, quadratic, and neural network regressions, respectively. The NN architecture is 30-10-10 for Set #1 and 91-34-34 for Set #2.
| . | . | Set #1 . | Set #2 . | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| . | . | . | . | EF-LR . | EFS-LR . | EF-QR . | EFS-QR . | EF-NN . | EFS-NN . | EF-LR . | EFS-LR . | EF-QR . | EFS-QR . | EF-NN . | EFS-NN . |
| . | Expt. . | DFT . | GAP . | (56) . | (56) . | (1596) . | (1596) . | (431) . | (431) . | (91) . | (91) . | (4371) . | (4371) . | (4353) . | (4353) . |
| a (Å) | 5.429 | 5.469 | … | 5.467 | 5.466 | 5.462 | 5.467 | 5.473 | 5.468 | 5.415 | 5.469 | 5.503 | 5.468 | 5.509 | 5.467 |
| C11 (GPa) | 167 | 156 | 153 | 153 | 151 | 149 | 152 | 157 | 154 | 137 | 167 | 173 | 158 | 167 | 153 |
| C12 (GPa) | 65 | 65 | 56 | 100 | 62 | 60 | 57 | 96 | 58 | 76 | 73 | 55 | 55 | 128 | 57 |
| C44 (GPa) | 81 | 76 | 72 | 69 | 70 | 75 | 75 | 66 | 68 | 73 | 85 | 81 | 71 | 43 | 76 |
| BVRH (GPa) | 99 | 95 | 89 | 118 | 92 | 90 | 89 | 117 | 90 | 96 | 104 | 94 | 89 | 141 | 89 |
| . | . | Set #1 . | Set #2 . | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| . | . | . | . | EF-LR . | EFS-LR . | EF-QR . | EFS-QR . | EF-NN . | EFS-NN . | EF-LR . | EFS-LR . | EF-QR . | EFS-QR . | EF-NN . | EFS-NN . |
| . | Expt. . | DFT . | GAP . | (56) . | (56) . | (1596) . | (1596) . | (431) . | (431) . | (91) . | (91) . | (4371) . | (4371) . | (4353) . | (4353) . |
| a (Å) | 5.429 | 5.469 | … | 5.467 | 5.466 | 5.462 | 5.467 | 5.473 | 5.468 | 5.415 | 5.469 | 5.503 | 5.468 | 5.509 | 5.467 |
| C11 (GPa) | 167 | 156 | 153 | 153 | 151 | 149 | 152 | 157 | 154 | 137 | 167 | 173 | 158 | 167 | 153 |
| C12 (GPa) | 65 | 65 | 56 | 100 | 62 | 60 | 57 | 96 | 58 | 76 | 73 | 55 | 55 | 128 | 57 |
| C44 (GPa) | 81 | 76 | 72 | 69 | 70 | 75 | 75 | 66 | 68 | 73 | 85 | 81 | 71 | 43 | 76 |
| BVRH (GPa) | 99 | 95 | 89 | 118 | 92 | 90 | 89 | 117 | 90 | 96 | 104 | 94 | 89 | 141 | 89 |
First, it is crucial to validate our code on Set #1. In the column of Set #1 in Table V, the performances of the MLFFs are presented with different training strategies: energy-force linear regression (EF-LR), energy-force-stress linear regression (EFS-LR), energy-force quadratic regression (EF-QR), energy-force-stress quadratic regression (EFS-QR), energy-force NN (EF-NN), and energy-force-stress NN (EFS-NN). All of the training involved bispectrum coefficients as the descriptor. Linear and quadratic regressions used bispectrum coefficients with lmax of 4, whereas NN used lmax of 3. Here, we used the NN architecture of 30-10-10. Moreover, EF was trained with DFT energy and forces only as the reference values, while EFS included the DFT stress information in the training. Without stress involvement, the quadratic regression performances are the closest to the DFT values. Seemingly, linear and NN regressions fail to extrapolate the C12. However, the C12 values tend to get closer to the DFT with tiny sacrifice in the accuracy of C11, when stress is involved.
Second, without stress information, linear and quadratic regressions are considered to be more transferable in predicting the physical properties of Set #2. Evidently, linear regression gains no prominent refinement without trade-off between elastic constants as stress information is added. However, the values are the closest to the experimental values. On the other hand, quadratic regression exhibits accuracy boosts in C11 and the lattice constants in comparison to the DFT. As stress training is employed, NNP seems to benefit the most in terms of transferability. Consequently, it is crucial to include stress tensors during the training of NNP.
IV. DISCUSSION
A. Training dataset
In general, MLFF lacks extrapolative ability, unlike the traditional force field method. The training dataset plays an extremely important role in MLFF development. A more complete dataset can grant the trained MLFF with more powerful predictive ability. The use of randomly pre-symmetrized crystal structures is able to produce a dataset with highly diverse atomic distribution.21,33,51 In this work, we prepared the training dataset from crystal structure prediction techniques. However, several recent works demonstrated that the MLFF can also be trained on-the-fly.21,22 In addition to generating random structures, advanced sampling techniques such as metadynamics10,52 and stochastic surface walking27 have also been used to provide the training data for MLFF development. In general, each method focuses on different aspects of the PES. For instance, the surface walking method may work better in describing the transition path between different low energy configurations, while random structure generation offers more energy basins of the PES. On the other hand, the metadynamics method excels at describing the liquid and amorphous states. At the moment, it remains challenging in obtaining a universal MLFF to fully replace DFT simulation for general purpose.25 Given the increasing power of regression techniques such as deep learning,53,54 it will be interesting to know if a MLFF can ultimately achieve the DFT accuracy by considering all training configurations from different sampling techniques.
B. Limitation of the objective function
A typical DFT calculation outputs the total energy for each configuration. Thus, the MLFF is trained to describe the total energy of a structure. However, it is possible that the MLFF fails to distinguish the atomic energies for the trained structures.55 To prevent the incorrect energy decomposition, one can either develop the approach to extract the site energy from the DFT simulation56 or intentionally prepare the structures with nearly identical atomic environments in the training. Random crystal structure generation with pre-symmetrization follows the latter. Therefore, Set #2 includes many structures with smaller unit cells to allow for better descriptions of the PES. Hence, this can help the performance in predicting the total energy. In addition, Set #2 can be further extended to consist of more varieties of atomic environments to enhance the capability of the current NNP. For example, it was shown above that adding stress tensors can help improve elastic constant predictions.
C. Descriptors
As the complexity of a system’s PES increases, different atomic descriptors can yield different accuracy in MLFF development.29 For instance, thousands of nodes are needed to achieve similar accuracy in NNP transferability,49 compared to 34 nodes in this study. The key to extract reliable descriptors is by reconstructing the atomic neighbor density function. The expansion of bispectrum coefficients as the descriptor is more straightforward to be applied than the Behler–Parrinello descriptors. Nevertheless, it is important to take account of the relation between computational cost and accuracy in MLFF training. The current MLFF is developed through the reconstruction of the neighbor density function, which is described by the Dirac δ function. The full description of the true neighbor density can only be partially represented by the finite spherical harmonic expansion. In addition, it is numerically unstable to compare the differences between two δ functions. A better design of the descriptor uses smooth Gaussian functions to express the atomic neighbor density, as recently developed in the SOAP method.29 The comparison between SO(4) bispectrum and SOAP descriptors for NNP development will be made in the future code development. Moreover, other similar types of descriptors, such as moment tensor potential (MTP),14 will be investigated in the future.
D. Fitting scheme
Linear regression, as the simplest method in curve fitting, has been used in developing several MLFFs.11,14 In particular, the MTP approach19 can predict energy and forces with great accuracy while maintaining acceptable computational cost. The advantage of the linear regression method lies in its simple algorithm, which provides easy and fast computation. Despite the simplicity, linear/quadratic regressions are usually sensitive to the noise in the dataset. In this work, we focused on NN regression since it has more flexibility, which can yield better accuracy. Compared to the linear/quadratic regressions, including stress training in NNP is critical to promote the transferability. Beside NN, some non-parametric regression techniques, such as Gaussian process regression, have also been proved to be efficient in MLFF development.16 However, this is beyond the scope of the current study.
E. Applicability
For the purpose of MD simulation around the equilibrium state, fitting the MLFF with a localized dataset generated from MD simulation is, perhaps, sufficient. However, the primary goal of this work is to generate high quality silicon MLFF for a more general purpose, which requires a complete description of PES for a given chemical system. As discussed above, the MLFF trained with Set #2 is generally capable of describing the entire PES of the crystalline system better. We expect that the MLFF generated in this work can be used to replace DFT simulation in predicting the structures of crystalline silicon, given that similar works have been done in several elemental systems.21,22 Yet, one needs to keep in mind that the quality still depends on the coverage of training dataset. For instance, additional data are needed to enable the prediction for surfaces and clusters.25 Moreover, the trained MLFF may not be able to describe the high energy configurations well since Set #2 only contains structures with energy less than 1.400 eV/atom from the ground state. It was found that some nonphysical configurations (e.g., short distances and overly clustered) may be favored under high temperature MD simulations. In this case, it is useful to either add a few explicit two-body and three-body terms to prevent the nonphysical configurations48 or include some highly strained configuration in the training. We will consider the combination of physical and machine learning terms in the training and investigate the applicability.
V. CONCLUSIONS
In summary, we present a systematic investigation of MLFF fitting for elemental silicon using our in-house code. The silicon MLFFs are developed by implementing different regression techniques based on Behler–Parrinello and bispectrum coefficients as the descriptors. The MLFFs trained with Set #1 (the localized dataset) can be described accurately in both energy and forces using generalized linear regression and NN based on both descriptor choices. Among the MLFFs, fitting NNP with the bispectrum coefficients is the most favorable option. This is due to the expansion of bispectrum coefficients being more straightforward than Behler–Parrinello descriptors. In addition, NNP provides a more flexible framework in which the functional form can be easily adjusted by adding/reducing the size of weight parameters. For Set #2 generated from random symmetric structures, the NNP fitting with bispectrum coefficients achieves accuracy at 9.8 meV/atom for energy and 80 meV/Å for force, which is comparable to the current state of the art based on other approaches. A thorough study on the applicability of Set #2 silicon MLFF on more challenging simulations such as crystal structure search will be the subject of our future work.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
ACKNOWLEDGMENTS
The authors acknowledge the NSF (I-DIRSE-IL, Grant No. 1940272) and NASA (Grant No. 80NSSC19M0152) for financial support. H.Y. was also supported by the Science Graduate Student Research (SCGSR) program, which is administered by the Oak Ridge Institute for Science and Education (ORISE) for the DOE under Contract No. DE-SC0014664. The computing resources are provided by XSEDE (No. TG-DMR180040). A portion of this work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344. The authors thank Dr. A. Thompson at Sandia for insightful discussions on the computation of bispectrum coefficients. They also thank the anonymous referees for excellent suggestions during the revision.





