We propose a scheme for *ab initio* configurational sampling in multicomponent crystalline solids using Behler–Parinello type neural network potentials (NNPs) in an unconventional way: the NNPs are trained to predict the energies of relaxed structures from the perfect lattice with configurational disorder instead of the usual way of training to predict energies as functions of continuous atom coordinates. An active learning scheme is employed to obtain a training set containing configurations of thermodynamic relevance. This enables bypassing of the structural relaxation procedure that is necessary when applying conventional NNP approaches to the lattice configuration problem. The idea is demonstrated on the calculation of the temperature dependence of the degree of A/B site inversion in three spinel oxides, MgAl_{2}O_{4}, ZnAl_{2}O_{4}, and MgGa_{2}O_{4}. The present scheme may serve as an alternative to cluster expansion for “difficult” systems, e.g., complex bulk or interface systems with many components and sublattices that are relevant to many technological applications today.

## I. INTRODUCTION

Configurational order/disorder determines many properties of functional materials including mechanical strength,^{1} catalytic activity,^{2,3} ion/electron/phonon conductivity,^{4–6} and so on. Configurational disorder can be predicted using Monte Carlo (MC) methods based on statistical mechanics, but MC methods usually require a huge number of energy evaluations. Because of this, most works in the past have relied on effective low-cost models that are fitted to density functional theory (DFT) calculations.^{7–9} For solid-state systems that can be mapped onto a lattice, the cluster expansion method^{10–17} is often considered the *de facto* standard for obtaining such effective models. This method describes configuration energies using an expansion based on effective cluster interactions (ECIs),

where *α* denotes empty, single, and multi-component clusters, *m*_{α} is the cluster multiplicity, *V*_{α} is the ECI, and $\sigma \u20d7$ is the occupation vector. Φ_{α} is the cluster basis function, which for a binary system takes the simple form of

where *p* are site indices and *σ*_{p} = ±1 depending on which component occupies site *p*. Appropriate basis functions for systems with more components can also be generated according to, e.g., Ref. 14. We note that the validity and formal approximations involved in mapping a realistic alloy model with vibrations and intra-atomic excitations to an on-lattice model has been explored rather thoroughly by Ceder.^{11}

The ECIs are usually fitted to energies of relaxed structures from small-scale DFT calculations and are then used to calculate energies of a much larger supercell in the course of MC sampling calculations. Although this cluster expansion approach has seen much success, especially for metallic alloys, certain limitations and difficulties have also been pointed out over the years. For example, it is known that the accuracy of the cluster expansion energy prediction degrades when there is significant lattice relaxation.^{18} In fact, cluster expansion with constant ECIs is, in a way, a zeroth order approximation and formally cannot provide an exact fitting, including relaxation effects.^{15} Another difficulty is in choosing the finite number of clusters that will give the best predictions. For example, Seko and Tanaka reported a spurious prediction of a discontinuous order–disorder transition in an oxide system with long-range interactions when choosing clusters that perform best within limited supercell sizes.^{19} Still another issue is in how to avoid undesirable training set bias. These issues are, in fact, highly nontrivial, and developing robust methods for cluster and training set selection is still an area of active research.^{13,17,19–21} Also, cluster expansion becomes computationally demanding in a combinatorial way as the number of components and sublattices increases, and this limits the number of clusters that can be included in the expansion.^{16}

In view of these issues, we as well as some other workers have opted to bypass fitted models and sample directly on DFT energies.^{22–25} Using more sophisticated MC schemes that are suited for parallel computation, such as Wang–Landau^{26} or replica exchange Monte Carlo (RXMC)^{27} sampling, sufficient statistical sampling has been achieved on calculation models with up to a few hundred atoms. However, our works^{24,25} required weeks of calculations on up to 100 supercomputer nodes, and those by others are also likely to be as costly although wall times are seldom reported in such papers. Thus, much acceleration is needed if this type of approach is to be used in a high-throughput setting for materials design. Toward this end, we consider, in this work, the so-called “active learning” approach, which has been developed to accelerate first-principles molecular dynamics simulations or MC calculations.^{28–36} The basic idea is to use machine-learning potentials (MLPs) that have been fitted to DFT results to accelerate the calculations. Since MLPs are usually good at interpolating but not at extrapolating, a relearning is performed when the system wanders into a previously unlearned region of structure space, then the simulation is restarted with the newly tuned potential. Here, we apply this idea to the lattice configuration problem.

Many forms of MLPs have been proposed in the literature;^{37} in this work, we employ the high-dimensional neural network potential (NNP) scheme proposed by Behler and Parinello (BPNNP).^{38–41} This scheme assumes that the total energy can be decomposed into atomic energies determined by the environment around each atom and uses neural networks to fit these atomic energies. That is, the total energy is expressed as

where the configuration vector $\sigma \u20d7$ is represented by continuous atom coordinates in contrast to the cluster expansion [Eq. (1)], where $\sigma \u20d7$ represents occupations on a discrete lattice. *t*_{i} represents the atom type, meaning that a unique NNP is fitted for each atom type. $\sigma \u20d7iRc\u2282\sigma \u20d7$ represents coordinates of atoms within a cutoff radius *R*_{c} of atom *i*, and *f* is a fingerprint function that transforms $\sigma \u20d7iRc$ into a multidimensional descriptor of the atomic environment. To be physically meaningful and transferable, *f* must be chosen so that the resulting descriptor is invariant with respect to translation and rotation of the system. To this end, several fingerprinting schemes have been devised, including *symmetry functions*,^{38–40} *smooth overlap of atomic positions* (*SOAP*),^{42} and *Deep Potential*.^{43,44} In this work, we employ the Chebychev polynomial-based fingerprint proposed by Artrith *et al.*;^{45} the effectiveness of this fingerprint function has been demonstrated, especially for multicomponent systems. There are many other local fingerprints or MLP forms as well as the more recently proposed graph network representations that show potential for enabling construction of universal models for energies, forces, and/or other physical properties covering the entire Periodic Table ^{46–50} The effectiveness of such alternatives for our purposes is certainly of interest and may be considered in future works.

The main distinguishing point of this work compared to the literature on BPNNPs is that we train the NNP model to predict the total energies of relaxed structures from unrelaxed ideal lattice structures with configurational disorder. This is quite different from the usual approach of learning the total energies as a function of continuous atom coordinates (the only exception we are aware of is Ref. 51, which used the NNP scheme to describe the relaxation energy when a single Cu interstitial is dissolved in amorphous Ta_{2}O_{5}). The obvious merit compared to conventional NNP approaches is that we can bypass structural relaxation, while the demerit is that we lose access to ion dynamics, and that we cannot expect transferability to other lattice structures. Since relaxation calculations in a moderately sized supercell can sometimes take hundreds of steps, the speedup attained in this manner can be significant. On the other hand, the above-mentioned demerits are not an issue when using the NNP model for MC sampling on a lattice (it should be noted that MC sampling in continuous coordinate space using random trial displacements is quite inefficient because atoms are more or less close-packed and such trial steps will almost never lead to exchange of atom sites^{24}).

The task of predicting properties from unrelaxed structures is also of importance in a slightly different context of high-throughput materials search using universal graph network models. For example, the Open Catalyst Project identifies “Initial Structure to Relaxed Energy (IS2RE)” as a critical task for catalyst design.^{52,53} Traditional graph networks were inept at predictions based on unrelaxed structures because they were trained on relaxed structures to predict, e.g., formation energies or adsorption energies.^{46,48} However, the issue is being tackled through various approaches, including Bayesian optimization utilizing crystal symmetry^{54} or augmentation of the training data with unrelaxed structures.^{55} We tentatively suggest that the on-lattice mapping approach presented in this work may also be useful in this context. Alternatively, one can utilize more recent graph network models that can predict interatomic forces^{47,49,50} to perform the relaxation procedure, but this can be more than two orders of magnitude slower than a direct approach as already mentioned.

In a sense, our approach is a “lazy” alternative to the cluster expansion method without the need to explicitly consider optimized bases, i.e., clusters; there is also no need to perform any explicit basis reduction based on lattice symmetry. This circumvents various issues associated with choosing the finite number of clusters to be used for prediction that were mentioned above. Also, promising atom–environment descriptors have been proposed for treating multicomponent and multi-sublattice systems without combinatorial explosion,^{45} which is an unsolved issue for cluster expansion (we note that graph networks are also promising in this regard^{50}). We can also use the same fingerprint functions for bulk, surface, and interface systems, while cluster expansion requires more clusters for non-bulk systems.^{56} Another merit is that the nonlinearity of the NNP model can lead to better convergence, in practice, compared to linear models (such as the cluster expansion).^{45} This may lead to more accurate description of systems with significant lattice relaxation where cluster expansion exhibits some difficulties.^{18} We will explore this aspect in the following sections. A remaining issue is in the choice of input structures for training, which is just as problematic as other approaches. In fact, we demonstrate a rather spectacular failure when training only on randomly generated configurations and then show how well the active learning approach can solve this issue.

As a side note, we point out that the NNP can be directly trained to predict the energy as a function of the configuration vector $\sigma \u20d7$, as was done in Ref. 57. However, the transferability of the NNP to larger supercells in such a scheme is nontrivial, and thus we advocate the use of atom-centric fingerprint functions. It is also possible to perform nonlinear cluster expansion, i.e., fitting the total energy as a function of cluster correlations using a neural network.^{58} Yet another successful model is the low rank potential, which expresses the interatomic potential as a low rank decomposition of a tensor whose indices specify the local atomic environment.^{33,59} We do not claim that the approach presented in this work is an overall better method, as it is difficult to compare overall performance, including calculation speed, accuracy, and ease-of-use on equal footing. However, as noted above, cluster selection or explicit mapping to a tensor-like expression required in other schemes can become intractable with increasing complexity of the system. Thus, we believe that the current approach will enable configurational sampling in complex bulk or interface systems with many components and sublattices, i.e., those systems where it is technically difficult to employ previously proposed approaches.

## II. MODEL AND METHODS

### A. Benchmark system: Spinel oxides

We demonstrate the above idea on the calculation of the degree of A/B-site inversion in a few oxides with spinel-type structures. Spinel oxides share the general formula AB_{2}O_{4} with A (B) representing a divalent (trivalent) cation and are of interest in mineralogy as well as technological applications, including magnets, photocatalysis, and battery materials.^{60–62} In an “ordered” or “normal” spinel, all divalent cations occupy the tetrahedrally coordinated sites, and all trivalent cations occupy the octahedral sites. However, it is known that spinels with various divalent/trivalent cation combinations can exhibit significant deviation from the “normal” occupation, and this is often quantified by the degree of inversion (DOI) defined as the fraction of A-sites (tetrahedral) occupied by B (trivalent) cations. The DOI corresponding to a completely random occupation is 2/3, while a nearly “inverse” occupation with DOI of nearly unity is also known for some compounds. Prediction of the DOI as a function of temperature is of interest because of its potential impact on the magnetic, optical, thermal, and electrochemical properties of the material.

In this work, we calculate the degree of inversion in MgAl_{2}O_{4}, ZnAl_{2}O_{4}, and MgGa_{2}O_{4} and compare to a series of cluster expansion works by Seko *et al.*^{13,19,20,63} The former two are known to be nearly normal, while significant inversion has been reported for MgGa_{2}O_{4}. Spinels constitute a prototypical multivalent system (i.e., A^{2+} and B^{3+} share lattice sites) where a naive cluster expansion can lead to large qualitative errors due to overfitting in a small calculation cell; augmentation with a screened point charge model (CE-SPCM) was found to be necessary to obtain accurate predictions without considering rather long-range pair clusters in Ref. 19. Also, the feasibility of avoiding cluster expansion fitting and sampling directly on DFT-relaxed energies was demonstrated in combination with RXMC sampling in Ref. 24 with a 48-cation model. Here, we consider a 192-cation model, which is completely beyond the reach of direct DFT sampling for two reasons: (1) because of the longer time required for each DFT calculation and (2) the *much* larger configurational space that has to be sampled. Regarding (2), the number of degrees of freedom for the 48 cation model is _{48}C_{16} ∼ 2 × 10^{12}, while that for the 192 cation model is _{192}C_{64} ∼ 7 × 10^{51}. This explosion in the configuration space should attest to the curse of dimensionality that we are fighting against, although there is actually no need to perform calculations on all of these configurations if good sampling schemes (such as RXMC) are available.

### B. DFT calculations

The reference DFT calculations were performed using the Vienna Ab initio Simulation Package (VASP) code.^{64,65} We employed the projector augmented wave method^{66} to describe electron–ion interactions. A plane wave basis set with an energy cutoff of 400 eV was used to expand the wave functions. The Brillouin zone was sampled only at the Γ point. The Perdew–Burke–Ernzerhof generalized gradient approximation (GGA-PBE) functional^{67} was used to approximate the exchange-correlation energy. In this work, we performed relaxation of the lattice vectors as well as the internal coordinates to within 0.04 eV/Å. It should be noted that lattice vector relaxation leads to changes in the calculation mesh density, which means that the effective cutoff energy deviates from the preset value. To alleviate this issue, the relaxations were repeated several times so that the final structures and energies are consistent with the cutoff energy of 400 eV.

### C. NNP training and evaluation

The NNP training and subsequent evaluation were performed using Atomic Energy NETwork (ænet) code.^{45,68} As noted above, the NNP model is trained to predict relaxed energies. This may be represented using a slightly modified version of Eq. (3),

where $Erel(\sigma \u20d7)$ represents the energy after structural relaxation from a starting structure $\sigma \u20d7$, $f[\sigma \u20d7iRc]$ represents atomic environment descriptors for the starting structure, and $NNPtirel$ represents atomic energy contributions to the relaxed total energy. The training and use of the NNP is restricted to $\sigma \u20d7\u2208{\sigma \u20d7lattice}$, where ${\sigma \u20d7lattice}$ represents the set of ideal lattice structures with configurational disorder. Equation (4) states that the relaxed energies can be calculated as a sum of atomic contributions, which, in turn, can be calculated from coordinates corresponding to the ideal lattice. This is an ansatz that we are making. In general, it is impossible to prove the exactness of a certain machine learning model in reproducing a physics model, and our approach is no exception. This is in contrast to cluster expansion for which a rigorous proof exists for mapping fully relaxed energies to a lattice-based Hamiltonian.^{11,15} The ansatz will need to be verified for each system under study, and active learning is an ideal approach in this regard.

We used the fingerprinting scheme by Artrith *et al.*,^{45} where the radial and angular distribution functions (RDF and ADF) with an appropriately chosen cutoff are expanded by an orthogonal basis set based on Chebychev polynomials, and the expansion coefficients are fed in to the neural network as the input descriptors. A key point in their scheme is that the structural and compositional descriptors are given separately; the structural descriptor is given by the usual RDF and ADF, and the compositional descriptor is also given by RDF and ADF but with atomic contributions weighted differently for each chemical species. In this work, we expand the RDF with a cutoff of 8.0 Å and expansion order 16 and the ADF with a cutoff of 6.5 Å and expansion order 4. The employed neural network thus has an input layer with 44 nodes,^{69} and we chose to use two hidden layers with 15 nodes each and the tanh activation (there is also an additional bias node to accommodate arbitrary references for the energy^{68}). The neural network weights were optimized using the bounded limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm.^{70,71} The training epocs were monitored and terminated when the error on a separate validation set began to increase. We note that it is necessary to evaluate the performance of the model on an additional dataset that is separate from validation data used for early stopping of the training epocs. This is done, in a sense, through the active learning procedure detailed below.

### D. Replica exchange Monte Carlo sampling combined with active learning

The configurational sampling was performed using the replica exchange Monte Carlo (RXMC) method combined directly with NNP evaluation using ænet. The RXMC method employs several copies, or replicas of the system under study. Each of the replicas are sampled using the usual Metropolis MC algorithm at different temperatures. At preset intervals, temperatures are swapped between replicas according to the Metropolis criterion with the following probabilities,

Here, *i*, *j* are the replica indices, *β* is the inverse temperature 1/(*k*_{B}*T*), and *E* is the energy of the replica. Higher temperature replicas are given the task of global surveillance of the energy landscape while lower temperature replicas basically perform local optimization. The swapping of temperatures occurs when a higher temperature replica finds a new energy minimum; local optimization is then performed at the lower temperature. The detailed theory behind RXMC is given in Ref. 27. Our previous studies^{24,25} may also be helpful for understanding the basic concept. A parallelized code for RXMC calculations in combination with DFT codes as well as ænet is available as a part of *ab initio* Configuration Sampling Toolkit (abICS) https://github.com/issp-center-dev/abICS. In this work, 15 replicas of the system were sampled in parallel with temperatures ranging from 400 to 1900 K, and temperature exchange was attempted every four steps to speed up the global sampling. Equilibration was performed for 80 000 steps, and then the DOI, which is calculated as the ratio of Al ions on Mg sites, was averaged from 400 000 steps to obtain the temperature dependence.

The NNPs used in the RXMC sampling were refined iteratively using the active learning scheme shown in Fig. 1. We note that a very similar process using an effective pair interaction model was reported recently.^{72} First, we train an NNP set with randomly generated samples, then use those NNPs to perform RXMC sampling. In subsequent iterations, we sample a preset number of configurations from the previous RXMC run and perform DFT relaxations on those configurations. Comparison of the DFT energies with the NNP predictions serves as the validation procedure for the machine-learning model, although the validation dataset will be biased somewhat as we discuss below. Discrepancies between DFT energies and NNP predictions will signify that the simulation has wandered into previously unlearned regions of configuration space. We then add those DFT results to the training set, retrain the NNPs, and run RXMC sampling again. This process is repeated until convergence is achieved in the physical quantities of interest. In this work, we sampled 300 structures from the first 6400 RXMC steps in each active learning iteration. The final DOI’s were calculated from a separate longer RXMC run with 480 000 RXMC steps (80 000 for equilibration and 400 000 for thermodynamic averaging) as mentioned above.

We note that the above procedure biases the training dataset toward thermodynamically relevant configurations predicted by the NNP model in that active learning step. The thermodynamic biasing, in itself, is usually not a bad thing since we are often only concerned with thermodynamically relevant, i.e., realistic configurations. However, the resulting NNPs cannot be expected to perform well when used to predict energies of configurations with both low entropy and high energy, since such configurations are thermodynamically unstable over all temperatures and rarely visited in the RXMC procedure. Such an example in the case of AB_{2}O_{4} spinels considered in this work would be a phase-separated configuration with “A” cations in one-half of the cell and “B” cations in the other half. If, for some reason, we want to correctly reproduce the energy of such situations, we would have to add such configurations to the dataset, or we may accumulate training data using active learning with varying compositions. The performance of the on-lattice NNP model for varying compositions is tested below.

On the other hand, the fact that we are relying on an inaccurate NNP model to generate the training set during the active learning iterations may pose issues with uncontrolled biasing of the training data. In the case where the model underestimates the energy of certain configurations, the RXMC algorithm at lower temperatures will bias the sampling toward those configurations so that the underestimation is quickly corrected in the subsequent iteration. The biggest foreseeable problem is when the model overestimates the energy. This is covered to some extent by setting a high enough temperature for the RXMC sampling so that configurations with higher predicted energies will still be sampled. It is also noted that RXMC in each iteration is initialized with random configurations, i.e., with an effective temperature of infinity. Ultimately, it is necessary to evaluate if such issues are severe through computational experiments, which we present in the following sections.

We should also mention that we are using the term “active learning” in a rather broad sense that the training set configurations are not given *a priori* but are chosen according to the given calculation procedure. In our case, this is done through thermodynamic importance sampling as explained above. On the other hand, many works in the active learning MLP literature employ various uncertainty estimates to determine if a structure met during simulation is in an extrapolation region and should be added to the training set.^{28,36} The efficacy of using such uncertainty indicators to further limit the number of expensive *ab initio* calculations may be explored in the near future.

## III. RESULTS AND DISCUSSION

### A. Active learning

We started the active learning process with 300 randomly chosen configurations. DFT relaxation and energy calculations were performed on these configurations, and then an NNP model was trained to predict relaxed energies from the input configurations. The oxygen sublattice is neglected in the training process since the input structures all have identical oxygen coordinates and do not provide any useful information for the prediction. We used 90% of input structures for training and 10% for validation-based early stopping. For the spinel systems studied in this work, we typically obtained mean absolute errors (MAEs) and root mean squared errors (RMSEs) of less than a few meV/cation in the training and validation sets, which is close to the accuracy limitations of the DFT calculations themselves. However, such low fitting error does not guarantee high accuracy in predicting the energies of structures obtained in the RXMC sampling. This is checked by performing DFT relaxations on the structures encountered during RXMC sampling using the NNP. As shown in Fig. 2, the NNP trained on random samples (indicated by blue circles in Fig. 2) shows poor accuracy, especially in predicting the energies of the lower energy configurations. This is because the model was trained on random configurations with higher energies, and the training set did not include lower energy structures with less disorder (please note that the splitting of samples between higher and lower reference DFT energies seen here and in subsequent figures originates from our sampling procedure and is not an inherent property of the system.^{73}). Increasing the number of random samples does not improve the results as we show later. The importance of having configurations with varying degrees of order in the training set has already been pointed out in the cluster expansion literature. A suggested solution was to perform a grouping of input structures based on correlation functions (which roughly corresponds to the degree of disorder), and to perform cross validation on each group during the fitting process.^{20} Here, we take an alternate approach, i.e., active learning, as was detailed in Sec. II D. The predictions at lower energies progressively improve with the number of active learning iterations because of the thermodynamic biasing mentioned above; the prediction RMSE for the three spinel systems studied here converged within 0.5 meV after the third iteration. This is clearly smaller than ∼10 meV/cation reported in Ref. 19 using cluster expansion, although it may not be a fair comparison because of the difference in the dataset used to calculate prediction errors.

Convergence is also attained in the physical quantity of interest, i.e., the DOI, as shown in Fig. 3. For the three spinel oxides treated here, 2 or 3 active learning iterations with 300 DFT relaxation calculations each were sufficient to obtain a converged DOI plot (Fig. 3, purple diamonds), which looks completely different from that predicted using NNPs trained on randomly generated configurations (Fig. 3, blue circles). The final DOI plots are in good agreement with most of the cluster expansion literature as well our previous work employing direct RXMC sampling on DFT energies (Fig. 4). There is, however, a sizable deviation for MgGa_{2}O_{4}, and there is one cluster expansion work for MgAl_{2}O_{4} that deviates from the others and shows a discontinuous transition around 600 K^{13} (denoted by “Seko2009” in Fig. 4). As for MgGa_{2}O_{4}, the current work predicts a DOI that is about 0.1 smaller than cluster expansion in Ref. 63. Our result is within the range of most of the available experimental data that report 0.75,^{74} 0.67,^{75} 0.84–0.90,^{76} and 0.81–0.84,^{77} although the first work that put forth the idea of an “inverse spinel” reported unity, i.e., complete inversion.^{78} The discrepancy may come from two sources apart from the difference in the employed models (i.e., cluster expansion and NNP): the supercell size used in the training stage, and that used in the MC sampling stage. The cluster expansion literature has performed ECI fitting in supercells with 24 cation sites or smaller, while they performed MC simulations on supercells with as many as 48 000 cations. Here, both the training and sampling were performed on 192-cation supercells. The impact of these supercell sizes on the results are discussed in more detail in Sec. III B. As for “Seko2009,”^{13} the discontinuous transition was reported in a later work to be a spurious result due to truncation of the range of pair clusters in the cross validation process.^{19}

To confirm that the active learning procedure is indeed more efficient than random sampling, we performed DFT relaxation on 1200 randomly generated configurations of MgAl_{2}O_{4}, trained the NNP on those configurations, and compared the results to those from active learning with the same number of DFT calculations [1 random + 3 active learning iterations, which was enough to converge the DOI results as shown in Fig. 3(a)]. The training on the random dataset was performed independently twice with different training/validation data separation and initial neural network weights. As shown in Fig. 5(a), the energy predictions from NNPs trained on random configurations deviate from the DFT value for configurations with lower energies, and it is also clearly less accurate than the NNP trained through the active learning procedure. Again, this is not surprising because lower energy configurations are seldom included in the randomly generated training set. Moreover, the deviation behavior is rather uncontrolled as indicated by different predictions in the energies and the DOIs [Fig. 5(b)] for the two independent training runs; thus, it is virtually impossible to obtain the correct temperature-dependent results with only random generation of the training dataset.

Savings in computer time achieved by the current approach turned out to be quite significant compared to sampling directly on DFT energies. On the Xeon E5-2680v3 central processor unit (CPU), the NNP prediction using 1 CPU core is a few hundred to a few thousand times faster than DFT relaxation using 216 cores depending on the number relaxation steps necessary for convergence. We also note that the number of necessary DFT calculations turned out to be orders of magnitude smaller than the number of NNP evaluations, i.e., RXMC steps, necessary for converging the DOI. A converged DOI was obtained for ZnAl_{2}O_{4} and MgGa_{2}O_{4} at the first active learning iteration, while the DOI for MgAl_{2}O_{4} was converged at the third active learning iteration (Fig. 3), which correspond to 600 and 900 DFT relaxations, respectively. On the other hand, $\u223c7200000$ NNP evaluations (=480 000 MC steps/replica × 15 replicas) were necessary to obtain a converged DOI vs temperature. The $\u223c1000$ DFT calculations can be completed within a few hours on a modern supercomputer system since the calculations can be performed completely in parallel. The RXMC calculations took roughly one day using 15 CPU cores, which would fit in one node of a modern workstation. We also note that the RXMC simulations were initialized with random configurations, and yet they still found the single ground state ordered spinel configuration out of $\u223c1051$ possible configurations for the normal spinels MgAl_{2}O_{4} and ZnAl_{2}O_{4}. This attests to the effectiveness of modern sampling methods in battling the curse of dimensionality.

We also tested active learning cycles with smaller numbers of DFT calculations to examine if we can attain similar accuracy with less data. The prediction RMSEs for MgAl_{2}O_{4} were found to degrade slightly; after the first active learning cycle, the RMSEs were 0.9, 1.3, and 1.7 meV/cation for 300, 150, and 75 DFT calculations per cycle (the total number of DFT data is 600, 300, and 150, respectively, including the initial random configurations). The values improved to 0.7, 0.8, and 1.2 meV after the second cycle. This means that we cannot attain RMSEs less than 1 meV with less than about 600 DFT calculations for this system. Nonetheless, the necessary amount of DFT training data should depend on the complexity of the system under study and will need to be tested for each case.

The fact that less than 1000 reference samples were necessary to obtain an accurate NNP model might be surprising, as previous studies on NNPs usually report at least one order of magnitude larger reference sets (see, e.g., Refs. 44 and 68). This is most likely because the input configuration space is restricted to $\sigma \u20d7\u2208{\sigma \u20d7lattice}$, and this is a clear merit of our approach. It should be pointed out, however, that each of our reference samples is a result of DFT relaxation calculations, which comes at an additional cost compared to conventional NNP approaches that can be trained on structures without relaxation. Still, it should be *much* easier to attain sufficient sample coverage in this restricted configuration space. The restriction also makes the problem easier for the neural network, as it needs to distinguish between *much* fewer structures. These facts combined with the ability to bypass lattice relaxation make the present approach more suitable for the lattice configuration problem in comparison to conventional continuous-coordinate NNP methods. That is not to say that those methods cannot be applied to systems with configurational disorder. In fact, the *Deep Potential* approach has been applied to randomly generated high entropy alloys^{44,79} with promising results, and it can provide vibration properties while the present approach cannot. However, the computational expense for NNP training and simulation of configurational order/disorder behavior will be much larger than our lattice-restricted approach, and they are yet to be applied to this type of calculation.

### B. Convergence vs supercell size used for training and Monte Carlo sampling

The usual approach in the NNP and cluster expansion literature is to use a set of many small-scale DFT calculations to fit the model and use that model to accelerate calculations in an expanded supercell. In doing so, one must be careful that (1) the fitting procedure does not lead to overfitting in the small cells at the cost of transferability to larger cells not contained in the training set, and (2) thermodynamically relevant correlations are contained within the training set supercell size. These points are especially important here since spinel oxides are mulitvalent ionic oxides where significant long-range interactions are expected.

In the case of cluster expansion, cross validation is usually performed within the training set to truncate the number of clusters to a computationally tractable amount, but for systems such as MgAl_{2}O_{4}, this procedure can lead to large prediction errors for structures with longer periods than those in the training set.^{19} This lead to a spurious prediction of a discontinuous order–disorder transition in MgAl_{2}O_{4} at ∼600 K in Ref. 13 (denoted by “Seko 2009” in Fig. 4); it was found to be necessary to either ignore the truncation procedure and use as many long-distance pair clusters as possible, or to use fewer pair clusters and augment the cluster expansion Hamiltonian with a screened Coulomb term, which lead to prediction of a more continuous transition in Ref. 19 (“Seko 2014” in Fig. 4). It is difficult to see if similar issues may arise in our approach. Although there is no explicit truncation procedure, neural networks are prone to overfitting to the available data; the network might decide that interactions beyond the training set supercell dimensions are unimportant and effectively truncate the range of interactions. Even if it does not, we would have to see if the cutoff radii of 8 Å for the radial descriptor and 6.5 Å for the angular descriptor was long enough for sufficient precision.

To check for such issues, we examined the convergence behavior against the supercell size used for training and MC sampling as follows: First, we performed active learning of the NNP model in a conventional cell of cubic spinel having 24 cation sites in addition to the 192-cation model examined above. Two active learning iterations were performed to converge the results for the 24-cation model. Then, we performed RXMC sampling calculations on the 24-cation, 192-cation, and 648-cation models using those NNP models. This means that we performed six RXMC calculations resulting in DOI data that can be indexed by the number of cation sites in the training and sampling steps as DOI(*N*_{train}, *N*_{sample}). Figure 6 shows the results for MgAl_{2}O_{4} and MgGa_{2}O_{4}, where the left-hand side plot shows the dependence vs *N*_{sample} at *N*_{train} = 192 and the right-hand side plot shows that against *N*_{train} with *N*_{sample} = 648. It is clear that we have attained convergence against RXMC sampling cell size, as the results for *N*_{sample} = 192 and 648 are virtually identical, and *N*_{sample} = 24 also results in a decent prediction in the case of MgAl_{2}O_{4}. This is surprisingly small compared to the supercell sizes employed for MC calculations in the cluster expansion literature (48 000 as mentioned above), but we have obtained essentially identical results for MgAl_{2}O_{4} and ZnAl_{2}O_{4} as seen in Fig. 4. On the other hand, the convergence with respect to the training supercell size *N*_{train} is more difficult to confirm. Sizable differences are seen for *N*_{train} = 24 and 192, especially for MgGa_{2}O_{4}. This shows that the NNP trained in the 24-cation supercell of MgGa_{2}O_{4} is not capable of reproducing the correct energetics in larger supercells. Now, the ideal way to check whether we have achieved convergence at *N*_{train} = 192 would be to perform active learning on a larger supercell, but that would be prohibitively expensive in terms of computer resources. As a compromise, we decided to test the accuracy of the *N*_{train} = 192 NNP model by performing DFT relaxations on a small subset of the 648-cation configurations from the (*N*_{train}, *N*_{sample}) = (192, 648) calculation. As shown in Fig. 7, the *N*_{train} = 192 NNP turns out to be quite accurate in predicting the energies in the 648-cation cell. Thus, we can be fairly confident that the results are converged with respect to the training set supercell size at 192 cations. The discrepancy of the MgGa_{2}O_{4} DOI with Ref. 63 is at least partly due to the fact that they employed only up to 24-cation supercells in the fitting of the ECIs.

These results show that regardless of the model used (NNP or cluster expansion), the training set needs to include structures with periods that are long *enough* for the system under study. Also, if the model includes hyperparameters pertaining to the range of interactions, then that must be set long enough as well. For spinel oxide compounds examined here, the periods were found to be rather long but still tractable with the help of modern-day supercomputers. This may be surprising because our NNP model (or cluster expansion, for that matter) does not consider truly long-range Coulomb interactions along the lines of, e.g., Ref. 80, as all explicit interactions are truncated by the cutoff in the descriptors. However, our results suggest that the cutoff is long enough, especially for lower energy or higher entropy structures encountered during the Monte Carlo procedure that should have relatively small polarization. The structural relaxation should also reduce the required cutoff due to dielectric screening.

### C. Robustness against relaxation

One point worth stressing is that the success of the present approach in predicting the relaxed energies from ideal lattice structures is not a trivial consequence of negligibly small relaxation energies in this system. As shown in Fig. 8 for MgAl_{2}O_{4}, the relaxation energies are roughly one order of magnitude larger than energy differences between relaxed configurations (notice the difference in scales of the horizontal and vertical axes). Also, although the relaxation energies are correlated with energies before relaxation, the correspondence is not one-to-one and the scatter is as large as 50 meV/cation. This is clearly significant when comparing against relaxed energy differences between distinct configurations.

We also examine the magnitude of relaxation, as it has been shown to be correlated with prediction errors in the cluster expansion literature.^{18} To quantify the amount of relaxation in the three spinel systems under equal footing, we use the normalized displacement, which we define as

where Δ*d* is the atom displacement upon relaxation and *V*_{0} is the volume per atom *after* volume relaxation; $V01/3$ can be interpreted as an average lattice parameter, so the normalized displacement defined in this manner is an indicator of local strain. As shown in Fig. 9, the relaxation in the three spinel oxides are far from negligible. A significant number of atoms show ND’s larger than 5%, and the maximum ND reaches ∼19% for ZnAl_{2}O_{4}. The correlation between relaxation and prediction errors are shown in Fig. 10, where the prediction RMSEs for the three systems are plotted vs mean normalized displacement (MND), which is calculated by averaging the normalized displacements over all atoms from all configurations in the dataset (Fig. 9). Active learning calculations without relaxation were also performed on MgAl_{2}O_{4} to examine the prediction errors not attributed to relaxation. As a result, we see an increasing trend in the errors vs MND, suggesting that relaxation does lead to higher errors. However, the RMSEs are all less than 0.5 meV per atom, which is as good as we can reasonably expect when considering numerical noise in the relaxation process.

We also note that the normalized mean *squared* displacement (NMSD) was suggested in Ref. 18 as a relaxation metric that correlates somewhat with prediction errors, and they heuristically categorized NMSD values into “Good,” “Maybe,” and “Bad” regions in terms of expected reliability and quality of cluster expansion fitting. The NMSDs for MgAl_{2}O_{4}, ZnAl_{2}O_{4}, and MgGa_{2}O_{4} calculated from the active learning procedure are 0.158%, 0.156%, and 0.147%, respectively, and they would all fall into the “Maybe” category. The fact that we were able to obtain RMSEs of less than 0.5 meV/atom for those systems indicates that our neural network scheme is at least as robust as cluster expansion against relaxation. Ultimately, we would need to accumulate experience in applying this approach to a wider variety of systems, but that is beyond the scope of this work.

### D. Robustness with respect to input precision

As noted in Sec. II C, we designed our NNP model to be trained and used only on coordinates of the ideal lattice $\sigma \u20d7\u2208{\sigma \u20d7lattice}$. However, there is no way to specify the *exact* lattice coordinate due to the nature of floating point numbers used to represent non-integer values in computers. Thus, to see how robust our method is against deviations in the input coordinates, we examined the predictions of our final NNP model for randomly perturbed lattices (Fig. 11). We find that the predictions are virtually indistinguishable from predictions on the corresponding perfect lattice structures (that is, perfect within the numerical precision of our software stack) when each atom is perturbed by 0.01 Å. The NNP looks useable up to a perturbation of $\u223c0.05$ Å, while at 0.1 Å, we start to observe sizable deviations. 0.01 Å is not that small, and it should not be too difficult to make sure that the errors in the input coordinates fall within this range. The robustness may originate from the fact that the NNP maps similar input to similar energies and because there are no other similar structures to choose from in the training set space except the corresponding ideal lattice structure.

### E. Training on varying compositions

Being able to predict energies at varying compositions with a single NNP model is necessary if the model is used for grand canonical simulations or if phase separation occurs in the system of interest as mentioned in Sec. II D. Thus, we test the capability of our model to reproduce energies of varying compositions, and we also examine the transferability to compositions that are not in the training set by varying the composition during active learning cycles.^{81} First, we trained the NNP model for stoichiometric MgAl_{2}O_{4} (i.e., Mg_{64}Al_{128}O_{256}) in the same manner as in Sec. III A with one random cycle and two active learning cycles (Fig. 12 AL0–AL2). The model was then used for RXMC sampling of Mg_{1.125}Al_{1.875}O_{4} (Mg_{72}Al_{120}O_{256}), and a subset of the samples were recalculated using DFT and compared with the NNP predictions as shown in Fig. 12 AL3. There is a systematic error in the total energy prediction, but the error is corrected after these configurations are added to the training dataset (Fig. 12 AL4). The stoichiometry was then changed to Mg_{1.0625}Al_{1.9374}O_{4} (Mg_{68}Al_{124}O_{256}) that lies in the middle of the previous compositions, and the prediction accuracy is quite high even before this composition is added to the training set (Fig. 12 AL5). This means that our model is capable of interpolation between compositions. Finally, we checked that the resulting model can reproduce energies of all three compositions as shown in Fig. 12(b).

### F. Applicability to hard-to-relax or frustrated systems

Finally, we note that for truly complicated systems, local frustrations may lead to several local minima when starting the relaxation from ideal lattice structures. This means that the implicitly assumed one-to-one correspondence between lattice configuration and relaxed energy is broken. In such cases, it may be advisable to perform the active learning scheme for a few iterations, then use the final NNP in nested Monte Carlo^{82–87} or upsampling schemes,^{88} where MC or MD steps are carried out with low cost potentials then augmented with DFT calculations at preset intervals. These schemes are guaranteed to converge to the same result as sampling based only on DFT energies, and the speedup will depend on the quality of the low cost potential. The frustration problem mentioned above can be solved by nested DFT relaxations during the MC sampling: by randomly perturbing the structure slightly before relaxation, the system has a chance of visiting any of those local minima. This is in the spirit of basin-hopping MC.^{89} We plan on implementing and testing such nesting schemes in the near future.

## IV. SUMMARY

In this work, we proposed a new method to obtain lightweight models that can be used to sample millions of configurations in the *ab initio* lattice configuration problem: Behler–Parinello NNPs are trained to predict relaxed energies from perfect-lattice structures with configurational disorder, and the NNPs are iteratively refined using an active learning scheme involving several RXMC runs. The idea was tested successfully in calculating the degree of A/B site inversion in three spinel oxides: MgAl_{2}O_{4}, ZnAl_{2}O_{4}, and MgGa_{2}O_{4}. Merits compared to conventional continuous-coordinate NNP models include the ability to bypass structural relaxation, as well as the restriction on the input coordinate space that makes machine learning much easier. This approach may also be considered a “lazy” alternative to cluster expansion; no explicit cluster selection is necessary, and it also avoids the combinatorial explosion in the amount of computation that makes cluster expansion practically unusable for many-component systems with multiple sublattices. We also demonstrated that the scheme is at least as robust as cluster expansion with respect to magnitude of local relaxation, and also in treating systems with long-range interactions. Due to the merits outlined above, we believe that this approach will enable configurational sampling in complex bulk or interface systems where it is technically difficult, if not impossible, to apply other previous approaches.

## ACKNOWLEDGMENTS

The software used in this work (abICS) was developed, in part, by the “Project for advancement of software usability in materials science” by the Institute for Solid State Physics, the University of Tokyo, and the calculations were performed on the joint-use supercomputer system at the same Institute. This research was supported by the “Program for Promoting Research on the Supercomputer Fugaku” (Fugaku Battery & Fuel Cell Project) and JST CREST Program (Grant No. JPMJCR18J3, Japan). S.K. acknowledges funding from the Japan Society for the Promotion of Science (JSPS) KAKENHI, Grant No. 20H05284 (Grant-in-Aid for Scientific Research on Innovative Areas “Interface IONICS”), and support from the JST FOREST Program (Grant No. JPMJFR2037, Japan). T.O. was funded by JSPS KAKENHI Grant No. 19H05792 (Grant-in-Aid for Scientific Research on Innovative Areas “Crystal Defect Core”) and JSPS KAKENHI Grant No. 21K04648 [Grain-in-Aid for Scientific Research(C)]. Plots of the data were generated using Matplotlib^{90} and seaborn.^{91}

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Shusuke Kasamatsu**: Conceptualization (equal); Data curation (lead); Formal analysis (lead); Funding acquisition (lead); Investigation (lead); Methodology (equal); Project administration (lead); Resources (lead); Software (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (lead). **Yuichi Motoyama**: Software (equal); Writing – original draft (equal); Writing – review & editing (equal). **Kazuyoshi Yoshimi**: Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). **Ushio Matsumoto**: Investigation (equal); Methodology (equal); Validation (equal); Writing – original draft (equal). **Akihide Kuwabara**: Funding acquisition (equal); Supervision (equal); Writing – original draft (equal). **Takafumi Ogawa**: Conceptualization (equal); Funding acquisition (equal); Methodology (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The dataset of configurations and corresponding energies for MgAl_{2}O_{4}, ZnAl_{2}O_{4}, and MgGa_{2}O_{4} as well as trained models that can be used for energy evaluation using aenet code are available at ISSP Data Repository https://isspns-gitlab.issp.u-tokyo.ac.jp/abics-dev/abics-gallery. Other data that support the findings of this study are available from the corresponding author upon reasonable request.

## REFERENCES

_{2}NiF4-type Ba–Li oxyhydride solid electrolyte

_{3}under processing conditions and their impact on proton conduction: A large-scale first-principles thermodynamics study

_{2}O

_{5}studied with a simplified neural network potential

_{75}Pt

_{25}(111) surface: A first-principles cluster expansion study

_{2}

We have 16 coefficients for RDF and 4 for ADF resulting in 20 descriptors, and this is multiplied by 2 since we have separate RDFs/ADFs for composition and structure.

The splitting of samples between higher and lower energies with few samples in the interval of e.g., ∼50–75 meV/cation for MgAl_{2}O_{4} originates from our sampling procedure: we always sampled the first step in the RXMC calculations which is generated randomly, and most of the replicas relax quickly to lower energy configurations before the next sampling step.

_{2}O

_{4}

_{0.2}Hf

_{0.2}Ti

_{0.2}Nb

_{0.2}Ta

_{0.2})C by deep learning potential

Note that limited composition variation is expected for spinel oxides under normal experimental conditions, so MgAl_{2}O_{4} is used here just as a proof-of-concept.