Inferring energy-composition relationships with Bayesian optimization enhances exploration of inorganic materials

Computational exploration of the compositional spaces of materials can provide guidance for synthetic research and thus accelerate the discovery of novel materials. Most approaches employ high-throughput sampling and focus on reducing the time for energy evaluation for individual compositions, often at the cost of accuracy. Here, we present an alternative approach focusing on effective sampling of the compositional space. The learning algorithm PhaseBO optimizes the stoichiometry of the potential target material while improving the probability of and accelerating its discovery without compromising the accuracy of energy evaluation.


Main text
A fundamental challenge in materials science is establishing the relationships between the materials' compositions and their synthetic accessibility.For any set of chemical elements (phase field), only a small proportion of viable compositions will afford experimentally accessible phases [1].There is a global effort to accelerate materials discovery; most approaches focus on reducing the cost of assessment of candidate compositions in high-throughput screening of the phase fields [2][3][4][5][6][7][8].In this Letter, we investigate an alternative approach: can we represent the energy landscape of a phase field as a function of composition (i.e., stoichiometry) and thus accelerate the search for accessible compositions via optimization of such a function?This hypothesis is motivated by theoretical and experimental observations, in which synthetically accessible phases were realized in the vicinity of computed low-energy compositions with similar stoichiometry [9][10][11][12][13][14].Here, we present the learning algorithm PhaseBO that approximates the energy landscape with a simple function and, by selective sampling of the phase field, iteratively improves energy approximation and discovers energy minima.
Computationally, the synthetic accessibility of a composition can be approximated by energy differences with the phases reported in the phase field -the convex hull [9].There is an increasing number of methods, including density-functional theory (DFT)-based [15][16][17], interatomic force fields [18][19][20][21], and machine learning models [22][23][24][25][26] that aim to improve the accuracy and speed of energy estimation.Most of these approaches use the crystal structure as input; thus, crystal structure prediction (CSP) of a new composition is the most intensive part of its energy evaluation, making exhaustive sampling of a phase field computationally intractable.Composition-based approaches on the other hand offer less reliable energy estimates in comparison to the DFT methods, disfavouring their incorporation into exploratory experimental workflows [27].
In addition to the uncertainties in energy estimation, high-throughput screening methods inherently introduce discretization errors by sampling the phase field.This raises an important question of uncertainty in the approximation of the energy landscape in a compositional space.To address this question, the learning process can incorporate the previous assessments of energy, while taking the uncertainties into account; from the statistical viewpoint, the exploration process should make posterior inference possible, i.e., it should learn to produce the full distribution of possible energies for every point in a continuous compositional space.
In this work, we demonstrate that the energy profile of the compositional space can be approximated as a function of stoichiometry and that this functional dependency can be effectively exploited to accelerate its exploration.Namely, the search for the stable phasea point on the convex hull -can be approached as an global optimization problem.For this, the energy profile of a compositional space can be approximated with a Gaussian process (GP) and optimized via Bayesian optimization (BO) [28,29].BO has been proven effective in the exploration of costly-to-evaluate functions and has been increasingly used in materials science to design experiments and optimize sampling [30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46].
Here by employing BO with GP, incorporating previous assessments of energy of compositions into the learning process and taking the uncertainties into account, we enable posterior inference and uncertainty quantification for the whole compositional space, including chemical formulae that are impossible to reliably assess with CSP due to combinatorial complexity.We demonstrate the efficacy of this approach in two examples of previously studied combinatorial spaces Li-Sn-S-Cl [10] and Y-Sr-Ti-O [47], where PhaseBO discovers the experimentally stable and the lowest-energy compositions more consistently and with a higher success probability in comparison to conventional random sampling (RS) and two-step grid search.We also illustrate the capability of the approach to study previously unexplored multi-dimensional compositional spaces, in which high-throughput screening would be extremely costly.In the example of the unexplored Li-Mg-P-Cl-Br phase field, we identify 6 likely candidates for synthetically accessible phases by evaluating only 38 compositions.With demonstrated efficiency in three different solid-state inorganic chemistries, PhaseBO offers routes towards significant acceleration and automation of computationally-driven materials discovery without compromising the accuracy of energy evaluation.
Bayesian optimization represents a class of machine learning methods aiming to find the global optimum in the problem: where the analytical form of the objective function  in -dimensional space is unknown and its evaluation for any point  is expensive.Here, we represent the search for stable compositions in the materials phase fields as optimization defined in (1).In this formulation, we search for the energy minima above a convex hull, presented as a function of stoichiometry in -element space, in which feasible values of  are stoichiometric coefficients that form a -dimensional simplex { ∈ ℝ  : ∑    = 1}.To find the minima, we employ BO, illustrated in the example quaternary phase field in Figure 1. 2 ,  ′′  3 ,  ′′′  4 ) that we will denote as (),  = ( 1 ,  2 ,  3 ,  4 ), which is unknown and modelled as  ̃ with Gaussian processes (GP) with prior probability distribution ( ̃).b   = (  ) energy above the convex hull is calculated for  reference compositions in the phase field (red triangles).c The posterior probability distribution ( ̃ |   ,   ) is updated, its mean represents the model energy function  ̃ that best fits the data {  ,   } =1  within this GP.d From the posterior, an acquisition function () is built, and its optima suggest compositions   for the next evaluation of (  ) at stage e.Stages c-e are repeated while the computational budget allows (number of CSP-studied compositions  is less than a set budget number ).
The search for stable compositions and the corresponding minima of energy above the convex hull in the compositional space, e.g., the quaternary  −  ′ −  ′′ − ′′′, starts with the approximation of the unknown objective energy function of stoichiometric coefficients (  1 ,  ′  2 ,  ′′  3 ,  ′′′  4 ), which we will denote as (),  = ( 1 ,  2 ,  3 ,  4 ) with the GP prior ( ̃) (Figure 1a): where the prior is a normal probability distribution of functions  ̃(),  is the mean of the distribution , (,  ′ ) is a covariance function, for which we use a Matérn 5/2 kernel [48] (Supplementary Information),  is a unit matrix and the Gaussian noise   2 with variance   reflects imprecision in energy evaluations.
In BO, information about the objective function is obtained iteratively, following sampling of compositions in the phase field.We start with  reference compositions   of a phase field, e.g., in Li-Sn-S-Cl, the references include all reported materials in the phase field (e.g., Li, Sn, S, etc., Supplementary Table 1), whose structures are obtained from databases such as [49], and we evaluate their energy   = (  ) with DFT (Figure 1 b), that updates the posterior according to Bayes' theorem [28] (Figure 1 c): where the denominator (  |  ) is a normalizing constant, optimized with the GP hyperparameters (Supplementary Information).
The posterior mean represents the model energy function  ̃ that best fits the data in this GP; it can be used to predict energy for unexplored .From the posterior, the acquisition function () is constructed (Figure 1 d), which upon optimization, e.g., with gradient-descent, suggests the next composition for evaluation.
() can be derived in different forms [50] to incorporate a strategy for posterior exploration and exploitation: we employ expected improvement (EI) for sequential and Thompson sampling (TS) [51][52][53] for batch sampling and compare their performance (Supplementary Information).The posterior update, composition selection and evaluation (Figure 1c -Figure 1e) are repeated until the stopping criteria, such as stable composition discovery or a number of CSP evaluations , defined by a computational budget are satisfied.
To validate this approach, we compare the exploration of a compositional space with conventional methods (random sampling (RS), grid search) and PhaseBO on the examples of two different chemistries, the quaternaries Li-Sn-S-Cl and Y-Sr-Ti-O, which were previously studied with CSP [10,47] (Supplementary Fig. 1).Several compositions on the convex hull have been identified in the Li + -Sn 4+ -S 2--Cl -phase field with specified charges, and a new compound (Li3.3SnS3Cl0.7)was discovered experimentally [10] in the vicinity of one of these (Li3SnS3Cl) ; the lowest-energy composition identified in the Y 3+ -Sr 2+ -Ti 4+ -O 2-phase field is 57 meV/atom above the convex hull.In RS, the compositions for evaluation with CSP are selected at random: the sequential choices among accessible compositions are not related.An improved strategy could incorporate results from the previous evaluations, e.g., in a 2-step grid search, where a coarse grid (centroids of clustered compositions) identifies lower energy regions, followed by dense-grid examination of those areas (Supplementary Fig. 2).However, the success of such a strategy depends strongly on the discretization, i.e., on whether a grid dissects a phase field in the vicinity of the low energy compositions.PhaseBO uses previous evaluations to model energy-composition relationships.In Figure 2a, the Li + -Sn 4+ -S 2--Cl - phase field is represented in 2 dimensions: ′ =  1 /( 1 +  2 ), ′=  3 /( 3 +  4 ) from the combinations of the stoichiometric coefficients  1 ,  2 ,  3 ,  4 of constituent elements.In (2 3 −  1 −  2 ), with  4 derived from charge balance.Starting from the reference compositions, for which energies are calculated (Supplementary Table 1-2), the posterior is calculated, and at each iteration, argmax () suggests further compositions for evaluation.
The posterior and uncertainties associated with the GP variance (Supplementary Fig. 3) are updated and the process is repeated until the global minimum is found.The mean posterior can be used to predict the energies for the compositions that cannot or have not yet been assessed with CSP and DFT.In Figure 2, we use the mean posterior to predict the for 49820 and 25736 compositions in Li-Sn-S-Cl and Y-Sr-Ti-O, respectively.Importantly, the main features of the energy trends can be reproduced by fitting the energy to only a fraction of all CSP-studied compositions (Supplementary Fig. 4), therefore rapidly highlighting compositional regions of low energy.Thus, the compositional region of the experimentally realized Li3.3SnS3Cl0.7 (lime star in Figure 2) can be discovered, on average, in CSP evaluation for only 40 candidates (Figure 3a, inset).
In Figure 3  In the insets of Figure 3, we compare the average number of evaluations,  ̅ * , required for discovery.PhaseBO requires an average of 40 evaluations to discover the ground state composition in Li-Sn-S-Cl , and on average 57 evaluations in Y-Sr-Ti-O -far fewer evaluations compared to both conventional approaches.Performance of PhaseBO with different acquisition functions, EI presented in Figure3 and TS, is also compared in Supplementary Fig. 2, where the advantage of EI is likely due to the larger number of iterations (GP regressions, posterior updates) performed sequentially with EI.This suggests an optimal strategy for the simultaneous exploration of multiple chemistries: via parallelisation over compositional spaces and sequential exploration of individual phase fields.For an unexplored individual compositional space, deceleration of PhaseBO with TS in comparison to EI can be offset against acceleration due to parallel CSP runs, as undertaken below.
We illustrate application of PhaseBO to study an uncharted compositional space, on example of the unexplored Li-Mg-P-Cl-Br phase field, chosen based on synthetic accessibility [10,54] and ionic-conductivity classification [55].This phase field does not contain any reported quinary phases [49], and makes conventional sampling challenging due to the large combinatorial space.Starting from 8 random compositions, we use USPEX [56] and VASP [57] to generate structures and compare their energies to reference compositions reported in the phase field, constructing a 3-dimensional convex hull.Sampling the posterior with TS of 4 compositions at each iteration, PhaseBO finds 6 compositions with energy above the convex hull < 25 meV/atom in 8 iterations (Figure 4).The lowest-energy compositions found are: LiMg2PCl10 (4 meV/atom) and LiMg2PCl9Br (6 meV/atom), which share a similar predicted structure (Supplementary Fig. 5).The full list of the discovered compositions, energies and reference materials are listed in Supplementary Table 1.The sampled energies above the convex hull calculated with CSP (Supplementary Fig. 6) after 8 PhaseBO iterations can suggest promising compositional regions where stable materials may be found.
As demonstrated for 3 CSP methods in 3 different chemistries, PhaseBO coupled with CSP offers a versatile approach to overcome challenges for accurate exploration of compositional spaces at scale, such as intractable computational costs for dense sampling and evaluation of large supercells for modelling disorder.PhaseBO enables rapid identification of low-energy compositional regions, predictions for challenging compositions, and evaluation of uncertainty.It increases the probability of discovery compared to conventional sampling approaches and discovers more potentially attractive candidates more rapidly.It can be applied to other fields of materials discovery, including combinatorial chemistry, where success of the synthesis can be judged by the strength of the diffraction signals.More fundamentally, the successful optimization of energy landscapes suggests a functional relationship between energy and composition, motivating its further examination.

Figure 1 .
Figure 1.Schematics of the search for stable compositions -minima of energy E -in a compositional field  −  ′ −  ′′ − ′′′ with Bayesian optimization.a The true dependence of energy above the

Figure 2 .
Figure 2. Exploration of phase fields.a Li + -Sn4+ -S 2--Cl -, compositions   = Li  1 Sn  2 S  3 Cl  4 presented in 2 dimensions (main text), markers correspond to the total dataset,  = 195 compositions with maximum 16 atoms per unit cell, for which energies were calculated with CSP and DFT[10]; the ground state composition Li3SnS3Cl, is marked with a cyan star.The mean posterior calculated with PhaseBO predicts the energy for 49820 unsampled compositions (background colour), including the synthesized Li3.3SnS3Cl0.7 (lime star).b Y 3+ -Sr 2+ -Ti 4+ -O 2-is represented in triangular coordinates (main text), markers correspond to the total dataset,  = 145 composition with maximum 35 atoms per unit cell, for which energies were calculated with CSP and force field approach in[47].The mean PhaseBO posterior predicts the energy for 25736 unsampled compositions (background).

Figure 2b ,
Figure 2b, the Y 3+ -Sr 2+ -Ti 4+ -O 2-phase field is represented in 2 dimensions: ′ = 1 2 ( 2 −  1 ), , we compare the performance of RS, 2-step search and PhaseBO in 300 search experiments.In RS, the probability of randomly discovering a composition depends linearly on the number of accessible compositions, as expected (red dashed line).PhaseBO outperforms conventional approaches in discovering the global minima in Li-Sn-S-Cl and Y-Sr-Ti-O.PhaseBO achieves 100% success of discovery with fewer evaluations, requiring only 50% or less of all compositions.

Figure 3 .
Figure 3. Performance of conventional and PhaseBO methods for exploration of phase fields.Legend: Random search (black markers for simulations and dashed line for the trend calculated analytically); 2-step clustering search (blue); PhaseBO (green).a Probability of discovering the stable composition (Li3SnS3Cl) among 195 compositions in Li-Sn-S-Cl is higher with PhaseBO, and reaches 100% after 82 compositions are evaluated (cf 170 for 2-step and 195 randomly).b Probability of discovering the lowest-energy composition among 145 compositions in Y-Sr-Ti-O, where PhaseBO reaches 100% after 72 compositions are evaluated (cf 138 for 2-step and 145 randomly).Insets: With PhaseBO, the average number of evaluations plateaus at  ̅ * = 40 compositions for Li-Sn-S-Cl and at 57 compositions for Y-Sr-Ti-O.The shaded areas correspond to standard deviations in 300 experiments.