Discovering novel chemicals and materials can be greatly accelerated by iterative machine learning-informed proposal of candidates—active learning. However, standard *global error* metrics for model quality are not predictive of discovery performance and can be misleading. We introduce the notion of *Pareto shell error* to help judge the suitability of a model for proposing candidates. Furthermore, through synthetic cases, an experimental thermoelectric dataset and a computational organic molecule dataset, we probe the relation between acquisition function fidelity and active learning performance. Results suggest novel diagnostic tools, as well as new insights for the acquisition function design.

## I. INTRODUCTION

Accelerated design, optimization, and tuning of chemicals and materials via machine learning is receiving increasing interest in science and industry. A major driver of this interest is the potential to reduce the substantial cost and effort involved in manual development, synthesis, and characterization of large numbers of candidates. The primary aim is to reduce the number of both failed candidates and development cycles.

A data-driven approach to achieve this acceleration is *active learning* (AL),^{1} an iterative procedure in which a machine-learning model suggests candidates, a selection of which are synthesized, characterized, and fed back into the model to complete a learning iteration. The objective of this procedure varies; in chem- and materials-informatics, the objective is often to identify promising candidates by optimizing properties of interest. Recent work has leveraged AL for candidates when there is a *single objective*.^{2–17} However, new issues arise when optimizing multiple objectives simultaneously, as is frequently the case beyond proof-of-principle settings.

Furthermore, the aim to identify promising candidates is distinct from the aim to improve model accuracy, a frequent objective in AL.^{18} In fact, model accuracy can be at odds with acquiring optimal candidates: Figure 1 demonstrates that the usual global notion of model accuracy is not necessarily associated with optimal chemicals and materials discovery. In this work, we introduce a notion of model accuracy more closely associated with rapidly discovering new candidates.

We offer two primary contributions: First, we introduce novel concepts to judge the performance of multi-objective AL, with an aim toward the specific concerns of candidate discovery. In addition to the usual notion of multi-objective optimality^{19}—non-dominance—we use criteria informed by the concept of *strata*—recursive non-dominance—from the database literature.^{20} We also introduce *scoped* error metrics, which emphasize regions of interest in performance (property) space, particularly bands about the Pareto frontier we call *Pareto shells*. We demonstrate that the usual global error provides less usable signal for candidate discovery performance, while Pareto shell error—under certain conditions—correctly signals when AL will likely identify performant candidates. Second, we compare multi-objective acquisition functions (also known as *improvement criteria*) in terms of their quantitative and qualitative performance. Specifically, we compare a collection of exploitation/exploration-navigating acquisition functions: Probability of Joint Exceedance (PJE), Hyperplane Probability of Improvement (HPI), and Probability Non-Dominated (PND). These acquisition functions differ in terms of the fidelity with which they represent the Pareto frontier, allowing us to study how this affects AL performance.

In this work, we review AL for single-objective candidate discovery and relevant concepts from multi-objective optimization and synthesize scientifically relevant concepts for judging AL performance. We introduce a family of acquisition functions (decision-making strategies) and compare their performance across several synthetic datasets and a real thermoelectric dataset. Our results illustrate how new concepts can help articulate the difference between AL strategies and how model accuracy does—and does not—relate to AL performance in discovering novel candidates.

## II. ACTIVE LEARNING FOR SINGLE-OBJECTIVE CANDIDATE DISCOVERY

AL is a specialized problem setting in machine learning related to the optimal experimental design.^{1,11,21} In the chemical and materials science context, consider a description ** x** of a candidate with corresponding observed property of interest

*y*, thought to be linked by an unknown function

*f*:

**↦**

*x**y*, which is expensive to evaluate, for example, synthesizing and characterizing a candidate. To systematically identify novel candidates

**with desirable changes in**

*x**y*, a statistical (machine-learning) model $f^$ is built that predicts the unknown function

*f*. Trained on an initial set of characterized $X\u03030={x1,\u2026,xn0}$ with measured properties $Y\u03030={y1,\u2026,yn0}$, the initial model $f^0$ is used to select new candidates $xn0+1,\u2026,xn0+n1$. These are characterized and added to the dataset, $X\u03031={x1,\u2026,xn0+n1}$, which is then used to train a new improved model $f^1$. This results in a sequence of datasets $X\u0303k,Y\u0303k$ for

*k*= 0, …,

*K*. While this cycle can, in principle, be fully automated, new candidates are usually selected by domain experts based on AL suggestions (“human-in-the-loop”).

Various objectives can drive the design of AL strategies; for instance, a common objective in general machine learning is to improve the model $f^$. In informatics, the objective is often to identify an improved material ** x** using as few physical experiments as possible.

^{6}We call the design space of experimentally accessible, that is, synthesizable and measurable, candidates the

*global scope*and call the error computed on this scope

*global error*.

For a *single objective*, measuring the relative performance of candidates is straightforward (Fig. 2) as they can be ranked unambiguously based on their scalar performance, for example, the strength-to-weight ratio for structural alloys. One way to measure the performance of AL is to retrospectively simulate AL on a known dataset and count the number of AL iterations necessary to reach a known top candidate.^{6} A key decision in designing an AL method is the choice of *acquisition function*. We will discuss these in greater detail in Sec. III; briefly, an acquisition function is a decision rule used to rank potential candidates in an AL context. Such criteria navigate the “exploration-exploitation” trade-off:^{22} The algorithm should seek improved candidates but should also try “risky” candidates to improve its model and enable later discoveries.^{18} Many improvement criteria for the single-objective case have been proposed and tested in the literature.^{2,6,14–16,23} The multi-objective setting, however, requires a more nuanced understanding of candidate ranking.

## III. METHODS

### A. Multi-objective optimization background

In multi-objective optimization, the categories of “greater” and “lesser” are insufficient (Fig. 2). Since two candidates can compete along multiple axes, it is possible for them to be mutually *non-dominated*. Multiple objectives occur naturally in chemical and materials science problems and are often unavoidable, e.g., the strength-toughness trade-off, which arises from fundamental, competing effects.^{24} *In lieu* of more preference information, one must navigate the resulting multi-objective trade-off space. Having access to more candidates in this trade-off space enables more complete scientific understanding and more informed engineering decisions. In this section, we review concepts from multi-objective optimization and introduce acquisition functions for the multi-objective setting.

#### 1. Dominance and strata

In the single-objective case, candidates $y,y\u2032\u2208R$ can be unambiguously ranked: A candidate *y* is either lesser *y* < *y*′, greater *y* > *y*′, or equal *y* = *y*′ to another candidate *y*′. However, the introduction of multiple objectives $y,y\u2032\u2208RD$ introduces new complexities. Figure 2 illustrates the universe of possible comparisons; in addition to lesser, greater, and equal, in the multi-objective setting, there exist *non-dominated* points^{19} (Chap. 12). *Dominance* (in the Pareto sense) is a pairwise relationship; the candidate ** y** is said to

*dominate*

**′ if**

*y*We use the notation ** y**′ ≺

**to denote that**

*y***dominates**

*y***′. Note that the definitions above pre-suppose that optimization is posed in terms of maximization of all objectives. This is not a limitation—if minimization is desired for a given axis**

*y**y*

_{d}, then one must simply reverse the relevant inequalities in definition 1. Furthermore, proximity to a desired value can be encoded as minimization of the absolute distance from the target value, e.g., bandgap as close as possible to 1.2 eV.

In the multi-objective setting, the possibility of non-dominance implies that a single “best” multi-objective output value ** y** may not exist. Instead, there may be a set of “best values.” Given a set of candidates $A\u2286RD$, the set of non-dominated points is called the

*Pareto frontier*, defined by

The Pareto frontier $P(A)$ represents the set of trade-offs one must navigate in choosing an optimization candidate. Further selection must be made through external considerations: possibly a prioritization of the objectives or through harder-to-quantify concerns such as the corrosion resistance of a material^{25} (Chap. 5).

While the Pareto frontier is an important set of candidates, points outside the frontier are not without utility, particularly if aforementioned external concerns exist. Candidates near the frontier can be useful as training data for a machine learning model, while measurement or model uncertainties may lead to the false classification of a point as dominated. To describe points near the Pareto frontier, we use the notion of *strata*.

The *strata* are defined via a recursive relationship.^{20} Let $A$ be a set of candidates as above and define the s-th stratum $Ss$ via

Figure 3 illustrates a few strata on a thermoelectric dataset (introduced in Subsection IV A). Note that by definition, we have $Si\u2229Sj=\u2205$ if *i* ≠ *j*. This allows us to define a *stratum number* for each point $y\u2208A$ via

The candidates along the Pareto frontier then have *s*(** y**) = 1, while points with a larger stratum number lie further from it. We will use the stratum number to rank the performance of AL with greater resolution than counting frontier points alone.

We also define the *Pareto s-shell* via

This definition allows one to select a “band” of points along and near the Pareto frontier. Below, we will use the Pareto shell $Ps$ as a targeted scope for computing model accuracy. We use the nomenclature *s*-shell to denote $Ps$: Figure 3 illustrates a Pareto 2-shell for the thermoelectric dataset.

### B. Proposed performance measures

#### 1. Candidate improvement performance

Prior work assessing AL in informatics has focused primarily on the number of Pareto frontier points acquired during AL.^{6,26} While relevant, this “all or nothing” measurement of success provides no accounting for candidates near the Pareto frontier, which may still be of scientific interest. This is particularly troublesome when studying datasets that have very few Pareto frontier points, such as the “sparse” frontier we will consider below. To provide a measure with more granularity, we consider the *mean stratum number* at each iteration of AL $Y\u0303k$ using the ground truth strata from the full dataset.

#### 2. Model accuracy performance

In addition to the candidate improvement performance of AL, we also consider trajectories of model performance. We use the notion of *non-dimensional error* (NDE) across each of the output quantities; given a set of true ${yi}i=1n=Y$ and estimated ${y^i}i=1n=Y^$ response values, we compute the NDE for the output *d* via

where $y\xafd=1n\u2211i=1nyi,d$ is the sample mean. If we have *D* output quantities, then there are *D* NDE values to compute. Note that the NDE is closely related to the *coefficient of determination R*^{2} via the expression $NDE=1\u2212R2$.^{27} Since the NDE is dimensionless, we may safely average NDE values across the *D* outputs to compute a *mean non-dimensional error* (MNDE), given by

Given the retrospective nature of our test cases, the global error is naturally evaluated by (6) across the entire dataset. However, this *global scope* includes regions of output space that are not emphasized in our frontier-seeking context. To compute error metrics with *targeted scope*, we use the notion of a Pareto shell to compute error on the relevant portions of the output domain (Fig. 3). This notion of scope is closely related to the concept of a *domain of applicability* (input space), with the contrast that the scope is defined with respect to the range (output space).^{28}

### C. Acquisition functions for benchmarking

To assess the performance measures above, we must perform active learning with some form of decision criteria that summarizes available multi-objective property predictions and automatically ranks candidates. To this end, we define an *acquisition function* $fa(x;M)$ as any function that is used to select an optimal candidate via

where $X\u0303C$ is the complement of the training set and $M$ is a trained machine learning model, which returns both a prediction $y^$ and a predictive distribution $Y\u223cD^$, both of which are available to the acquisition function *f*_{a}.

Many generalizations of single-objective acquisition functions are available from the literature, including the probability of improvement and expected improvement criteria,^{29} the max–min criterion,^{30} and the expected hyper-volume improvement (EHVI) criterion.^{26} Comparing these and other criteria is outside the scope of this work; instead, we are focused on (i) how the fidelity with which the Pareto frontier is represented relates to AL performance and (ii) how optimal candidate selection is (or is not) related to model accuracy. Below, we introduce the acquisition functions to be studied in this work after a remark on dimensional homogeneity.

#### 1. Importance of dimensional homogeneity

Before introducing specific acquisition functions, we first make a remark on the importance of *dimensional homogeneity*. In order to measure the performance of multi-objective AL, we may wish to measure the “distance” of candidates to the Pareto frontier. However, we will see here that a naive notion of distance is problematic as it does not respect dimensional homogeneity—the ranking results are not independent of the analyst’s choice of the unit system.

To illustrate, let $y,y\u2032\u2208RD$, where the *y*_{d} potentially have different physical units. The ordinary notion of distance is given by

Note that for all *p* ∈ (0, +∞), this expression involves the addition of terms of potentially different physical units. This expression violates dimensional homogeneity, which introduces an artificial dependence on the chosen unit system. Thus, the ranks computed by distances are not necessarily stable to a change of the unit system—we provide an example of this pathology in the supplementary material. This illustrates that arbitrary choices can drastically affect decisions based on this sort of distance computation. To overcome this issue, we only consider acquisition functions that respect dimensional homogeneity.

#### 2. Uncertainty sampling

On the scale from exploitation to exploration, *uncertainty sampling* leans heavily toward the latter; one chooses candidates based on where the model is most uncertain. In the scalar case, this is easily accomplished by choosing the candidate with the greatest predictive variance $\sigma ^i2$.^{18} This approach does not immediately generalize to the multi-objective case as the component variances $\sigma ^i,d2=\Sigma ^i,dd$ do not necessarily have the same units. To respect dimensional homogeneity, we generalize uncertainty sampling by considering a sum of dimensionless quantities.

*Sum of Coefficients of Variation* (SCV): The SCV is defined via

where the COV_{i,d} = *σ*_{i,d}/|*μ*_{i,d}| are coefficients of variation and $\mu ^i,d,\sigma ^i,d$ are the *d* = 1, …, *D* components of the mean and standard deviation of the predictive distribution $D^i$. Note that this definition is problematic if any of the *μ*_{i} are exactly zero. However, the coefficients of variation COV_{i,d} are dimensionless quantities with a normalizing scale $\mu ^i,d$ set not by the user, but rather by the available data.

#### 3. Frontier modeling strategies

Here, we introduce a family of acquisition functions that seek improvement over an existing Pareto frontier, in the order of increasing fidelity with which they model the Pareto frontier. Each of these strategies is a form of probability statement; by construction, these quantities respect dimensional homogeneity.

*Probability of Joint Exceedance* (PJE): The PJE is defined via

where $Yi\u223cD^i$ is the random variable, which follows the predictive distribution $D^i$ for candidate *i*. In words, definition (11) is the probability that candidate *Y*_{i} will exceed the performance observed in our existing training data $Y\u0303$ along every axis of comparison. This is a very “aggressive” acquisition function that ignores much of the structure of the Pareto frontier. Figure 4 illustrates the PJE, alongside the other frontier modeling criteria (Fig. 5).

*Hyperplane Probability of Improvement* (HPI): The HPI is defined in terms of a hyperplane fit of the Pareto frontier in the available training data. Modeling the Pareto frontier as a hyperplane requires fitting a normal direction $w^\u2208RD$ and an offset $b^$. Once these are fit, the HPI is defined in terms of the appropriate Z-score via

In words, the HPI is the probability that a candidate *Y*_{i} will present improvement over the hyperplane fit of the existing Pareto frontier.

There are many ways to fit a hyperplane to a set of data. One option is to arbitrarily select one output *Y*_{i} and perform linear regression using the remaining outputs as regression variables. We recommend against this approach as it suffers from the *regression fallacy*^{31} (Chap. 9.1). In practice, we perform a principal component analysis of the available Pareto frontier data and use the least-variance direction to define the hyperplane direction $w^$. Together with the mean of the Pareto frontier data $Y\xaf$, we then define the offset via $b^=w^\u22a4Y\xaf$.

Figure 4 illustrates the HPI: Note that this definition assumes a hyperplane structure, which will not be appropriate for all Pareto frontiers. Furthermore, the HPI “fills in” the staircase structure posed by the true Pareto frontier—the final frontier modeling acquisition function captures this structure.

*Probability Non-Dominated* (PND): The PND is defined via

with dominance ≺ defined in (1). In words, the PND computes the probability that a given candidate *Y*_{i} will be non-dominated with respect to the available data. This criterion is studied in an approximate form in Ref. 29. The PND fully considers the known Pareto frontier, introducing no modeling assumptions. Figure 4 illustrates this acquisition function against the aforementioned definitions.

Note that while PND captures the true geometry of the Pareto frontier, since the frontier is allowed to have quite general structure, no simple analytic expression as for the PJE or HPI exists. Instead, one may approximate the PND via ordinary Monte Carlo, drawing samples from the predictive distribution $Di$ and counting the proportion that are non-dominated. This leads to a greater computational expense to evaluate the PND, as compared with the PJE or HPI. The experiments below will demonstrate that this added expense can be valuable if higher AL performance is desired.

## IV. RESULTS

### A. Test cases

To compare the acquisition functions introduced above, we simulate AL on a collection of synthetic and physical databases. The synthetic cases are constructed to present different Pareto frontier geometries, including a linear frontier, as well as examples of convex and concave frontiers. We also construct a “sparse” frontier containing relatively few low-stratum points. Figure 6 presents these test-cases’ two-dimensional output spaces. The functional forms for these models are given in the Appendix.

In the motivating example above (Fig. 1), we considered a published dataset of experimental thermoelectric materials.^{32} The performance (property) space consists of thermal conductivity *κ*, electrical resistivity *ρ*, and the Seebeck coefficient *S*. The inputs are computed using the Magpie featurization library: We use the matminer package to compute features (descriptors) including stoichiometry, valence orbital, and ion properties as inputs.^{33,34} Full code to load this dataset and featurization is available in the supplementary material.

We also consider a computational dataset of organic molecules: the QM9 dataset, which contains 134 k small organic molecules with ground-state geometry and several properties computed at the density-functional level of theory.^{35} We featurize these data using the Chemistry Development Kit^{36} and model the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) properties. This case study is inspired by, but not directly related to, the design of organic photovoltaics.^{37} For optimization, we consider maximization of HOMO and minimization of LUMO, corresponding to a minimization of the HOMO–LUMO gap, while seeking a low LUMO value. To promote scalability for replications, we filter the dataset to contain only the first 25 strata, resulting in 2108 molecules. Figure 5 illustrates this frontier.

### B. Retrospective active learning experiments

The primary evidence of this work is based on a battery of AL simulations, which support the comparison of different acquisition functions against a common setup. For all experiments, we use the random forest model implemented in the lolo package.^{38}

Chance phenomena such as initial data selection can affect AL results, implying that single runs of AL are insufficient for measuring performance. We consider an ensemble of runs against a randomized selection of initial data in order to provide a robust estimate of relative performance.

To perform a single run of AL, we carry out the following steps:

Choose an initial random subset $X\u03030\u2286X$ of size $|X\u03030|=C$ for the training data and reveal the paired labels $Y\u03030$. Set the iteration counter to

*k*= 0.Fit a random forest model to the available paired data $X\u0303k,Y\u0303k$. This returns predictions $y^i$ and predictive densities $D^i$.

Rank the candidates $xi\u2208(X\u2212X\u0303k)$ remaining in the dataset according to the chosen acquisition function

*f*_{a}(*x*_{i}). Select the top-performing candidate*x*^{*}and add it to the database $X\u0303k+1=X\u0303k+{x*}$. Reveal the label for*x*^{*}.Repeat from Step 2 for

*k*= 1, …,*K*− 1 total iterations.

To perform an ensemble of AL runs, we select different random subsets $X\u03030,1,\u2026,X\u03030,R$ for *R* repetitions and aggregate the results across these runs. We vary the initial candidate pool size *C*, the total number of iterations *K*, the considered test case, and the chosen acquisition function. We performed 150 AL runs for each choice of acquisition function and test case: Table I summarizes experimental details, while the supplementary material provides numerical details on *C*, *K*, and *R*.

Short . | Acquisition function . | Fidelity . |
---|---|---|

Rand. | Random selection | NA |

MSCV | Maximum sum of coefficients of variation | NA |

MPJE | Maximum probability of joint exceedance | Low |

MHPI | Maximum hyperplane probability | Medium |

of improvement | ||

MPND | Maximum probability non-dominated | High |

Short . | Acquisition function . | Fidelity . |
---|---|---|

Rand. | Random selection | NA |

MSCV | Maximum sum of coefficients of variation | NA |

MPJE | Maximum probability of joint exceedance | Low |

MHPI | Maximum hyperplane probability | Medium |

of improvement | ||

MPND | Maximum probability non-dominated | High |

### C. Candidate improvement

Here, we report results on candidate discovery performance. We consider both the usual metric of the number of non-dominated points (NNDP) acquired (Fig. 7) and the proposed measure of the mean stratum number (Fig. 8). Generally, the acquisition functions that acquire the most NNDP are (1) MPND, (2) MHPI, and (3). MPJE—this matches the descending order for the Pareto frontier representation fidelity. The ranking results for the same criteria are mixed for shorter-term performance. These results suggest that capturing the true geometry of the Pareto frontier is most important in “later stage” AL, where the space of candidates is limited. As seen in Fig. 1, MPND also out-performs other criteria on the thermoelectric dataset in terms of long-run performance. These results suggest that incorporating high-fidelity modeling of the Pareto frontier into the acquisition function is particularly important in well-studied problems with few possible candidates. Finally, note that the hyperbolic test-case has only five non-dominated candidates total—this fundamentally limits the resolution of AL results for this case. Figure 8 analyzes the same AL results in terms of the NNDP and mean stratum number, demonstrating how the two metrics indicate different trends.

The difference in the “early stage” performance is better revealed by the mean stratum results (Fig. 8). In the early stages of AL, MHPI tends to give the best performance, delivering more candidates at or near the Pareto frontier, while MPND consistently achieves long-run performance. Note that MPJE performance tends to saturate earlier than the other criteria, rather than ranking second when the performance is measured with the NNDP metric—this suggests that MPJE begins selecting candidates far from the Pareto frontier in later-stage AL. The reasons for these differences are explained in Subsection IV E, which analyzes qualitative performance.

### D. Accuracy and acquisitions

Here, we report results on how model accuracy relates to acquisitions during AL. Figure 1 illustrates that global scope error can be misleading for judging the relative performance of active learning techniques; Figure. 9 compares shell and global error with acquisition performance. The results on the analytically simple linear case indicate that shell scope error is more closely associated with the acquisition of higher-performing candidates.

However, in Fig. 9, we also see complicating factors: Results on the QM9 dataset show that shell error erroneously highlights MPJE as performant but correctly identifies MHPI as performant. Note that the shell scope is based solely on recursive non-dominance, a feature of output property space. Other factors, such as structure–property relationships, are only accounted for indirectly through a shell error calculation. These results indicate that while shell error is more informative for AL performance, the shell scope is evidently (and intuitively) not the only salient factor.

### E. Qualitative performance

Here, we report how the acquisition functions of Subsection III C behave in terms of qualitative performance, where their selections tend to lie in output space. Figure 10 reports empirical densities for selected candidates in output space. Broadly, MPJE tends to select candidates throughout the output space, MHPI focuses on “hotspots” along the frontier, and MPND thoroughly explores the frontier but infrequently selects strongly dominated candidates. This qualitative analysis helps to explain the difference in NNDP (Fig. 7) and mean stratum (Fig. 8) performance: MPND is able to explore the frontier thoroughly, selecting many non-dominated points. However, MPND is also more liable than MHPI to select strongly non-dominated candidates, leading to a higher mean shell number. Conversely, MHPI tends to concentrate its selections closer to a limited region of the Pareto frontier, leading to fewer possible non-dominated selections but a lower mean stratum.

## V. CONCLUSIONS

In this work, we explored the relationship between different aspects of machine learning accuracy and candidate acquisition in multi-objective candidate discovery. We showed that AL schemes which optimize for the usual notion of model accuracy—global error—do not guarantee optimal candidate discovery. To ameliorate the situation, we introduced Pareto shell error, which we found to be more closely associated with discovering improved candidates.

We also studied the relationship between the fidelity with which acquisition functions represent the Pareto frontier and AL performance. We demonstrated that dimensionally inhomogeneous acquisition functions can lead to non-robust decision making and so limited our attention to dimensionally homogeneous acquisition functions. We found that the long-run discovery of non-dominated candidates was improved by modeling the Pareto frontier with greater fidelity. However, an acquisition function, which rendered the Pareto frontier with lesser fidelity (MHPI), uncovered more candidates at or near the Pareto frontier in the early stages of AL, leading to a lower mean stratum number than the highest fidelity acquisition function (MPND). We found that these acquisition functions tended to select material candidates at varying locations along the Pareto frontier with different frequency, leading to preferred “hot-spots” in material property space.

These results have ramifications for scientists seeking to use AL for candidate discovery. Since global error is not always predictive of optimal candidate discovery, an analyst should check both global and shell error when deciding whether a model is sufficiently accurate to be used to rank candidates. Furthermore, an analyst can use these insights prescriptively, choosing hyperparameters to optimize shell error rather than global error.

The selection of an appropriate acquisition function is highly dependent on the analyst’s goals. Based on our results, accurately modeling the Pareto frontier in an acquisition function is not critical for early stage AL, that is, if the problem has a very large set of uncharacterized material candidates. Accurately capturing the Pareto frontier in the acquisition function logic becomes important when the design space is more thoroughly explored. In the absence of more specific preference criteria than non-dominated, the maximum probability non-dominated (MPND) strategy is the most performant among the acquisition functions tested here.

There are a number of remaining questions related to the present topic. By the simulation nature of our experiments, we were able to evaluate the true Pareto shell error; in practice, one must devise an estimation technique based on the available data. Given the suitability of different acquisition functions for different phases of active learning, a *homotopy* approach that weights different objectives *λf*_{1} + (1 − *λ*)*f*_{2} with *λ* varying between iterations may be a useful framework. In this work, we fit independent models to all output quantities: Prior work suggests that modeling improvements can be made by accurately capturing the output dependence structure among output quantities.^{39,40} It would be interesting to study whether similar improvements can be found in AL performance. A conceptually different way to treat the multi-objective setting is to turn some objectives into constraints; it would be interesting to compare the constrained approach against a multi-objective acquisition function in terms of the mean stratum number (cf. Fig. 8) and acquisition densities (cf. Fig. 10). Similarly, it would be interesting to compare performance across AL strategies using different scalarizations of the multi-objective space, e.g., the thermoelectric figure of merit *zT*.

## SUPPLEMENTARY MATERIAL

See the supplementary material for full code to reproduce the datasets used in this study.

## AUTHORS’ CONTRIBUTIONS

Z.R., Y.K., E.A., and J.L. designed the research. Z.R. performed the research and analyzed the data. Z.R. and M.R. wrote the paper. All the authors read the manuscript, commented on the contents, and agreed with the publication of the results. M.R. contributed to the present work while employed by Citrine Informatics.

## DATA AVAILABILITY

The data that support the findings of this study are available within the article (and its supplementary material).

## ACKNOWLEDGMENTS

Z.R. was funded by the Diversifying Academia, Recruiting Excellence program at Stanford University. Part of this work was performed while M.R. was visiting the Institute for Pure and Applied Mathematics (IPAM), which is supported by the National Science Foundation (Grant No. DMS-1440415).

### APPENDIX: TEST CASE DETAILS

The following are the underlying distributions and function used to generate the synthetic data cases. These test cases were designed to provide a variety of Pareto frontier geometries. For all four synthetic test cases, the output space is two dimensional $Y\u2208R2$, and both outputs are to be maximized. Code for reproducing these datasets is provided in the supplementary material.

#### 1. Linear test case

A simple linear frontier geometry, generated by rotating (and stretching) a uniform distribution,

#### 2. Circular test case

A circular frontier geometry, generated via trigonometric functions,

#### 3. Hyperbolic test case

The hyperbolic responses lead to a far greater density of points near the origin. This results in a “sparse” Pareto frontier, which often has very few non-dominated candidates,

#### 4. Bat test case

A non-convex Pareto frontier generated by perturbing the radius of the circular test-case,

## REFERENCES

_{2}reduction and H

_{2}evolution