The recent boom in computational chemistry has enabled several projects aimed at discovering useful materials or catalysts. We acknowledge and address two recurring issues in the field of computational catalyst discovery. First, calculating macro-scale catalyst properties is not straightforward when using ensembles of atomic-scale calculations [e.g., density functional theory (DFT)]. We attempt to address this issue by creating a multi-scale model that estimates bulk catalyst activity using adsorption energy predictions from both DFT and machine learning models. The second issue is that many catalyst discovery efforts seek to optimize catalyst properties, but optimization is an inherently exploitative objective that is in tension with the explorative nature of early-stage discovery projects. In other words, why invest so much time finding a “best” catalyst when it is likely to fail for some other, unforeseen problem? We address this issue by relaxing the catalyst discovery goal into a classification problem: “What is the set of catalysts that is worth testing experimentally?” Here, we present a catalyst discovery method called myopic multiscale sampling, which combines multiscale modeling with automated selection of DFT calculations. It is an active classification strategy that seeks to classify catalysts as “worth investigating” or “not worth investigating” experimentally. Our results show an ∼7–16 times speedup in catalyst classification relative to random sampling. These results were based on offline simulations of our algorithm on two different datasets: a larger, synthesized dataset and a smaller, real dataset.

## I. INTRODUCTION

Recent advances in computing hardware and software have led to substantial growth in the field of computational materials science. In particular, databases of high-throughput calculations^{1–6} have increased the amount of information available to researchers. These databases facilitate the development of models that supplement human understanding of physical trends in materials.^{7–9} These models can then be used in experimental discovery efforts by identifying promising subsets of the search space, resulting in increased experimental efficiency.^{10–15}

However, many materials design efforts use material properties and calculation archetypes that are too problem-specific to be tabulated in generalized databases. When such efforts coincide with design spaces too large to search in a feasible amount of time, we need a way to search through the design space efficiently. Sequential learning, sometimes referred to as optimal design of experiments or active learning, can fill this role. Sequential learning is the process of using the currently available data to decide which new data would be most valuable for achieving a particular goal.^{16–18} In practice, this usually involves fitting a surrogate model to the available data and then pairing the model with an *acquisition function* that calculates the values of a new, potential data point. Then, we *query* the most valuable data points, add them to the dataset, and repeat this process. These sequential learning methods have been estimated to accelerate materials discovery efforts by up to a factor of 20.^{19}

Sequential learning has numerous sub-types of methods that can and have been used for different goals. One such sub-type is active learning. With many active learning algorithms, the goal is to replace a relatively slow data-querying process with a faster-running surrogate model.^{20} Since the surrogate model may be used to query any point, the acquisition functions focus on ensuring that the entire search space is explored. Another sub-type of sequential learning is active optimization.^{21} With this sub-type, the goal is to maximize or minimize some objective function. Thus, the acquisition functions generally focus on parts of the search space where maxima or minima are more likely to occur. One of the most common types of active optimization is Bayesian optimization.^{21} Yet another sub-type of sequential learning is online or on-the-fly learning.^{22} The goal of these methods is to accelerate the predictions of streams of data. In the field of computational materials science, this is often applied to predicting trajectories for density functional theory (DFT) or molecular dynamics calculations.^{23,24}

In computational materials discovery, we often have the following task: we have a set of available materials $X={xi}i=1n$, where each material *x*_{i} has an associated quantity *y*_{i}, denoting its value for some application. Examples of common properties for *y*_{i} include—but are not limited to—formation energies of materials, catalyst activity, tensile strength, or conductivity. The value *y*_{i} is unknown and must be calculated, which can be costly in time, money, or other resources. Furthermore, theoretical calculations of material properties may be inconsistent with experimental results. As per a common aphorism among statisticians, “All models are wrong, but some are useful.”

Due to these potential model errors and due to the exploratory nature of materials discovery, we propose reframing the materials discovery question. Instead of trying to discover materials with optimal *y*_{i} values, what if we instead classify materials as having promising or unpromising *y*_{i} values? In other words, what if we frame materials discovery efforts as classification problems rather than optimization problems? The estimated classes could then be used to design physical experiments. Mathematically, this is akin to assuming that the material *i* has a binary value *y*_{i} ∈ {0, 1}, where 0 denotes “*not of interest*” and 1 denotes “*of interest*.”

The goal is then to determine the values *y*_{i} for each $xi\u2208X$ as cheaply as possible. One can view this as the task of most-efficiently learning a classifier that, for each *x*_{i}, correctly predicts its value *y*_{i}. In this way, materials discovery problems can be framed as problems of *active classification*. Active classification is the task of choosing an ordering of $xi\u2208X$, over which we will iterate and sequentially measure their values *y*_{i}, in order to most efficiently (using the fewest measurements) learn a classifier that predicts the correct label for all materials $xi\u2208X$.^{25,26}

Another aspect of computational materials discovery is the ability to turn calculations into recommendations—e.g., how can we convert DFT results into actionable experiments? This conversion is relatively straight-forward when properties are directly calculable, which is the case for properties such as the enthalpy of formation.^{27} If we perform a single DFT calculation that suggests a single material may be stable, then we can suggest that single material for experimentation. However, for many applications, the properties of interest may not be calculable directly. For example, let us say we are interested in finding active catalysts. One way to do this is to use DFT to calculate the adsorption energy between the catalyst and particular reaction intermediates and then couple the resulting adsorption energy with a Sabatier relationship.^{28} However, *in situ*, a catalyst comprises numerous adsorption sites and surfaces. Thus, the true activity of a catalyst may be governed by an ensemble of adsorption energies and, therefore, may need multiple DFT calculations. How do we address the fact that we need multiple DFT queries to resolve the properties of a single material?

Here, we attempt to address both outlined issues: (1) we need an ensemble of DFT queries to calculate a single experimental property of interest and (2) we need a sequential learning method designed for high-throughput discovery/classification. We overcome both issues by creating the Myopic Multiscale Sampling (MMS) method (Fig. 1). MMS addresses the first aforementioned issue by using a multiscale modeling framework for estimating the activity of a catalyst using an ensemble of both DFT and Machine Learning (ML) predicted adsorption energies. MMS then addresses the second issue by combining this multiscale modeling framework with a number of sequential learning methods, including active classification. Note that MMS, as we describe it in this paper, is tailored to discovering active catalysts. Although this method may not be directly transferable to other applications, we hope that others may be able to adapt the principles of the method to their own applications.

## II. METHODS

### A. Multiscale modeling

In this paper, we use the discovery of active catalysts as a case study. Catalyst activity is often correlated with the adsorption energy of particular reaction intermediates, as per the volcano relationships stemming from the Sabatier principle.^{28,29} These adsorption energies can be calculated using DFT. Each DFT-calculated adsorption energy is specific to a particular binding site of a particular surface of a particular catalyst. Thus, the relationship between DFT-calculated adsorption energies and a catalyst’s activity is not simple.

For example, in cases of lower adsorbate coverage on the catalyst surface, adsorbates tend to adsorb to stronger-binding sites before weaker-binding sites. In cases of higher adsorbate coverage, adsorption energies are difficult to calculate, so it is not uncommon to assume low adsorbate coverage.^{29–31} We make this assumption here. Specifically, we assume that the adsorption energy of a surface is equivalent to the strongest adsorption energy seen across all sites on that surface. We then feed this adsorption energy into an activity volcano to predict the activity of the surface. Note that the activity volcano we used here^{32} used changes in free energy (Δ*G*) to predict activity, while the datasets we used here contained changes in enthalpy (Δ*H*). We assumed that free energy corrections remained constant at 0.5 eV such that Δ*G* = Δ*H* + 0.50 eV, as explained in our previous work.^{14}

Given the activities of the surfaces of a catalyst, the next step is to estimate the activity of the entire catalyst. One way to do this would be to perform a weighted average of the surface activities, where higher weights are given to surfaces that are more stable. For the sake of simplicity, we instead propose a uniform average. We recognize that future work may involve investigating more sophisticated averaging methods. Such methods may be especially important because the exponential nature of the catalyst activity often leads to orders-of-magnitude shifts in activity between different surfaces. Thus, the estimates of bulk activity may be sensitive to the weights used for high-activity surfaces.

Concretely, suppose we have *n* catalyst candidates ${xi}i=1n$, where each candidate *x*_{i} has *m* surfaces ${ui,j}j=1m$, and the surface *u*_{i,j} has *ℓ* sites ${si,j,k}k=1\u2113$. For a given site *s*_{i,j,k}, denote its free energy of adsorption by Δ*G*(*s*_{i,j,k}), and for a given surface *u*_{i,j}, denote its catalytic activity by *α*(*u*_{i,j}). Likewise, for a given catalyst material candidate *x*_{i}, denote its average catalytic activity by $\alpha (xi)=1m\u2211j=1m\alpha (ui,j)$. Suppose we have a predictive uncertainty estimate for the adsorption energy Δ*G*(*s*_{i,j,k}) of a site, represented by a normal distribution with mean *μ*_{i,j,k} and variance $\sigma i,j,k2$. We can then perform simulation-based uncertainty quantification of catalyst activity by using the multiscale modeling process we described above to propagate uncertainties from sites’ adsorption energies. Specifically, for each material candidate *x*_{i}, we generate *H* samples of its catalytic activity, ${\alpha \u0303ih}h=1H$, by simulating from the following generative process:

where *t*^{*} is the optimal absorption energy for a given volcano relationship and *M*_{1}, *M*_{2}, *B*_{1}, and *B*_{2} are the linear coefficients associated with the two sides of the log-scaled volcano relationship of a given chemistry. Furthermore, we use the notation $\Delta \u0303Gi,j,1:\u2113h$ to denote the set ${\Delta \u0303Gi,j,kh}k=1\u2113$. In other words, we first generate *H* samples from the adsorption energy distributions for each adsorption site. Then, we pair each of the *H* samples from one adsorption site to other samples from sites on the same surface. Within each set of pairings, we find the minimum/strongest/most negative adsorption energy and treat that as one sample for the adsorption energy of the surface. This leads to *H* samples for the adsorption energy of a surface. Next, we map these *H* adsorption energies for each surface through a literature-sourced volcano relationship^{32} to generate *H* samples of activity for each surface. Due to the non-continuous and opaque nature of the volcano relationship, we opt to simplify this mapping process by approximating the volcano as the junction of two linear approximations [Eq. (1)]. Last, we pair each of the *H* activity samples from each surface with corresponding activity samples from other surfaces within the same bulk and then average them to generate a sample for bulk activity. This leads to *H* samples for bulk activity, which can then be composed into a distribution of bulk activity. Figure 2 illustrates this process of using our multiscale modeling method to estimate catalyst activity from DFT-calculated adsorption energies, including uncertainty quantification.

Each catalyst material candidate $x\u2208X$ has some true catalytic activity level *α*(*x*). Our goal will be to determine the top *p*-% of catalyst material candidates in terms of their activity levels, which we denote $Xp={x\u2208X:r(\alpha (x))\u2265pn100}$, where $r:R+\u2192{1,\u2026,n}$ is a function mapping the activity level *α*(*x*) to an index denoting its rank (from highest to lowest activity). Given a specified *p*, if a candidate material is in this set, i.e., $xi\u2208Xp$, then we say that its associated binary value *y*_{i} = 1, or say *y*_{i} = 0 otherwise. In simpler terms, we want to find the top *p*-% most active catalysts. For this paper, we choose *p* = 10% arbitrarily. Any catalyst that falls within the top 10% in terms of activity will be labeled active, and anything below the top 10% will be labeled inactive.

We can, therefore, frame our goal as determining the associated binary value *y*_{i} for each catalyst material candidate $xi\u2208X={xi}i=1n$. Suppose we have formed point estimates for each of the binary values, written as ${\u0177i}i=1n$. To assess the quality of this set of estimates with respect to the set of true candidate values, we focus on the *F*_{1} score—a popular metric for classification accuracy, defined as

Given a set of ground-truth values ${yi}i=1n$, we are able to compute the *F*_{1} score for a chosen set of value estimates ${\u0177i}i=1n$.

However, in practice, we will typically not have access to these ground-truth values and, thus, cannot compute this score in an online procedure. For use in online experiments, we will take advantage of a metric that yields an estimate of the change in *F*_{1} score. This metric is computable using only our model of the activity of each catalyst, without requiring access to ground-truth values ${yi}i=1n$, and can be used to assess and compare the convergence of our methods. Furthermore, it can be used to provide an early stopping method for our active procedures. We will show experimentally in Sec. III that this metric shows a strong correlation with the *F*_{1} score.

### B. Sampling strategy

The goal of MMS is to discover catalysts that are likely to be experimentally active. Optimization of catalytic activity is not the main priority because we assume that experimentalists have the ability to investigate multiple candidate catalysts and that unforeseen experimental issues may make some of these redundant. Instead, a greater focus is given on identification of a large number of candidates rather than finding “the most active” candidate. That is why the core sequential learning algorithm we use in MMS is active classification.^{25,26} To be specific, we use Level Set Estimation (LSE) to identify catalysts for DFT sampling. After identifying catalysts for DFT sampling, we then need to choose which surface of the catalyst to sample; here, we use techniques from active regression to choose a given surface to query (note that, here, regression is used simply as a subroutine within our ultimate task of active classification). Once a surface is chosen, we then attempt to find the strongest binding site on that surface by using active optimization of the adsorption energies. Thus, we combine three different sequential learning strategies across three different length scales to decide which site-based DFT calculation will help us classify active vs inactive catalysts (Fig. 3).

We first describe the initial step of our sampling strategy, which consists of selecting a catalyst material candidate from our candidate set $X={xi}i=1n$. Note that our high-level goal is binary classification in which we want to efficiently produce accurate estimates ${\u0177i}i=1n$ of the binary value for each material candidate. Based on our definition of $yi=1xi\u2208Xp$, this problem can be equivalently viewed as the task of LSE in which we aim to efficiently produce an accurate estimate of the superlevel set $Xp={x\u2208X:r(\alpha (x))\u2265pn100}$. There has been a body of work on developing acquisition functions for choosing candidates to query in the task of LSE.^{33,34} In particular, we focus on the *probability of incorrect classification* acquisition function,^{35} defined for an $xi\u2208X$ as

where

Thus, to select a subsequent catalyst candidate, we compute *φ*(*x*_{i}) for each $xi\u2208X$ and return the maximizer $x*=arg maxxi\u2208X\phi (xi)$. In simpler terms, we choose the catalyst that we are most likely to classify incorrectly. Note how this implies that we do *not* query catalysts that we are confident are active, which is different from active optimization methods. This provides a more exploratory method—i.e., focuses on bulks that are near the classification boundary—rather than an exploitative one—i.e., focusing on bulks with highest activity, as would be done for an optimization task. Given our classification objective designed for early-stage computational discovery and screening, the former method is more appropriate.

The selection of a catalyst candidate *x*_{i} depends on its estimated catalytic activity, which we model as an average of the catalytic activities across the surfaces of the candidate, i.e., $\alpha (xi)=1m\u2211j=1m\alpha (ui,j)$. Though we select a candidate based on its ability to help improve our estimate of the superlevel set $Xp$, once selected, we then wish to most efficiently improve our estimate of this candidate’s catalytic activity. Our goal at this stage is, therefore, to most efficiently learn the catalytic activities for each surface of that candidate. This can be viewed as an active regression task, where we aim to sample a surface that will most reduce the uncertainty of our surface activity estimates. To select a surface, we use an *uncertainty sampling for regression* acquisition function from the active learning literature,^{36} defined as

which selects a surface $ui*$ of the material candidate *x*_{i} that has the greatest variance. In simpler terms, we choose the surface of a catalyst that has the most uncertainty because we suspect that this choice is most likely to reduce our uncertainty estimate of catalyst activity.

The catalytic activity of a given surface *α*(*u*_{i,j}) is a function of the adsorption energies of the sites on this surface, according to the relationship $\alpha (ui,j)=exp(\u2212|M\Delta \u0303Gi,j,1:\u2113+B|)$ from Eq. (1), where $\Delta \u0303Gi,j,1:\u2113$ is the set of adsorption energies over all sites on the surface. Therefore, given a selected surface *u*_{i,j}, we wish to determine efficiently the site on this surface with minimum adsorption energy. This can be viewed as an optimization task. We, therefore, use the *expected improvement* acquisition function from Bayesian optimization,^{37} defined as

where $\mu \u0303\u2009=\u20091H\u2211h=1H\Delta G\u0303i,j,kh$ is the expected adsorption energy, $\sigma \u0303\u2009=\u20091H\u22121\u2211h=1H\Delta G\u0303i,j,kh\u2212\mu \u03032$ is its standard deviation, Φ is the cumulative density function (CDF) of a standard normal distribution, *ϕ* is the probability density function (PDF) of a standard normal distribution, and Δ*G*^{*} is the minimum observed adsorption energy. This selects a site $si,j*$ that is expected to most reduce the site adsorption energy relative to the current minimum observed energy and allows for efficient estimation of the minimum energy site on the surface *u*_{i,j}. In simpler terms, we choose the site on a surface that is most likely to help us identify the strongest/lowest energy binding site on the surface.

### C. Active learning stopping criteria

Assessing convergence of an active algorithm is useful for enabling early stopping, which can save resources. Measures of convergence can also provide diagnostics in online use settings. To quantify convergence, we use the *predicted change in F*1 *score* ($\Delta F^$).^{38} Intuitively speaking, this rule says to stop an active learning procedure when $\Delta F^$ drops and stays below a predefined threshold *ϵ* for *κ* consecutive windows. We refer to *κ* as the stopping horizon, i.e.,

In our setting, $\Delta F^$ is defined to be

where *a* is the number of bulks for which the models at iterations *i* and *i* + 1 both yield a positive label, *b* is the number of bulks for which the model at iteration *i* yields a positive label, while that at iteration *i* + 1 yields a negative label, and *c* is the number of bulks for which the model at iteration *i* yields a negative label, while that at iteration *i* + 1 yields a positive label. Each of *a*, *b*, and *c* is computed over the previous *k* iterations. This measure provides an estimate of the change in accuracy at each iteration, and it allows us to control how conservatively (or aggressively) we stop early via an interpretable parameter *ϵ*. We show results of this measure alongside our *F*1 score in Sec. III. Note that Altschuler and Bloodgood^{38} recommend using a *stop set* of unlabeled points over which to calculate $\Delta F^$. Here, we use the entire search space of catalysts *in lieu* of a stop set because it was non-trivial for us to define a stop set that was representative of the search space.

### D. Management of data queries

Implementation of MMS also involves definition of several hyper-parameters. For example, most surrogate models require training data before making predictions to feed the sampling method. This means that we needed to seed MMS with initial training data. We chose to create the initial training data by randomly sampling 1000 adsorption energies from the search space. We used random sampling for simplicity, and we sampled 1000 adsorption energies because that was the minimum amount of data on which the Convolution-Fed Gaussian Process (CFGP) (described below in more detail) could train on and maintain numerical stability.

Another consideration for MMS is the *batch size* and how to handle queries in tandem. Normal sequential learning assumes that we can make one query at a time. However, in applications such as ours, it may be possible to make multiple queries in parallel—i.e., we can perform multiple DFT calculations at a time. There are several methods for handling queries in parallel; we chose to use a type of look-ahead sampling.^{39} With look-ahead sampling, we began by choosing the first point to sample using the standard acquisition strategy. Then, while that point was still “being queried,” we assumed that the first point was queried successfully and set the “observed” value equal to our predicted value. In other words, we pretend that we sampled the first data point and that our prediction of it was perfect. This allowed us to then recalculate our acquisition values to choose a second point. This process of “looking ahead” one point at a time was continued until a predetermined number of points were selected for querying—i.e., the batch size. Here, we chose a batch size of 200 points because that was roughly the number of DFT calculations that we could perform in a day during our previous high-throughput DFT studies.^{14} Note that we did not re-train the surrogate models within each batch of 200 points; we only re-calculated acquisition values between each sample within each batch. We skipped re-training of surrogate models within each batch to reduce the amount of model training time required to perform this study. Although this may have reduced the effectiveness of the look-ahead method, we found the increased algorithm speed to be worthwhile.

### E. Estimating performance through simulation

We aim to experimentally assess the performance of MMS and compare it with that of a variety of baseline methods without incurring the high cost of repeated DFT calculations. To do this, we simulate each procedure using a database of pre-determined adsorption energies. Specifically, suppose we have chosen a set of *n* catalyst material candidates ${xi}i=1n$ of interest. For each candidate *x*_{i}, we already have all the adsorption energies Δ*G*(*s*_{i,j,k}) for the full set of sites across the full set of surfaces on *x*_{i}. We can then run our procedures in a relatively fast manner, where we can quickly query the database at each iteration of a given method rather than running DFT. Similar offline-data discovery procedures have been pursued by previous work in optimization and active learning, where expensive evaluations have been collected offline and used for rapid online evaluation.^{40–42}

One notable baseline method is *random search*, which at each iteration samples sites to carry out DFT calculations uniformly at random from the full set of sites over all catalyst material candidates. We provide simulation results using random search as a benchmark to compare MMS against.

#### 1. Surrogate models used

Our objective in this paper is to assess the performance of MMS. The performance of MMS is likely to depend on the surrogate model used to predict adsorption energies from atomic structures. We assume that surrogate models with high predictive accuracy and calibrated uncertainty estimates^{43} will outperform models with low accuracy and uncalibrated uncertainty estimates, but we are unsure of the magnitude of this difference. We, therefore, propose to pair at least two different models with MMS: a “perfect” model and an “ignorant” model.

We define the “perfect” model, hereby referred to as the “prime” model, as a model that returns the true adsorption energy of whatever data point is queried. This perfect prediction ensures high model accuracy. When asked for a standard deviation in the prediction, the prime model will return a sample from a *χ*^{2} distribution whose mean is 0.1 electron volts (eV). This uncertainty ensures a sharp and calibrated^{43,44} measure of uncertainty. We do not use standard deviation of zero because (1) it causes numerical issues during multiscale modeling and (2) any model in practice should not be returning standard deviations of zero.

We define the “ignorant” model, hereby referred to as the “null” model, as a model that returns the optimal adsorption energy no matter what is queried. This constant prediction ensures a relatively low model accuracy. When asked for a standard deviation in the prediction, the null model will return 1 eV. This uncertainty ensures a relatively dull and uncalibrated measure of uncertainty.

Last, we also choose to use a third, most practical model: CFGP.^{44} The CFGP is a Gaussian process regressor whose features are the output of the final convolutional layer in a trained graph convolutional neural network. This model is our best current estimate of both an accurate and a calibrated model that could be used in practice. Thus, we have three models: null, CFGP, and prime, which are intended to give quantitative estimates of the minimal, medial, and maximal performances of MMS, respectively.

#### 2. Search spaces used

Previous studies have shown that different materials discovery problems have varying difficulties.^{18} Searching for a needle in a hay stack is generally more difficult than searching for a leaf on a branch. Thus, any simulation we do depends on the search space we use. To obtain a range of potential MMS performances, we perform simulations using two different datasets. Both datasets comprise thousands of atomic structures that represent CO adsorbing onto various catalyst surfaces, as well as corresponding adsorption energies. We then use Sabatier relationships from the literature to transform the adsorption energies into estimates of activity.^{32}

We defined our first search space by synthesizing it randomly. We did so by retrieving a database of enumerated adsorption sites from the Generalized Adsorption Simulator for Python (GASpy).^{14,45} These sites composed all the unique sites on all surfaces with Miller indices between −2 and 2 across over 10 000 different bulk crystal structures. We then randomly selected 200 of the bulk crystals along with all of the resulting surfaces and sites, yielding over 390 000 adsorption sites. Then, for each bulk crystal, we randomly sampled its “bulk mean adsorption energy” from a unit normal distribution. For each surface within each crystal, we randomly sampled its “surface mean adsorption energy” from a normal distribution whose mean was centered at the corresponding bulk mean and whose standard deviation was set to 0.3 eV. Then, for each site within each surface, we randomly sampled its adsorption energy from a normal distribution whose mean was centered at the corresponding surface mean and whose standard deviation was set to 0.1 eV. Thus, the adsorption energies were correlated within each bulk, and they were also correlated within each surface. We used a standard deviation of 1.0 eV for the catalysts because it was similar to the ∼1 eV variability in energy shifts we saw in our GASpy dataset. We used a standard deviation of 0.1 eV for the sites because DFT precision does not often exceed 0.1 eV. We used a standard deviation of 0.3 eV for the surfaces because it was between the 1.0 and 0.1 eV standard deviations of the catalysts and sites, respectively.

We defined our second search space by retrieving our database of ∼19 000 DFT-calculated CO adsorption energies calculated by GASpy, hereafter referred to as the GASpy dataset. The sites in this database were chosen using previous iterations of our sequential learning methods,^{14} and they, therefore, have bias in the locations at which they were sampled. Specifically, the sites in this database were chosen based on the likelihood that their adsorption energies were close to the optimal value of −0.67 eV.^{14,32}

There are several advantages of using the synthesized dataset over the real GASpy dataset, and vice versa. The synthesized dataset contains pseudo-random adsorption energies that are difficult for the CFGP to predict, thereby hindering its performance unfairly. Therefore, we should not and did not use the CFGP with the synthesized dataset; we used it with the GASpy dataset only. On the other hand, the number of surfaces per bulk and the number of sites per surface in the GASpy dataset were relatively sparse compared to those in the synthesized dataset. This can result in catalysts that require relatively few site queries to sample fully, which reduces the number of queries necessary to classify a catalyst. This reduction in the number of required queries per catalyst could artificially improve the observed performance of MMS.

## III. RESULTS

At the beginning of the simulations, the multiscale models made their catalyst class predictions (i.e., active or inactive) using the adsorption energy predictions and uncertainties of the models. As the simulations progressed and adsorption energies were queried, the models’ predictions of each queried energy were replaced with the “true” value of the query and the corresponding uncertainty was collapsed to 0 eV. This was done to mimic a realistic use case where we would not use model predictions when we had the “real” DFT data instead. It follows that as the simulations progressed and nearly all points were queried, most models performed similarly because they all had comparable amounts of “true” data to use in the multiscale model. Conversely, the number of “predictions” at the end of the simulations decreased, and so differences between methods also decreased as simulations progressed.

### A. Performance on synthesized data

This behavior is seen in Fig. 4(a), which shows how the *F*1 score changes at each point in the simulation of the synthesized dataset. Here, we see that the simulations using the prime model began with an *F*1 score of ∼0.6 that increased to 1 over time. On the other hand, simulations using the null model began with an *F*1 score closer to 0 or 0.2 before gradually increasing to 1. This shows that more accurate surrogate models for adsorption energies led to more accurate multiscale models, even initially. Note also that the rate at which the *F*1 score improved was better when using MMS than when using random sampling, especially when using the null model. These data may suggest that the rate of improvement is governed by the acquisition strategy, while the initial performance is governed by the model.

Figure 4(b) shows how the $\Delta F^$ changes at each point in the simulation of the synthesized dataset. The simulations using random search generally yielded higher $\Delta F^$ values. This indicates slower convergence, which is consistent with the slower *F*1 score increase seen in the random search curves in Fig. 4(a). Note also how the $\Delta F^$ values for the MMS-prime simulation decreased at around 500 batches, which is the number of batches it took for the *F*1 score to reach ∼1. Last, we note that the $\Delta F^$ values for the MMS-null simulation were often zero. This is because the null model was a “stiff” learner that did not result in any multiscale modeling changes unless a low-coverage adsorption site was found. This shows that slow-learning models may result in relatively low $\Delta F^$ values, which may necessitate longer stopping horizons (i.e., *κ*) to offset this behavior. In other words, worse models may need longer horizons before stopping the discovery to mitigate the chances of missing important information.

These simulations provided us with an estimate of the improvement in active classification that we may get from using MMS. With the synthesized dataset, we saw that the MMS-with-null case achieved an *F*1 score of ∼0.6 after ∼250 batches (or 50 000 queries). This was over seven times faster than the random-sample-with-null case, which achieved an *F*1 score of ∼0.6 after ∼1800 batches (or 360 000 queries). When using the prime model, MMS was able to achieve an *F*1 score of ∼0.75 in 200 batches, while the random search achieved the same performance in ∼1200 batches, or six times slower.

We remind readers that we allowed each method to supplant model predictions with “true” values from their respective queries. This is why all methods converged on identically perfect performance near the end of the simulations, where each method had near-perfect information to use.

### B. Performance on DFT data

Figure 5 shows the *F*1 score and the $\Delta F^$ of the multiscale model at each point in the simulation of the GASpy dataset. Interestingly, the system performance when using the CFGP was similar to the performance when using the null model, both of which were overshadowed by the relatively good performance when using the prime model. This suggests that there is a large room for improvement for the CFGP model. Note also how the MMS strategy outperforms random sampling for this dataset as well.

These simulations provided us with a second estimate of the improvement in active classification that we may get from using MMS. With the GASpy dataset, we saw that the MMS-with-null case achieved an *F*1 score of ∼0.8 after ∼6 batches (or 1200 queries). This was over 16 times faster than the random-sample-with-null case, which achieved an *F*1 score of ∼0.6 after ∼80 batches (or 16 000 queries). When using the prime model, both MMS and random search were able to achieve an *F*1 score of ∼0.8 after only a single batch.

To validate our multiscale model, we used it to classify the 5% most active catalysts in the GASpy dataset. We found that copper alloys composed a third of this set of most active catalysts, which is consistent with the research community’s findings that copper is one of the most active elements for reducing CO_{2}.^{46,47} We also found aluminum, tin, and gallium alloys in this active set of alloys, which have also appeared in CO_{2} reduction literature.^{15,48,49} Yttrium, scandium, and silicon alloys also appear in this active set. We acknowledge that these results are based on analyses of a dataset that was built using more heuristic methods, and further investigation is needed to verify the integrity of these candidates. However, at minimum, these results are not inconsistent with current CO_{2} reduction literature.

### C. Recommended diagnostics

We note that the *F*1 scores illustrated in Figs. 4(a) and 5(a) cannot be calculated without knowing all the true classes, which are not possible to know during a real discovery process. We need metrics to monitor the behavior of both our discovery algorithms. We recommend monitoring the $\Delta F^$ as well as the accuracy, calibration, and sharpness (i.e., the magnitude of the predicted uncertainties) of the surrogate model over time. Figure 6 shows an example of such diagnostic metrics over the course of our simulation that used MMS and CFGP on the GASpy dataset.

$\Delta F^$ estimates the amount of overall improvement in the discovery process. Sustained low values of $\Delta F^$ are a necessary but not sufficient indicator of convergence. To improve our confidence in the predictive strength of $\Delta F^$, we can test one of its underlying assumptions: the multiscale model becomes progressively more accurate as it receives more data. This assumption is true when we replace surrogate model predictions with incoming DFT results, but it is not necessarily true for unqueried points. We can estimate the accuracy on unqueried points by calculating the residuals between the surrogate model and the incoming DFT results [Fig. 6(b)]. As each “batch” of queries is received, we compare the queried, true adsorption energies with the energies predicted by the surrogate model just before retraining—i.e., the predictions used to choose that batch. Any improvements in accuracy on these points show that the overall, multiscale model is improving over time and that the $\Delta F^$ metric is an honest indicator of convergence. Figure 6(b) shows that model accuracy improves within the first ∼10 batches (or 2000 adsorption energy queries) but plateaus afterward. This indicates that, after ten batches, improvements in overall classification accuracy came from receipt of additional DFT data rather than improvements in surrogate model predictions.

Prediction accuracy of adsorption energies is not the only indicator of improved model performance. If a surrogate model’s accuracy does not change but its uncertainty predictions decrease/improve, then our confidence in the overall material classification may still improve. Of course, improvements in uncertainty must not be obtained at the expense of worse calibration. In other words, reductions in predicted uncertainties may also indicate improved model performance and better confidence in $\Delta F^$, but only if the expected calibration error^{44} does not increase. In our illustrative example, Fig. 6(c) shows the predicted uncertainty, while Fig. 6(d) shows the calibration. Unfortunately, the uncertainty predictions do not decrease over the course of the discovery process. Note that all uncertainty and calibration estimates for each batch should be calculated using the surrogate model predictions used to choose that batch, just as was done for the residuals.

Last, we also recommend monitoring the negative-log-likelihood^{44} of the surrogate model for each incoming batch. This metric incorporates model accuracy, calibration, and sharpness into a single metric. Lower values of negative-log-likelihood indicate better model performance. Figure 6(e) shows that this metric improves until ∼2000 queries, after which it stagnates. This is consistent with the improvement in accuracy until 2000 queries and subsequent stagnation of all performance metrics thereafter.

## IV. CONCLUSIONS

Here, we created a multi-scale modeling method for combining atomic-scale DFT results with surrogate/ML models to create actionable plans for experimentalists—i.e., a classification of catalysts as “worthy of experimental study” or “not worthy.” We then coupled this modeling method with a Myopic Multiscale Sampling (MMS) strategy to perform automated catalyst discovery via active classification. We tested this strategy on two hypothetical datasets using three different surrogate models, giving us an estimate on the range of performances we might see in the future. In some cases, the results show up to a 16-fold reduction in the number of DFT queries compared to random sampling. The degree of speed-up depends on the quality of the ML model used, the homogeneity of the search space, and the hyperparameters used to define convergence of the active classification. Speed-up estimates on more realistic use cases show a more conservative seven-fold reduction in the number of DFT queries. Finally, we provide a set of recommended diagnostic metrics to use during active classification (Fig. 6): $\Delta F^$ and the ML model’s residuals, uncertainty estimates, and calibration.

Our results elucidated a number of qualitative behaviors of active classification. First, we observed that higher-quality ML models yielded better initial performance of the classification process. Conversely, we observed that higher-quality sampling strategies yielded better rates of improvement over time. We also observed that our latest ML model (CFGP) yielded performance closer to a naive, ignorant model than to a perfect, omniscient model. This suggests that there is a relatively large amount of potential improvement left in the ML modeling space. Next, we observed that better sampling strategies (as quantified by the *F*1 score) led to lower rates of change in classes (as quantified by $\Delta F^$), suggesting that $\Delta F^$ may be an indicator of sampling strategy performance. Conversely, we observed that slow-learning ML models may also reduce $\Delta F^$. This phenomenon could be counteracted by using more conservative convergence criteria. All these details were observed in specific and synthetic use cases though. The behaviors seen here may not be observed in situations where search spaces and/or ML models differ.

We encourage readers to focus on the main goals of this work: (1) converting atomic-scale simulations and ML models into actionable decisions for experimentalists and (2) relaxing the active discovery process from an optimization/regression problem to a classification problem. The ability to convert computational results into experimental recommendations helps us serve the research community better. Simultaneously, relaxing the discovery process to a classification problem helps us prioritize exploration rather than exploitation, which is more appropriate for early-stage discovery projects.

We also recognize several future directions that may stem from this research. Future work might include incorporation of DFT-calculated surface stability by performing weighted averaging of surface activities when calculating bulk activities. Future work may also include cost-weighted sampling such that less computationally intensive calculations are chosen more frequently than more intensive ones, which may improve discovery rates in real-time. Perhaps most importantly, future work should incorporate some ability to feed experimental data and information to computational sampling strategies—e.g., multi-fidelity modeling.

## AUTHORS’ CONTRIBUTIONS

Kevin Tran and Willie Neiswanger contributed equally to this work.

## ACKNOWLEDGMENTS

This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. K.B. acknowledges the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Data Science for Knowledge Discovery for Chemical and Materials Research program under Award No. DESC0020392. W.N. was supported by the U.S. Department of Energy Office of Science under Contract No. DE-AC02-76SF00515.

## DATA AVAILABILITY

The data that support the findings of this study are openly available at https://github.com/ulissigroup/catalyst-acquisitions/.

## REFERENCES

_{2}reduction and H

_{2}evolution

_{2}electrocatalysts using active machine learning

_{2}reduction: The quest for electrocatalytic materials

_{2}electroreduction at low overpotentials

_{2}to highly reduced products at low overpotentials