Designing functional molecules and advanced materials requires complex design choices: tuning continuous process parameters such as temperatures or flow rates, while simultaneously selecting catalysts or solvents. To date, the development of data-driven experiment planning strategies for autonomous experimentation has largely focused on continuous process parameters, despite the urge to devise efficient strategies for the selection of categorical variables. Here, we introduce Gryffin, a general-purpose optimization framework for the autonomous selection of categorical variables driven by expert knowledge. Gryffin augments Bayesian optimization based on kernel density estimation with smooth approximations to categorical distributions. Leveraging domain knowledge in the form of physicochemical descriptors, Gryffin can significantly accelerate the search for promising molecules and materials. Gryffin can further highlight relevant correlations between the provided descriptors to inspire physical insights and foster scientific intuition. In addition to comprehensive benchmarks, we demonstrate the capabilities and performance of Gryffin on three examples in materials science and chemistry: (i) the discovery of non-fullerene acceptors for organic solar cells, (ii) the design of hybrid organic–inorganic perovskites for light-harvesting, and (iii) the identification of ligands and process parameters for Suzuki–Miyaura reactions. Our results suggest that Gryffin, in its simplest form, is competitive with state-of-the-art categorical optimization algorithms. However, when leveraging domain knowledge provided via descriptors, Gryffin outperforms other approaches while simultaneously refining this domain knowledge to promote scientific understanding.
I. INTRODUCTION
The discovery of functional molecules and advanced materials is recognized as one of the fundamental obstacles to the development of emerging and future technologies which are needed to face immediate challenges in clean energy, sustainability, and global health.1,2 To date, scientific discovery across chemistry, materials science, and biology has been accelerated by combinatorial high-throughput strategies and automated experimentation equipment.3–7 Despite remarkable successes with high-throughput approaches,8–12 the combinatorial explosion of molecular and materials candidates renders exhaustive evaluations impossible. This limitation can be alleviated by the use of adaptive strategies, which selectively explore the search space and only evaluate the most promising materials candidates.13 Autonomous platforms combine such adaptive, data-driven strategies with automated experimentation systems and might constitute a next-generation approach to the advancement of scientific discovery.14–17
Recently, data-driven experiment planning has experienced increased attention by the scientific community. Eminent examples include the search for antimicrobial peptides,18 the synthesis of organic molecules,19,20 the development of a process for the synthesis of short polymer fibers,21 the discovery and crystallization of polyoxometalates,22 the discovery of metallic glasses,23 the optimization of carbon dioxide–assisted nanoparticle deposition,24 the optimization of Mitsunobu and deoxyfluorination reactions,25 and the creation of Bose–Einstein condensates.26 Motivated by the successes of data-driven experiment planning, the development and deployment of autonomous workflows for scientific discovery, as well as their benefits over conventional experimentation strategies, are being actively explored.27 For example, autonomous platforms have been applied to the optimization of reaction conditions,28–31 the unsupervised growth of carbon nanotubes,32,33 autonomous synchrotron x-ray characterization,34,35 the discovery of thin-film materials,36 the synthesis of inorganic photoluminescent quantum dots,37 and the discovery of photostable quaternary polymer blends for organic photovoltaics.38
Although autonomous experimentation platforms and data-driven experiment planning strategies are on the rise, the previously mentioned examples mostly targeted optimization tasks involving continuous process parameters. Yet, scientific discovery in chemistry and materials science typically requires the simultaneous optimization of continuous and categorical variables, such as the selection of a catalyst or solvent, which cannot be targeted efficiently with continuous optimization methods. In fact, the iterative optimization of a molecule itself may be framed as a categorical optimization task. Approaches for the data-driven selection of categorical parameters are often handcrafted and demand human decisions, which can adversely affect the experimentation throughput and its efficiency.25 Examples of experimentation workflows involving the selection of categorical variables with partial human interaction have been demonstrated in the context reaction optimization.39–41 More recently, strategies for the data-driven optimization of categorical spaces have started being proposed.25,42 Yet, the limited availability of general-purpose approaches for the efficient optimization of categorical variables in chemistry and materials science is a challenge to autonomous discovery workflows and a major obstacle to the successful deployment of autonomous experimentation platforms.
The machine learning community has been exploring algorithmic approaches to the data-driven selection of categorical variables in the context of hyperparameter optimization43 and robotics.44 Yet, these applications are different from optimization tasks in chemistry and materials science, where categorical variables can usually be characterized by a notion of similarity between individual choices. In fact, the concept of molecular, or chemical, similarity is well established in chemistry and is central to cheminformatics.45 For example, co-polymers for hydrogen production can be synthesized from monomeric units that may be clustered based on their reactivities.46 More generally, a quantitative measure of similarity between molecules and materials can be defined based on their physical, chemical, and structural properties. An experiment planning strategy that actively leverages physical, chemical, or structural descriptors is expected to accelerate scientific discovery, thanks to the exploitation of expert domain knowledge. Furthermore, the ability of these descriptors to predict target properties of interest would allow one to gain scientific insights from optimization campaigns, thus potentially inspiring the design of additional molecules and materials not initially considered in the chemical search space.
In this work, we introduce Gryffin, a global optimization strategy for the selection of categorical variables in chemistry and materials science, which can be readily integrated into autonomous workflows. Gryffin implements a Bayesian optimization framework leveraging kernel density estimation directly on the categorical space, which can be accelerated with domain knowledge in the form of physicochemical descriptors by locally redefining the metric on the categorical space. In addition to accelerating the identification of optimal categorical options via the use of descriptors, Gryffin can construct more informative ones on-the-fly. This feature can highlight the relevance of the provided descriptors, enabling the interpretation of their role in determining the target properties of interest. Gryffin's approach to categorical optimization differs from how Bayesian optimization has been adopted in automatic chemical design. In the latter, categorical input spaces are mapped onto continuous ones with variational autoencoders and using large datasets.47–49 Bayesian optimization is then used to search the continuous space for promising molecules or materials. Gryffin is a Bayesian optimization approach that directly searches the categorical space instead, without having to learn a map between categorical and continuous spaces.
We highlight the applicability and performance of Gryffin on a set of synthetic benchmark functions and three real-world tasks: the discovery of small molecule non-fullerene acceptors for organic solar cells, the discovery of hybrid organic–inorganic perovskites for light-harvesting, and the combined selection of ligands and process parameters for the optimization of Suzuki–Miyaura coupling reactions. We identify four key advantages of Gryffin: it (i) provides a framework for categorical optimization that is competitive with state-of-the-art algorithms in its naïve formulation, (ii) outperforms the state-of-the-art when taking advantage of expert knowledge in the form of descriptors, (iii) can inspire scientific insights, and (iv) can be readily integrated with continuous optimization strategies to enable the robust and efficient optimization of mixed continuous–categorical domains in both sequential and batched workflows.
II. BACKGROUND AND RELATED WORK
Experiment planning can be formulated as an optimization task, where we consider a set of controllable parameters within a defined domain, , and an experimental response, f(z), for each of the parameter choices. In the context of reaction optimization, the controllable parameter could, for example, include the reaction temperature, the amount of solvent, or the choice of catalyst, while the experimental response could be the reaction yield, or the rate at which the product is formed. The optimization domain is also referred to as the design space or the candidate space. The optimization task in experiment planning consists of the identification of specific parameter values, , which yield the desired experimental outcome, . For simplicity, we will consider minimization tasks from here on, i.e., we formulate f such that corresponds to the desired experimental result. The optimization task can be approached with a closed-loop strategy, which iteratively evaluates a set of options and records associated responses, , to gradually collect a set of observations, , as feedback to the experiment planning strategy.
The optimization of categorical parameters poses additional challenges compared to the optimization of continuous or discrete variables due to the lack of a natural ordering between individual parameter values, which is illustrated in the supplementary material (see Sec. S.2). Regardless of the optimization strategy, the confidence of having identified the best-performing candidate in the search space increases with the number of evaluated candidates. In the best case, only one evaluation is required, while in the worst case, all candidates in the search space need to be evaluated. Yet, the choice of the optimization strategy modulates the chance of having identified the best-performing candidate after a certain number of evaluations and consequently, the average fraction of the candidate space that needs to be evaluated to identify the most desired one.
Straightforward search strategies rely on exhaustive random50–52 or systematic53–55 evaluations of all candidates without leveraging any feedback from collected responses to refine the search policy. In the absence of accurate prior expectations on the performance of individual candidates, both random and systematic search strategies require the evaluation of 50% of all candidates, on average, to identify the best performing candidate and are therefore only applicable to relatively small search spaces. Yet, exhaustive strategies are massively parallelizable and thus well suited for high-throughput experimentation. Genetic algorithms and evolutionary strategies56–58 extend the idea of a random exploration of the search space. While they typically start with random selection of candidate solutions, they then condition exploration policies based on a population of candidate solutions whose fitness has already been evaluated. Poorly performing candidates in the population are dropped, while new ones are selected based on local perturbations of the best performing candidates. Thus, the set of new candidate solutions to be evaluated is constantly updated based on a local, rather than global, search strategy.59
Bayesian optimization60,61 has recently gained increasing attention as a competitive global optimization strategy across various fields,62–64 including automatic machine learning,65–67 and experimental design.68–70 The common framework of Bayesian optimization strategies follows two basic steps: (i) the construction of a surrogate to the unknown response surface from a probabilistic model based on collected measurements, and (ii) the selection of new candidates with an acquisition function that balances the expected performance of each candidate and the uncertainty of this estimate. Various machine learning models have been suggested to construct the surrogate, including Gaussian processes,71 random forests,72 and Bayesian neural networks,73 and different acquisitions functions, such as probability of improvement,61 expected improvement,74 upper (lower) confidence bound,75 predictive entropy,76 and kernel density–based formulations77 are commonly employed.
Extensions of Bayesian optimization frameworks to categorical parameter domains are under active development. One approach consists of the representation of categorical parameters as one-hot encoded vectors.43,78,79 This representation expresses the j-th option of a categorical variable with n different options, , as an n-dimensional vector with elements for , which can be interpreted as the corners of an n-dimensional simplex, [see Fig. 1(a)]. Standard Bayesian optimization strategies for continuous parameter domains can be deployed on these one-hot encoded categorical variables, such that even optimizations of mixed continuous-categorical domains are possible. However, the most promising choices for future evaluations are determined by projecting promising candidates from the continuous space to the one-hot boundaries. This strategy presents two limitations. First, redundancies in the projection arise from the fact that the continuous optimization space contains an additional degree of freedom compared to the categorical domain. These redundancies can be reduced by imposing constraints on the acquisition function. For example, the acquisition function can be modified such that covariances are computed after the projection operation.80,81 This modification results in a stepwise acquisition function from which choices for future evaluations can be suggested directly. However, stepwise functions are generally more challenging to optimize than smooth ones. A second limitation is because one-hot encoding imposes an equal measure of covariance between all choices of the categorical variables. As a consequence, all categorical options are considered equally similar to each other, and one cannot account for imbalanced similarities. Mixed categorical-continuous Bayesian optimization displays similar challenges and is the subject of current research, too.82,83
Illustrations of the naïve and the static Gryffin strategies for the (descriptor-guided) optimization of categorical variables. (a) Illustration of a categorical variable with three options represented on a simplex. Color contours indicate the affiliation of any given point on the simplex to one of the categorical options at the corners. (b) Representation of a continuous descriptor space, where descriptors are associated with the categorical options shown in panel A. Color contours indicate the affiliation with individual options of the categorical variable. (c) Illustration of a generic kernel density on the simplex modeled with a concrete distribution. (d) Illustration of a descriptor-guided transformation of the kernel density shown in panel C based on the descriptors shown in panel B.
Illustrations of the naïve and the static Gryffin strategies for the (descriptor-guided) optimization of categorical variables. (a) Illustration of a categorical variable with three options represented on a simplex. Color contours indicate the affiliation of any given point on the simplex to one of the categorical options at the corners. (b) Representation of a continuous descriptor space, where descriptors are associated with the categorical options shown in panel A. Color contours indicate the affiliation with individual options of the categorical variable. (c) Illustration of a generic kernel density on the simplex modeled with a concrete distribution. (d) Illustration of a descriptor-guided transformation of the kernel density shown in panel C based on the descriptors shown in panel B.
Because of the importance of categorical variables in chemistry and materials science, recent advances in Bayesian optimization algorithms have sought to overcome the aforementioned issues to find application in reaction optimization25 and materials design.42 Shields et al.25 encoded categorical variables as continuous descriptor vectors and pruned highly correlated descriptors. Zhang et al.42,84 have proposed a latent-variable Gaussian process approach that maps categorical options to a low-dimensional numerical latent space. In both cases, these continuous representations reflected chemico-physical properties of their respective categorical options (e.g., solvent, catalyst, metal ion, etc.), such that a measure of similarity between options could be established. These approaches were shown to be superior to using one-hot encoded vectors to describe the categorical options.
III. FORMULATING Gryffin
Building upon previous work (see Sec. II), we base the formulation of Gryffin on a one-hot encoding of categorical variables. Yet rather than constructing the surrogate on the continuous space spanned by the one-hot encoded categorical choices, where each dimension is bounded by , Gryffin aims to support the surrogate on the simplex to avoid projection redundancies. To this end, we extend the recently reported Phoenics approach77 from continuous to categorical domains.
Phoenics uses kernel regression, based on the Nadaraya–Watson estimator,85,86 to build a surrogate model of the objective function. The kernel density estimates used by this model are inferred by a Bayesian neural network (BNN) with an autoencoder-like architecture. The use of this BNN allows us to obtain complex multivariate density estimates that are regularized by the prior distribution of the kernels' precision. Because the BNN is used to estimate kernel densities only, evaluation of the objective function does not require additional BNN evaluations. Phoenics' acquisition function is constructed by adding a single uniform kernel, weighted by a user-defined λ parameter, to the surrogate model. This uniform kernel can be interpreted as the lack of knowledge present in regions of parameter space with low density of observations, following the assumption that the global optimum may be located anywhere in the search domain with equal probability. The λ parameter offers an intuitive way to bias the behavior of the optimizer toward exploration or exploitation. It also provides a simple yet effective means of enabling batch optimization, given that acquisitions with multiple different λ values (each with a different exploration–exploitation trade-off) can be readily obtained. Overall, this framework results in a Bayesian optimization approach that scales linearly with the number of observations and the dimensionality of the search space, and that naturally supports optimization in parallel.77
Similar to Phoenics, in Gryffin, we model categorical parameters as random variables and construct the surrogate from their reweighted kernel density estimates as inferred by a BNN. Beyond the implementation of kernel density–based Bayesian optimization on categorical domains, we further demonstrate how physical and chemical domain knowledge can be used to transform the surrogate to accelerate the search, and how this bias can be refined during the optimization to gain scientific insight.
A. Categorical optimization with naïve Gryffin
Naïve Gryffin constructs kernel densities by extending the one-hot encoding of categorical options to the entire simplex, i.e., we consider . The largest entry of any given point can be used to associate this point to a realizable option [see Fig. 1(a)]. Various probability distributions with support on the simplex have been introduced in the past. The Dirichlet distribution, for example, constitutes the conjugate prior to the categorical distribution.87 Another example is the logistic normal distribution which ensures that the logit of generated samples follows a standard normal distribution.88 While both of these distributions are widely used, their deployment in a computational graph is numerically involved due to demanding inference and sampling steps. Directed probabilistic models can be implemented at low computational cost if stochastic nodes of such graphs can be reparameterized into deterministic functions of their parameters and stationary noise distributions.89 Such reparameterizations, however, are unknown for the Dirichlet and the logistic normal distribution.
The recently introduced concrete distribution90 (simultaneously introduced as Gumbel–Softmax),91 illustrated in Fig. 1(c), overcomes this obstacle. This distribution is supported on the simplex and parameterized by a set of deterministic variables with noise generated from stationary sources. As such, the concrete distribution is amenable to automatic differentiation frameworks for accelerated sampling and inference. In addition, the concrete distribution contains a temperature parameter, τ, which can be tuned to smoothly interpolate between the discrete categorical distribution and a uniform distribution on the simplex. As such, this temperature parameter controls the localization of constructed kernel densities toward the corners of the simplex.
Naïve Gryffin estimates kernel densities from concrete distributions and conditions the parameters of the concrete distribution on the sampled candidates, as introduced by the Phoenics framework.77 The temperature parameter is modified based on the number n of collected observations, , such that the priors gradually transition from a uniform distribution to a continuous approximation of the categorical distribution. Options for future evaluation are determined via the acquisition function of Phoenics, which compares the constructed kernel densities to the uniform distribution on the simplex. Using a sampling parameter λ to reweight the uniform distribution, this acquisition function can favor exploration or exploitation explicitly, and natively supports batch optimization. Details of this procedure are provided in the supplementary material (see Sec. S.1.A).
Approximating the computation of kernel densities can further reduce the overall computational cost of Gryffin by a significant amount. The approximation is based on the idea that the low density regions of the kernel densities indicate a lack of information. A precise estimate of the kernel densities in these regions might therefore not be required. We find that an approximate estimate of the kernel densities in low density regions can significantly reduce Gryffin's computational cost without degrading optimization performance. Details on this approximation are provided in the supplementary material (see Sec. S.3.A).
B. Descriptor-guided searches with static Gryffin
The naïve Gryffin approach imposes an equal measure of covariance between individual options of categorical variables, which is undesired in cases where a notion of similarity can be established between any two options. Especially in the context of scientific discovery, similarities between the options of categorical variables can be defined, for example, via physicochemical descriptors for small molecules or material candidates. We extend the naïve approach by assuming that the metric to measure similarity between any two options is based on the Euclidean distance between real-valued d-dimensional descriptor vectors, , which are uniquely associated with individual categorical options [see Figs. 1(a) and 1(b)].
While the descriptors are embedded in a continuous space, their arrangement in this space is unknown for a generic optimization task and only selected points in the descriptor space can be associated with realizable categorical options. These limitations present major obstacles to optimization strategies that operate directly on the descriptor space. Instead, we propose to leverage the naïve Gryffin framework but redefine the metric on the simplex based on the provided descriptors. Following this strategy, the length of an infinitesimal line element on the simplex is conditioned not only on the corresponding infinitesimal change of location on the simplex, but also on the infinitesimal change of the associated descriptors. Imposing a linear mapping between points on the simplex and the descriptor space , we can compute the length of a finite line element, , following the redefined metric to be
which is derived in detail in the supplementary material (see Sec. S.1.B). Kernel densities generated by the naïve approach can be transformed following this descriptor-based definition of distances on the simplex to reflect the similarity between individual options as illustrated in Figs. 1(a) and 1(b). As a consequence, the evaluation of one option of the categorical variable will be more informative with respect to the expected performance of other, similar options. Further implications of imposing a descriptor-guided metric on the simplex are illustrated in the supplementary material (see Sec. S.1.B).
We refer to the descriptor-guided categorical optimization as static Gryffin, as provided descriptors are used without further transformations. The benefits of static Gryffin over naïve Gryffin with respect to an accelerated search will depend on the provided descriptors. More informative descriptors are expected to guide the algorithm toward well-performing options more efficiently than less informative descriptors. This effect is investigated empirically in Sec. IV.
C. Descriptor refinement with dynamic Gryffin
The dynamic formulation of Gryffin aims to alleviate the expected sensitivity of the performance of static Gryffin on the choice of provided descriptors by transforming them during the optimization. Specifically, dynamic Gryffin infers a transformation, T, which constructs a new set of descriptors, , from the provided descriptors, , based on the feedback collected from evaluated options. The transformation T can be constructed to target two major goals: (i) the generation of more informative descriptors, , which help to navigate the candidate space more efficiently, and (ii) the interpretable identification of relevant domain knowledge to inspire design choices and scientific insights, as we will demonstrate in Sec. V. In addition to these two goals, the transformation T is required to be robust with respect to overfitting due to the low data scenarios which are commonly encountered in autonomous workflows.
In an attempt to balance flexibility, robustness, and interpretability, we suggest constructing this transformation T from a learnable combination of the provided descriptors:
where and are the learnable parameters inferred from the feedback collected in previous evaluations. The class of transformations described by Eq. (2) includes slightly non-linear translations and rotations of the provided descriptors. While more complex transformations accounting for higher-order interactions between individual descriptors could potentially yield even more informative descriptors, a slightly non-linear transformation is inherently robust to overfitting,92 and is more amenable to intuitive interpretation than more complex models.93 We will demonstrate empirically in Secs. IV and V that this class of transformations is well suited for a variety of categorical optimization tasks.
Following a stochastic gradient optimization, the parameters and in Eq. (2) are adjusted to (i) increase the correlation between the newly generated descriptors and the associated measurements, (ii) reduce correlations between newly generated descriptors, and (iii) remove redundant descriptors with poor correlations with the measurements or high correlations with other newly generated descriptors. These three goals are modeled as penalties which are to be minimized at training time (see supplementary material, Sec. S.1.C for details).
IV. SYNTHETIC BENCHMARKS
We empirically assess the performance of all proposed variants of Gryffin on a set of synthetic benchmark surfaces, which are detailed in the supplementary material (see Sec. S.2). Four of the surfaces constitute categorized adaptations of established functions commonly used to benchmark global and local optimization strategies on continuous parameter domains. In addition, we include three partially and fully randomized surfaces with responses sampled from stationary probability distributions. While the ordering of the categorical options is arbitrary, we introduce a reference ordering to illustrate the surfaces (see supplementary material, Sec. S.2, for details). Unless noted otherwise, descriptors for the categorical options are generated such that they encode the reference ordering. Implementations of all benchmark surfaces are made available on GitHub.94
Gryffin is compared to a set of qualitatively different optimization strategies that are implemented in publicly available libraries: genetic optimization available through PyEvolve,56,57,95 Bayesian optimization with random forests as implemented in SMAC,96–98 Bayesian optimization with Gaussian processes via GPyOpt,79,99–101 and Bayesian optimization with tree-structured Parzen windows introduced in Hyperopt.102,103 In addition, we run random explorations of the candidate space as a baseline. We compare the performance of the different formulations of Gryffin to the other optimization strategies on all benchmark surfaces, probe the influence of the number of descriptors, study the scaling of Gryffin with the number of options per categorical variable and the number of categorical variables, and investigate the benefits of dynamic Gryffin over static Gryffin. For all comparisons, we measure the fraction of the candidate space that a given optimization strategy explored to locate the best candidate. Unless noted otherwise, all comparisons are averaged over 200 independent executions of each strategy.
A. Optimization performance
In a first test, we compare the optimization strategies on two-dimensional formulations of the synthetic benchmark surfaces with 21 options per dimension, as illustrated in Fig. 2. The Dejong surface generalizes the convex parabola from continuous to categorical spaces, such that we consider it as pseudo-convex. In contrast, the Ackley surface is generalized from the Ackley path function which is non-convex on the parameter domain. The Camel surface presents a degenerate global optimum as there are two different combinations of options which yield the same optimal response. Finally, in the Noise surface, the response increases linearly with the index of the option along each dimension in the reference ordering (as in Slope, see S.2), but uniform noise is added to the response to yield a substantial local variance. The fractions of the search spaces that the optimizers explored to locate their optima, averaged over 200 independent executions, are illustrated in Fig. 2. Full optimization traces are reported in the supplementary material (see Sec. S.3.B).
Performance of various optimization strategies on selected synthetic surfaces. Individual panels indicate the fraction of the candidate space each algorithm explored before finding the global minimum of the surfaces illustrated on the top right (low values are shown in yellow, high values in green), averaged over 200 independent executions. Best performing algorithms are indicated by a star. Color codes for the optimization strategies are shown in the top panel and apply to all panels in this figure.
Performance of various optimization strategies on selected synthetic surfaces. Individual panels indicate the fraction of the candidate space each algorithm explored before finding the global minimum of the surfaces illustrated on the top right (low values are shown in yellow, high values in green), averaged over 200 independent executions. Best performing algorithms are indicated by a star. Color codes for the optimization strategies are shown in the top panel and apply to all panels in this figure.
We observe that random exploration requires the evaluation of approximately half the space for the surfaces with well-defined global optima, and about a third of the space for the Camel function with degenerate optima. We find that the performances of PyEvolve, SMAC, and Hyperopt are roughly comparable across the different surfaces, although PyEvolve tends to outperform SMAC and Hyperopt on the noiseless surfaces. GPyOpt generally locates global optima faster than the other strategies but is slightly slower than naïve Gryffin on the non-convex surfaces. The faster optimization of convex surfaces by Gaussian process–based Bayesian optimization, compared to kernel density–based approaches, has already been observed and discussed for continuous domains.77 Notably, the static and dynamic formulations of Gryffin can significantly outperform the other optimization strategies, with reductions of the explored space by several factors. This observation confirms that providing real-valued descriptors can substantially accelerate the search. We also observe similar performances of the static and dynamic formulations of Gryffin for the deterministic surfaces (Ackley, Dejong, Camel), while dynamic Gryffin optimizes the noisy surface at a faster rate. This observation suggests that dynamic Gryffin is indeed capable of learning a more informative set of descriptors.
B. Scaling to more options and higher dimensions
We further study the performance of Gryffin on larger candidate spaces with (i) more categorical variables, and (ii) more options per categorical variable. Increasing the number of variables or the number of options per variable generally increases the number of candidates in the space and is thus expected to require more candidate evaluations overall before the best candidate is identified. Results obtained for these benchmarks are detailed in the supplementary material (see Secs. S.3.C and S.3.D). The benchmarks suggest that Gryffin indeed uses more candidate evaluations to locate the global optimum with an increasing volume of the search space, consistently across all benchmark surfaces. However, although the number of evaluations increases, the fraction of the explored space generally decreases. More specifically, across the different surfaces, we identify a polynomial decay of the explored space with an increasing number of options per variable, with decay exponents ranging from –1.0 to –1.25, and an exponential decay for an increase in the number of parameters with decay coefficients ranging from –1.6 to –2.0 (more details in the supplementary material, Secs. S.3.C and S.3.D). Based on this observation, we conclude that Gryffin may show an onset of the curse of dimensionality104 only for a relatively large number of dimensions and thus constitutes an optimization strategy that can efficiently navigate large categorical spaces.
C. Data-driven refinement of descriptors
The effectiveness of transforming provided descriptors to accelerate the search for the best candidate is studied in detail on the Slope surface with 51 options per dimension, resulting in 2601 different candidates (see Fig. 3). For this benchmark, we randomly assign descriptors to each of the categorical options at a desired targeted correlation between the descriptors and the responses of the associated options. With a decreasing correlation, the local variance increases, which results in a less structured space that is more challenging to navigate. We therefore generally expect a performance degradation for both static and dynamic Gryffin with decreasing correlation.
Behavior of naïve, static, and dynamic Gryffin on the Slope surface for different descriptors. Descriptors have been generated randomly with a targeted correlation between the descriptors and the response of the associated options. The correlation between descriptors and response is shown on the x axis. The y axis represents the fraction of the search space explored before the identification of the optimum. Kernel density estimates show the distribution of explored space fractions across 200 repeated simulations, while square markers represent the mean. The performance of naïve Gryffin is independent of the descriptor correlation, as expected. Static and dynamic Gryffin always showed equal or superior performance to the naïve formulation. Static Gryffin outperformed the naïve formulation for correlations equal to or above 0.8. Dynamic Gryffin showed superior performance to the naïve formulation for any correlation equal to or above 0.1, suggesting that dynamic Gryffin is able to take advantage of even poorly informative descriptors to improve the optimization performance.
Behavior of naïve, static, and dynamic Gryffin on the Slope surface for different descriptors. Descriptors have been generated randomly with a targeted correlation between the descriptors and the response of the associated options. The correlation between descriptors and response is shown on the x axis. The y axis represents the fraction of the search space explored before the identification of the optimum. Kernel density estimates show the distribution of explored space fractions across 200 repeated simulations, while square markers represent the mean. The performance of naïve Gryffin is independent of the descriptor correlation, as expected. Static and dynamic Gryffin always showed equal or superior performance to the naïve formulation. Static Gryffin outperformed the naïve formulation for correlations equal to or above 0.8. Dynamic Gryffin showed superior performance to the naïve formulation for any correlation equal to or above 0.1, suggesting that dynamic Gryffin is able to take advantage of even poorly informative descriptors to improve the optimization performance.
Figure 3 illustrates the fractions of the candidate space explored by static and dynamic Gryffin for different targeted correlations between the supplied descriptors and the responses. For comparison, we also report the performance of the naïve formulation of Gryffin, which is independent of the supplied descriptors. We observe a significant increase in the fraction of the explored space with decreasing correlation for both static and dynamic Gryffin. Although both methods require more candidate evaluations with less informative descriptors, their performance never degrades beyond the performance of the naïve formulation, indicating that even entirely uninformative descriptors do not delay the search for the best candidate compared to descriptor-less scenarios. Note that, for the purpose of redefining the metric on the simplex, negatively correlated descriptors are as informative as positively correlated ones, both for static and dynamic Gryffin.
We further find that static Gryffin can benefit from descriptors and significantly outperform the naïve approach if the Pearson correlation coefficient between descriptors and responses is at least 0.8. Below this value, the average performance of static and naïve Gryffin is comparable, although the variance on the performance is higher for the static formulation. Similar to static Gryffin, learning a more informative set of descriptors with dynamic Gryffin accelerates the search more if the correlation between the descriptors and the responses is high. However, the dynamic formulation is generally at least as fast as the static formulation and can successfully leverage descriptors to outperform descriptor-less searches even at correlations as low as 0.1. We thus confirm that the descriptor transformation introduced in Eq. (2) is sufficiently robust to be applied to low-data tasks and conclude that deploying dynamic Gryffin can be beneficial for some descriptor-guided optimization tasks without delaying the optimization compared to static Gryffin.
V. APPLICABILITY OF Gryffin TO CHEMISTRY AND MATERIALS SCIENCE
Following the empirical benchmarks of Gryffin, we now demonstrate its applicability and practical relevance to a set of optimization tasks across materials science and chemistry. Specifically, we target the discovery of non-fullerene acceptors for organic solar cells, the design of organic–inorganic perovskites for light-harvesting, and the selection of phosphine ligands and optimization of process conditions for Suzuki–Miyaura coupling reactions.
Obtaining statistically significant performance comparisons at a sufficient level of confidence requires the repeated execution of optimization runs to average out the influences of initial conditions and probabilistic elements of the optimization strategies. As repetitive executions of optimization runs on these applications are highly resource demanding, we construct these optimization tasks from recently reported datasets: the applications on the discovery of non-fullerene acceptors and perovskites are based on lookup tables, and the optimization of Suzuki reactions is facilitated via a probabilistic model trained on experimental data (virtual robot) to emulate experimental uncertainties in addition to the average response. Virtual robots have recently been introduced to benchmark algorithms for autonomous experimentation.105,106
The selection of physicochemical descriptors associated with each option of the categorical variables was based on chemical intuition as well as their accessibility. The most informative set of descriptors for a specific task might be a priori unknown. However, a domain expert might have an intuition for which descriptors might be informative for the task at hand. In addition, given that descriptors need to be provided for all candidate solutions in the search space, it is convenient for these to be easily accessible. As such, in the following sections, descriptors are selected based on their expected suitability to each task as well as their availability.
A. Discovery of non-fullerene acceptor candidates for organic photovoltaics
Small organic molecules currently constitute the highest performing acceptor materials for organic solar cells.107,108 The large number of degrees of freedom when designing such non-fullerene acceptors, arising from complex aromatic moieties, allows us to fine-tune their relevant electronic properties, for example, the optical gap and the energy level alignment between the donor and acceptor materials. While this large design space provides the required flexibility to fine-tune desired molecular properties, it is also challenging to navigate and constitutes a major obstacle to the discovery of novel candidate molecules.
We demonstrate the applicability of Gryffin for the discovery of non-fullerene acceptors on a candidate space of 4216 different small organic molecules, which form a subset of a recently reported comprehensive study.109 Acceptor candidates in this library are constructed from a set of molecular fragments that are separated into three fragment pools [see Fig. 4(a)]. Each candidate is composed of one core fragment C (8 options), two spacer fragments S (31 options), and two terminal fragments T (17 options) following a symmetric design. Details on the library of candidate fragments are reported in the supplementary material (see Sec. S.4.A). The performance of each acceptor candidate is quantified based on the power conversion efficiency (PCE), computed from calibrated DFT frontier molecular orbital energies.109 The optimization task aims to identify molecules with the highest possible PCE.
Performance of Gryffin on the task of identifying non-fullerene acceptors for maximized power conversion efficiencies. (a) Non-fullerene acceptor candidates are constructed from a set of molecular fragments (core, spacer, and terminal) which are symmetrically arranged to span a library of 4216 different candidate molecules. (b) Fraction of the candidate library to be explored by each of the studied optimization strategies to identify the best performing acceptor candidate. (c) Most informative descriptors to guide the search of dynamic Gryffin. The relevance metric is computed as the normalized weights that dynamic Gryffin learned to associate to each descriptor. The distributions shown are over the 200 repeated optimizations performed. Diamonds indicate the average relevance of each descriptor.
Performance of Gryffin on the task of identifying non-fullerene acceptors for maximized power conversion efficiencies. (a) Non-fullerene acceptor candidates are constructed from a set of molecular fragments (core, spacer, and terminal) which are symmetrically arranged to span a library of 4216 different candidate molecules. (b) Fraction of the candidate library to be explored by each of the studied optimization strategies to identify the best performing acceptor candidate. (c) Most informative descriptors to guide the search of dynamic Gryffin. The relevance metric is computed as the normalized weights that dynamic Gryffin learned to associate to each descriptor. The distributions shown are over the 200 repeated optimizations performed. Diamonds indicate the average relevance of each descriptor.
We guide static and dynamic Gryffin with a set of electronic and geometric descriptors for each of the fragments: the HOMO and LUMO energy levels, the dipole moment, the radius of gyration, and the molecular weight. Electronic properties were computed at the B3LYP/Def2SVP level of theory on a Super-FineGrid using Gaussian,110 and the radius of gyration was computed for the ground state conformations of the fragments. The correlations of the descriptors with the PCE of the resulting non-fullerene acceptor are generally low, with the highest encountered Pearson correlation coefficients reaching values of about 0.2 (see supplementary material, Sec. S.4.A, for details). In fact, the identification of improved descriptors for the accurate prediction of PCE in organic solar cells is an active field of research.111–113
Figure 4(b) illustrates the fraction of the candidate library (averaged over 200 independent executions) that each optimization strategy had to explore before identifying the combination of fragments yielding the highest PCE. Full optimization traces for all optimization strategies are reported in the supplementary material (see Sec. S.4.A). In agreement with the synthetic tests (see Sec. IV), we find that PyEvolve explores a smaller fraction (21%) of the space than Hyperopt (27%) or SMAC (41%) before identifying the best candidate molecule. The performance of naïve Gryffin, exploring about 11%, is comparable to GPyOpt and thus significantly faster than the other benchmark strategies. However, the physical descriptors supplied for each of the fragments enable static Gryffin to find the best acceptor candidate after exploring only 8.7% of the candidate space (∼22% reduction of the required acceptor evaluations compared to naïve search). Dynamic Gryffin can refine the supplied descriptors to find the best candidate while exploring only 6.9% of the library (∼38% reduction over naïve search). This improvement of dynamic Gryffin over static Gryffin confirms that the supplied descriptors can be transformed into a more informative set to accelerate the search.
Figure 4(c) illustrates the importance of individual descriptors in guiding the search, as determined by dynamic Gryffin. Specifically, we plot the relative contributions of individual descriptors to the set of the transformed descriptors that were used when the best performing candidate was identified. We observe that the descriptor-based search emphasizes the relevance of electronic descriptors over geometric descriptors consistently across all types of fragments. These results are consistent with established design rules for acceptor materials. Indeed, the Scharber model estimates PCEs qualitatively from the electronic properties of the acceptor material.114,115 The design of non-fullerene acceptor candidates beyond the provided library could therefore be inspired mostly by the electronic properties of the fragments rather than their geometric properties, although more informative descriptors could potentially be constructed with more computational effort.111
B. Discovery of hybrid organic–inorganic perovskites for light-harvesting
Perovskite solar cells constitute another class of light-harvesting materials. They are typically composed of inorganic lead halide matrices and contain inorganic or organic anions [see Fig. 5(a)].116–118 Recently, perovskite solar cells have experienced increased attention as breakthroughs in materials and device architectures boosted their efficiencies and stabilities.119 The discovery of a viable perovskite design involves numerous choices regarding material compositions and process parameters, which poses a challenge to the rapid advancement of this light-harvesting technology. This second demonstration of the applicability of Gryffin focuses on the discovery of hybrid organic–inorganic perovskites (HOIPs) based on a recently reported dataset.120 The HOIP candidates of this dataset are designed from a set of four different halide anions, three different group-IV cations, and 16 different organic anions, resulting in 192 different HOIP compositions. Among other properties, this dataset reports the bandgaps of the HOIP candidates obtained from DFT calculations with GGA and the HSE06 functional. In this application, we aim to minimize the HSE06 bandgaps, which are considered to be more accurate with respect to experiment than the GGA values.120
Results of the benchmarks on hybrid organic–inorganic perovskites. (a) Perovskites are assembled by choosing one of three inorganic cations, one of four inorganic anions, and one of 16 organic anions, resulting in 192 unique designs. (b) Fractions of the candidate library to be explored by each of the studied optimization strategies to identify the perovskite design with the lowest bandgap. (c) Most informative descriptors to guide the optimization for each of the three constituents identified by dynamic Gryffin. The relevance metric is computed as the normalized weights that dynamic Gryffin learned to associate to each descriptor. The distributions shown are over the 200 repeated optimizations performed. Diamonds indicate the average relevance of each descriptor.
Results of the benchmarks on hybrid organic–inorganic perovskites. (a) Perovskites are assembled by choosing one of three inorganic cations, one of four inorganic anions, and one of 16 organic anions, resulting in 192 unique designs. (b) Fractions of the candidate library to be explored by each of the studied optimization strategies to identify the perovskite design with the lowest bandgap. (c) Most informative descriptors to guide the optimization for each of the three constituents identified by dynamic Gryffin. The relevance metric is computed as the normalized weights that dynamic Gryffin learned to associate to each descriptor. The distributions shown are over the 200 repeated optimizations performed. Diamonds indicate the average relevance of each descriptor.
The inorganic constituents are characterized by their electron affinity, ionization energy, mass, and electronegativity to guide the searches of the static and dynamic formulations of Gryffin. The organic compounds are described by their HOMO and LUMO energy levels, dipole moment, atomization energy, radius of gyration, and molecular weight. All electronic descriptors were computed at the HSEH1PBE/Def2QZVPP level of theory on a Super-FineGrid with Gaussian,110 and the radii of gyration were calculated for the lowest-energy conformer. Note that, in contrast to the search for viable non-fullerene acceptors, this application presents an optimization task that not only features physically different descriptors between individual categorical variables, but also varying dimensionalities of the descriptors associated with individual categorical variables. However, the correlations between individual descriptors and the expected bandgaps of the assembled HOIP materials are significantly higher compared to the descriptors used for the non-fullerene acceptors (see supplementary material, Sec. S.4.B, for additional details).
The fractions of the candidate space explored by each optimization strategy before locating the HOIP composition with the lowest bandgap are illustrated in Fig. 5(b). More detailed results are reported in the supplementary material (see Sec. S.4.B). Similarly, to the synthetic benchmarks (see Sec. IV) and the optimization of non-fullerene acceptors (see Sec. V A), we find that all optimization strategies outperform a purely random exploration of the candidate space. Bayesian optimization strategies tend to locate the best-performing HOIP candidate at a faster rate than PyEvolve (∼34%), with GPyOpt evaluating only about 10% of the candidate space, followed by Hyperopt (∼21%) and SMAC (∼27%). Naïve Gryffin succeeds after exploring less than 9% of the search space. Note that this fraction of the search space corresponds to roughly 17 HOIP candidates, which approximately matches the number of available organic compounds.
Despite naïve Gryffin already outperforming all other search strategies tested, its static and dynamic formulations still manage to improve upon Gryffin's performance and identify the best-performing HOIP after exploring less than 8% of the design space, i.e., less than 16 HOIP candidates. This means that, on average, Gryffin could find the optimal design without the need to evaluate all organic anions. This observation further confirms that Gryffin's optimizations are indeed accelerated by the availability of suitable descriptors. However, in this example, we did not observe a significant performance difference between the static and the dynamic formulation of Gryffin. This behavior is expected given the high correlation (up to 0.9) between provided descriptors and the bandgaps (see Sec. IV C).
For this application, we find that electronegativity is most relevant for the inorganic constituents, while the radius of gyration and the molecular weight are most informative for the organic compound. Although the targeted property (bandgap of the HOIP) is an electronic property, dynamic Gryffin seems to benefit the most from the geometric (and not the electronic) descriptors of the organic compound. In contrast, the mass of the inorganic compounds seems to be the least relevant, while their electronegativity is most informative. These observations suggest that the organic molecule does not directly affect the electronic properties of the HOIP material, but rather induces a change in the arrangement of the inorganic compounds which in turn modulates the bandgap. Indeed, this hypothesis has emerged in various studies on perovskite materials.121–125 These results further demonstrate how dynamic Gryffin is able to capture relevant and sometimes unexpected trends in the descriptors and inform future design choices.
C. Suzuki–Miyaura cross coupling optimization
As a final application, we demonstrate how Gryffin can aid in the optimization of Suzuki–Miyaura cross coupling reactions with heterocyclic substrates.126 These reactions are of particular interest to the pharmaceutical industry,127 and have recently been studied in the context of self-optimizing reactors for flow chemistry.39–41 The optimization of chemical reactions typically targets a maximization of the yield. The yield of a reaction can be modified by varying its process conditions, which can largely be described by continuous variables. However, the reaction rate is also affected by the choice of catalyst, which is a categorical variable and cannot be described with a continuous value.
Here, we consider a flow-based Suzuki–Miyaura cross coupling reaction, in which we tune three continuous reaction conditions (temperature, residence time, and catalyst loading) and one categorical variable (ligand for Palladium catalyst) as illustrated in Fig. 6(a). In this optimization task, we aim at maximizing both the reaction yield and the turnover number (TON) of the catalyst. We employ the Chimera scalarizing strategy105 to enable this multi-objective optimization, where we accept a 10% degradation on the maximum achievable reaction yield to increase the TON as the secondary objective. This acceptable degradation corresponds to a desired reaction yield of above 85.4%. Gryffin extends the Phoenics algorithm to simultaneously optimize categorical and continuous parameters. We consider a set of seven ligands [see Fig. 6(b)], which are characterized by their molecular weight, the number of rotatable bonds, their melting points, the number of valence electrons, and their partition coefficients (logP). Details on the physicochemical descriptors and the ranges for the continuous parameters are provided in the supplementary material (see Sec. S.4.C).
Results of the benchmarks on the optimization of Suzuki–Miyaura reactions. (a) Parameters and objectives for the optimization task: We target the identification of optimal values for three process parameters (temperature, residence time, and catalyst loading) and one categorical variable (ligand) to maximize the yield of the reaction and the turnover number (TON) of the catalyst. (b) Illustration of the seven available ligands. (c) Optimization traces showing the performance of individual optimization strategies. (d) Most informative descriptors to guide the optimization. The relevance metric is computed as the normalized weights that dynamic Gryffin learned to associate to each descriptor. The distributions shown are over the 200 repeated optimizations performed. Here, the relevance is indicative of the influence of each descriptor on both objectives. Diamonds indicate the average relevance of each descriptor.
Results of the benchmarks on the optimization of Suzuki–Miyaura reactions. (a) Parameters and objectives for the optimization task: We target the identification of optimal values for three process parameters (temperature, residence time, and catalyst loading) and one categorical variable (ligand) to maximize the yield of the reaction and the turnover number (TON) of the catalyst. (b) Illustration of the seven available ligands. (c) Optimization traces showing the performance of individual optimization strategies. (d) Most informative descriptors to guide the optimization. The relevance metric is computed as the normalized weights that dynamic Gryffin learned to associate to each descriptor. The distributions shown are over the 200 repeated optimizations performed. Here, the relevance is indicative of the influence of each descriptor on both objectives. Diamonds indicate the average relevance of each descriptor.
As a complete performance analysis of the different optimization strategies is experimentally not tractable, we emulate noisy experimental responses with a probabilistic model trained on experimental data.105,106 Specifically, we train a Bayesian neural network to reproduce the reaction yield and TON of a previously reported flow-based reactor.40 Details on the data acquisition, model training, and prediction accuracies are provided in the supplementary material (see Sec. S.4.C). The excellent performance () of the model on unseen data for both the reaction yield and the TON indicates it is an accurate approximation of the experimental surface. With this experiment emulator, we execute 200 independent optimization runs, each with 240 evaluations, for each of the eight experiment planning strategies tested.
Figure 6(c) illustrates the individual performance of each optimization strategy. We find that GPyOpt requires about 18 evaluations to identify reaction conditions achieving the desired reaction yield above 85.4%. The other benchmark strategies, including random exploration, satisfy this first objective already after evaluating approximately 10–12 different conditions. Only the three formulations of Gryffin can locate the desired reaction conditions at even faster rates, requiring 7–8 evaluations. For the optimization of TON, the secondary objective, we observe that PyEvolve is the slowest optimization strategy. In fact, random search starts outperforming PyEvolve after about 100 evaluations. Despite its relatively poor performance on the reaction yield, GPyOpt maximizes the TON faster than random search. Yet, SMAC and Hyperopt achieve significantly higher TONs for any given number of evaluations. They are slightly outperformed only by Gryffin. In this application, we do not observe a significant difference in the performance of the three formulations of Gryffin. This result can be attributed to the fact that we only have one categorical variable with only seven options to choose from. Nevertheless, dynamic Gryffin achieves slightly higher TONs than static or naïve Gryffin.
The contributions of individual descriptors are illustrated in Fig. 6(d), where we find that the number of valence electrons shows the highest relevance among all descriptors to guide dynamic Gryffin, while the number of rotatable bonds is the least relevant. Indeed, the number of rotatable bonds correlates the least with the maximum and average reaction yields and TONs for any values of the other parameters (see the supplementary material, Sec. S.4.C), confirming that dynamic Gryffin correctly identifies non-informative descriptors within the given library of ligand candidates. The number of valence electrons also correlates strongly with the maximum achievable reaction yield for each of the ligands, confirming that this descriptor is highly informative to identify ligands that satisfy the reaction yield threshold. Melting point and molecular weight are identified as relevant likely due to their strong correlation with the number of valence electrons. Based on the indications of dynamic Gryffin, the design of more potent ligand candidates could be inspired by the number of valence electrons. In contrast to the previous applications, however, we consider a relatively small library of only seven ligands, such that the descriptor indications might not necessarily generalize well to larger libraries.
Overall, across all three applications, we find that naïve Gryffin generally constitutes a competitive strategy for the optimization of categorical variables in chemistry and materials science, which tends to outperform state-of-the-art optimization strategies. Static Gryffin can accelerate the search with provided descriptors and navigate the search space more efficiently by exploiting descriptor-based similarities between individual options, thus efficiently leveraging domain knowledge. Dynamic Gryffin can accelerate the search even further by transforming provided descriptors to improve their relevance and inspire scientific insights. Finally, Gryffin integrates well with optimization strategies for continuous variables and thus enables the simultaneous optimization of mixed continuous-categorical parameter spaces.
VI. DISCUSSION
In this work, we introduced Gryffin, an experiment planning strategy for the selection of categorical variables, such as functional molecules, catalysts, and materials. Similar to a recently introduced algorithm for optimization over continuous domains,77 Gryffin relies on Bayesian kernel density estimation for the construction of a surrogate model. To extend this formulation of Bayesian optimization to categorical spaces, Gryffin takes advantage of a smooth approximation of categorical variables.90,91 Furthermore, by locally transforming the metric of the optimization domain, Gryffin is able to leverage domain knowledge in the form of descriptors. This unique feature allows it to exploit the similarity of different options to navigate categorical spaces more efficiently. In addition, because the relevance of each descriptor for the property being optimized can be investigated, Gryffin can also inform future design choices and spark scientific insight.
We compared the performance of Gryffin to state-of-the-art categorical optimization algorithms on a set of synthetic benchmark functions. Our benchmarks indicate that the naïve formulation of Gryffin, which does not use any descriptor information, is competitive with state-of-the-art strategies on pseudo-convex surfaces and outperforms them on all other surfaces. Descriptor-guided searches with static Gryffin identify global optima at significantly faster rates across all surfaces. Dynamic Gryffin, which attempts to construct a more informative set of descriptors, can accelerate the search even further, especially when correlations between the descriptors and the properties to be optimized are moderate, and for noisy response surfaces. Importantly, across all our tests, the dynamic formulation of Gryffin was never found to be considerably inferior to its static formulation.
The capabilities of Gryffin were further demonstrated on three real-world applications across materials science and chemistry: (i) the discovery of non-fullerene acceptors for organic solar cells, (ii) the discovery of hybrid organic–inorganic perovskites for light-harvesting, and (iii) the mixed categorical-continuous selection of ligands and reaction conditions for Suzuki–Miyaura cross coupling reactions. Gryffin outperformed the other experiment planning strategies in all three applications. Static and dynamic Gryffin accelerated the searches even with moderately informative physicochemical descriptors. We further found that dynamic Gryffin was able to identify trends among descriptors that elucidate some of the prevalent phenomena giving rise to the properties of interest. This observation indicates that dynamic Gryffin has the potential to foster scientific understanding and encourage physical and chemical intuition for the studied systems.
A question that naturally arises when using Gryffin is how to select the descriptors associated with specific chemical species. In fact, cheminformatics tools like RDKit128 and Mordred129 can compute several hundred descriptors. In principle, descriptor selection should be guided by domain knowledge, and only descriptors that are expected to be informative should be included. This is because the higher the correlation between descriptors and objective, the more efficient the optimization will be. In addition, ideally, the chosen descriptor should not carry redundant information (i.e., be highly correlated between each other). However, in practice, we found that Gryffin is able to take advantage of descriptors with Pearson correlation magnitudes as low as 0.1 (Fig. 3). Furthermore, the inclusion of up to 64 redundant descriptors was not found to have a significant impact on the performance of the algorithm (Fig. S10). A downside of using a large number (e.g., hundreds) of descriptors is the higher probability of observing spurious correlations between descriptors and the objective being optimized. Spurious correlations might result in some uninformative descriptors being considered relevant by dynamic Gryffin. However, the impact of possible spurious correlations is expected to be small when informative descriptors are present and the uninformative descriptors will be identified as such when more data are collected. Hence, while it is prudent to carefully select the descriptors used, Gryffin is expected to be robust against uninformative or redundant descriptors. If in doubt whether a certain descriptor is informative to the specific optimization task, we suggest to include, rather than remove, the descriptor and let dynamic Gryffin learn its true relevance from the data collected during the optimization.
Several extensions and improvements to Gryffin are possible and will be considered in future work. While dynamic Gryffin learns the importance of each descriptor from data, the static formulation assumes equal importance among all descriptors provided by the user. A possible extension could allow the user to provide weights associated with each descriptor to reflect user knowledge of their relative importance. Another possible improvement is the use of unsupervised learning techniques, such as variational autoencoders,47,130,131 to automatically obtain continuous molecular representations to be used instead of or in conjunction with user-defined descriptors.
One of the most computationally demanding aspects of Gryffin is the optimization of the non-convex acquisition function. Within acquisition optimization, the evaluation of the objective function is responsible for most of the cost, given that numerous evaluations are required. The use of zeroth-order optimization approaches, like evolutionary algorithms, might reduce this cost by avoiding the function evaluations required by first- and second-order methods to estimate gradients. In addition, the use of compositional solvers132,133 might improve the performance of the acquisition optimization routine when proposing experiments in batches. Overall, improvements in the algorithms used to optimize the acquisition function might result in better performance and a more computationally efficient code.
The use of power transforms to handle heteroscedastic noise has recently been found to improve the performance of Bayesian optimization based on Gaussian processes.134 Among other recently proposed approaches to handle this scenario, common in many experimental and computational experiments, is the use of surrogate models that are robust against this type of uncertainty.135 Analogous approaches could be implemented and tested in Gryffin as well.
VII. CONCLUSION
In summary, Gryffin constitutes a readily available strategy for the efficient selection of categorical variables across optimization tasks in science and engineering and alleviates some of the immediate challenges to the versatile deployment of autonomous experimentation platforms. The demonstrated improvement upon state-of-the-art approaches—thanks to the use of physicochemical descriptors—constitutes a step toward effective, data-driven experiment planning guided by domain knowledge. We believe that Gryffin will enable the acceleration of scientific discovery, such as the search for promising molecules and materials. We invite the community to test and deploy it in expensive optimization tasks where similarities between categorical options can be defined. Gryffin is available for download on GitHub.94
SUPPLEMENTARY MATERIAL
See the supplementary material for additional details on the derivation of Gryffin, benchmark results and their analyses, and chemistry and materials science applications.
ACKNOWLEDGMENTS
The authors thank Melodie Christensen, Dr. Pascal Friederich, Dr. Gabriel dos Passos Gomes, and Dr. Daniel P. Tabor for valuable and insightful discussions. F.H. acknowledges financial support from the Herchel Smith Graduate Fellowship and the Jacques-Emile Dubois Student Dissertation Fellowship. M.A. is supported by a postdoctoral fellowship from the Vector Institute. R.J.H. gratefully acknowledges the Natural Sciences and Engineering Research Council of Canada (NSERC) for provision of the Postgraduate Scholarships-Doctoral Program (Grant No. PGSD3-534584-2019). L.M.R. and A.A.G. were supported by the Tata Sons Limited-Alliance Agreement (Grant No. A32391). A.A.G. thanks Dr. Anders Frøseth for his support. This work relates to Department of Navy award (Contract No. N00014-19-1-2134) issued by the Office of Naval Research. The United States Government has a royalty-free license throughout the world in all copyrightable material contained herein. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Office of Naval Research. All computations reported in this paper were completed on the Arran cluster supported by the Health Sciences Platform (HSP) at Tianjin University, People's Republic of China and the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University.
DATA AVAILABILITY
The data that support the findings of this study are openly available on GitHub at https://github.com/aspuru-guzik-group/gryffin.