Data generation remains a bottleneck in training surrogate models to predict molecular properties. We demonstrate that multitask Gaussian process regression overcomes this limitation by leveraging both expensive and cheap data sources. In particular, we consider training sets constructed from coupled-cluster (CC) and density functional theory (DFT) data. We report that multitask surrogates can predict at CC-level accuracy with a reduction in data generation cost by over an order of magnitude. Of note, our approach allows the training set to include DFT data generated by a heterogeneous mix of exchange–correlation functionals without imposing any artificial hierarchy on functional accuracy. More generally, the multitask framework can accommodate a wider range of training set structures—including the full disparity between the different levels of fidelity—than existing kernel approaches based on Δ-learning although we show that the accuracy of the two approaches can be similar. Consequently, multitask regression can be a tool for reducing data generation costs even further by opportunistically exploiting existing data sources.
I. INTRODUCTION
Predicting molecular properties based on first-principles approaches drives innovation across the physical sciences. However, exact prediction is infeasible due to the exponential scaling of the Schrödinger equation—the underlying quantum-mechanical description of materials. Decades of research have yielded numerous first-principles methods to describe electronic structures approximately, each representing a compromise between computational cost and prediction accuracy. Coupled cluster (CC) methods are typical examples of high fidelity models, which offer accuracy in exchange for considerable computational cost. In particular, CCSD(T)—the “gold standard” of quantum-chemical prediction—suffers from a steep scaling, where Ne is the number of electrons in the target system.1 In contrast, density functional theory (DFT) methods are popular baseline models since they provide adequate accuracy for most systems while enjoying substantially lower cost [scaling only as ].2–6
Traditional workflows force users to settle on a single quantum chemical method that is cheap enough to result in manageable costs but still sufficiently high fidelity to give meaningful predictions. However, the advent of modern data-driven modeling has introduced new tools, which make it possible to decouple the step of estimating the properties of unseen molecular systems from the step of running expensive first-principles computations. Statistical surrogate models are trained with predictions from first-principles methods and, once trained, can be substituted into workflows, making the prediction step considerably cheaper. In the context of building potential energy surfaces and interatomic potentials, these data-driven approaches have become the de facto standard.7–10 Simultaneously, large-scale data generation is facilitated by sophisticated workflows for high-throughput calculations10–12 and robust DFT methods inspired by mathematical research.5,13–16 Increasing amounts of data are openly available18–21 and are, in principle, ready for incorporation into statistical surrogates, but quality is generally not homogeneous as each dataset’s author has made different methodological choices. Consequently, most surrogate models are single fidelity: all training data are generated using the same quantum-chemical method and the same numerical approach. Heterogeneity has thus been an obstacle to opportunistically exploiting the already available data. In this work, we construct surrogates that do not require a single choice between accurate (but expensive) and fast (but crude) methods but instead combine the advantages of both. In particular, we consider multitask methods, which construct a relationship between multiple regression problems without requiring precise information about the respective accuracy of each source.22,23
Our multitask framework can accommodate a range of heterogeneous training set structures. For example, suppose we aim to design a surrogate that predicts the three-body energy for any configuration of a water trimer with an accuracy comparable to CCSD(T). We may only have access to 100 CCSD(T) data points, all corresponding to configurations with HOH angles smaller than 110°. Additional data may be provided by including a more exhaustive dataset based on DFT simulations with the PBE functional. Notably, even though the geometries of the molecular configurations in both datasets may not match, they can both be fully employed to train a multitask surrogate. Moreover, if higher predictive accuracy is desired, further training data can be created using yet another quantum-chemical method. For example, we may employ BLYP-based calculations on all molecular configurations covered by CCSD(T) and PBE, plus some configurations that we know we will eventually target for prediction. Remarkably, in this case, all available data can be used without imposing any assumption about which of the DFT methods (here, BLYP or PBE) is more accurate for the considered systems. Internally, a multitask surrogate creates a statistical submodel for each of the categories of data we supply it with [here, different levels of theory, i.e., CCSD(T), PBE, and BLYP]. In the language used in the rest of the article, we call the submodel associated with our main prediction objective—the three-body energy at the CCSD(T) level—the primary task. The other submodels—here associated with the DFT levels of theory—strive to improve the predictive accuracy of the primary task and are called secondary tasks. For a concise overview of these and other terms employed in the article, see Table I.
Definitions of key terms.
Model . | A data-based surrogate for quantum chemical prediction . |
---|---|
training data | Possibly heterogeneous data used to construct a model |
testing data | Data used to evaluate the prediction error of a model; disjoint from training data |
level of theory | The first-principles method used to generate quantum chemical data [e.g., CCSD(T), DFT with PBE, …] |
high fidelity | An accurate, expensive method such as CCSD(T) |
low fidelity | A method, such as DFT, which is cheaper but less accurate than the high fidelity method |
task | A model component trained with homogeneous data |
primary | The task associated with the main prediction objective of the model |
secondary | Supplemental tasks, which are included in a model to support the prediction of the primary task |
core (C) | Molecular configurations for which training data are available for the primary task |
additional (A) | Molecular configurations for which training data are only available for secondary tasks |
target (T) | Molecular configurations for which the model will make predictions (i.e., the testing set) |
Model . | A data-based surrogate for quantum chemical prediction . |
---|---|
training data | Possibly heterogeneous data used to construct a model |
testing data | Data used to evaluate the prediction error of a model; disjoint from training data |
level of theory | The first-principles method used to generate quantum chemical data [e.g., CCSD(T), DFT with PBE, …] |
high fidelity | An accurate, expensive method such as CCSD(T) |
low fidelity | A method, such as DFT, which is cheaper but less accurate than the high fidelity method |
task | A model component trained with homogeneous data |
primary | The task associated with the main prediction objective of the model |
secondary | Supplemental tasks, which are included in a model to support the prediction of the primary task |
core (C) | Molecular configurations for which training data are available for the primary task |
additional (A) | Molecular configurations for which training data are only available for secondary tasks |
target (T) | Molecular configurations for which the model will make predictions (i.e., the testing set) |
In this work, we will investigate such multitask models and the impact of training set design (i.e., the distribution of training data between the primary and secondary tasks) on prediction accuracy. Figure 1 visualizes the different training set structures, which we consider. Secondary task training sets may share some molecular configurations with primary task training sets, include configurations unique to the secondary task, and may even include configurations from the testing dataset. An example of the latter case is the use of a BLYP energy prediction at the same geometry, where we later apply the surrogate to predict the CCSD(T) energy. The impact of these choices on inference is under-explored in the literature, and multitask models are useful tools for investigating the benefits of different training set configurations. Crucially, the multitask framework enables training with a “dataset of opportunity,” amalgamated from existing datasets, as an alternative to expending computational time to generate entirely new data.
Diagram of training datasets compatible with the methods considered in this work. Here, we define three sets of molecules—core, additional, and target systems—and two fidelities—primary and secondary. Our goal is to train a model that will predict our quantity of interest for the target molecules with the accuracy of the primary method. The molecular space covered by secondary training data is a design choice for which we present six options, labeled C, CT, A, AT, CA, and CAT. Key terms are defined in Table I.
Diagram of training datasets compatible with the methods considered in this work. Here, we define three sets of molecules—core, additional, and target systems—and two fidelities—primary and secondary. Our goal is to train a model that will predict our quantity of interest for the target molecules with the accuracy of the primary method. The molecular space covered by secondary training data is a design choice for which we present six options, labeled C, CT, A, AT, CA, and CAT. Key terms are defined in Table I.
In this paper, we construct submodels for each task using Gaussian process (GP) regression, a popular tool for constructing surrogates informed by heterogeneous data. Several works24–26 have explored the use of multifidelity GP models27 trained using two fidelities, namely, a high fidelity and a low fidelity DFT approach. Alternatively, Δ-learning may be employed to model the difference between two levels of theory, then add that difference to low level predictions of the target to produce high level predictions.25,28,29 Although GP regression and the related technique of kernel ridge regression are common choices for fitting such models, in principle, any method of fitting a surrogate may be employed.25,29 Hierarchical machine learning (hML) models extend the Δ-learning framework to leverage more than two training datasets by learning and summing multiple difference models.8,29 However, between each pair of training sets, data points must be in some way aligned. For example, in a model of the difference between DFT and CCSD(T), DFT predictions may only contribute to the training set if the corresponding CCSD(T) prediction is also available. Furthermore, when either hML or multifidelity GP regression learns from more than two heterogeneous training sets, these sets must be assigned some order—for example, following a hierarchy in accuracy. Recent work30,31 has proposed an optimized linear combination of submodels as an alternative, but all coefficients in this linear combination must be re-optimized whenever any submodel is altered, and opportunistic dataset construction is not fully explored.
Progress toward multifidelity approaches has also been made in the context of neural network surrogates. For example, transfer learning methods train an initial neural network on a larger secondary dataset and then re-optimize part of the parameters using a primary dataset.7,32,33 For the prediction of atomistic properties, the secondary dataset might contain millions of DFT calculations, while the primary dataset might be a smaller set of CCSD(T) predictions. In prior studies, it has been observed that transfer learning can produce higher accuracy than training a network on only one of these datasets.8,29 While the framework we discuss in this work is not restricted to GP regression strategies, we will focus exclusively on GP methods. In comparison to neural network approaches, where a rigorous mathematical explanation for the low generalization error of well-trained models is still lacking,34,35 GP approaches are more amenable to mathematical analysis.36 This comparative transparency simplifies the interpretation of the surrogate’s predictive process in the diverse training data configurations we analyze. In addition, GP methods have the advantage of yielding accurate predictions even for relatively small training set sizes. While the training cost of GP regression may become large as the size of the training set increases, many strategies for efficient scaling have been developed and can be routinely employed.37–40
A. Advantages of the multitask method
Our numerical experiments will demonstrate the following advantages of the multitask modeling approach: Key figures illustrating each claim are referenced in parentheses.
Multitask models can achieve target accuracy with less computational cost than Gaussian process regression (Fig. 3).
Accuracy of multitask inference improves as more DFT data are added to the training set. Furthermore, increasing the amount of training data from different secondary methods for the same molecule improves accuracy (Fig. 6).
Unlike the Δ and multifidelity approaches, multitask methods can model more than two observation datasets without imposing an arbitrary order on the sets (Figs. 5 and 6).
Multitask models have flexible training set requirements. The sets are constructed from multiple levels of theory, and the same molecules do not need to be included for each level (Fig. 3).
Flexibility is a key advantage of the multitask method compared to a single task or Δ-learning approach. The method is compatible with many training set structures. For instance, a multitask model can be trained with CCSD(T) data for one molecule set and DFT data for a different molecule set, with CCSD(T) and DFT data for the same molecule set, or with some in the intermediary case. Models also have the option to train with DFT data for the molecules that we aim to make predictions for at the CCSD(T) level of accuracy. The flexible multitask framework makes it possible to explore the impact of different compositions of a training dataset on inference, as shown in Fig. 3. Consequently, we can determine the balance between expensive high level and cost effective low level data, which best suits the problem setting and available data.
In Sec. II, we will review prediction with GP regression as well as the Δ and multifidelity methods before describing the multitask framework. Section III provides details on our approach to testing different training set configurations, and Sec. IV describes our numerical tests. We demonstrate the performance of the multitask method on two examples: prediction of the three-body energy of water trimer configurations and prediction of the ionization potential of small organic molecules.
II. STATISTICAL MODELS
A. Gaussian process regression
Our goal is to predict the values of some quantity of interest (QoI) f, given an input feature vector Xm describing the mth molecular configuration. We train our predictive model using , where each Ym is a noisy observation of f(Xm). In our setting, we interpret imperfect first-principles predictions of the QoI as noisy observations of the true quantity. Each Xm is a representation of the electronic structure of a molecular system. For our numerical results, we build Xm using the well-established Smooth Overlap of Atomic Potentials (SOAP).42–44 Ym is the CCSD(T) or DFT prediction of the quantity of interest for the mth molecular configuration.
B. Multiple observation sets
Suppose we have access to heterogeneous training data, which includes predictions of our QoI by multiple first-principles methods. We thus seek an inference model that leverages all data , where each n = 1, …, N represents a task and the corresponding set contains indices of the molecular configurations available to train a model for task n. A simple “pooling” approach could put all observations into one vector and proceed with inference via Eq. (4). Then, if , we can interpret Ymn and Ymn′ as independent observations of the same QoI, f(Xm), for molecular configuration Xm. The distinct behavior of each quantum chemical method is treated as a perturbation of some mean value by a realization of the noise.
Rather than doing this, we seek a model where the distribution of each Ymn is sensitive to n. Ideally, we want to map differences in electronic structure methods to differences in their predictions. Method-specific features could be a solution, but the construction of such features is challenging. One approach involves appending a vector, which indicates the method category to each feature. This strategy puts a large burden on covariance hyperparameter optimization since one set of hyperparameters must encode differences between molecular configurations and between quantum chemical methods. Alternatively, to characterize the qualitative difference between methods within the features, we would have to solve the very problem for which we are creating the features in the first place: for instance, we would need to reliably predict CCSD(T) performance given CCSD(T) and DFT data from training molecules as well as DFT data for target molecules.
Instead, we will extend the framework of GP regression to model relationships between multiple prediction tasks.
C. Δ-Learning
Consider a case where we want to build a model from an existing dataset, which consists of CCSD(T) predictions for configurations in and DFT predictions for configurations in . In the notation introduced by Table I, is the core set, C. may contain any combination of molecules from the C, A, and T sets. As (5) indicates, the training data for Δ-learning will include only molecular configurations in . If , then Δ-learning will be able to leverage more data than the conventional GP regression model described by (3) trained on the CCSD(T) data. Unfortunately, Δ-learning is not-applicable in cases where predictions made by different methods correspond to different molecular configurations. When , it is impossible to train a Δ-learning model. To include a DFT prediction for a molecule Xm in the training set for a Δ model between CCSD(T) and DFT, we also require a CCSD(T) prediction for the same molecule.
In the following subsections, we consider alternative approaches that model disparities in data sources and have the ability to use all available observations.
D. Multifidelity method
E. Multitask method
“Multitasking” refers to a category of methods that consider several regression tasks and assume some relationship between these tasks. As with Δ-learning or multifidelity fusion, multitasking can allow us to incorporate a larger dataset into our inference problem without just pooling the data into one observation vector. For instance, we can define a regression problem trained on CCSD(T) data as well as one trained on each density-functional approximation (DFA) we consider. By modeling the correlation between regression functions, we can use the DFA regression tasks to support prediction in the CCSD(T) task.
Symmetric variants of multitask modeling treat all tasks equally. For instance, we may assume that all the data are modeled by a single Gaussian process whose covariance kernel is the Kronecker product of one kernel, which relates the tasks and another, which models relationships between data from the same task.22,45 In many cases, however, the goal is to perform regression for a primary task, and all other tasks are considered secondary tasks, which provide data that may be useful in learning the primary task. In our own setting, CCSD(T) is a natural choice of data for the primary regression task because we would like to make predictions that match this method in accuracy. Each set of DFT predictions made using a different DFA could inform a secondary regression task, which supports the primary task. Thus, we consider an asymmetrical model22 in which secondary tasks are related to the primary task.
Superscripts indicate the GP prior structures used to create a vector or matrix, and subscripts indicate the features that are compared by a matrix. In both cases, s indicates evaluations for all secondary tasks, n = 1, …, N − 1. Inference proceeds as described in Eq. (4).
F. Comparison of methods for more than two fidelities
Asymmetric multitasking assumes a relationship between regression functions—given by Eq. (12)—which is structurally reminiscent of the multifidelity fusion model—Eq. (7). Both assume that the regression function of one dataset is related to the regression function of another by a scaling parameter, ρ, and an additive disparity function, δ.
Multitasking also holds a similar advantage over the hML extension to Δ-learning. The indices n = 1, …, N suggest an order of the training sets, and the difference models in (6) correspond to consecutive pairs of training sets. Such an order may be arbitrary, for example, if our datasets are generated by multiple DFAs with similar structures. In Sec. III, we will describe different compositions of a training dataset that the multitask framework allows us to test.
III. MAPPING HETEROGENEOUS DATA TO TASKS
In our numerical examples, we train inference models using data for multiple molecular configurations computed with multiple quantum chemistry methods. The multitask method leaves many questions about the particulars of the training dataset up for the researcher. In particular, it is necessary to define tasks and map molecules to the training set of each task.
A. Selecting tasks
The primary task is chosen according to our prediction goal. We aim to demonstrate that by training with data from multiple quantum chemistry methods, we can construct an inference model that has prediction accuracy comparable to CCSD(T) for less computational expense than if we trained a single task model on CCSD(T) alone. Thus, we will test our surrogates using a set of CCSD(T) predictions for some target molecules belonging to a set we call T. Our primary task must be informed by CCSD(T) predictions for a different set of molecules, which we refer to as our core training set, C. See Table I for an overview of our key definitions.
We consider two approaches to mapping between levels of theory and tasks. In the first (Subsections IV B 1 and IV C), the training data for our primary task will consist of direct computations of our QoI at the CCSD(T) level, and each secondary task will be trained by DFT predictions of the QoI obtained with different DFT functionals.
The second approach (Subsection IV B 2) we call the “multitask Δ” option. In that setting, the levels of the theory include CCSD(T) and DFT computations performed with two different DFAs. The difference between CCSD(T) and the first DFA will be the primary task, while the difference between the two DFAs informs the secondary task. Note that the Δ-learning method could be interpreted as the single task counterpart to the multitask Δ option.
B. Assigning molecular configurations to tasks
We are specifically interested in the flexibility that the multitask framework provides toward assigning molecular configurations to secondary training sets. Our pool of available molecules is divided into Core (C), Additional (A), and Target (T) subsets to systematically explore the impact of the molecular space covered by the secondary training data and its relationship to our primary training and testing data. The A set contains all molecular configurations that are not part of the primary training or testing set. In our setting, these are configurations that are not our targets for inference and for which high level CCSD(T) predictions are unavailable. A major difference between the multitask framework and Δ-learning is the former’s ability to leverage this A set.
Figure 1 shows the six main secondary training set structures considered in this work. We name the training set structures using combinations of the letters C, A, and T. The inclusion of a letter in a structure name indicates that at least one molecular configuration from the corresponding subset is included in the secondary training dataset. For example, a CT training structure in our setting would contain CCSD(T) predictions for the C set of molecules, DFT predictions for some subset of the C set, and DFT predictions for some subset of the T set. The data requirement of implementing a multitask method with this CT training set structure is comparable to the data demands of a Δ-learning model. However, a crucial distinction is that the multitask approach does not require a DFT prediction for every molecular configuration in the C and T sets. Furthermore, the multitask model can make use of an additional A training set for which CCSD(T) predictions are unavailable. Δ-learning models cannot be implemented for such a training set structure.
C. Multiple secondary tasks
The multitask framework can accommodate an arbitrary number of secondary tasks and allows the observational data used to train each task to correspond to non-identical sets of inputs. We will use the same definition for the C, A, and T molecular sets regardless of the total number of tasks, N. In this context, when a training structure is labeled CAT, this name indicates that at least one configuration from each of these sets is present in the training set for at least one secondary task. To the best of our knowledge, most work that explores training set design for multifidelity, Δ-learning, or hML models focuses on optimizing the size of the training set for each source. Questions of whether different secondary sources of information should explore the same or new areas of molecular space are underexplored in the literature. For example, if our secondary training data includes DFT predictions at the PBE level for CO2 configurations and the BLYP level for ONH4 configurations, we aim to determine how much our model will improve if we fill in the blanks (i.e., add PBE level predictions for ONH4 and BLYP level predictions for CO2 vs adding data on previously unseen molecules). Such questions are investigated in Subsection IV C.
Definition of functions, which return sets of molecular configurations corresponding to primary task, p, and secondary tasks, s1, …, sN−1. Note that by definition, is the entire core set, C, and .
Intersection between core molecules and training molecules for task t | |
Intersection between additional molecules and training molecules for task t | |
Intersection between target molecules and training molecules for task t |
Intersection between core molecules and training molecules for task t | |
Intersection between additional molecules and training molecules for task t | |
Intersection between target molecules and training molecules for task t |
Example dataset structures. C, A, and T are sets of molecular configurations. C contains all configurations that the primary task, CCSD(T), is trained on. T contains the target configurations; our goal is to estimate their properties with the accuracy of CCSD(T). The testing set is colored green. All secondary tasks are highlighted in gray. Configurations in set A can only be used for training secondary tasks. The vertical alignment of dots indicates that multiple tasks have access to the same molecular configurations.
Example dataset structures. C, A, and T are sets of molecular configurations. C contains all configurations that the primary task, CCSD(T), is trained on. T contains the target configurations; our goal is to estimate their properties with the accuracy of CCSD(T). The testing set is colored green. All secondary tasks are highlighted in gray. Configurations in set A can only be used for training secondary tasks. The vertical alignment of dots indicates that multiple tasks have access to the same molecular configurations.
D. Training with target molecules
We stress that whenever , the secondary training data include low level predictions of at least some of the molecules that we will use to test the multitask model. In Fig. 2, the testing molecules—target molecules at the primary fidelity—are represented in green. The black dots in the third column of the lower plot indicate that BLYP predictions for the test molecules are included in the training set.
All lower level predictions for target molecular configurations that are available at training time can be used to obtain a multitask model. If new lower level T predictions become available after this initial training, retraining is needed to use these data. Note that a Δ-learning model can leverage such additional T data without any retraining. However, each target data point will only contribute to a single prediction of the Δ-learning framework. In contrast, retraining the multitask model buys the benefits of this additional data point for all future predictions. Furthermore, sparse update techniques such as the Sherman–Morrison formula offer efficient alternatives to full retraining of the multitask model.
IV. NUMERICAL EXAMPLES
A. Numerical setup
We consider two case studies. In the first, the quantity of interest is the three-body (3-b) interaction energy of randomly selected water trimer configurations. The 3-b interaction term is isolated by subtraction of all 1-b and 2-b energies from the full energy of the trimer. We selected water because it is a simple molecule with unique properties, which make it crucial for life and a popular subject for research.46–48 The second case study considers the prediction of the ionization potential of small organic molecules. Ionization potential is known to be challenging for DFT to capture accurately, so it is insightful to test the ability of inference models to predict this quantity.
1. Data generation
Multitask models can be trained on heterogeneous datasets constructed opportunistically from multiple sources without the need to specify a predefined accuracy hierarchy (apart from singling out the primary task). Our test cases were selected to explore this flexibility by tapping mostly on previously published datasets supplemented with only a few additional computations.
For our first test case—inference of 3-b interaction energy—we consider 5986 water trimer configurations and corresponding CCSD(T) calculations. These were randomly selected from 45 332 configurations available at https://github.com/jmbowma/q-AQUA49 to obtain training data sizes approximately matching our second test case. To illustrate that even such truncated datasets can be employed to yield reasonable predictive qualities with multitask models, we enrich this dataset again by running a small number of cheap density-functional theory simulations. On 2992 of these configurations, we employ Psi450 to compute the 3-b interaction energies at the PBE51 and SCAN52 levels. The former was selected for its popularity, and the latter because its description of intermediate range dispersion has been found to contribute to an accurate prediction of water energy differences.46,47 Following recent work on dispersion corrections,53 we paired the PBE approximation with a D3 correction using finite damping and the SCAN approximation with a D3 correction using zero damping. Both DFT and CCSD(T) calculations employed a standard counterpoise correction to alleviate basis set superposition error.54
The second test case is informed by the ionization potential of 3165 molecular configurations, combinations of 479 different molecules, and seven configuration options.55 Configurations are subselected56 from the ANI-1 dataset21 and contain only elements from the set {H, C, N, O}. We compute57 the ionization potential for these configurations using CCSD(T) as well as secondary tasks corresponding to data from four density functionals: PBE0_DH,58 PBE0,59 PBE,51 and BLYP.60
The quantification of cost in our numerical results refers to the cost of dataset generation. The cost of inference is negligible next to quantum chemistry calculations. We base our cost model for each method on the average runtime of Psi4 computations for ten representative systems. For more details about the cost model, see Appendix B. We make the code required to reproduce our results and figures openly available, as described in statement V. Similarly, the first-principles data as well as surrogate prediction data are openly accessible, along with the cost of training and applying surrogate models.
2. Features
We require features that can distinguish between molecular systems to serve as inputs, Xi, to our Gaussian process models. Kernel functions use features to model the relationships between quantities of interest. As noted in Section II A, in our experiments we consider Smooth Overlap of Atomic Positions (SOAP) features, which have been successfully used to fit interatomic potentials in a Gaussian process based approach.42–44 SOAP features are constructed based on the neighborhood of each atom in a system, so challenges arise when two systems contain different numbers of constituent atoms of elements. Multiple strategies have been proposed for constructing “global” features to capture entire systems.61 The simplest approach is to take an average of the SOAP features for each atom in the system. Since this approach is computationally efficient and works well in practice, we apply it to our test cases. We chose SOAP parameters following experimentation and the advice of the literature.62,63 More details are available in Appendixes C and D.
3. Hyperparameter optimization
We estimate correlation parameters, ρ, using Pearson’s correlation coefficient between each secondary task optimization dataset and the corresponding primary task data. Popular belief in the literature holds that correlation parameters require a detailed optimization procedure,29 but we find that these easy to obtain estimates work reasonably well in practice. A more detailed optimization is left for future work. We estimate by maximizing its log-marginal likelihood. This approach is easily extended to multifidelity and multitask cases by treating Yij − ρjYi1 as the ith observation.64 For the organic molecules example, we also use a maximum likelihood estimate for v, while for the water trimer example, we find better results by using an estimate of v that minimizes the mean absolute prediction error. See Appendix A for a comparison of MLE and minimum MAE estimates as well as a more detailed discussion of the resources required for hyperparameter optimization.
4. Isolating the impact of target training data
Consider a molecular configuration, Xm. We are interested in contrasting the accuracy of a multitask model trained using a CA set and a model trained on a CAT set, which are identical except for the inclusion of a DFT prediction for Xm. The target set may contain a large number of molecular configurations, and we cannot make a fair comparison if we simultaneously include DFT predictions for all target molecules in the training set. To facilitate comparison, whenever for n = 1, …, N − 1, we train a new multitask model for each element of . Each model will train on only one configuration from the T set, and we will test that model by making predictions for the same configuration. We report the mean absolute error (MAE) over all these models and compare this value to the MAE of the predictions of a CA trained model for the same set of targets.
We implement this retraining procedure to investigate the impact of training set design. Retraining is not a prerequisite for leveraging all molecules in . All lower level data for target molecules that are available at training time can be simultaneously incorporated into one multitask model. See Subsection III D for further discussion.
B. Three-body interaction energy of water trimer
1. Multitask performance
We first compare the performance of multitask models with only one secondary task (N = 2) to the performance of a traditional GP model (N = 1). Figure 3 shows the error incurred by several implementations of each model when predicting the three-body (3-b) energy of water trimer configurations. We used the SCAN functional to generate the secondary task data. Pearson’s correlation coefficient between the primary and secondary task training data is 0.997, indicating a strong positive correlation. The three subplots in Fig. 3 cover six secondary training set configurations (C, CT, A, AT, CA, and CAT), labeled within the legends.
Water trimer case. Comparison of the MAE of the multitask method for a range of training set constructions (scatter points) to the MAE of a GP regression model trained only with CCSD(T) data (black line). Predictions made by the multitask method are informed by CCSD(T) and SCAN. The legends of the plots indicate the structure of the secondary training set according to the system presented in Fig. 1. The red arrows emphasize the order of magnitude improvement in data generation cost achieved by CAT multitask models, quantified in Table III.
Water trimer case. Comparison of the MAE of the multitask method for a range of training set constructions (scatter points) to the MAE of a GP regression model trained only with CCSD(T) data (black line). Predictions made by the multitask method are informed by CCSD(T) and SCAN. The legends of the plots indicate the structure of the secondary training set according to the system presented in Fig. 1. The red arrows emphasize the order of magnitude improvement in data generation cost achieved by CAT multitask models, quantified in Table III.
All models were tested on a target set of 320 molecular configurations. For multitask models, we consider seven possible sizes of the core set: |C| ∈ {5, 10, 20, 40, 80, 160, 320}. We use the same seven core sets as well as sets with |C| ∈ {640, 1280, 2560} to train the corresponding single task (N = 1) models. The performance of this single task reference is marked with a black line on each subplot. A larger |C| is considered in the single task case to demonstrate the cost required for these models to reach the accuracy of the best performing multitask models. We choose the additional training sets for multitask models so that |A| = r|C|, where r is allowed to take on each value in {0.5, 1, 2, 3, 4, 5, 6}. Whenever core data are included in the secondary training set, we only consider . Thus, we consider a total of seven different training set sizes for the C and CT structures and 49 different training set sizes for the A, AT, CA, and CAT structures. Each point on each subplot of Fig. 3 represents a model with a different training set size. The x axis indicates the cost of each model’s training set; details of our cost model are provided in Appendix B. The y axis gives the average value of the mean absolute error obtained from six models trained on a given training set size and structure. For each of these six implementations, available molecular configurations are randomly assigned to the C, A, and T sets without replacement.
For every secondary training set structure except C, the multitask method outperforms the single task reference. Including T data from DFT in the training set leads to better results than corresponding cases without T in every example. In particular, training with secondary T data can enable the multitask method to exhibit an accuracy comparable to a GP model that is an order of magnitude more expensive. Table III provides representative examples. The data points corresponding to the last line in the table are marked with arrows in Fig. 3. This analysis of MAE demonstrates that each statistical model can achieve chemical accuracy. For insight into a prediction for individual molecules, we report that the best performing multitask models produce predictions with correlation to the CCSD(T) target data.
Water trimer case. Data generation cost to achieve the threshold of accuracy given in the left column.
Accuracy (meV) . | Cost (millions of seconds) . | ||
---|---|---|---|
MAE . | Single task . | CA . | CAT . |
14 | 2.0 | 0.58 | 0.13 |
7 | 16.0 | 4.5 | 2.0 |
Accuracy (meV) . | Cost (millions of seconds) . | ||
---|---|---|---|
MAE . | Single task . | CA . | CAT . |
14 | 2.0 | 0.58 | 0.13 |
7 | 16.0 | 4.5 | 2.0 |
While the C results demonstrate no significant improvement on the single task model, both the A and CA results show steady improvement as the size of the A set increases. We can distinguish the impact of increasing |A| from the impact of increasing |C| in Fig. 3 because CCSD(T) is significantly more expensive than DFT. Thus, every jump in |C| leads to a significant increase in cost, while jumps in |A| for fixed |C| produce a comparatively slight increase in cost. The positive correlation between |A| and performance whenever suggests that for the multitask method to be beneficial, secondary training data must cover molecular space that is not included in the CCSD(T) training data. We mention that it is possible that the behavior of C models would improve if we considered task dependent features or different relationships between task regression functions. We do not investigate these alternatives further here; instead, we focus on the usual setting where secondary DFT data for additional systems is cheap to obtain.
The CT and CAT models are the best-performing multitask models, demonstrating very similar MAE per cost to each other. Furthermore, unlike the AT model, the CAT model does not show improvement as the size of the A set increases. It seems that for our setting, the combination of secondary C and T training data is so valuable to the model that any benefit from the A set is dwarfed. We can hypothesize that when the model has access to training data for the core molecules from both the primary and secondary tasks, it implicitly learns information about the differences between the tasks. Given secondary data on the target, it can leverage the captured relationship between tasks to make an accurate prediction for the target based on primary fidelity.
Figure 4 provides more insight into the relative roles of the training datasets and the inherent flexibility of the multitask framework. While Fig. 3 demonstrates that multitask models outperform single task models for several training set configurations, it does not directly compare the efficacy of the A and CA training structures. Such a comparison will illuminate the resilience of the multitask model to the loss of points in the training set that are common to primary and secondary tasks. We emphasize that this resilience is an advantage of the multitask method in comparison with Δ-learning, where no training is feasible without access to secondary C data. Each point on subplot (a) of Fig. 4 compares the average MAE of two multitask models: one with a CA secondary training set and the other with an identical A set but no secondary C data. As before, we are inferring the 3-b energy of water trimer configurations with only one secondary task. The closeness of the points to the reference y = x line shows that there is minimal effect on accuracy when the secondary C data are excluded from the training set. In fact, the median improvement in accuracy from using the CA training model instead of the A training model is 0.0007 eV. By comparison, the median improvement in accuracy from combining DFT data from the A set with CCSD(T) training data—that is, switching from a single task approach to a multitask method trained on an A set—is 0.004 eV.
Water trimer case. Effects of including core and additional data (defined in Table I) in secondary training sets. Each indigo point represents a comparison between two multitask models, where one model uses a training set that is a subset of the other. In (a), this comparison is between A and CA training configurations, and in (b), we compare AT and CAT configurations. For each pair of models compared, both the primary training data and the secondary data from the A set are identical. The y = x line is plotted in green for reference.
Water trimer case. Effects of including core and additional data (defined in Table I) in secondary training sets. Each indigo point represents a comparison between two multitask models, where one model uses a training set that is a subset of the other. In (a), this comparison is between A and CA training configurations, and in (b), we compare AT and CAT configurations. For each pair of models compared, both the primary training data and the secondary data from the A set are identical. The y = x line is plotted in green for reference.
Subplot (b) in Fig. 4 also compares models trained with and without C secondary data in the case where for all models. While the AT models perform well, there is a penalty for losing access to secondary C training data when secondary target data are also available. The median improvement in accuracy when switching from an AT model to a CAT model is 0.002 eV, a third of the 0.006 eV median improvement incurred from switching from a single task approach to an AT multitask model with the same CCSD(T) training cost. As we remarked previously, when the model trains on secondary C data, it likely gains information about the difference between the primary and secondary tasks. This information is most useful to the model when it also has access to secondary predictions for the target molecules.
2. Comparison with Δ-learning
Δ-learning is trained on differences between predictions of our quantity of interest, while the single and multitask methods are trained on absolute predictions of the quantity of interest. Table IV reports summary statistics for these datasets. In this case, both the mean and standard deviation of the differences between quantum chemistry methods are smaller by at least an order of magnitude than the corresponding statistics for any set of quantum chemistry predictions. Greater standard deviation in the distribution of target data tends to result in greater standard deviation in the prediction error of a statistical surrogate, which in turn leads to a larger MAE. Since Δ-learning is trained on differences to predict differences, it is thus expected that this method will produce a smaller mean absolute error than a method trained on absolutes to predict absolutes.
Water trimer case. Summary statistics for training data.
. | CCSD(T) (eV) . | SCAN (eV) . | PBE (eV) . |
---|---|---|---|
Absolute mean | 0.0205 | 0.0223 | 0.0225 |
Absolute median | 0.0126 | 0.0144 | 0.0155 |
Standard deviation | 0.0300 | 0.0316 | 0.0332 |
. | CCSD(T) (eV) . | SCAN (eV) . | PBE (eV) . |
---|---|---|---|
Absolute mean | 0.0205 | 0.0223 | 0.0225 |
Absolute median | 0.0126 | 0.0144 | 0.0155 |
Standard deviation | 0.0300 | 0.0316 | 0.0332 |
. | CCSD(T)−SCAN (eV) . | CCSD(T)−PBE (eV) . |
---|---|---|
Absolute mean | 0.0021 | 0.0037 |
Absolute median | 0.0013 | 0.0018 |
Standard deviation | 0.0033 | 0.0057 |
. | CCSD(T)−SCAN (eV) . | CCSD(T)−PBE (eV) . |
---|---|---|
Absolute mean | 0.0021 | 0.0037 |
Absolute median | 0.0013 | 0.0018 |
Standard deviation | 0.0033 | 0.0057 |
By modifying our choice of tasks, we can design a multitask model that trains on and predicts differences. This reinterpretation of tasks will be referred to as the multitask Δ option. Figure 5 compares this option to Δ-learning as well as our original implementation of the multitask model, trained on absolute predictions of the QoI. For each inference method, we report the average MAE obtained over six random assignments of the training data to the C, A, and T sets. We consider the same molecule set sizes that were considered in our previous experiments. The CAT secondary training set structure is used for the multitask absolute QoI models as well as the multitask Δ option models. As expected, the multitask absolute model is not as successful as those that train on and learn differences.
Water trimer case. The MAE of different inference methods: GP regression (line), multitask regression trained with a CAT secondary set (orange), two level Δ-learning (indigo), three level Δ-learning (light blue), and multitask inference on Δs (gold). The reported performance is the average of six draws from each training set. (a) Estimation of CCSD(T)-PBE. (b) Estimation of CCSD(T)-SCAN.
Water trimer case. The MAE of different inference methods: GP regression (line), multitask regression trained with a CAT secondary set (orange), two level Δ-learning (indigo), three level Δ-learning (light blue), and multitask inference on Δs (gold). The reported performance is the average of six draws from each training set. (a) Estimation of CCSD(T)-PBE. (b) Estimation of CCSD(T)-SCAN.
In subplot (a) of Fig. 5, the object of estimation is the difference between CCSD(T) and PBE computations for each molecular system. For the multitask Δ option, we let the primary task be exactly this difference. The secondary task is the difference between SCAN and PBE computations, which has a correlation coefficient of 0.782 with the primary task data. The results of the multitask Δ option are plotted in gold. Δ-learning predictions for CCSD(T)-PBE are plotted in indigo. Note that these Δ-learning models are the single task counterpart to the multitask Δ approach—they are a conventional GP model trained only on its primary task. Therefore, subplot (a) provides another example where the multitask method outperforms its single task counterpart, in this case with a smaller correlation parameter than we considered in Subsection IV B 1: ρ = 0.782.
Subplot (b) in Fig. 5 shows the performance of the same set of methods when the primary and secondary tasks do not have a significant correlation. In this case, the extra data from the secondary task does not contribute to improved inference from the primary task. Here, we estimate the difference between CCSD(T) and SCAN predictions. For all descriptions accompanying subplot (a), we reverse the roles of SCAN and PBE for subplot (b). The major difference between the cases is the correlation coefficient between the data for CCSD(T)-SCAN and PBE-SCAN: ρ = −0.271. The multitask models shown in subplot (b) demonstrate the limitations of choosing Pearson’s coefficient as the correlation parameter. The multitask Δ option and hML extension to Δ-learning do not perform as well as standard Δ-learning in this setting. Since standard Δ-learning is the single task counterpart to the multitask Δ option, this behavior suggests that the multitask method requires a stronger correlation between tasks to outperform the single task method. Furthermore, the subplot shows that the multitask Δ option outperforms hML for models trained on larger amounts of DFT data. Thus, hML seems to suffer more from problematic data than the multitask Δ option. Importantly, Δ-learning and its hML extension have no correlation parameters and cannot adjust the contribution of each level of theory. Although we have left a detailed optimization of the multitask Δ option’s correlation parameter for future work, we observe that ρ = 0 can always be chosen. Therefore, the fully optimized multitask Δ option will always perform at least as well as standard Δ-learning. Consequently, the multitask framework makes it possible to benefit from useful secondary datasets that standard Δ-learning cannot accommodate while offering more protection against uncorrelated data than hML models.
We conclude our comparison between hML and the multitask Δ option with a remark about scaling when training data are available for more levels of theory. Suppose that alongside PBE and CCSD(T), we want to include L additional DFT functionals y1, …, yL. Then, an hML model that uses all levels of theory would require that these levels be assigned some order—with CCSD(T) first and PBE last and y1, …, yL in between. The difference between neighboring pairs of methods is then learned. However, hierarchies in DFA accuracy can be difficult to predict and differ between problem settings. This necessary ordering can thus become rather arbitrary. By contrast, the multitask Δ option would incorporate the L new secondary tasks based on the differences to PBE, i.e., yℓ − PBE. Each of these secondary tasks can be directly related to the primary task through a correlation parameter, which thus eliminates the need to impose an arbitrary order.
C. Ionization potential of small organic molecules
Organic molecules case. Pearson’s correlation coefficients between each secondary task and the primary task dataset. The column headers give the DFA used to generate the secondary set.
. | PBE0_DH . | PBE0 . | PBE . | BLYP . |
---|---|---|---|---|
ρ | 0.970 | 0.967 | 0.954 | 0.955 |
. | PBE0_DH . | PBE0 . | PBE . | BLYP . |
---|---|---|---|---|
ρ | 0.970 | 0.967 | 0.954 | 0.955 |
We provide results for . These cases correspond—in order, from the top down—to the three rows in Fig. 6.
Organic molecule case. For different numbers of tasks (indicated by color), plots show MAE vs cost. Note that 10−0.3 eV is approximately half an electronvolt. For each row of plots, k determines the fraction of training data, which is shared between any pair of secondary sets. Models represented in column (a) were trained without DFT data from the T set, while models in column (b) had access to these data.
Organic molecule case. For different numbers of tasks (indicated by color), plots show MAE vs cost. Note that 10−0.3 eV is approximately half an electronvolt. For each row of plots, k determines the fraction of training data, which is shared between any pair of secondary sets. Models represented in column (a) were trained without DFT data from the T set, while models in column (b) had access to these data.
In Fig. 6, as for previous results, the reported mean absolute error is the average result of six tests produced by randomly assigning molecules to the C, A, and T sets. The left column of subplots compares results for multitask models trained using the CA structure to the single task reference, while the right column considers multitask models trained with the CAT structure. Translucent stripes encompass the results for models with a given number of tasks, N. For the two task cases, the secondary task data are generated with PBE, and for the three task cases, the secondary data come from PBE and PBE0. In all cases, secondary target data are only generated with PBE. Recall from Subsection IV A 4 that our training and testing procedures are designed to produce fair comparisons between models trained on CA and CAT training structures. Although we test the predictive performance of each training set structure on 320 different target molecular configurations, we do not simultaneously include all of these configurations in . Rather, we iterate over the target configurations and train a multitask model for each configuration such that in each case. We remark that in practice, it would not be necessary to train a new model for each molecule in the target set.
As we observed in the water trimer case, the multitask models consistently perform better than single task models of comparable cost. In particular, for relatively small training sets, the multitask method shows an improvement in performance when secondary target data are included in the training set. Furthermore, we report that for larger secondary training sets, the correlation between multitask predictions and the CCSD(T) target data exceeds 95%, but we defer a full evaluation of prediction quality at the level of individual molecular configurations to future work.
Figure 6 also informs us about the utility of including more than one secondary task in a model. Note that we consider the same combination of sizes for the C, A, and T sets regardless of the total number of tasks, N. In the top row, k = 0 implies that molecular configurations in the C and A sets are each assigned to exactly one secondary task, regardless of the total number of secondary tasks. Therefore, in this row, we can compare models implemented with different N but nearly identical training budgets. The results show that for sufficiently large training sets, given a fixed budget, there is a benefit to dividing the budget across multiple secondary tasks rather than employing only one secondary task. For smaller datasets, choosing N = 2 may lead to performance as good as or better than a larger number of tasks. In particular, the top subplot in column (b) suggests that models where N = 2 perform particularly well when given a relatively inexpensive training set of the CAT structure. Note that, in each model, only one of the secondary tasks is trained using T data: only is nonempty. Thus, in this setting, the decrease in performance when new tasks are added likely comes from siphoning off part of the computational budget to train tasks that do not have access to T data.
We can also consider whether it is useful to include the same molecular configuration in the training set of multiple secondary tasks. In the final row of Fig. 6, k = 1 implies that is the same set for all n = 1, …, N − 1. By contrast, in the middle row, to enforce , we assign 50% of the configurations in C and A to every secondary task, while the remaining configurations in these sets are distributed between the secondary tasks so that each is assigned to exactly one secondary task. The improved performance of the N = 3 and N = 5 cases in the last row compared with the middle row suggests that there is a benefit to training our model on multiple secondary predictions for the same molecule.
V. CONCLUSION
The present work has demonstrated that inference models see a general benefit from leveraging all available data. Furthermore, the use of the multitask framework can reduce the cost of generating entirely new training data by an order of magnitude and can also facilitate the opportunistic incorporation of existing datasets of heterogeneous quality into the training set. To our knowledge, our work is the first to explore an asymmetric multitask GP regression model in a materials science setting. Significantly, the multitask framework is not restricted to the choice of GP inference. Future work can preserve the same relationship between primary and secondary task submodels and investigate training each submodel with neural networks or other tools for regression.
Unlike other machine learning methods for constructing surrogates from ab initio data, the multitask model does not require all data sources to be ordered in a hierarchy of accuracy. In this work, the data that informed our primary task—generated with CCSD(T)—was the most accurate in our training set, but we were also able to leverage an arbitrary number of secondary datasets without any information on their accuracy relative to each other. This feature of the multitask method is particularly useful when employing datasets constructed from different density functional approximations. We found that multitask model performance improves as we increase the number of secondary sets in the training data, which include predictions for a molecule of interest. The influence of each secondary dataset is controlled by a correlation parameter, and we found that in most settings, an estimate of Pearson’s correlation coefficient from training data works well. The use of this estimate simplifies the implementation of the method. The multitask method also supports training on differences, and with an appropriate choice of correlation parameter, multitask difference models can perform at least as well as and often better than Δ models. To explore the framework’s flexibility, we demonstrated the success of the multitask method on diverse training dataset configurations.
ACKNOWLEDGMENTS
The authors would like to thank Yeongsu Cho, Chenru Duan, Dallas Foster, and Heather Kulik for insightful discussions. MIT SuperCloud and Lincoln Laboratory Supercomputing Center are acknowledged for providing the HPC resources that have contributed to the research results reported within this paper.65 This work was supported by the Department of Energy, National Nuclear Security Administration PSAAP-III program, under Award No. DE-NA0003965 and the National Science Foundation Graduate Research Fellowship under Grant No. 1745302. This research was also supported by the NCCR MARVEL, a National Centre of Competence in Research, funded by the Swiss National Science Foundation (Grant No. 205602).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
K. E. Fisher: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (equal); Methodology (equal); Software (lead); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). M. F. Herbst: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (equal); Methodology (equal); Supervision (equal); Visualization (equal); Writing – review & editing (equal). Y. M. Marzouk: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Methodology (equal); Supervision (equal); Visualization (equal); Writing – review & editing (equal).
DATA AVAILABILITY
All data supporting this work are openly available. All codes to reproduce figures as well as all DFT predictions for the three-body energy of water trimers can be found at https://doi.org/10.5281/zenodo.10387647 or https://github.com/kefisher98/Multitask_GP/. Water trimer configurations and CCSD(T) predictions for three-body energy are available at https://github.com/jmbowma/q-AQUA. The data that support the findings of this study for the small organic molecules example are available at https://doi.org/10.5281/zenodo.10215421.
APPENDIX A: LENGTHSCALE AND VARIANCE OPTIMIZATION
Water trimer case. Each point compares the test MAE of a GP model that uses the MLE variance to the test MAE of a model that uses a variance estimate obtained by minimizing MAE on the optimization set. Different data points are obtained by shuffling the optimization data, and color indicates the size of the optimization dataset. In plot (a), we consider a GP model for the primary regression task, while in plot (b), we consider a model , as defined in (12).
Water trimer case. Each point compares the test MAE of a GP model that uses the MLE variance to the test MAE of a model that uses a variance estimate obtained by minimizing MAE on the optimization set. Different data points are obtained by shuffling the optimization data, and color indicates the size of the optimization dataset. In plot (a), we consider a GP model for the primary regression task, while in plot (b), we consider a model , as defined in (12).
Each color in Fig. 7 represents a different optimization set size, M. We obtain multiple hyperparameter estimates for each M by shuffling the optimization dataset. For each shuffled realization of the data, we solve for estimates , , and . The plotted points compare MAE on a test dataset of 320 molecular configurations obtained using and to MAE for the same test dataset obtained using and . We can see that both approaches to estimating variance lead to good predictive performance, but use of generally produces lower MAE than use of for both fp and . Figure 7 also demonstrates that the optimization dataset size can be reduced without significant reductions in prediction accuracy although there is a greater variance in prediction quality for M ≤ 45.
Note that given , the expression for is closed form, and is found using . Thus, we only optimize for one hyperparameter at a time. If the optimization procedure reaches a threshold of 30 s, we halt the procedure, reshuffle the data, and restart. In the case that M = 150, we require an order of 10 s on a laptop to successfully obtain MLE estimates. MAE estimates are found in s for the primary task and s for the secondary task. The reported amounts include any time spent on false starts that time out at 30 s. By comparison, a single DFT calculation for 3-b energy using a SCAN functional approximation requires s on a cluster. Thus, compared with data generation, hyperparameter optimization accounts for a small portion of the computational budget of our numerical experiments. Our results in Fig. 7 suggest that we can construct successful statistical models even when the budget for generating the optimization dataset is much smaller than in our work. For any M, the cost of generating the optimization set will be dominated by CCSD(T) computations, which are necessary for all statistical models that we implement. Future work may fully evaluate the minimum cost required to produce hyperparameters that will achieve a specified threshold of accuracy and investigate strategies for constructing opportunistic optimization sets from existing data.
APPENDIX B: TRAINING DATA COST MODEL
For the water trimers example, we randomly selected ten molecular configurations from our dataset as the basis for our cost model. Using Psi4, we computed the 3-b interaction energy at the level of SCAN and CCSD(T) for each configuration. We applied a counterpoise correction to ameliorate the basis set superposition error. All computations for the water cost model were performed on the same Intel Xeon Platinum 8260 node of MIT SuperCloud (https://supercloud.mit.edu/systems-and-software). All computations for a given molecular configuration were performed on the same core. For this case study, we estimate that a CCSD(T) calculation incurs 37 times the cost of a DFT calculation based on the average prediction time of these methods for our example molecular configurations.
Several systems in the small organic molecules dataset contain more atoms than the water trimer configurations. The median number of electrons per system in the organic molecules set is 40, the minimum is 14, and the maximum is 50. We estimate the cost of the level of theory ℓ as the computational time required to obtain the total energy at this level of theory averaged over ten randomly sampled configurations with 36 electrons. Note that this estimate is conservative. CCSD(T) scales as , where Ne is the number of electrons in a system, while DFT scales as . Therefore, an estimate of the ratio between the cost of CCSD(T) and the cost of DFT based on 36 electrons will be smaller than the actual ratio for most molecular configurations in our dataset. Furthermore, our prediction of the ionization potential was performed using ΔCC (delta coupled cluster), which requires multiple CCSD(T) calculations to reach an IP prediction. Thus, while our reported results (Fig. 6) estimate that CCSD(T) calculations demand 244 times the expense of DFT, the true ratio between the cost of CCSD(T) and DFT is even larger. Consequently, the reported gains in efficiency for the multitask method are expected to be a lower bound on the true improvement.
APPENDIX C: SOAP PARAMETERS
To fix reasonable values for parameters required by SOAP (rcut, σatom, nmax, and lmax), we refer to established conventions as well as experimentation. The cutoff radius, rcut, controls the size of each local neighborhood, and a small value may lead to lost geometric insight. Unfortunately, surveys on feature construction62,63 hold that a large cutoff radius does not necessarily provide proportionate insight; increasing radii larger than 6–8 Å is rarely, if ever, useful. Our choice of σatom also influences our model of each atom’s neighborhood; it determines the lengthscale of the Gaussian shapes located on each of the surrounding atoms. The larger our choice of σatom, the greater the chance the Gaussian tails will slip past the rcut border. The current literature62 indicates that the best practice when working with the first three rows of the Periodic Table is to use σatom = 0.3 Å for systems with hydrogen and σatom = 0.5 Å for systems without. Finally, we choose the parameters nmax and lmax to control the size of our expansions of the local neighborhoods. Generally,62 choosing nmax = 12 with lmax = 6 is sufficient for high accuracy. In practice, the best values for nmax and lmax will be sensitive to the choice of radial basis set.
We selected reasonable candidate values for each SOAP parameter and tested the performance of the resultant features in Gaussian process regression. All features for this work were calculated using the DScribe Python package.66 Figure 8 shows the mean absolute error (MAE) obtained with different SOAP parameters when GP regression trained with PBE level data is used to predict ionization potential (IP). Tests on CCSD(T) and PBE0 data produced similar results. Our tests include all combinations of rcut ∈ {3, 4, 5}, σatom ∈ {0.3, 0.4, 0.5}, lmax ∈ {2, 4, 6, 8}, and nmax ∈ {6, 8, 10, 12}. Units for rcut and σatom are Å. We find that our accuracy is comparable to probabilistic predictions of IP found in the literature.61 Of the SOAP parameters, the choice of σatom has the largest impact on predictive performance.
Example of inference results for a range of SOAP parameters. The colorbar gives the mean absolute error of predictions made for the ionization potential of small organic molecules. Both rcut and σatom have units of Å.
Example of inference results for a range of SOAP parameters. The colorbar gives the mean absolute error of predictions made for the ionization potential of small organic molecules. Both rcut and σatom have units of Å.
For our numerical experiments in this work, we fixed σatom = 0.4 Å because this choice demonstrated the best overall performance across our tests. The other parameters are chosen to balance reasonable accuracy and cost. We fix both lmax and nmax at 8, and for the example of small organic molecules, we use rcut = 4 Å. For water trimers, we use rcut = 10 Å to capture all three components of water molecules.
APPENDIX D: GLOBAL FEATURES
When using SOAP to compare two molecular systems, we must make modeling choices to standardize our representation of entire systems. SOAP features are constructed based on the neighborhood of each atom in a system, so challenges arise when two systems contain different numbers of constituent atoms of elements. There are multiple strategies for constructing “global” features to capture entire systems.61 The simplest approach is to average the SOAP features for each atom in the system. It is possible that the loss of information from averaging may make it challenging to distinguish between the features of similar molecular systems. A variation of this approach averages the local features corresponding to each element contained in a molecular system.61 If we wish to compare systems A and B, and system B contains more elements than A, we insert SOAP features corresponding to isolated atoms of those excess elements in the representation of A. This model suggests that an atom of such an element does not interact with the rest of the system. A regularized entropy match (REMatch) strategy has also been proposed for constructing global features.61 This approach solves a regularized optimization problem to find the match between the sets of local features for each molecular system, which maximizes information entropy.
With the same set of small organic molecules that we use to test SOAP parameters, we investigate how well the global features constructed by different approaches agree. For pairs of molecules, we construct global features using each of the three approaches outlined in the previous paragraph. We represent the difference between each pair by squaring the inner product of their features. This procedure is the same as using a polynomial kernel of degree 2. By comparing the output of the kernel for different featurization strategies, we determine whether the strategies generally agree about which molecules are similar.
Figure 9 shows the correlation in kernel outputs corresponding to each pair of molecules for different global featurization strategies. Consistently, in our tests, we find a strong correlation (Pearson’s coefficient ) between the REMatch features and features computed by averaging all local representations. These two strategies are also both positively correlated with the “Average by Species” approach, which produces an averaged feature for each element in the system and inserts isolated atoms to represent elements not included in the system, but these correlations are not as strong as those between the REMatch and totally averaged features. We may see this result because the REMatch and totally averaged features share the same dimension, n, whereas features that are constructed by species-specific averaging have dimensions of 2n to 4n in the cases we considered. Because averaging features is much more computationally efficient than constructing REMatch features, we use this approach in our numerical experiments.
Correlation between different methods of constructing global features when calculating the distance between molecule pairs.60
Correlation between different methods of constructing global features when calculating the distance between molecule pairs.60