There is a perceived dichotomy between structure-based and descriptor-based molecular representations used for predictive chemistry tasks. Here, we study the performance, generalizability, and explainability of the quantum mechanics-augmented graph neural network (ml-QM-GNN) architecture as applied to the prediction of regioselectivity (classification) and of activation energies (regression). In our hybrid QM-augmented model architecture, structure-based representations are first used to predict a set of atom- and bond-level reactivity descriptors derived from density functional theory calculations. These estimated reactivity descriptors are combined with the original structure-based representation to make the final reactivity prediction. We demonstrate that our model architecture leads to significant improvements over structure-based GNNs in not only overall accuracy but also in generalization to unseen compounds. Even when provided training sets of only a couple hundred labeled data points, the ml-QM-GNN outperforms other state-of-the-art structure-based architectures that have been applied to these tasks as well as descriptor-based (linear) regressions. As a primary contribution of this work, we demonstrate a bridge between data-driven predictions and conceptual frameworks commonly used to gain qualitative insights into reactivity phenomena, taking advantage of the fact that our models are grounded in (but not restricted to) QM descriptors. This effort results in a productive synergy between theory and data science, wherein QM-augmented models provide a data-driven confirmation of previous qualitative analyses, and these analyses in turn facilitate insights into the decision-making process occurring within ml-QM-GNNs.
I. INTRODUCTION
The rationalization and prediction of reactivity trends is one of the core objectives of (theoretical) chemistry. Before advanced ab initio quantum mechanics (QM) computations became a commonplace, a plethora of heuristic concepts and qualitative theory-inspired rules—usually tailored to a specific subclass of compounds or reactions—were already proposed and developed to this end. Some iconic examples in this regard are the Woodward–Hoffmann rules for the prediction of reaction outcomes of pericyclic transformations,1 the Bell–Evans–Polanyi principle,2 and the hard–soft acid–base (HSAB) concept pioneered by Pearson.3
Later on, several competing, overarching theoretical frameworks emerged, all of which aspire to describe chemical reactions in a universally applicable manner. Some well-known examples of such frameworks are conceptual density functional theory (cDFT),4,5 the (molecular orbital-based) activation-strain model (ASM),6,7 and the valence bond (VB) reactivity model.8,9 At the core of each of these frameworks is the definition of a limited set of chemically meaningful quantities or descriptors. These probe, either directly or indirectly, the magnitude of different fundamental interactions (e.g., orbital-based or “soft–soft” vs electrostatic or “hard–hard” interactions in cDFT), which collectively characterize the chemical system and enable an insightful and internally consistent (qualitative) discussion of its reactivity. As these frameworks matured, a concerted effort has taken place to incorporate/embed the various previously proposed concepts and rules, albeit of limited scope, into them.10,11 Attempts have also been made to build bridges between the individual frameworks themselves, thus facilitating enhanced understanding.12–15
With the advent of machine learning (ML) and artificial intelligence methods, an entirely different approach to chemical reactivity prediction emerged, not constrained by any specific theoretical framework. Recent work in this area includes the prediction of reaction products,16–18 reaction yields,19–21 and bond dissociation energies.22,23 Instead of building physically motivated representations of the constituent molecules (e.g., of reactants, products, and catalysts), many ML approaches take advantage of a model’s ability to learn meaningful representations and start with simple structural descriptors/features. These include graph-based molecular representations using graph neural networks (GNNs),22–25 simpler molecular fingerprint26 representations,16,27 and even simplified molecular-input line-entry system (SMILES) representations.21,28,29 Such structure-based representations do not necessarily have a direct connection to reactivity and rely on the nonlinearity and expressivity of ML models to relate the structure to function.30 Nevertheless, given sufficient data to train these models, deep learning for reactivity prediction has been demonstrated to achieve accurate predictions of reaction products and/or energies.
Unfortunately, the setup of regular GNNs renders the decision-making process occurring inside them rather opaque. Rationalizing the predictions made by these networks, i.e., rendering them “explainable,”31 has largely been limited to brittle techniques for estimating the sensitivity of predictions to atom- and bond-level contributions.20,32–35 Therefore, these models have been often characterized as black boxes. Furthermore, structure-based GNNs tend to/are assumed to underperform in data-limited settings as they must learn a meaningful representation “from scratch.”36,37 This represents a significant drawback of GNNs, as large datasets with thousands of data points are rare in the field of chemistry, especially when focusing on subtle reactivity questions.38 Other machine learning methods, e.g., kernel ridge regression (KRR)39,40 or Gaussian process (GP) regression37 in combination with molecular representations such as bag-of-bonds (BoB),41 FCHL19,42 or spectrum of London and Axilrod–Teller–Muto (SLATM) potentials43 are generally assumed to perform better in the face of data scarcity; even then, several thousand data points may be needed to train a model to an acceptable level of accuracy.37,39,40
One strategy to mitigate the drawbacks of these data-hungry methods is to represent molecules with functional descriptors that have a more direct (and linear) relationship with their reactivity. Instead of working within a specific theoretical framework, reactivity is modeled through statistical methods, but descriptors/features used as input are inspired by the chemistry and physics underlying the investigated reactivity problem. Champions of this approach are Sigman et al., among others, who have employed multivariate linear regression to relate sophisticated electronic and steric descriptors to complex properties such as enantioselectivity,44 and Zahrt et al., who used support vector machines (SVMs) and feed-forward neural networks for the same task.45 Recently, Ahneman et al. demonstrated the prediction of reaction yields of C–N cross-coupling reactions with a random forest model by selecting reaction-specific descriptors, starting from ∼4000 data points obtained via high-throughput experimentation.20 Other recent examples of this strategy can be found in the work of Beker et al.,46 Li et al.,47 and Jorner et al.48
While descriptor-based methods may require less data, be more generalizable, i.e., achieve an improved performance on classes of compounds not present in the training set,46 and enable at least some explainability/interpretability compared to the use of universal/non-specific representations,49 the selection of suitable descriptors is a non-trivial and problem-specific matter.38 Even more importantly, in order to obtain these chemically relevant descriptors, a dataset specific computational workflow is often required, which creates a bottleneck that significantly hampers the ease of employability of this strategy.38
In this work, we build upon a unifying approach proposed by Guan et al.,50 which aims to combine the advantages of structure-based GNNs with those of models based on expert-guided descriptors. Instead of feeding computationally expensive, task-specific descriptors directly to a machine learning model, we start from graph-based/structural input features but predict, as an intermediate step, a set of atom- and bond-level QM descriptors prior to the final reactivity prediction. Taking this approach, it becomes possible to construct a QM-based representation on-the-fly at minimal cost.
We assess the model performance on computational and experimental datasets (focusing on competing E2 vs SN2 reactions51 and aromatic substitution reactions,50 respectively) and find that the resulting model architecture exhibits excellent accuracy in a data-limited regime. We further evaluate the model’s ability to generalize to structures not seen during training using non-random data splits and observe a comparable improvement in performance. Most importantly, since our ml-QM-GNN models base their final reactivity predictions on a representation partially comprising QM descriptors, we are able to build a bridge to the traditional theoretical reactivity frameworks and explain the neural network’s decision-making process in terms of traditional chemistry concepts. Overall, our work underscores that machine learning techniques and conceptual models are not mutually exclusive approaches but are able to benefit each other in a synergistic manner.
II. COMPUTATIONAL METHODS
A schematic overview of the complete QM-augmented neural network architecture used in this work is presented in Fig. 1 (cf. Sec. S1 of the supplementary material for an in-depth discussion of the individual network branches).
First, the simplified molecular-input line-entry system (SMILES) representations of compounds involved in the reaction are parsed into graph-based representations using RDKit,52 and structural descriptors (atomic number, formal charge, ring status, bond order, etc.) are calculated for each heavy atom and bond. The structure-based representation is then used as input for a multitask GNN for QM descriptor prediction, based on a directed message passing neural network (D-MPNN) encoder, which has been adopted without modification from Guan et al.50 (details related to the setup and training of this D-MPNN encoder are included in Sec. S1). Note that this multitask GNN part of our network shares several characteristics with the stand-alone DeepMoleNet package and AIMNet, which have been developed to provide QM descriptors on-the-fly for qualitative reactivity analyses.53,54
The set of descriptors predicted by the D-MPNN encoder can be subdivided in two main categories: bond descriptors, i.e., natural population analysis (NPA) bond orders55 and bond lengths, and atom-centered descriptors, i.e., (Hirshfeld) atomic charges,56 as well as (atom-condensed) nucleo- and electrophilic Fukui functions,5 and nuclear magnetic resonance (NMR) shielding constants.57 In the finite-difference approximation,58 the electrophilic Fukui function on site i is defined as
where qi(N) and qi(N + 1) are the (Hirshfeld) partial charges on atom i for the corresponding (N)- and (N + 1)-electron wave function evaluated in the optimized N-electron geometry. The nucleophilic Fukui function in its turn is defined as
where qi(N) and qi(N − 1) are the (Hirshfeld) partial charges on atom i for the corresponding (N)- and (N − 1)-electron wave function evaluated in the optimized N-electron geometry.
The bond descriptors are converted to a vector representation through the application of a radial basis function (RBF) expansion and relayed to a separate Weisfeiler–Lehman network (WLN) branch, which combines them with the initial, structure-based representation in a convolutional embedding.59 The atom-centered descriptors are kept separately and are similarly post-processed, i.e., they are scaled and then turned into a vector representation through RBF expansion.
The learned representation emerging from the WLN and the expanded atomic QM descriptor representation emerging from the D-MPNN branch of the network are subsequently concatenated, after which the concatenated representation is passed through a dense activation layer followed by a global attention mechanism60 to capture the influence of distant parts of the reacting system. Because we focus on molecule- or reaction-level prediction tasks in this work, we aggregate these representations into one final feature vector by sum-pooling over the (hypothetically) reacting atoms. This global feature vector is transformed once more in a single-layer network to produce the barrier heights/activation energies in the case of a regression task and a regioselective preference in the case of a classification task (vide infra).
The regular GNN, used as the primary baseline in the discussions below, follows an analogous architecture as the ml-QM-GNN, except that it does not contain the QM descriptor network branch, i.e., the WLN operates on the original structural features alone and the attention mechanism takes only the learned features as input (cf. Sec. S1 of the supplementary material).
A. Datasets
We focus the evaluation of our ml-QM-GNN model on two publicly available datasets: one computational and one experimental (summary statistics, characterizing the distribution of the individual sets, are included in Sec. S3). The first dataset comprises computed stationary points along the potential energy surface for competing E2 and SN2 reactions in the gas-phase, recently published by von Rudorff et al.51 In total, four distinct nucleophiles (H−, F−, Cl−, and Br−), three distinct leaving groups (F−, Cl−, and Br−), and permutations of five potential substituents (H, NO2, CN, CH3, and NH2), on an ethyl-based scaffold, were considered [Fig. 2(a)]. As indicated by the authors, the substituents were selected to (i) maximize electronic effects, and (ii) minimize steric hindrance, which makes this dataset ideally suited for our ml-QM-GNN model based on electronic QM descriptors.
To construct a regression task, 3647 barrier heights calculated at the DF-LCCSD/cc-pVTZ//MP2/6-311G(d) level-of-theory61–67 were extracted (1286 corresponding to E2 pathways; 2361 to SN2).40 An appropriate input for our GNNs was constructed by converting the 3D reactant complex geometry for each reaction into a 2D SMILES representation using xyz2mol.68 Additional details about the data pre-processing can be found in Sec. S4.
To construct a classification task, we extracted both an E2 and SN2 transition state (TS) from 791 unique reaction systems (Sec. S5). These data were used for training classification models to predict whether the E2 or SN2 pathway is kinetically favored.
The second dataset we examine is a purely experimental one consisting of regioselective aromatic C–X substitution and C–H functionalization reactions from the Pistachio database, to which plausible side products corresponding to regiochemical alternatives, identified through template extraction and application,69 were added.70 This dataset was originally curated by Guan et al.50 Since the data in Pistachio are only available to license-holders, Guan et al. filtered out a subset of 3242 data points for which the reactions are also present in the USPTO public database. Here, we examine the multi-way classification performance when predicting the regiochemical preference for this subset of reactions [Fig. 2(b)]. In other words, the model is trained to assign the highest probability to the correct/“true” product among the various regiochemical alternatives provided as input for each data point.
As we are mainly interested in understanding the effect of QM-augmentation under data-scarce conditions, models applied to this filtered dataset were primarily trained on a sample of only 200 data points. In each fold and iteration considered, these data points were randomly selected from the original, full training set. The small standard deviations in accuracy observed across different iterations (vide infra) suggest that the model is rather insensitive to the identity of the specific reactions sampled. Learning curves, visualizing how the accuracy of the model evolves as more data points are included during training, can be found in Sec. S6 of the supplementary material.
III. RESULTS AND DISCUSSION
A. Accuracy—E2/SN2 dataset
As a first step, the performance of our ml-QM-GNN model was assessed by comparing the accuracy obtained for the barrier height/activation energy prediction for the competing E2/SN2 reactions with the (structure-based) GNN baseline model. Initially, we split up the data according to the respective reaction type and performed three 5-fold cross-validations (CV) on each. In every iteration, the number of labeled data points considered for the construction of the model, i.e., the combination of training and validation set, was limited to 125, 250, 500, and 1000 in the case of E2, and 225, 450, 900, and 1800 points in the case of SN2, matching the prior evaluation by the work of Heinen, von Rudorff, and von Lilienfeld.40 The average mean absolute error (MAE) for our WL-based GNN models, obtained in this manner, are presented in the first two columns of Tables I and II.
Labeled points . | Baseline GNN . | ml-QM-GNN . | Chemprop . | BoBa . | SLATMa . | FCHL19a . | One-hot encodinga . | Multivariate regression . |
---|---|---|---|---|---|---|---|---|
225 (180 + 45) | 9.07 ± 0.04 | 3.61 ± 0.14 | 6.71 ± 0.08 | 4.89 | 4.44 | 3.80 | 3.53 | 6.43 |
450 (360 + 90) | 8.89 ± 0.13 | 3.28 ± 0.04 | 4.01 ± 0.02 | 4.28 | 3.87 | 3.43 | 2.80 | 6.33 |
900 (720 + 180) | 8.61 ± 0.07 | 2.97 ± 0.03 | 3.23 ± 0.07 | 3.78 | 3.21 | 3.11 | 2.42 | 6.40 |
1800 (1440 + 360) | 8.49 ± 0.03 | 2.76 ± 0.01 | 2.85 ± 0.02 | 3.49 | 2.92 | 2.87 | 2.14 | 6.43 |
Labeled points . | Baseline GNN . | ml-QM-GNN . | Chemprop . | BoBa . | SLATMa . | FCHL19a . | One-hot encodinga . | Multivariate regression . |
---|---|---|---|---|---|---|---|---|
225 (180 + 45) | 9.07 ± 0.04 | 3.61 ± 0.14 | 6.71 ± 0.08 | 4.89 | 4.44 | 3.80 | 3.53 | 6.43 |
450 (360 + 90) | 8.89 ± 0.13 | 3.28 ± 0.04 | 4.01 ± 0.02 | 4.28 | 3.87 | 3.43 | 2.80 | 6.33 |
900 (720 + 180) | 8.61 ± 0.07 | 2.97 ± 0.03 | 3.23 ± 0.07 | 3.78 | 3.21 | 3.11 | 2.42 | 6.40 |
1800 (1440 + 360) | 8.49 ± 0.03 | 2.76 ± 0.01 | 2.85 ± 0.02 | 3.49 | 2.92 | 2.87 | 2.14 | 6.43 |
Taken directly from the work of Heinen, von Rudorff, and von Lilienfeld.40
Labeled points . | Baseline GNN . | ml-QM-GNN . | Chemprop . | BoBa . | SLATMa . | FCHL19a . | One-hot encodinga . | Multivariate regression . |
---|---|---|---|---|---|---|---|---|
125 (100 + 25) | 9.03 ± 0.18 | 4.08 ± 0.06 | 7.49 ± 0.10 | 4.67 | 4.43 | 4.01 | 3.53 | 6.43 |
250 (200 + 50) | 8.78 ± 0.35 | 3.24 ± 0.08 | 5.78 ± 0.18 | 4.07 | 3.87 | 3.42 | 3.12 | 6.27 |
500 (400 + 100) | 8.18 ± 0.04 | 2.91 ± 0.02 | 3.29 ± 0.09 | 3.71 | 3.21 | 3.01 | 2.69 | 5.90 |
1000 (800 + 200) | 8.04 ± 0.17 | 2.65 ± 0.02 | 2.75 ± 0.06 | 3.27 | 2.92 | 2.75 | 2.40 | 6.07 |
Labeled points . | Baseline GNN . | ml-QM-GNN . | Chemprop . | BoBa . | SLATMa . | FCHL19a . | One-hot encodinga . | Multivariate regression . |
---|---|---|---|---|---|---|---|---|
125 (100 + 25) | 9.03 ± 0.18 | 4.08 ± 0.06 | 7.49 ± 0.10 | 4.67 | 4.43 | 4.01 | 3.53 | 6.43 |
250 (200 + 50) | 8.78 ± 0.35 | 3.24 ± 0.08 | 5.78 ± 0.18 | 4.07 | 3.87 | 3.42 | 3.12 | 6.27 |
500 (400 + 100) | 8.18 ± 0.04 | 2.91 ± 0.02 | 3.29 ± 0.09 | 3.71 | 3.21 | 3.01 | 2.69 | 5.90 |
1000 (800 + 200) | 8.04 ± 0.17 | 2.65 ± 0.02 | 2.75 ± 0.06 | 3.27 | 2.92 | 2.75 | 2.40 | 6.07 |
Taken directly from the work of Heinen, von Rudorff, and von Lilienfeld.40
The comparison between the MAEs obtained for the QM-augmented model and for the baseline model reveals that the inclusion of the QM descriptors in the model architecture results in a 5–6 kcal/mol lower error. The rate at which the accuracy improves as more labeled data points are included during training/validation is also improved.
Next to the WL-based GNNs, we also considered the performance of a powerful alternative structure-based neural network architecture, Chemprop,71 which has recently been developed by Yang et al., for the same data splits, cf. the third column in Tables I and II as well as Sec. S7. Remarkably, Chemprop, which makes use of a D-MPNN72 encoder for its convolutional embedding instead of a WL one, is significantly more data efficient than our baseline GNN for the considered task: Whereas for the smallest numbers of labeled data points, Chemprop does only marginally, i.e., 1–2 kcal/mol, better than the regular WL-based GNN, the advantage of the former grows rapidly as the number of labeled data points considered during training increases. Nevertheless, our ml-QM-GNN model outperforms Chemprop across the entire range of training/validation-set sizes considered. For the lowest training/validation-set sizes, the difference in accuracy is stark; the MAE obtained for the QM-augmented, WL-based GNN is almost half the MAE obtained for Chemprop. When a 1000 or more data points are considered during training, the advantage of our ml-QM-GNN dwindles to about 0.1 kcal/mol.
To contextualize the performance of our models, we also include kernel ridge regression (KRR) models benchmarked by Heinen et al. for four different (global) representations—BoB, SLATM, FCHL19, and one-hot encoding, respectively (in which every substituent site, as well as the nucleophile and leaving group site, is assigned a bit vector spanning all the possible substituent/nucleophile/leaving group species)—in Tables I and II.40
From Tables I and II, it is straightforward to discern that the QM-augmented model outperforms BoB, SLATM, and FCHL19 in combination with KRR for even the smallest number of labeled data points (125 or 225 points). In the absence of QM augmentation, however, the baseline GNN performs significantly worse than any of these methods across the board, as does Chemprop for all but the biggest training/validation set sizes considered. Further underscoring the excellent performance of the ml-QM-GNN model is the lack of hyperparameter optimization and the use of the mean square error loss during training, rather than MAE directly (cf. Sec. S1). Nevertheless, it should also be noted that simple one-hot encoding combined with KRR still outperforms the QM-augmented model here. However, as will be demonstrated below, one-hot encoding—as well as other models that do not base their representation on physical principles—suffers from an inherent limitation related to generalization, i.e., out-of-sample predictions, which severely limits their appeal with respect to our ml-QM-GNN in practical applications.
As an additional check, we constructed multivariate linear models based on the QM descriptor values outputted by our surrogate model, cf. the final columns in Tables I and II, as well as Sec. S8 of the supplementary material. The accuracies of these models are significantly worse than those of the (non-linear) ml-QM-GNN model, and the performance does not increase as more data points are added to the training set. Furthermore, univariate analysis reveals that none of the descriptors correlate particularly well with the activation energies: The maximal R2 values obtained are 0.24 and 0.44 for the the SN2 and E2 reactions, respectively. These findings confirm the established insight from the cDFT/VB perspectives that, even though these electronic descriptors determine chemical reactivity to a great extent (vide infra), there is no simple, universally valid linear relationship between their magnitude and the height of reaction barriers, i.e., non-linearity is required to fully exploit the reactivity patterns encoded in these physically motivated descriptors.15 Additional evidence for the latter point can be found in Sec. S13 of the supplementary material: (non-linear) GNNs that start from an exclusively QM-based representation do recover a reasonable accuracy, although they still underperform with respect to the full ml-QM-GNNs.
Since our WL-based GNN models involve pooling over reacting atoms, they enable simultaneous treatment of distinct reaction modes for the same reactant/reagent system (in contrast to the reaction-specific KRR models). Hence, we were able to combine the data for the E2 and SN2 reactions and train a common model for both sets of barrier heights. In Fig. 3, correlation plots between the “true” (computed) and predicted activation energies (Ea), aggregated across all test sets sampled during three 5-fold CVs, are presented.
The correlation between the predicted and computed values is weak for our baseline GNN model [Fig. 3(a); R2 = 0.65]; the mean MAE and root mean square error (RMSE) from three replicates is 8.4 and 10.1 kcal/mol, respectively. The QM-augmented analog achieves much stronger correlation [Fig. 3(b); R2 = 0.96] and an average MAE and RMSE of 2.9 and 3.9 kcal/mol.
B. Generalizability—E2/SN2 regression dataset
Next, we considered the ability of our regression models to generalize to new compounds not present in the training data. While one-hot encoding representations have proven to be well suited for reactivity problems focused on interpolation,73,74 they do not perform well on in out-of-sample predictions.49
To assess model generalizability, training/validation and test sets were selectively sampled so that for one of the four nucleophiles in the dataset, all its reactions would consistently be part of the test set, and consequently, this nucleophile is not “seen” by the model during training. The random train and validation set sampling (in a 3:1 ratio) was iterated five times, and the resulting predictions were aggregated. In Fig. 4, the correlation plots for each of the different “held-out” nucleophiles are presented.
The regular, WL-based GNN models barely manage to reproduce the qualitative trend in the computed activation energies; they significantly underestimate the quantitative values for the barriers when either the hydride or fluoride nucleophiles are held-out, whereas the barriers are vastly underestimated when chloride and bromide are held-out during training (mean RMSEs range from 9 to 18 kcal/mol). KRR in combination with one-hot encoding40 performs even worse than the regular GNN on this task; upon selectively sampling, RMSEs between 9 and 20 kcal/mol are obtained for this model architecture (cf. Table S5 of the supplementary material). Remarkably, the FCHL19 and SLATM representations do not fare much better than one-hot encoding (mean RMSEs range from 7 to 35 kcal/mol for SLATM and from 5 to 24 kcal/mol for FCHL19, cf. Table S6). Finally, Chemprop also fails to produce reasonable predictions (mean RMSEs between 9 and 16 kcal/mol, cf. Table S11 of the supplementary material), underscoring that limited generalizability is a universal issue for models that do not base their representation on physical principles. The ml-QM-GNN on the other hand obtains decent correlations for each of the models trained with hold-out nucleophiles [Figs. 4(e)–4(h)], and the quantitative agreement between model predictions and true values is much better: For hydride and fluoride, the mean RMSEs amount to 8–9 kcal/mol, whereas for chloride and bromide, the mean RMSEs amount to a reasonable 5–6 kcal/mol.
These results constitute an unequivocal demonstration that designing a GNN model so that it constructs a QM-based representation before the final reactivity prediction not only improves the model accuracy in this data-limited setting but also improves the model’s ability to generalize to unseen nucleophiles, i.e., nucleophiles not found in any example in either the training or the validation set.
C. Explainability—E2/SN2 dataset
Finally, we aimed to gain some insights into how the model reaches its decisions/reactivity predictions. It is not always apparent what exactly deep learning models are learning and how they generalize to new, previously unseen data points, which makes their performance less predictable in prospective settings. We performed a set of ablation experiments, where we controlled the number and type of atom-centered QM descriptors that are used to supplement the structural representation. Specifically, we masked either the nucleophilic and electrophilic Fukui indices or the Hirshfeld partial charges and NMR shielding constants (the effect of inclusion of individual descriptors in the model is concisely discussed in Sec. S17 of the supplementary material).
The decision to consider the effect of these pairs of descriptors simultaneously is inspired by the core principle emerging from physical organic chemistry/cDFT that the interactions between reacting species can generally be subdivided in two main types: “hard–hard” or electrostatic interactions, and “soft–soft” or (frontier) orbital interactions.3,75 Fukui functions, defined as the (atom-condensed) distribution of an added and removed electron to the system, probe the latter, whereas atomic charges and NMR shielding constants, which reflect the (de)shielding of the nuclei caused by intramolecular electron donation/withdrawal, probe the former.5,15,76 It should be noted that this theory-informed dichotomy of the considered atom-condensed descriptors can also readily be retrieved from purely data-driven feature analysis: In both datasets considered in this study, partial charges and NMR shielding constant values correlate with a significant extent, whereas neither of those consistently correlate with either of the Fukui function values, cf. Sec. S18 of the supplementary material.
Table III contains the accuracies obtained after three 5-fold CVs for the baseline GNN model and the models with one or both sets of estimated QM descriptors. For the baseline model, an average RMSE of 10.1 kcal/mol was obtained; for the full QM-augmented model [“ml-QM-GNN (full)”], the average RMSE was reduced to 3.9 kcal/mol. Remarkably, the ablated model that bases its reactivity predictions solely on atomic charges and NMR shielding constants as QM descriptors [“ml-QM-GNN (charge + NMR)”] recovers the exact same accuracy as the full QM-augmented model (RMSE = 3.9 kcal/mol); the ablated model that only makes use of the Fukui functions [“ml-QM-GNN (Fukui)”] on the other hand achieves the same accuracy as the baseline model (RMSE = 10.0 kcal/mol). When the training set size during these fivefold CVs is reduced to 200 data points, similar results are obtained: The ml-QM-GNN (Fukui) model only marginally improves the accuracy of the model relative to the baseline, whereas the ml-QM-GNN (charge + NMR) model gets exceedingly close to the ml-QM-GNN (full) model (cf. Sec. S19). These findings suggest that electrostatic/hard–hard interactions are the main drivers of the variations observed in the dataset and that Fukui functions, i.e., frontier orbital interactions, are not particularly relevant in this regard.
Model . | RMSE (kcal/mol) . | Accuracy (%) . |
---|---|---|
Baseline GNN | 10.13 ± 0.14 | 77.0 ± 0.8 |
ml-QM-GNN (full) | 3.92 ± 0.01 | 89.0 ± 0.6 |
ml-QM-GNN (Fukui) | 10.02 ± 0.02 | 77.1 ± 0.1 |
ml-QM-GNN (charge + NMR) | 3.93 ± 0.02 | 88.9 ± 0.6 |
Model . | RMSE (kcal/mol) . | Accuracy (%) . |
---|---|---|
Baseline GNN | 10.13 ± 0.14 | 77.0 ± 0.8 |
ml-QM-GNN (full) | 3.92 ± 0.01 | 89.0 ± 0.6 |
ml-QM-GNN (Fukui) | 10.02 ± 0.02 | 77.1 ± 0.1 |
ml-QM-GNN (charge + NMR) | 3.93 ± 0.02 | 88.9 ± 0.6 |
Further evidence that Fukui functions do not play a significant role in the decision-making process of our QM-augmented network is obtained when the fully trained ml-QM-GNN model is applied to all data points after averaging Fukui function values across all atoms. In this setup, the model predictions are barely affected (R2 = 0.96 between the original predictions and the predictions with Fukui information averaged out). Averaging out the charges and NMR shielding constants on the other hand scrambles the predictions made by the model entirely (Fig. S6).
It is worthwhile to note here that our uni-/multivariate analysis (vide supra) already contained clues hinting in the same direction: While—as noted before—none of the descriptors were found to correlate particularly well with the activation energies in these linear models, the hard–hard descriptors did collectively outperform the soft–soft descriptors by a significant margin (cf. Sec. S8 of the supplementary material).
The same trends discussed above emerge for the classification models constructed for the second curated dataset, also extracted from the E2/SN2 reaction data (Table III). Again, an improved accuracy is observed when the model bases its predictions on the full set of predicted QM descriptors (77%–89% accuracy; cf. Sec. S15 for some representative failures to predict the correct preference for the ml-QM-GNN model). Removing the Fukui function information from the QM-augmented model does not affect the accuracy in a meaningful way, whereas removing the charge and NMR shielding constant data again causes the accuracy to drop to the baseline level.
Our findings about the relative importance of electrostatic/“hard–hard” vs (frontier) orbital/“soft–soft” interactions, emerging in both regression and classification tasks, are perfectly in line with a recent qualitative VB/cDFT analysis undertaken by one of the authors of the present work.77 In this analysis, it was demonstrated that the modulation of the E2/SN2 competition is primarily driven by the electrostatic interactions present in the E2-TS: the formation of a strongly stabilizing array of point-charges, i.e., (−)–(+)–(−), in this geometry tends to push its energy below that of the SN2-TS; in the case that the point-charges in this array—and thus the Coulombic interaction—are weaker, the SN2-pathway dominates [cf. Fig. 5(a)].
The observation that the Fukui function values are not really helpful to a GNN aiming to learn this mechanistic competition could also have been readily anticipated from the analysis presented in the same qualitative study: The ethylhalide substrates on which the nucleophiles attack generally carry two relatively close-lying, unoccupied frontier orbitals that are delocalized over both the α-carbon and the hydrogen on the β-carbon [Fig. 5(b)];77 providing information about only the lowest-lying of these orbitals through the (electrophilic) Fukui function is not very informative.
The discussion above demonstrates that combining a qualitative analysis rooted in conceptual reactivity frameworks with our QM-augmented machine learning approach leads to a productive synergy for the considered dataset of competing E2/SN2 reactions: On the one hand, the qualitative insights provide context to explain and understand the decision/prediction-making process of the network. At the same time, the results emerging from our ablation study can also be considered as an indirect, data-driven confirmation of the qualitative reactivity analysis.
D. Explaining the predicted regioselectivity of aromatic substitution reactions
To further explore the explainability/interpretability of ml-QM-GNN models, we revisited the classification dataset containing regiochemical data for electrophilic substitution reactions, originally compiled by Guan et al. to demonstrate the potential of this model architecture in data-limited settings.50 As mentioned in Sec. II, we select only those data points that are publicly available in the USPTO database, resulting in a curated dataset of 3242 reactions. In Table IV, the accuracies achieved by the baseline GNN, the ml-QM-GNN, and the two ablated models on the curated dataset (training set limited to 200 data points) are presented. Once more, we observe that the full QM-augmented model significantly outperforms the baseline GNN. More interestingly, it can be observed now that each set of descriptors leads to a significant improvement in accuracy over the baseline, but both are needed to maximize performance.
Model . | Accuracy (%) . |
---|---|
Baseline GNN | 71.4 ± 1.3 |
ml-QM-GNN (full) | 86.5 ± 0.7 |
ml-QM-GNN (Fukui) | 84.4 ± 1.1 |
ml-QM-GNN (charge + NMR) | 82.2 ± 1.3 |
Model . | Accuracy (%) . |
---|---|
Baseline GNN | 71.4 ± 1.3 |
ml-QM-GNN (full) | 86.5 ± 0.7 |
ml-QM-GNN (Fukui) | 84.4 ± 1.1 |
ml-QM-GNN (charge + NMR) | 82.2 ± 1.3 |
Even though the differences are rather small, the results presented in Table IV suggest that for this class of reactions, the Fukui functions are slightly more informative than charges and NMR shielding constants, which is in contrast to what was observed for the E2/SN2 dataset in Sec. III C. These findings can be straightforwardly reconciled with the qualitative physical organic reactivity models that have been constructed and popularized throughout the years: Aromatic compounds are strongly delocalized, and hence, orbital interactions/changes in delocalization stabilization as probed through the Fukui functions are generally considered to be the main driving force shaping their reactivity.14,78,79 At the same time, it has been underscored in recent years that electrostatics cannot be neglected—in particular under conditions favoring kinetic control—since atomic charges tend to become more pronounced as the compounds involved in the reaction approach,9 enabling a significant Coulombic stabilization/destabilization of the wave function in the transition state region.15,76
To obtain a better understanding of how the (ablated) QM-augmented models reach their decisions/predictions, we constructed a set of confusion matrices comparing the respective prediction accuracies for all test sets considered during the first cross-validation (Fig. 6).
These matrices show that there are relatively few data points (97 and 119) where an ablated model makes the correct prediction, while the model with access to all descriptors reaches the wrong conclusion; the full QM-augmented model appears to mainly rectify incorrect predictions by the ablated models (cf. the upper two square matrices in Fig. 6). Additionally, the lower-right confusion matrix in Fig. 6 demonstrates that many incorrect predictions by the two ablated models are distinct: The number of points for which there is disagreement between the models (245 + 351) is greater than the off-diagonal elements in the upper matrices.
We then considered whether incorrect prediction of regioselectivity preference is connected to failures of either the “hard–hard”/electrostatic or “soft–soft”/(frontier) orbital criterion. The “hard–hard”/electrostatic criterion is considered to be fulfilled when the reacting site on the aromatic substrate corresponds to the site carrying the highest partial negative charge (in the case of electrophilic attack) or partial positive charge (in the case of nucleophilic attack) since this would maximize the Coulombic stabilization upon approach between the reacting species. The “soft–soft”/(frontier) orbital criterion is considered to be fulfilled when the reacting site on the substrate corresponds to the site on which the nucleophilic Fukui function is most concentrated (in the case of electrophilic attack) or the site on which the electrophilic Fukui function is most concentrated (in the case of nucleophilic attack)—which would maximize the orbital interaction upon approach between the reacting species.
Since nitration reactions and halogenation reactions involving N-substituted succinimides are electrophilic without exception, are straightforward to recognize based on a SMILES representation, and collectively cover approximately half of the dataset (1534 out of 3242 data points), we decided to focus exclusively on these reaction types for this part of our analysis. A confusion matrix comparing the predictions obtained through naive evaluation of the individual physical organic criteria (vide supra) with the predictions made by the (ablated) models during fivefold CV is presented in Table S14. The predictions made by the models tend to adhere to the respective criteria: The vast majority of data points for which both criteria point to the correct reactive site are classified correctly by the ablated and the full QM-augmented GNNs, whereas reactions incorrectly classified only by the ablated model that exclusively considers Fukui function values disproportionately violate the “soft–soft” criterion (and vice versa: The reactions that are only incorrectly classified by the ml-QM-GNN (charge + NMR) disproportionately violate the “hard–hard” criterion, Fig. 7). Only a handful of reactions for which neither criterion is fulfilled are classified correctly.
Putting all this together, one can conclude that the trained QM-augmented GNN model with all descriptors included appears to be able to balance the relative importance of the electrostatic and (frontier) orbital criteria, whereas the ablated models are generally biased toward (a) assigning too much importance to the interaction type that they have access to and (b) adhering to the rudimentary assignment based on the “soft–soft”/“hard–hard” criterion, even when this assignment is incorrect.
As such, our ml-QM-GNN model essentially refines the balancing act that is implicitly part of any cDFT-based qualitative reactivity analysis. As indicated above, most human theoreticians tend to assume that the regioselectivity of aromatic reactions is primarily determined by the Fukui functions, i.e., the soft–soft interactions dominate, and they will often ignore the electrostatics altogether (or assume that these are in sync with the Fukui function values).79 Our analysis (cf. Sec. S21) shows that this is not an unreasonable approximation to make for this dataset; selecting the site to which the soft–soft criterion points—ignoring the hard–hard criterion completely—results in an accuracy of ≈80%. By relaxing this rigid guiding rule and replacing it by a more complex classification function centered around the same QM descriptors, our models already reach accuracies exceeding 86% when a mere 200 training points are used (Table IV) and reaching up to 93%–94% when the training set size is expanded to 1945 training points (cf. Sec. S6 of the supplementary material for learning curves visualizing the gradual rise in accuracy upon training set expansion).
E. Failures in predicting regioselectivity of aromatic substitution reactions
Finally, we also took a closer look at some of the reacting systems for which neither criterion is fulfilled and for which all our GNN models, trained on only 200 example reactions, fail to make the correct prediction. While it appears impossible to assign a specific reason for each failure, there are some recurring patterns.
As an example, all criteria and models consistently appear to fail for most nitration reactions involving aniline analogs, i.e., aromatic substrates with an NH2-substituent [Fig. 8(a)]. For this type of reaction, nitration in ortho- or para-position is expected, but the “true” reactive site corresponds to the meta-position. This failure is a reflection of the fact that nitration reactions are usually performed in strongly acidic reaction media to promote the protonation of nitric acid into the active nitronium ion species, which inherently results in simultaneous protonation of the NH2-substituent, rendering this substituent meta-directing instead of para-directing [Fig. 8(c)].
Another recurring failure is observed for nitration reactions involving halogenated phenols. In this type of reaction, both the criteria and models predict the reaction to occur in ortho-/para-position relative to the halogen substituent, whereas the “true”/recorded reactive site is consistently the ortho-position relative to the OH-substituent [Fig. 8(b)]. This failure can most likely be attributed to a mechanistic crossover: Instead of a conventional electrophilic aromatic substitution mechanism, reactions between phenolic compounds and nitric acid/nitronium ions have been demonstrated to involve a (pure) single electron transfer step, followed by radical recombination [Fig. 8(d)].80,81 This change in lthe mechanism can be expected to impact the balance in directing strengths of the halogen- and hydroxy-substituents with respect to regular electrophilic aromatic substitution reactions, resulting in a modification of the regiochemistry.
Since both the physical organic criteria and our GNN models are agnostic to reaction conditions and “concealed” reaction steps, they are unable to capture these modifications to the directing character of the NH2 and OH-substituents under data-scarce conditions. One can anticipate that by adding more and more training data, the model may learn to overrule the “regular” QM descriptor assignment when NH2/OH-substituents and nitric acid are simultaneously present in the reacting system, but since only 200 data points were used for the initial training here, there are simply not enough examples to capture these particular patterns. Indeed, we find qualitative evidence for this assertion in the case of the nitration reactions of halogenated phenols: With 1945 training points, the correct regiochemical predictions are generally recovered by the ml-QM-GNN model for this class of failures (cf. Sec. S22 of the supplementary material).
IV. CONCLUSIONS
In this work, the performance, explainability, and generalizability of the QM-augmented GNN (ml-QM-GNN) model architecture have been assessed for a few distinct predictive chemistry tasks. Our models achieve a significantly improved accuracy over an analogous, conventional GNN baseline and generalize markedly better to unseen compounds, particularly in a data-limited regime. Even when only a couple hundred labeled data points are available, our ml-QM-GNN models are competitive with traditional “low-data” KRR models.
Importantly, since the predictions made by our models are rooted in (predicted) QM descriptors, it becomes possible to build bridges between their predictions and the existing physical organic frameworks, developed to qualitatively analyze chemical reactivity. Through a series of selective descriptor ablation experiments, we have demonstrated that for competing E2/SN2 reactions, Fukui function values, i.e., information about soft–soft interactions, are of limited value; charges and NMR shielding constants are the main drivers of the improvement in model accuracy with respect to the baseline GNN model. These findings can be rationalized through consideration of a recent qualitative valence bond/cDFT analysis of this mechanistic competition.77
For aromatic substitution reactions, we observed that the ml-QM-GNN model appears to make its decision/predictions in a similar manner as human theoreticians trained in physical organic chemistry/cDFT, i.e., by considering soft–soft and hard–hard interactions separately through their corresponding local descriptors. What makes our models excel with respect to a naive conceptual treatment is their ability to fine-tune the relative importance of the individual physical organic criteria based on subtle patterns present in the data, resulting in a more complex classification function and a significantly higher accuracy.
Overall, our analysis underscores that a productive interplay between machine learning models and qualitative reactivity analysis is possible: On the one hand, qualitative insights into the considered reactivity problem provide context to explain and understand the decision-making process of the network. Additionally, they can provide clues about the suitability to include specific QM descriptors in the neural network. At the same time, the results emerging from machine learning models augmented with QM descriptor information can provide an indirect, data-driven confirmation of a qualitative reactivity analysis.
SUPPLEMENTARY MATERIAL
See the supplementary material for an in-depth technical description of the WL-based GNN model architectures, technical details related to the overall training process and cross-validation, summarizing statistics to characterize the data distributions, the effect of removing “duplicate” data points from the E2/SN2 dataset, note regarding the data curation procedure for the E2/SN2 classification dataset, learning curves for the aromatic substitution dataset, computational details related to the application of Chemprop to the E2/SN2 dataset, multi- and univariate analysis for the E2/SN2 dataset, selective sampling results for one-hot encoding in combination with KRR, selective sampling results for Chemprop, RMSEs and standard deviations obtained for the WL-based GNN and ml-QM-GNN during selective sampling, performance of a GNN with an exclusively QM-based representation as input for the E2/SN2 dataset, the effect of averaging out the respective descriptors for the full QM-augmented model trained on the E2/SN2 dataset, representative examples of failures of the ml-QM-GNN model for the E2/SN2 classification, performance of the full ml-QM-GNN model, trained exclusively on either the E2 or SN2 data, on the held-out reaction type, the effect of inclusion of individual QM descriptors in the ml-QM-GNN, correlation analysis of the QM descriptors included in the ml-QM-GNN, methodology for the DFT calculations in Fig. 5, confusion matrix comparing the classifications by the QM-augmented models to the physical organic criteria for the aromatic substitution dataset, and recovery of the correct regiochemical predictions as the training set increases for the aromatic substitution dataset.
ACKNOWLEDGMENTS
The authors acknowledge the Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium for funding and thank both Yanfei Guan and Esther Heid for helpful discussions.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
The data that support the findings of this study are openly available in the GitHub repository containing the main coding associated with this project at https://github.com/coleygroup/QM-augmented_GNN.