The integration of machine learning (ML) methods into chemical catalysis is evolving as a new paradigm for cost and time economic reaction development in recent times. Although there have been several successful applications of ML in catalysis, the prediction of enantioselectivity (ee) remains challenging. Herein, we describe a ML workflow to predict ee of an important class of catalytic asymmetric transformation, namely, the relay Heck (RH) reaction. A random forest ML model, built using quantum chemically derived mechanistically relevant physical organic descriptors as features, is found to predict the ee remarkably well with a low root mean square error of 8.0 ± 1.3. Importantly, the model is effective in predicting the unseen variants of an asymmetric RH reaction. Furthermore, we predicted the ee for thousands of unexplored complementary reactions, including those leading to a good number of bioactive frameworks, by engaging different combinations of catalysts and substrates drawn from the original dataset. Our ML model developed on the available examples would be able to assist in exploiting the fuller potential of asymmetric RH reactions through a priori predictions before the actual experimentation, which would thus help surpass the trial and error loop to a larger degree.
See Tables S1–S4 in the supplementary material for the full list of alkene, coupling partner, ligand, and additives, respectively.
See Sec. 4.3 of the supplementary material for additional details on the generation of synthetic data and their effect on ee prediction.
See Secs. 4.7 and 4.8 of the supplementary material where the effect of different train-test splits and different dataset sizes is, respectively, analyzed.
We wish to point out a pertinent technical aspect of an ensemble method, such as the RF, when applied to the labeled data with unevenly distributed output values (class imbalance). In a RF approach, an ensemble of decision trees are considered (the number of such trees is a hyperparameter that could be tuned for the problem at hand) and the output ee value in a particular leaf node of a given decision tree represents the average over all the reactions that belong to that specific leaf node (by virtue of decision-making features of those reactions). In addition, the ee of each reaction is predicted using all the trees in the forest and the average value of the ee over all such decision trees is determined as well. In essence, the RF predicted values represent a two-level averaging. The immediate ramification of such averaged predicted values could be seen in the form of an upper bound of the ee folding into 92. Thus, those reactions whose true ees are >92 get predicted as well, as 92, perhaps due to the presence of a richer number of samples centered around this ee value in this study (Ref. 46). These characteristics point to the persisting challenges in the application of ML to class imbalanced data in chemical reactivity and also underscore the need for better documentation of low ee reactions.
See Sec. 4.6 of the supplementary material for the error analysis in different ee intervals across all the ML models.
Details of train and test RMSEs are provided in Table S9 of the supplementary material for all the models employed.
See Table S8 for full details about the architecture of DNN.
In the RF method, scrambling of features between different samples (x-scrabling) by keeping the output (ee) value the same resulted in a test RMSE of 14.9. A test RMSE of 14.1 is obtained when only the output values were scrambled (y-scrambling) keeping the features of each sample intact. See Table S11 of the supplementary material. Similarly, x-scrambling led to much worse test performances in kNN (17.3), GB(14.2), and DNN(17.7) and so were in the y-scrambling kNN (17.1), GB(14.9), and DNN(18.6), where the numbers in parentheses refer to test RMSEs.
The chemical features were replaced with random numbers and resulted in poorer test RMSEs for kNN (11.2), GB(10.3), and DNN(16.8).
Although the one-hot encoding representation predicted well on samples in the high ee region (60–99), it failed in the low ee (0–60) region. The trained model also has a higher over-fitting with a train RMSE of 4.8. See Sec. 4.9.3 of the supplementary material.
It should also be noted that the one-hot encoding of the reactants cannot be extrapolated to unseen reactions consisting of newer reactants.
Different feature reduction methods (e.g., charge, sterimol, and partition functions) provided test RMSEs around 8.00. See Secs. 4.11 and 4.12 of the supplementary material for details of various feature reduction approaches and the corresponding test RMSEs.
We have analyzed the important features affecting the ee. See Fig. S8 of the supplementary material for a list of top 30 high-ranked features.
All the reactants/ligand involved in the reactions in set-1 (Tables S17–S19) and set-2 (Tables S21–S23) are separately listed in the supplementary material.
The new coupling partners in the out-of-bag set mainly contain differently substituted boronic acids, alkenyl triflates, and 2-indole triflates. There are 20 new coupling partners in set-1 and 9 in set-2. See Tables S18 and S22 of the supplementary material for their full list.
See Table S25 of the supplementary material for the variance difference before and after the inclusion of samples containing the 5-nitro-Pyrox ligand.
It shall be noted that RF is an ensemble of decision trees, where a decision to split a node is done on the basis of the information gain. The inclusion of a few samples in the training set might have a beneficial impact on the information gain, thereby leading an improved trained RF model.
See Fig. S11 of the supplementary material for more details about the identities of the reactants/ligand and the corresponding ee involved in each pixel.