Molecular-orbital-based machine learning (MOB-ML) provides a general framework for the prediction of accurate correlation energies at the cost of obtaining molecular orbitals. The application of Nesbet’s theorem makes it possible to recast a typical extrapolation task, training on correlation energies for small molecules and predicting correlation energies for large molecules, into an interpolation task based on the properties of orbital pairs. We demonstrate the importance of preserving physical constraints, including invariance conditions and size consistency, when generating the input for the machine learning model. Numerical improvements are demonstrated for different datasets covering total and relative energies for thermally accessible organic and transition-metal containing molecules, non-covalent interactions, and transition-state energies. MOB-ML requires training data from only 1% of the QM7b-T dataset (i.e., only 70 organic molecules with seven and fewer heavy atoms) to predict the total energy of the remaining 99% of this dataset with sub-kcal/mol accuracy. This MOB-ML model is significantly more accurate than other methods when transferred to a dataset comprising of 13 heavy atom molecules, exhibiting no loss of accuracy on a size intensive (i.e., per-electron) basis. It is shown that MOB-ML also works well for extrapolating to transition-state structures, predicting the barrier region for malonaldehyde intramolecular proton-transfer to within 0.35 kcal/mol when only trained on reactant/product-like structures. Finally, the use of the Gaussian process variance enables an active learning strategy for extending the MOB-ML model to new regions of chemical space with minimal effort. We demonstrate this active learning strategy by extending a QM7b-T model to describe non-covalent interactions in the protein backbone–backbone interaction dataset to an accuracy of 0.28 kcal/mol.

## I. INTRODUCTION

The calculation of accurate potential energies of molecules and materials at affordable cost is at the heart of computational chemistry. While state-of-the-art *ab initio* electronic structure theories can yield highly accurate results, they are computationally too expensive for routine applications. Density functional theory (DFT) is computationally cheaper and has, thus, enjoyed widespread applicability. However, DFT is hindered by a lack of systematic improvability and by an uncertain quality for many applications.

In recent years, a variety of machine learning approaches have emerged, which promise to mitigate the cost of highly accurate electronic structure methods while preserving accuracy.^{1–30} While these machine learning methods share similar goals, they differ in the representation of the molecules and in the machine learning methodology itself. Here, we will focus on the molecular-orbital-based machine learning (MOB-ML) approach.^{15,18,19} The defining feature of MOB-ML is its framing of learning highly accurate correlation energies as learning a sum of orbital pair correlation energies. These orbital pair correlation energies can be individually regressed with respect to a feature vector representing the interaction of the molecular orbital pairs. Without approximation, it can be shown that such pair correlation energies add up to the correct total correlation energy for single-reference wave function methods. Phrasing the learning problem in this manner has the advantage that a given pair correlation energy, and, hence, a given feature vector, is independent of the molecular size (after a certain size threshold has been reached) because of the inherent spatial locality of dynamic electron correlation. Consequently, operating in such an orbital pair interaction framework converts the general extrapolation task of training on small molecules and predicting on large molecules into an interpolation task of training on orbital pairs in a small molecule and predicting on the same orbital pairs in a large molecule.

In this work, we address challenges introduced by operating in a vectorized molecular orbital pair interaction framework (Sec. II). We show how changes in the feature design affect the performance and transferability of MOB-ML models within the same molecular family (Sec. IV A) and across molecular families (Secs. IV B–IV C). We probe these effects on relative-energy and total-energy predictions for organic and transition-metal containing molecules, and we investigate the applicability of MOB-ML to transition-state structures and non-covalent interactions.

## II. THEORY

MOB-ML predicts correlation energies based on information from the molecular orbitals.^{15,18,19} The correlation energy *E*^{corr} in the current study is defined as the difference between the true total electronic energy and the Hartree–Fock (HF) energy for a given basis set. Without approximation, the correlation energy is expressed as a sum over correlation energy contributions from pairs of occupied orbitals *i* and *j*,^{31}

Electronic structure theories offer different ways of approximating these pair correlation energies. For example, the second-order Møller–Plesset perturbation theory (MP2) correlation energy is^{32}

where *a*, *b* denote virtual orbitals, *F* is the Fock matrix in the molecular orbital basis, and $ia\Vert jb$ is the anti-symmetrized exchange integral. We denote a general repulsion integral over the spatial coordinates **x**_{1}, **x**_{2} of molecular orbitals *p*, *q*, *m*, *n* following the chemist’s notation as

The evaluation of correlation energies with post-HF methods such as MP2 or coupled-cluster theory [including CCSD(T)] involves computations that exceed the cost of HF theory by orders of magnitude. By contrast, MOB-ML predicts the correlation energy at negligible cost compared to a HF calculation by machine-learning the map

where **f**_{ij} denotes the feature vector into which information on the molecular orbitals is compiled. MOB-ML is a so-called Δ machine learning (Δ-ML) model, which, in this context, means that the difference between an energy calculated with a cheap (e.g., HF) and an expensive [e.g., CCSD(T)] quantum chemical method is predicted. Other machine learning models that fall into this category are DeepHF^{25} (based on a HF calculation), NeuralXC^{27} (based on a DFT calculation), and OrbNet^{29} (based on a semi-empirical calculation). Another class of machine learning models are direct ML models such as SchNet^{7} and ANI-1^{9} that predict the target energy only based on the nuclear coordinates of a molecule, which is usually considerably cheaper than a quantum chemical calculation. However, Δ-ML models in general, and MOB-ML in particular, are usually more data efficient than direct machine learning methods, which we highlight, e.g., in Fig. 4.

Following our previous work,^{18} we define a canonical order of the orbitals *i* and *j* by rotating them into gerade and ungerade combinations [see Eq. (7) in Ref. 18], creating the rotated orbitals $\u0129$ and $j\u0303$. The feature vector **f**_{ij} assembles information on the molecular orbital interactions: (i) orbital energies of the valence-occupied and valence-virtual orbitals *F*_{pp}, (ii) mean-field interaction energy of valence-occupied and valence-occupied orbitals and of valence-virtual and valence-virtual orbitals *F*_{pq}, (iii) Coulomb interaction of valence-occupied and valence-occupied orbitals, of valence-occupied and valence-virtual orbitals, and valence-virtual and valence-virtual orbitals $[Kpp]qq$, and (iv) exchange interaction of valence-occupied and valence-occupied orbitals, of valence-occupied and valence-virtual orbitals, and valence-virtual and valence-virtual orbitals $[Kpq]pq$. We note that all of these pieces of information enter either the MP2 or the MP3 correlation energy expressions, which helps to motivate their value within our machine learning framework. We remove repetitive information from the feature vector and separate the learning problem into the cases where (i) *i* ≠ *j* where we employ the feature vector as defined in Eq. (5) and (ii) *i* = *j* where we employ the feature vector as defined in Eq. (6),

Here, the index *k* denotes an occupied orbital other than *i* and *j*. For blocks in the feature vector that include more than one element, we specify a canonical order of the feature vector elements. In our previous work,^{18} this order was given by the sum of the Euclidean distances between the centroids of orbitals $\u0129$ and *p* and between the centroids of orbitals $j\u0303$ and *p*. In the current work, we introduce a different strategy to sort the feature vector elements (Sec. II A), we modify the protocol with which we obtain the feature vector elements associated with $\u0129,j\u0303$ (Sec. II B), and we revise our feature vector elements to ensure size consistency (Sec. II C). We provide a conceptual description of the changes in the feature set below, and we give the full definition of the feature vector elements and the criteria according to which the feature elements are ordered in Tables S3–S6 in the supplementary material.

### A. Defining importance of feature vector elements

Careful ordering of the elements of the feature vector blocks is necessary in the current work because Gaussian process regression (GPR) is sensitive to permutation of the feature vector elements. Furthermore, the application of a Gaussian process requires that the feature vectors be of fixed length.^{33}

Given the near-sighted nature of dynamical electron correlation, it is expected that only a limited number of orbital-pair interactions are important to predict the pair correlation energy with MOB-ML. To construct the fixed-length feature vector, a cutoff criterion must be introduced.^{15} For some feature vector elements, a robust definition of importance is straightforward. The spatial distance between the orbital centroids *i* and *a* is, for example, a reliable proxy for the importance of the feature vector elements ${[Kii]aa}$ of the feature vector **f**_{i}. However, the definition of importance is less straightforward for feature vector elements that involve more than two indices. The most prominent example is the ${[Kab]ab}$ feature vector block of **f**_{ij}, which contains the exchange integrals between the valence-virtual orbitals *a* and *b* and which should be sorted with respect to the importance of these integrals for the prediction of the pair correlation energy *ε*_{ij}. It is non-trivial to define a spatial metric that defines the importance of the feature vector elements ${[Kab]ab}$ to predict the pair correlation energy *ε*_{ij}; instead, we employ the MP3 approximation for the pair correlation energy,

where $tijab$ denotes the T-amplitude. Although we operate in a local molecular orbital basis, the canonical formulae are used to define the importance criterion; if we consider orbital localization as a perturbation (as in Kapuy–Møller–Plesset theory^{34}), the canonical expression is the leading order term. The term we seek to attach an importance to, ${[Kab]ab}$, appears in the first term of Eq. (7), and all integrals necessary to compute this term are readily available as (a combination of) other feature elements, i.e., we do not incur any additional significant computational cost to obtain the importance of the feature vector elements.

The way in which we determine the importance of the ${[Kab]ab}$ elements here is an example of a more general strategy that we employ, in which the importance is assigned according to the lowest-order perturbation theory in which the features first appear. Similar considerations have to be made for each feature vector block, all of which are specified in detail in Tables S3 and S4 in the supplementary material.

### B. Orbital-index permutation invariance

The Fock, Coulomb, and exchange matrix elements that comprise MOB features are naturally invariant to rotation and translation of the molecule. However, some care is needed to ensure that these invariances are not lost in the construction of symmetrized MOB features. In particular, rotating the valence-occupied orbitals into gerade and ungerade combinations leads to an orbital-index permutation variance for energetically degenerate orbitals *i*, *j* because the sign of the feature vector elements $Mj\u0303p$, where $Mj\u0303p$ may be $Fj\u0303p$, $[Kj\u0303j\u0303]pp$, or $[Kj\u0303p]j\u0303p$,

depends on the arbitrary assignment of the indices *i* and *j*. To rectify this issue, we include the absolute value of the generic feature vector element *M* in the feature vector instead of the signed value,

The corresponding equation,

is already orbital-index permutation invariant because we chose *M*_{pq} (*p* ≠ *q*) to be positive.^{18}

### C. Size consistency

Size consistency is the formal property by which the energy of two isolated molecules equals the sum of their dimers upon infinite separation.^{35,36} In the context of MOB-ML, satisfaction of this property requires that the contributions from the diagonal feature vectors are not affected by distant, non-interacting molecules and that

for contributions from the off-diagonal feature vectors. To ensure that MOB-ML exhibits size-consistency without the need for explicit training on the dimeric species, the following modifications to the feature vectors are made.

#### 1. Diagonal feature vector

The feature vector as defined in Eq. (6) contains three blocks whose elements are independent of the orbital *i*, {*F*_{ab}}, ${[Kaa]bb}$, and ${[Kab]ab}$. The magnitude of these feature vector elements does not decay with an increasing distance between the orbital *i* localized on the molecule *I* and an orbital (for example, *a*) localized on the molecule *J*. To address this issue, we multiply these feature vector elements by their estimated importance (see Sec. II A) so that they decay smoothly to zero. The other feature vector elements decay to zero when the involved orbitals are non-interacting albeit at different rates; we take the cube of feature vector elements of the type ${[Kpp]qq}$ to achieve a similar decay rate for all feature vector elements in the short-to medium-range, which facilitates machine learning.

#### 2. Off-diagonal feature vector

We modify the off-diagonal feature vector such that **f**_{ij} = **0** for *r*_{ij} = *∞* by first applying the newly introduced changes for **f**_{i} to also for **f**_{ij}. Further action is needed for the off-diagonal case because many feature vector elements do not decay to zero when the distance between *i* and *j* is large due to rotation of the orbitals into a gerade and an ungerade combination, e.g., $F\u0129k=12Fik+12Fjk=12Fik\u2009\u2009for\u2009\u2009rij=\u221e,rjk=\u221e$. As a remedy, we apply a damping function of the form $11+16(rij/r0)6$ to each feature vector element. The form of this damping function is inspired by the semi-classical limit of the MP2 expression as it is also used for semi-classical dispersion corrections.^{37} The damping radius, *r*_{0}, needs to be sufficiently large as to not interfere with machine learning at small *r*_{ij}. If a damping radius close to zero would be chosen, all off-diagonal feature vectors would be zero, which nullifies the information content; however, the damping radius *r*_{0} also should not be too large as size-consistency has to be fully learned until the off-diagonal feature vector is fully damped to zero. Therefore, we employ a damping radius in the intermediate-distance regime, and we empirically found *r*_{0} = 5.0 bohrs to work well. Although it could be systematically and automatically optimized in general (see, e.g., Fig. S4), we simply apply *r*_{0} = 5.0 bohrs throughout this work.

Finally, we enforce that *ε*^{ML}(**0**) = 0. The MOB features are engineered to respect this limit and would, for example, in a linear regression with a zero intercept trivially predict a zero-valued pair correlation energy without any additional training. However, the Gaussian process regression we apply in this work does not trivially yield a zero-valued pair correlation energy for a zero-valued feature vector. In the case that a training set does not include examples of zero-valued feature vectors, we need to include zero-valued feature vectors and zero-valued pair correlation energies in training to ensure that *ε*^{ML}(**0**) = 0. For no model trained in the current study were more than 5% zero-valued feature vectors included.

The resulting MOB-ML model leads to size consistent energy predictions to the degree to which the underlying MO generation is. It is not required that the dimer is explicitly part of training the MOB-ML model to obtain this result. The detailed definition of each feature vector block is summarized in Tables S5 and S6. We apply the feature set defined in Tables S5 and S6 consistently in this work.

## III. COMPUTATIONAL DETAILS

We present results for five different datasets: (i) a series of alkane molecules, (ii) the potential energy surface of the malonaldehyde molecule, (iii) a thermalized version of the QM7b and the GDB13 dataset (i.e., QM7b-T and GDB13-T),^{38} (iv) a set of backbone–backbone interactions (BBIs),^{39} and (v) a thermalized version of a subset of mononuclear, octahedral transition-metal complexes put forward by Kulik and co-workers.^{40} We refer to the supplementary material, Sec. II, for a description of how the structures were obtained or generated. All generated structures are available in Ref. 41.

The features for all structures were generated with the entos qcore^{42} package. The feature generation is based on a HF calculation applying a cc-pVTZ^{43} basis for the elements H, C, N, O, S, and Cl. We apply a def2-TZVP basis set^{44} for all transition metals. The HF calculations were accelerated with density fitting for which we applied the corresponding cc-pVTZ-JKFIT^{45} and def2-TZVP-JKFIT^{46} density fitting bases. Subsequently, we localized the valence-occupied and valence-virtual molecular orbitals with the Boys–Foster localization scheme^{47,48} or with the intrinsic bond orbital (IBO) localization scheme.^{49} We implemented a scheme to localize the valence-virtual orbitals with respect to the Boys–Foster function (for details on this implementation, see Sec. II in the supplementary material). We applied the Boys–Foster localization scheme for datasets (i), (iii), (iv), and (v) for valence-occupied and valence-virtual molecular orbitals. IBO localization for valence-occupied and valence-virtual molecular orbitals led to better results for dataset (ii). As we have previously shown, Boys localization tends to yield more efficient MOB-ML learning curves than IBO localization;^{18} however, IBO localization is well known to yield smooth descriptions of localized orbitals over large geometric changes^{49} like those considered here, thereby leading to the choice of IBO localization in this example.

The resulting orbitals are imported into the Molpro 2018.0^{50,51} package via the matrop functionality to generate the non-canonical MP2^{52} or CCSD(T)^{53–55} pair correlation energies with the same orbitals we applied for the feature generation. These calculations are accelerated with the resolution of the identity approximation. The frozen-core approximation is invoked for all correlated calculations.

We follow the machine learning protocol established and outlined in the previous work^{18} to train the MOB-ML models. In the first step, we perform MOB feature selection by evaluating the mean decrease in accuracy in a random forest regression in the Scikit-learn v0.22.0 package.^{56} We then regress the diagonal and off-diagonal pair correlation energies separately with respect to the selected features in the GPy 1.9.6 software package.^{57} We employ the Matérn 5/2 kernel with white noise regularization.^{33} We minimize the negative log marginal likelihood objective with respect to the kernel hyperparameters with a scaled conjugate gradient scheme for 100 steps and then apply the BFGS algorithm until full convergence. As indicted in the results, both random-sampling and active learning strategies were employed for the selection of molecules in the training datasets. In the active learning strategy, we use a previously trained MOB-ML model to evaluate the Gaussian process variance for each molecule and then include the points with the highest variance in the training dataset, as outlined in Ref. 58. To estimate the Gaussian process variance for each molecule, it was assumed the variances per molecular orbital pair are mutually independent.

## IV. RESULTS

### A. Transferability within a molecular family

We first examine the effect of the feature vector generation strategy on the transferability of MOB-ML models within a molecular family. To this end, we revisit our alkane dataset^{18} that contains 1000 ethane and 1000 propane geometries as well as 100 butane and 100 isobutane geometries. We perform the transferability test outlined in Ref. 18, i.e., training a MOB-ML model on correlation energies for 50 randomly chosen ethane geometries and 20 randomly chosen propane geometries to predict the correlation energies for the 100 butane and 100 isobutane geometries (see Fig. 1). This transferability test was repeated with 10 000 different training datasets (each consisting of data for 50 ethane molecules and 20 propane molecules) to assess the training set dependence of the MOB-ML models. As suggested in Ref. 25, we consider various performance metrics to assess the prediction accuracy of the MOB-ML models: (i) the mean error [ME, Eq. (S2)], (ii) the mean absolute error [MAE, Eq. (S3)], (iii) the maximum absolute error [MaxAE, Eq. (S4)], and (iv) the mean absolute relative error [MARE, Eq. (S5)] that applies a global shift setting the mean error to zero. We report the minimum, peak, and maximum encountered MAREs in Table I alongside literature values obtained in our previous work,^{18} by Dick and Fernandez-Serra,^{27} and by Chen *et al.*^{25} The MEs, MAEs, and MaxAEs are reported in Fig. S1.

. | . | MARE . | |||||
---|---|---|---|---|---|---|---|

. | . | Butane . | Isobutane . | ||||

Method . | Feature set . | Min. . | Peak . | Max. . | Min. . | Peak . | Max. . |

NeuralXC | Reference 27 | 0.15 | 0.14 | ||||

DeepHF | Reference 25 | 0.06 | 0.11 | 0.43 | 0.07 | 0.13 | 0.53 |

MOB-ML | Reference 18 | 0.20 | 0.21 | ||||

This work | 0.06 | 0.11 | 0.19 | 0.06 | 0.10 | 0.19 |

In general, MOB-ML as well as NeuralXC^{27} and DeepHF^{25} produces MAREs well below chemical accuracy for correlation energies of butane and isobutane when trained on correlation energies of ethane and propane. Updating the feature vector generation strategy for MOB-ML results in the best peak MAREs for butane as well as for isobutane, which are 0.11 kcal/mol and 0.10 kcal/mol, respectively. As in our previous work,^{18} we note that the total correlation energy predictions may be shifted with respect to the reference data so that the MEs for MOB-ML range from −0.92 kcal to 2.70 kcal/mol for butane and from −0.18 kcal to 1.02 kcal/mol for isobutane (see also Fig. S1). This shift is strongly training-set dependent, which was also observed for results obtained with DeepHF.^{25}

The results highlight that this is an extrapolative transferability test. A considerable advantage of applying GPR in practice is that each prediction is accompanied by a Gaussian process variance, which, in this case, indicates that we are in an extrapolative regime (see Fig. 1). Here, this can be seen by comparing the variances of butane and isobutane molecules to variances of ethane and propane molecules not employed in training. The maximum variance for ethane molecules is below 0.01 kcal/mol, and the one for propane molecules is below 0.06 kcal/mol (see also Fig. S3). Extrapolations might be associated with quality degradation, which we see, most prominently, for the mean error in butane. By contrast, other machine learning approaches such as neural networks are less clear in terms of whether the predictions are in an interpolative or extrapolative regime.^{59} By including the butane molecule with the largest variance in the training set (which then consists of 50 ethane, 20 propane, and 1 butane geometries), we reduce the ME from 0.78 to 0.25, MAE from 0.78 to 0.26, MaxAE from 1.11 to 0.51, and MARE from 0.11 kcal/mol to 0.09 kcal/mol for butane (see Fig. S2). These results directly illustrate that MOB-ML can be systematically improved by including training data that are more similar to the test data; the improved confidence of the prediction is then also directly reflected in the associated Gaussian process variances.

As a second example, we examine the transferability of a MOB-ML model trained within a basin of a potential energy surface to the transition-state region of the same potential energy surface. We chose malonaldehyde for this case study as it has also been explored in previous machine learning studies.^{6} We train a MOB-ML model on 50 thermalized malonaldehyde structures that all have the property that *d*(O^{1}–H) + *d*(O^{2}–H) > 0.4 Å (where *d* denotes the distance between the two nuclei), which ensures that we are sampling from the basins. We then apply this trained model to predict the correlation energies for an MP2 potential energy surface mapping out hydrogen transfer between the two oxygen atoms (see Fig. 2). MOB-ML produces an accurate potential energy surface for hydrogen transfer in malonaldehyde only from information on the basins (compare left and middle left panels of Fig. 2). The highest encountered errors on the minimum potential energy path are smaller than 1.0 kcal/mol. Unsurprisingly, the predicted minimum energy structure [*d*(O^{1}–H) = 1.00 Å, *d*(O ^{2}–H) = 1.63 Å] is very similar to the reference minimum energy structure [*d*(O^{1}–H) = 1.00 Å, *d*(O^{2}–H) = 1.64 Å]. Strikingly, the predicted energy of 2.65 kcal/mol at the saddle point at *d*(O^{1}–H) = *d*(O^{2}–H) = 1.22 Å differs from the reference energy by only 0.35 kcal/mol, although the MOB-ML model was not trained on any transition-state like structures. The highest errors are encountered in the high-energy regime, and this region is also associated with the highest Gaussian process variance indicating low confidence in the predictions (compare middle right and right panels of Fig. 2). The Gaussian process variance reflects the range of structures the MOB-ML model has been trained in and highlights again that we did not include transition-state-like structures in the training.

### B. Transferability across organic chemistry space

The Chemical Space Project^{60} computationally enumerated all possible organic molecules up to a certain number of atoms, resulting in the GDB databases.^{61} In this work, we examine thermalized subsets^{18} of the GDB13 dataset^{61} to investigate the transferability of MOB-ML models across the organic chemistry space. The application of thermalized sets of molecules has the advantage that we can study the transferability of our models for chemical and conformational degrees of freedom at the same time. To test the transferability of MOB-ML across the chemical space, we train our models on MP2 energies for a thermalized set of seven and fewer heavy-atom molecules (also known as QM7b-T^{18}), and then we test the prediction accuracy for MP2 energies for a QM7b-T test set and for MP2 energies for a thermalized set of molecules with 13 heavy atoms (GDB13-T;^{18} see also Sec. V in the supplementary material), as also outlined in our previous work.^{18,19} We demonstrated previously that MOB-ML learns other single-reference correlated wave function methods such as CCSD or CCSD(T) with similar efficiency.^{18}

We first investigate the effect of changing the feature vector generation protocol on the QM7b-T → QM7b-T prediction task (see Fig. 3). In Ref. 18, we found that training on about 180 structures is necessary to achieve a model with an MAE below 1 kcal/mol. The FCHL method yields an MAE below 1 kcal/mol when training on about 800 structures,^{26} and the DeepHF method already exhibits an MAE below 1 kcal/mol when training on their smallest chosen training set that consists of 300 structures (MAE = 0.79 kcal/mol).^{25} The refinements in the current work reduce the number of required training structures to reach chemical accuracy to about 100 structures when sampling randomly. This number is, however, strongly training set dependent. We can remove the training-set dependence by switching to an active learning strategy where we can achieve an MAE below 1 kcal/mol reliably with about 70 structures. In general, the MAE obtained with the active learning strategy is comparable to the smallest MAEs obtained with random sampling strategies. This has the advantage that a small number of reference data can be generated in a targeted manner.

In general, our aim is to obtain a machine learning model that reliably predicts broad swathes of chemical space. For an ML model to be of practical use, it has to be able to describe out-of-set molecules of different sizes to a similar accuracy when accuracy is measured size-intensively.^{36} We probe the ability of MOB-ML to describe out-of-set molecules with a different number of electron pairs by applying a model trained on correlation energies for QM7b-T molecules to predict correlation energies for GDB13-T. We collect the best results published for this transfer test in the literature in Fig. 4. Our previous best single GPR model achieved an MAE of 2.27 kcal/mol when trained on 220 randomly chosen structures.^{18} The modifications in the current work now yield a single GPR model that achieves an MAE of 1.47 kcal/mol–1.62 kcal/mol for GDB13-T when trained on 220 randomly chosen QM7b-T structures. Strikingly, MOB-ML outperforms machine learning models trained on thousands of molecules like our RC/GPR model and FCHL18.^{26} The current MOB-ML results are of an accuracy that is similar to that of the best reported results from DeepHF (an MAE of 1.49 kcal/mol);^{25} however, MOB-ML only needs to be trained on about 3% of the molecules in the QM7b dataset, while DeepHF is trained on 42% to obtain comparable results (MAE of 1.52 kcal/mol for 3000 training structures). The best reported result for DeepHF (MAE of 1.49 kcal/mol) was obtained by training on 97% of the molecules of the QM7b dataset. The datasets in this work focus on the extremely small data regime, whereas widely used neural net methods such as SchNet or ANI-1 have instead been applied with datasets that are orders of magnitude larger. For example, a direct comparison of the learning efficiency between MOB-ML and SchNet has been provided in the context of forces.^{62} We attribute the excellent transferability of MOB-ML to the fact that it focuses on the prediction of orbital-pair contributions, thereby reframing an extrapolation problem into an interpolation problem when training machine learning models on small molecules and testing them on large molecules. The pair correlation energies predicted for QM7b-T and for GDB13-T span a very similar range (0 kcal/mol to −20 kcal/mol), and they are predicted with a similar Gaussian process variance (see Fig. S7), which we would expect in an interpolation task. The final errors for GDB13-T are larger than those for QM7b-T because the total correlation energy is size-extensive; however, the size-intensive error per electron pair spans a comparable range for QM7b-T and for GDB13-T (see Fig. S6). This presents a significant advantage of MOB-ML over machine learning models that rely on a whole-molecule representation and creates the opportunity to study molecules of a size that are beyond the reach of accurate correlated wave function methods.

Most studies in computational chemistry require accurate relative energies rather than accurate total energies. Therefore, we also assess the errors in the relative energies for the sets of conformers for each molecule in the QM7b-T and GDB13-T datasets obtained with MOB-ML with respect to the MP2 reference energies (see Fig. 5). We emphasize that MOB-ML is not explicitly trained to predict conformer energies, and we include at most one conformer for each molecule in the training set. Nevertheless, MOB-ML produces on average chemically accurate relative conformer energies for QM7b-T when trained on correlation energies for only 30 randomly chosen molecules (or 0.4% of the molecules) in the QM7b set. We obtain chemically accurate relative energies for the GDB13-T dataset when training on about 100 QM7b-T molecules. The prediction accuracy improves steadily when training on more QM7b-T molecules reaching a mean MAE of 0.43 kcal/mol for the relative energies of the rest of the QM7b-T set and of 0.77 kcal/mol for the GDB13-T set.

We now present the first reported test of MOB-ML for non-covalent interactions in large molecules. To this end, we examine the backbone–backbone interaction (BBI) dataset^{39} that was designed to benchmark methods for the prediction of interaction energies encountered within protein fragments. Using the implementation of MOB-ML described here and using only MP2 energies for 20 randomly selected QM7b-T molecules for training, the method achieves a mean absolute error of 0.98 kcal/mol for the BBI dataset (see Fig. 6). However, these predictions are uncertain as indicated by the large Gaussian process variances associated with these data points, which strongly suggested that we are now, as expected, in an extrapolative regime. We further improve the predictive capability of MOB-ML by augmenting the MOB-ML model with data from the BBI set. Specifically, we can draw on an active learning strategy and consecutively include data points until all uncertainties are below 1 kcal/mol, which in this case corresponds to only two data points. This reduces the MAE to 0.28 kcal/mol for the remaining 98 data points in the BBI set. Including more reference data points would further improve the performance for this specific dataset. However, this is not the focus of this work. Instead, we simply emphasize that MOB-ML is a clearly extensible strategy to accurately predict energies for large molecules and non-covalent intermolecular interactions while providing a useful estimation of confidence.

### C. Transition-metal complexes

We finally present the first application of MOB-ML to transition-metal complexes. To this end, we train a MOB-ML model on a thermalized subset of mononuclear, octahedral transition-metal complexes introduced by Janet and Kulik,^{40} which we denote as TM-T. The chosen closed-shell transition-metal complexes feature different transition metals (Fe, Co, and Ni) and ligands. The ligands span the spectrochemical series from weak-field (e.g., thiocyanate) over to strong-field (e.g., carbonyl). We see in Fig. 7 that the learning behavior between TM-T and QM7b-T is similar when the error is measured per valence-occupied orbital. These results demonstrate that MOB-ML formalism can be straightforwardly applied outside of the organic chemistry universe without additional modifications. It is particularly notable that the learning efficiency for TM-T is comparable to that for QM7b-T, as seen in the relatively simple organic molecules in QM7b-T (Fig. 7). We note that we do not expect MP2 theory to be quantitative for transition-metal complexes.^{63,64} Instead, it provides a demonstration of the learning efficiency of MOB-ML for transition-metal complexes in the current example, and as previously demonstrated, MOB-ML learns other correlated wave function methods with similar efficiency.^{15,18}

## V. CONCLUSIONS

Molecular-orbital-based machine learning (MOB-ML) provides a general framework to learn correlation energies at the cost of molecular orbital generation. In this work, we demonstrate that preservation of physical symmetries and constraints leads to machine-learning methods with greater learning efficiency and transferability. Exploiting physical principles such as size consistency and energy invariances not only leads to a conceptually more satisfying method but also leads to substantial improvements in prediction errors for different datasets covering total and relative energies for thermally accessible organic and transition-metal containing molecules, non-covalent interactions, and transition-state energies. With the modifications presented in the current work, MOB-ML is shown to be highly data efficient, which is important due to the high computational cost of generating reference correlation energies. Only 1% of the QM7b-T dataset (containing organic molecules with seven and fewer heavy atoms) needs to be drawn on to train a MOB-ML model that produces on average chemically accurate total energies for the remaining 99% of the dataset. Without ever being trained to predict relative energies, MOB-ML provides chemically accurate relative energies for QM7b-T when training on only 0.4% of the QM7b-T molecules. Furthermore, we have demonstrated that MOB-ML is not restricted to the organic chemistry space and that we are able to apply our framework out-of-the box to describe a diverse set of transition-metal complexes when training on correlation energies for tens of molecules.

Beyond data efficiency, MOB-ML models are shown to be very transferable across the chemical space. We demonstrate this transferability by training a MOB-ML model on QM7b-T and predicting energies for a set of molecules with 13 heavy atoms (GDB13-T). We obtain the best result for GDB13-T reported to date despite only training on 3% of QM7b-T. The successful transferability of MOB-ML is shown to result from its recasting of a typical extrapolation task (i.e., larger molecules) into an interpolation task (i.e., by predicting on the basis of size-intensive orbital-pair contributions). Even when MOB-ML enters an extrapolative regime as identified by a large Gaussian process variance, accurate results can be obtained; for example, we predict the transition-state energy for proton transfer in malonaldehyde and interaction energies in the protein backbone–backbone interaction dataset to chemical accuracy without training on transition-state-like data or non-covalent interactions, respectively. In this case, the uncertainty estimates also offer a clear avenue for active learning strategies, which can further improve the model performance. Active learning offers an attractive way to reduce the number of expensive reference calculations further by picking the most informative molecules to be included in the training set. This provides a general recipe of how to evolve a MOB-ML model to describe new regions of chemical space with minimal effort.

The future work will focus on the expansion of MOB-ML to cover more of chemical space. Specifically, particular areas of focus include open-shell systems and electronically excited states. Physical insight from exact conditions in electronic structure theory^{65} will continue to guide the development of the method, with the aim of providing a machine-learning approach for energies and properties of arbitrary molecules with controlled error.

## SUPPLEMENTARY MATERIAL

See the supplementary material for details on feature generation for all datasets used in this work, definition of error metrics, expanded results for the alkane transferability test, expanded results for the transferability within the organic chemistry space, and features and labels for all datasets used in this work.

## ACKNOWLEDGMENTS

This work was supported, in part, by the U.S. Army Research Laboratory (Grant No. W911NF-12-2-0023), the U.S. Department of Energy (Grant No. DE-SC0019390), the Caltech DeLogi Fund, and the Camille and Henry Dreyfus Foundation (Award No. ML-20-196). T.H. acknowledges funding through an Early Post-Doc Mobility Fellowship by the Swiss National Science Foundation (Award No. P2EZP2_184234). S.J.R.L. thanks the Molecular Software Sciences Institute (MolSSI) for a MolSSI investment fellowship. Computational resources were provided by the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the DOE Office of Science under Contract No. DE-AC02-05CH11231.

## DATA AVAILABILITY

The data that support the findings of this study are available within this article and its supplementary material and are openly available in Caltech Data Repository.^{41}

## REFERENCES

*Ab Initio*Programs,

*GPy*: A Gaussian Process Framework in Python,