The feature vector mapping used to represent chemical systems is a key factor governing the superior data efficiency of kernel based quantum machine learning (QML) models applicable throughout chemical compound space. Unfortunately, the most accurate representations require a high dimensional feature mapping, thereby imposing a considerable computational burden on model training and use. We introduce compact yet accurate, linear scaling QML representations based on atomic Gaussian many-body distribution functionals (MBDF) and their derivatives. Weighted density functions of MBDF values are used as global representations that are constant in size, i.e., invariant with respect to the number of atoms. We report predictive performance and training data efficiency that is competitive with state-of-the-art for two diverse datasets of organic molecules, QM9 and QMugs. Generalization capability has been investigated for atomization energies, highest occupied molecular orbital–lowest unoccupied molecular orbital eigenvalues and gap, internal energies at 0 K, zero point vibrational energies, dipole moment norm, static isotropic polarizability, and heat capacity as encoded in QM9. MBDF based QM9 performance lowers the optimal Pareto front spanned between sampling and training cost to compute node minutes, effectively sampling chemical compound space with chemical accuracy at a sampling rate of molecules per core second.
I. INTRODUCTION
Modern data-driven statistical Machine Learning (ML) models have emerged as powerful tools over the past decade for inferring quantum mechanical observables throughout chemical compound space, without explicitly solving electronic Schrödinger equations.1–3 Similar success was obtained for ML based interatomic potentials and force-fields4–9 as well as electronic structure modeling throughout Chemical Compound Space (CCS).10,11 For an entire set of extensive in-depth reviews on these and other related ML applications, we refer the reader to the recent special issue in Chemical Reviews.12,13 Various aspects in the development of ML model architecture and training protocols have proven to be essential for data efficiency. In particular, the molecular representation is known to have a strong impact on the performance of similarity based ML models, such as kernel ridge regression (KRR).14–16 This is not surprising as the representation controls the information about the systems, how its weighed, and the consistency of these Quantum Machine Learning (QML) models with ab initio methods.17 These representations are non-linear mappings of the atomistic systems to a suitable Hilbert space where a statistical regression model can easily be applied. The Hilbert space constraint applies due to the requirement of measuring similarity in terms of an inner product.18 These mappings should have some desirable features of which the most important are (i) uniqueness such that different systems necessarily possess different representations14 and (ii) invariance with respect to transformations that leave the target property invariant, such as global translations, rotations, and atomic index permutations of the same chemical elements. Other desirable properties include (iii) an analytical and continuous form of the representation function, (iv) differentiability with respect to nuclear coordinates, charges, number of electrons, and number of atoms, (v) as general as the Hamiltonian, (vi) computationally efficient evaluation, and (vii) compact or even constant size, limiting the computational cost for larger systems.19
Due to their critical role, many representations have been introduced and investigated within the context of atomistic simulation.20–32 For recent comprehensive reviews, the reader is referred to Refs. 16 and 33. These representations can either describe the molecule as a whole (global) or each atom (local or atomic) separately. For the sake of brevity, we restricted most of our comparative benchmarks within this study to the following representations, which are commonly used to train QML models throughout CCS: Faber–Christensen–Huang–Lilienfeld (FCHL19) representation,34,35 Smooth Overlap of Atomic Positions (SOAP),36 spectrum of London and Axilrod–Teller–Muto potentials (SLATM),37 atom index sorted Coulomb matrix (CM),1 and its vectorized form, Bag of Bonds (BOB).38 Other representations/models tested are mentioned in Sec. III.
While these representations satisfy most of the aforementioned qualities, seen the immense size of CCS, a more compact and scalable representation would still be desirable. Formally, the number of degrees of freedom of any material or molecule would prescribe usage of a 4M dimensional feature vector (three spatial coordinates and one nuclear charge coordinate for M atoms). However, all aforementioned representations when using optimal hyperparameters require a higher dimensional feature vector mapping in order to be accurate and training-data-efficient at regression tasks and some (e.g., CM or BOB) even scale quadratically with M. While verbosity facilitates the inclusion of invariances, the 4M degrees of freedom suggest that the same performance can be obtained using more compact representations. This is especially an issue for kernel based ML models, where the size of the representation directly affects the distance/kernel evaluation time.18,34 Although the scaling for kernel inversion39 is larger [ for Cholesky solvers], for highly data-efficient (i.e., efficient in training data) QML models, it is the kernel generation and evaluation that consumes the most compute time as demonstrated later. The kernel evaluation pre-factor becomes even worse when using atomic (or local) representations in conjunction with a suitable local kernel.40 Obvious solutions by reducing finite local cutoffs within the representation come at the expense of reducing the predictive power or, conversely, increasing training data needs. As such, a computationally efficient yet accurate solution is desirable as was shown in the discretization of the FCHL18 representation.34,35 Other solutions to this problem, such as sparse kernel models,41 and the recently introduced conjugate gradient based iterative approach for training kernel based models42 would also be well complemented by a compact molecular representation.
Herein, we propose a methodology for generating representations that minimize feature size. We use functionals of many-body distributions (MBDFs) and their derivatives to encode any local chemical environment of any atom in an integrated compact fashion. The representations thus generated preserve the system’s various symmetries (translation, rotation, and atom index invariance) and can be tailored to the physical property of interest through scaling functions. MBDFs are easily extendable to include higher order many-body interactions with minimal increase in size. In the current formulation, while including three-body interactions, MBDF scales as 5M. We further tackle the issue of storing this information in a manner that remains invariant to the number of atoms in a molecule. We do this by generating a discretized density (DF) function of MBDF values and using it as a global molecular representation. Using two diverse datasets, the performance of MDBF is tested against aforementioned SOAP and FCHL19, which are commonly used and state-of-the-art, as well as SLATM, BOB, CM representations and a few other QML models mentioned later. Finally, we explore the bottleneck crossover from kernel evaluation to inversion.
II. THEORY AND DISCUSSION
A. Many-body distribution functionals
We begin the discussion of our local representation using distribution functions over the internal coordinates, which can be constructed using the atomic coordinates.
Figure 1 shows the effect of each functional on the learning capacity of MBDF for the task of predicting atomization energies of the QM948 dataset. It is apparent that both the derivative and many-body terms improve the learning capacity, albeit by different magnitudes.
MBDF based QML learning curves using concatenated increasingly higher order and body functionals [Eqs. (12)–(16)]. Mean absolute error (MAE) of predicting atomization energies of the QM948 dataset is shown as a function of training set size N.
Throughout our testing on the QM7,1 QM9,48 and QMugs49 datasets, we have found these two- and three-body functionals to be sufficient at discriminating between all of the molecular structures. Cases where the two-body information does not suffice include homometric pairs, as already discussed many years ago.14 Even three- and four-body information does not suffice for some cases, as recently discussed in Ref. 50. We note here that, whenever necessary, arbitrarily higher order derivative and many-body information could also be included in MBDFs at minimal increase in size, i.e., one additional term per order [see Eq. (8)]. In particular, we believe that the inclusion of the four-body term as a functional of the dihedrals could further improve the learning capacity for conformational isomers. Inclusion of four-body information has been shown to result in further improvements of learning curves.51 Note that the size of MBDF is invariant to the cutoffs used, which can be raised to arbitrarily higher values while employing a suitable long-range functional (hence increasing the farsightedness of the representation) without affecting the kernel evaluation cost. We further note that other weighting functions and response terms could also be useful for QML models of physical observables, such as dipole moments, vibrational frequencies, and heat capacities.
B. Density of functionals
C. Numerical analysis
The DF method allows generating a representation that does not scale with the number of atoms in the system. However, in order to use it as a feature vector, the density functions have to be discretized. Through convergence testing, we have set the grid spacing to 0.2 throughout our work. However, we note that this grid spacing could be changed for a different dataset in order to achieve the desirable accuracy vs computational cost trade-off. Furthermore, DF corresponds to a flattened feature vector, which can be used with global kernels (or other ML methods) and which exhibits superior performance when compared to a straightforward concatenation of all MBDF rows (see Fig. 2). The flattened MBDF representation is generated by sorting the MBDF matrix of each molecule by the row norm and then flattening the matrix by concatenation of the rows to vectorize it.1
QML learning curves for MBDF with local kernel, DF (global), and flattened MBDF (global and sorted by row norm). Mean absolute error (MAE) of predicting atomization energies of the QM948 dataset as a function of training set size N.
QML learning curves for MBDF with local kernel, DF (global), and flattened MBDF (global and sorted by row norm). Mean absolute error (MAE) of predicting atomization energies of the QM948 dataset as a function of training set size N.
Figure 3 shows molecular fingerprints generated using the one and five functional DF representations for three diverse and relevant organic molecules (glucose, uric acid, and testosterone) on the same grid. For each molecule, a distinct fingerprint is obtained, with peak-positions depending on the local chemical environment of each atom. Consequently, peaks of atoms with chemically similar environments are located closer to each other. Peak heights encode both number and type (because of the density estimate being weighted) of chemical environments [See Eq. (20)]. Figure 3 demonstrates that for molecules with increasing size, corresponding DF based fingerprints will grow in magnitude, not in size. In the supplementary material, we also show how DF fingerprints distinguish conformational isomers, as exemplified for the chair and boat conformations of cyclohexane.
Two-body (left) and three-body (right) DF version [Eqs. (12) and (15)] representations for glucose, uric acid, and testosterone.
III. METHODS AND DATA
A. Kernel ridge regression
The ML method that we focus on, and use throughout this work, is the supervised learning method called Kernel ridge regression18,39 (KRR). This method has been covered extensively earlier,1–3,34,35 so we skip the details here.
B. Hyperparameter optimization
The current form of the representations has been optimized for Kernel based learning models. It depends on the weighting functions used and a number of hyperparameters that include variances (σr, σθ) of the Gaussian PDFs, weighting function hyperparameters mentioned in Eqs. (12)–(16), and bandwidth σb for the DF representation. The hyperparameter optimization was done on a random subset of two thousand molecules from the QM7 dataset1 and then kept fixed for all other datasets. We note that further improvements might be possible if they had been optimized simultaneously on all datasets. The weighting functions gn(τ) in Eq. (8) were chosen by straightforward screening of the functions mentioned earlier. The optimization minimized the atomization energy prediction errors on the QM7 subset using the Gaussian Process (GP) based Bayesian Optimization (BOpt).39 Starting with a Gaussian prior, the method fits a posterior distribution over the objective function using successive function evaluations. The posterior distribution is then used to construct an efficient acquisition function, which can be optimized using, for instance, a quasi-Newton method to determine the next query point. Table I shows the optimized hyperparameter values used throughout this work.
MBDF and DF hyperparameters after optimization on atomization energies of the QM71 subset.
Parameter . | Value . |
---|---|
σr | 1 |
σθ | 2 |
η | 10.8 |
α | 1.5 |
a0, a1, a2, a3 | 3, 100, −200, −164 |
σb | 0.07 |
rc (Å) | 6 |
Parameter . | Value . |
---|---|
σr | 1 |
σθ | 2 |
η | 10.8 |
α | 1.5 |
a0, a1, a2, a3 | 3, 100, −200, −164 |
σb | 0.07 |
rc (Å) | 6 |
We used the scikit-optimize60 implementation of GP based BOpt using the default Matérn kernel with unit variance and the limited memory BFGS optimizer61 for the acquisition function. In order to enable comparison to other representations (such as FCHL or SOAP) that rely on distance cutoffs, we have chosen to set inter-atomic distance cutoff rc for MBDF to 6 Å throughout this work. We note that larger cutoffs for MBDF would not change its size.
All MBDF functionals were evaluated using the trapezoidal numerical integration method. The grid spacing for discretizing DF densities has been set to 0.2 throughout our work as noted earlier. The bandwidth σb = 0.07 was found to work well on the QM7 subset; however, it is recommended to be screened once (in the range ∼[0.01, 1]) along with the grid-spacing when using with other datasets. The Numpy62 and Numba63 libraries are used in the representation generation code.
C. Data and code
The QM9 dataset48 consists of ∼134k small organic molecules with up to nine heavy atoms (C, N, O, and F). The calculations were performed at the B3LYP/6-31G(2df, p)65–66 level of theory.
QMugs is a dataset containing ∼665k biologically and pharmacologically relevant drug-like molecules. It consists of larger molecules than QM9 with up to 100 heavy atoms (C, N, O, F, P, S, Cl, Br, or I) per molecule. The training and predictions were performed on the DFT (ωB97X-D/def2-SVP)67,68 values reported in the dataset. The QMugs subsets we used for Fig. 5 were drawn at random and consist of 20k molecules. Throughout, we used zero-padding for all representations studied in order to accommodate training and test molecules smaller than the maximum present in the data.
In order to keep the FCHL19 and SOAP kernel evaluations computationally tractable, we have (a) restricted ourselves to a maximum 100 atoms per QMugs-molecule and (b) reduced the default hyperparameters of the FCHL19 and SOAP representations to nRs2 = 12, nRs3 = 10, rcut = 6 and nmax = lmax = 3, σ = 0.1, rcut = 6 , respectively. For consistency, we used the same parameters for all other results reported in this article. These two versions of the representations with the reduced basis sets are denoted as FCHL19* and SOAP* in all reported figures. Note that the latter choice of hyperparameters negligibly deteriorates the predictive accuracy for QM9 (as assessed below and when comparing to the published prediction errors on QM9 for FCHL19 and SOAP2013). For FCHL19 and SOAP based prediction errors reported here within for QMugs, it could still be that the accuracy could improve further if these parameters were optimized.
Figure 4 also includes results for QML models based on the two- (k = 2) and three-body (k = 3) Many-Body Tensor Representations (MBTR)22 and a variant69 of the Atom Centered Symmetry Functions20 (ACSF) as implemented in the QMLcode library.70 The MBTR representations were generated with the same hyperparameters as those used in Ref. 33 for the 10k training point on the QM9 dataset.
MBDF/DF performance and comparison on atomization energies from the QM9 dataset (drawn at random from ∼134k organic molecules).48 (a) (Left) Mean absolute error (MAE) of prediction as a function of training set size for representations CM,1 BOB,38 MBTR,22 SLATM,37 PI,81 ACSF,20,69 FCHL19,34 and SOAP.36 Numbers in legend denote representation size (feature vector dimensions) and G and L denote Global and Local kernels, respectively. (b) (Right) Timing for training and testing as a function of training set size required for making chemically accurate (MAE = 1 kcal/mol) predictions on 100k QM9 molecules. Blue, red, and green points indicate local kernels, global kernels, and neural network, respectively. The dashed gray line corresponds to the optimal Pareto front. For SchNet,73 PaiNN,74 and Allegro,76 an Nvidia RTX 3070 GPU has been used. All other timings were evaluated on a compute node equipped with a 24-core AMD EPYC 7402P @ 3.3 GHz CPU and 512 GB RAM. Timings for FCHL18,35 BOB, and CM are estimated using smaller kernels (not taking into account kernel inversion). Asterisk denotes representations with reduced hyperparameters used in this work. N values for ACSF, MBTR, BOB, and CM estimated via extrapolation. Numbers in brackets denote year of publication.
MBDF/DF performance and comparison on atomization energies from the QM9 dataset (drawn at random from ∼134k organic molecules).48 (a) (Left) Mean absolute error (MAE) of prediction as a function of training set size for representations CM,1 BOB,38 MBTR,22 SLATM,37 PI,81 ACSF,20,69 FCHL19,34 and SOAP.36 Numbers in legend denote representation size (feature vector dimensions) and G and L denote Global and Local kernels, respectively. (b) (Right) Timing for training and testing as a function of training set size required for making chemically accurate (MAE = 1 kcal/mol) predictions on 100k QM9 molecules. Blue, red, and green points indicate local kernels, global kernels, and neural network, respectively. The dashed gray line corresponds to the optimal Pareto front. For SchNet,73 PaiNN,74 and Allegro,76 an Nvidia RTX 3070 GPU has been used. All other timings were evaluated on a compute node equipped with a 24-core AMD EPYC 7402P @ 3.3 GHz CPU and 512 GB RAM. Timings for FCHL18,35 BOB, and CM are estimated using smaller kernels (not taking into account kernel inversion). Asterisk denotes representations with reduced hyperparameters used in this work. N values for ACSF, MBTR, BOB, and CM estimated via extrapolation. Numbers in brackets denote year of publication.
Throughout our work, the FCHL19, SLATM and ACSF representations were generated using the QMLcode library,70 SOAP was generated using the Dscribe library71 with the default Gaussian type radial basis functions, MBTR was generated using the qmmlpack library,72 and SchNet73 and PaiNN74 were built and trained using the SchNetPack75 libray. Allegro76 was built and trained using the nequip77 code78 based on the E(3)-NN framework.79,80
For FCHL19, SOAP, and ACSF, we employed the local Gaussian [Eq. (28)], for SLATM, MBTR, and DF, we use the global Gaussian [Eq. (25)], and for CM and BOB, we use the global Laplacian [Eq. (26)] kernels, respectively. For MBDF, we used the local Gaussian kernel on the MD17 dataset and the local exponential [Eq. (30)] kernel for all other models. These choices were made based on the best performances for each representation. All kernel evaluations were performed using the QMLcode library.
IV. RESULTS
A. Atomization energies
1. QM9
Figure 4(a) shows learning curves for QM9 and the size (dimension of feature vector) of the representation arrays in the legend. For the task of predicting atomization energies, local representations have previously been shown to be more efficient.34,35 Results for the local representations FCHL19 and SOAP are closely reproduced and reach chemical accuracy after training on less than 10k molecules.34 Among the global representations, SLATM has previously also been shown34,37 to perform quite well, reaching chemical accuracy after training on ∼9k molecules, although it shows a smaller slope at larger training sizes. This is closely followed by MBDF, which reaches chemical accuracy after training on ∼10k molecules (less than 10% of the dataset). The global DF representation also performs decently well, reaching chemical accuracy at ∼32k training data. The local ACSF representation shows a larger offset but a better slope, and it reaches chemical accuracy at ∼50k training set size. We note here that for consistency with the other representations used in this work, we did not optimize the hyperparameters of the MBTR representation on every training point but rather kept them fixed throughout. Only the KRR hyperparameters were optimized at each training set size as with all other representations used here. The three-body MBTR reaches chemical accuracy at ∼60k training set size, while the two-body MBTR performs better than the other two-body representations, BOB and CM. We have also included the recently introduced constant size Persistence Images81 (PI) representation for comparison.
Note that MBDF has the smallest size, requiring only five numbers per atom (145 dimensions for QM9 molecules). By contrast, other local representations, such as FCHL19 and SOAP, require ∼400 numbers per atom, while ACSF uses a 150 dimensional feature vector per atom. Encouragingly and despite its compact size, MBDF outperforms most of the other larger representations with the exception of SOAP and FCHL. We note here that while the size of the global DF representation is larger than MBDF, it utilizes a global kernel, implying training and prediction cost invariance with respect to the system size.
This compactness of the representation translates into faster ML model evaluation timings. This is shown in Fig. 4(b), which plots the trade-off between training and prediction timings vs training data needs for reaching mean absolute prediction errors of atomization energies of 1 kcal/mol. We note that there are only two representations located on the Pareto-front, FCHL18,35 and MBDF (this work). We also point out that currently the best performing model on the QM9 dataset is the recently proposed Wigner kernel method,82 which is not included in this study.
As noted earlier, local kernels based on representations, such as FCHL18/19, SOAP, or ACSF, exhibit very good training data efficiency, but this comes at the expense of a larger computational overhead. The exception is the local MBDF based kernel, which achieves the fastest training timing of ∼0.07 compute node minutes (14k training molecules) due to its compact size. Predictions on 100k QM9 molecules using the local MBDF kernel are made in ∼1.46 compute node minutes, which translates to an unprecedented chemically accurate navigation rate of ∼1140 molecules/second. SLATM, being close to the Pareto front, and the DF representation both represent fast global kernel based KRR models. While requiring more training data than SLATM in order to reach chemical accuracy, DF has the advantage that it is largely invariant with respect to system size (see below). For the sake of completeness, Fig. 4(b) also includes results for the deep learning based models SchNet,73 PaiNN,74 and the current state-of-the-art, Allegro,76 which were all trained on a GPU. The reported timing for SchNet refers to 3000 epochs of training on 20k molecules and predictions of 100k molecules taking about 3 h: 27 min and 7 s, respectively. For PaiNN, the reported timing corresponds to 1000 epochs of training on 7k molecules, which took 0 h: 56 mins and a prediction time of 9 s. Allegro reached chemical accuracy after training on 4500 molecules for 728 epochs, with early stopping, which took 17 h: 8 min, while the prediction on 100k molecules took ∼7 mins (using a maximum possible batch size of five on the used GPU).
MBDF achieves the fastest training time out of all models and the fastest prediction rate among the kernel based models. Numerical results for porting KRR results based on FCHL19 to GPUs83 would suggest that it seems likely for the prediction rate to increase significantly once MBDF is reimplemented in CUDA. Figure 2 in the supplementary material shows the training and prediction times separately for some of these methods.
2. QMugs
We have tested the generalizability of our method to larger systems and more diverse chemistries using the QMugs49 dataset. Figure 5 shows the atomization energy learning curves. Due to the large variety in the dataset, the predictive error is larger for all representations than their QM9 counterparts even when predicting on a much smaller test set. MBDF reaches ∼2 kcal/mol prediction error after training on 16k molecules. This is better than the QML based neural network predictions published in Ref. 84 and similar to the Δ-QML numbers they also report. In terms of speed, generating the local MBDF kernel for training and testing on 20k molecules on this dataset takes ∼1.8 compute node mins (see below), which corresponds to a navigation rate of ∼185 molecules/second. By comparison, this is substantially faster than the GPU based prediction rates of ∼50 and five molecules per second for the direct and Δ-learning (using GFN2-xTB85) based ML models, respectively, using the convolutional neural network reported in Ref. 84. Only SLATM and FCHL19 exhibit lower offset than MBDF, while the performance for SOAP and DF is similar, albeit slightly worse than MBDF. As mentioned before, however, in order to make FCHL19 and SOAP tractable, we have dramatically reduced the hyper parameters. In particular, we believe that the learning efficiency of SOAP for QMugs is being reduced due to the use of small basis sets (nmax = lmax = 3). Note that no representation reaches chemical accuracy within 16k training molecules, indicating that QMugs possesses substantially more chemical diversity than QM9.
MBDF/DF performance and comparison to CM,1 BOB,38 SLATM,37 FCHL19,34 and SOAP36 representation based atomization energy estimates using the QMugs (∼667k organic molecules with up to one hundred heavy atoms) dataset.49 Training and testing data drawn at random. Prediction mean absolute errors (MAEs) on the holdout test set of 4k molecules shown as a function of training set size. Numbers in legend denote representation size (feature vector dimensions), and G and L denote Global and Local kernels, respectively. The shaded region denotes the standard deviation across four different learning curves (except for FCHL19 and SOAP for which only one learning curve was tractable). Asterisk denotes representations with reduced hyperparameters used in this work.
MBDF/DF performance and comparison to CM,1 BOB,38 SLATM,37 FCHL19,34 and SOAP36 representation based atomization energy estimates using the QMugs (∼667k organic molecules with up to one hundred heavy atoms) dataset.49 Training and testing data drawn at random. Prediction mean absolute errors (MAEs) on the holdout test set of 4k molecules shown as a function of training set size. Numbers in legend denote representation size (feature vector dimensions), and G and L denote Global and Local kernels, respectively. The shaded region denotes the standard deviation across four different learning curves (except for FCHL19 and SOAP for which only one learning curve was tractable). Asterisk denotes representations with reduced hyperparameters used in this work.
In terms of representation sizes, MBDF again remains the smallest representation since it still requires only five dimensions per atom regardless of the chemical composition. However, being a local representation, on average, its size increases ∼3.4 times compared to QM9. FCHL19 and SOAP, on the other hand, now require more than 1000 dimensions to represent each atom for this larger dataset. CM, BOB, FCHL19, and SOAP show larger than tenfold increase in the representation size compared to QM9, followed by SLATM, which shows an increase of ∼6.6 times. This results in a considerable increase in the train/test time (vide infra), which precludes the straightforward application of these representations to the entire QMugs dataset. The DF representation changes the least in size since it does not formally scale with number of atoms but only with number of different chemical elements. Consequently, its size doubles compared to QM9 since a separate density function is generated for each unique chemical element in the dataset using Eq. (19) as mentioned earlier.
3. QM7b and MD17
Figure 3 in the supplementary material shows learning curves for the QM7b86,87 atomization energies and the size (dimension of feature vector) of the representation arrays in the legend. Similar trends in performance and representation size noted so far are observed on this dataset as well.
Figure 4 in the supplementary material shows learning curves of energies for a few molecules from the revised88,89 MD1790,91 molecular dynamics dataset. Although the comparative performance trend on this dataset is similar to the others, we note that the chosen functionals for MBDF and the current representation hyperparameters are not optimized for potential energy surface learning. Furthermore, the implementation of gradients for MBDF to enable force based learning should significantly enhance the performance for such tasks, as was shown88 for the FCHL19 representation. However, the relatively larger difference in performance between MBDF and FCHL19 on this dataset might indicate that compact representations are sufficient for regression through equilibrium structures across CCS, while more verbose representations are required for potential energy surface learning.
B. Timings
Figure 6 shows scaling plots of kernel evaluation timings across both QM9 and QMugs datasets for various representations and as a function of training set size. As one would expect, the offset increases systematically with increasing representation and training molecule size. More specifically, for the larger molecules of QMugs, the FCHL19 and SOAP kernel evaluations become rapidly intractable very quickly with the 16k kernel (Fig. 5) already taking an entire compute node day. Encouragingly and in stark contrast, DF, being a size invariant representation, shows hardly any change in computational overhead when moving from the small QM9 molecules to the much larger QMugs molecules.
Compute node time required for kernel inversion and evaluation as a function training set sizes N drawn from QM9 (squares) and QMugs (diamonds) datasets. QML results shown for SOAP,36 FCHL19,34 SLATM,37 and MBDF and DF. Dotted lines indicate extrapolation [using quadratic (kernel evaluation) and cubic (kernel inversion) polynomial fits]. G and L denote Global and Local kernels, respectively. Compute node: 24-core AMD EPYC 7402P @ 3.3 GHz CPU with 512 GB RAM.
Compute node time required for kernel inversion and evaluation as a function training set sizes N drawn from QM9 (squares) and QMugs (diamonds) datasets. QML results shown for SOAP,36 FCHL19,34 SLATM,37 and MBDF and DF. Dotted lines indicate extrapolation [using quadratic (kernel evaluation) and cubic (kernel inversion) polynomial fits]. G and L denote Global and Local kernels, respectively. Compute node: 24-core AMD EPYC 7402P @ 3.3 GHz CPU with 512 GB RAM.
For context, the time required in the kernel inversion step for each training kernel is shown as well. The bottleneck crossover from kernel generation to the inversion step occurs rather late. When using Cholesky decomposition, it occurs for MBDF at N ∼ 64k training molecules. For less compact representations (for SLATM and larger), the same crossover occurs at training set sizes that exceed ∼1M. As demonstrated above, chemical accuracy is already achieved at substantially smaller training set sizes. Consequently and contrary to popular belief, for any of the more modern and accurate representations, kernel inversion does not constitute a bottleneck.
Table II reports the kernel evaluation and representation generation timings for both the QM9 dataset (130k molecules) and the QMugs subset (20k molecules) used in our work. It can be seen that MBDF reduces the local kernel evaluation timings from days to a few minutes for both small and large molecules. For the representation generation step, we note that our code is currently written in Python and uses the Numba63 library and could be further optimized with a low-level implementation. However, the current timings as well do not affect the overall QML model cost, much given the kernel evaluation bottleneck.
Compute node times for generating representations (trep) and kernel matrices (tkernel) for 130k molecules from the QM9 dataset and 20k molecules from the QMugs dataset. Global and Local kernels are again denoted by (G) and (L), respectively. Representations with superscript a were generated with QMLcode70 library and b with the Dscribe71 library. Compute node: 24-core AMD EPYC 7402P @ 3.3 GHz CPU and 512 GB RAM.
. | QM9 130k molecules . | QMugs 20k molecules . | ||||
---|---|---|---|---|---|---|
Representation . | trep (min) . | tkernel (min) . | Dimension . | trep (min) . | tkernel (min) . | Dimension . |
CMa (G) | 0.186 | 2.862 | 435 | 0.012 | 1.146 | 5 050 |
BoBa (G) | 0.216 | 7.296 | 1 128 | 1.362 | 3.396 | 18 528 |
SLATMa (G) | 18.60 | 86.32 | 11 960 | 15.76 | 14.58 | 79 045 |
FCHL19a (L) | 0.846 | 1071 | 10 440 | 1.764 | 1566 | 122 000 |
SOAPb (L) | 0.216 | 1873 | 13 920 | 0.246 | 2925 | 186 000 |
MBDF (L) | 1.626 | 11.81 | 145 | 4.182 | 1.848 | 500 |
DF (G) | 2.262 | 12.16 | 2 500 | 2.442 | 0.996 | 5 000 |
. | QM9 130k molecules . | QMugs 20k molecules . | ||||
---|---|---|---|---|---|---|
Representation . | trep (min) . | tkernel (min) . | Dimension . | trep (min) . | tkernel (min) . | Dimension . |
CMa (G) | 0.186 | 2.862 | 435 | 0.012 | 1.146 | 5 050 |
BoBa (G) | 0.216 | 7.296 | 1 128 | 1.362 | 3.396 | 18 528 |
SLATMa (G) | 18.60 | 86.32 | 11 960 | 15.76 | 14.58 | 79 045 |
FCHL19a (L) | 0.846 | 1071 | 10 440 | 1.764 | 1566 | 122 000 |
SOAPb (L) | 0.216 | 1873 | 13 920 | 0.246 | 2925 | 186 000 |
MBDF (L) | 1.626 | 11.81 | 145 | 4.182 | 1.848 | 500 |
DF (G) | 2.262 | 12.16 | 2 500 | 2.442 | 0.996 | 5 000 |
C. Performance for molecular quantum properties
We assessed the generalization capacities of MBDF/DF on physical properties other than atomization energies. Figure 7 shows the learning curves for the task of predicting eight important molecular quantum properties from the QM9 dataset. These include the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) eigenvalues and the HOMO-LUMO gap, internal energy at 0 K (U0), dipole moment norm (|μ|), static isotropic polarizability (α), zero point vibrational energy (ZPVE), and heat capacity (Cv). Due to substantial computational costs, the KRR hyper-parameters were not optimized at each training size for the FCHL19 and SOAP representations, and we picked the same parameters as those used for learning atomization energies in Fig. 4. We reproduce earlier trends among intensive and extensive properties when using local/global kernels.3,92 MBDF and DF match this trend: They perform better on extensive and on intensive properties, respectively.
Learning curves for various representations in QML models of highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) eigenvalues, HOMO–LUMO gap (Δϵ), internal energy at 0 K (U0), dipole moment norm (|μ|), static isotropic polarizability (α), zero point vibrational energy (ZPVE), and heat capacity (Cv) from the QM9 dataset. G and L denote Global and Local kernels, respectively. Asterisk denotes representations with reduced hyperparameters used in this work.
Learning curves for various representations in QML models of highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) eigenvalues, HOMO–LUMO gap (Δϵ), internal energy at 0 K (U0), dipole moment norm (|μ|), static isotropic polarizability (α), zero point vibrational energy (ZPVE), and heat capacity (Cv) from the QM9 dataset. G and L denote Global and Local kernels, respectively. Asterisk denotes representations with reduced hyperparameters used in this work.
Again, note that the performance on these properties of MBDF/DF could be further improved by including different functionals suited to the property or by augmenting them with response terms as was done for the FCHL19 representation.92 It would also be interesting to see how the learning capacities across these different physical properties is affected by the inclusion of higher order functional terms.
V. CONCLUSION
We have introduced compact atomic local many body distribution functional (MBDF) and global density of functionals (DF) representations for use within kernel ridge regression based Quantum Machine Learning (QML) models for rapid sampling of chemical compound space. MBDF/DF can accurately describe any atom/molecule using a minimal number of discrete elements and thereby reduce QML training and prediction times by multiple orders of magnitude for small and large molecules. MBDF and DF correspond to functionals of analytical weighted many body distributions in interatomic distances and angles, as well as their derivatives. Chemical identity is encoded as a prefactor to the atomic functionals. DF is a weighted density estimation of MBDF, i.e., a global molecular fingerprint that is invariant with respect to the number of atoms (not number of chemical elements).
We have demonstrated predictive power competitive with the state-of-the-art for a variety of quantum physical properties, as encoded in the QM9 dataset.48 On the QM9 dataset, MBDF reaches a MAE of atomization energies of only 0.69 kcal/mol after training on 32k molecules while using only five dimensions to describe an atom. Regarding different molecular properties, it is beneficial to use DF representation along with a global kernel for intensive properties and MBDF along with a local kernel for the extensive properties. MBDF and DF generalize also well to other compound spaces, as evinced for the chemically more diverse QMugs dataset49 consisting of substantially larger molecules: After training on 16k molecules, MBDF reaches a MAE for atomization energies of 1.97 kcal/mol while still using only five dimensions per atom. Corresponding training and prediction times for both datasets are ∼2 compute node time minutes for MBDF based models. Furthermore, our results indicate that using MBDF/DF brings the train/test timings of kernel based ML models to their “lower-bound” as imposed by the kernel inversion bottleneck for both small and large molecules.
We have analyzed that the comparative performance for the sampling cost vs training set size needs to reach chemical accuracy for predicting atomization energies in the QM9 dataset. MBDF has extended the corresponding optimal Pareto front toward minimal time requirements.
While the numerical evidence collected is indicative of a surprising effectiveness of MBDF, it is clear that the truncations used in this work may lead to lack of uniqueness. However, neither within QM9 nor within the QMugs subset we sample, we have encountered a case where using this small set of functionals maps two different molecular structures to the same feature vector. Furthermore, the likelihood of uniqueness can easily be increased by inclusion of higher order derivatives and many-body terms. More rigorous numerical analysis would be desirable to quantify this trend.
Thanks to its numerical efficiency, we believe that this approach holds great promise for further accelerating the virtual discovery of novel molecules and materials. Furthermore, this framework provides a possible solution to the general problem of unfavorable scaling due to (i) inclusion of higher order many body and long range physics and (ii) applying these ML models to larger molecules with greater chemical diversity. Future work will deal with the extension of the representations as described to deal with various types of conformers and an assessment of its sensitivity to changes in molecular structures.
SUPPLEMENTARY MATERIAL
See the supplementary material for QM7b86 and MD1789–91 learning curves and a figure showing the transition of cyclohexane from chair to boat conformation alongside the response by its DF fingerprint. It also contains heat maps showing ideal KRR hyperparameters to be used with the MBDF representation and kernel PCAs of some molecules from the QM7b dataset.
ACKNOWLEDGMENTS
D.K. is thankful for discussions with B. Huang, S. Krug, D. Lemm, M. Sahre, and J. Weinreich. O.A.v.L. has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 772834). O.A.v.L. has received support as the Ed Clark Chair of Advanced Materials and as a Canada CIFAR AI Chair.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Danish Khan: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Validation (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). Stefan Heinen: Conceptualization (supporting); Data curation (supporting); Resources (supporting); Supervision (supporting); Validation (supporting); Writing – original draft (supporting); Writing – review & editing (equal). O. Anatole von Lilienfeld: Conceptualization (supporting); Formal analysis (supporting); Funding acquisition (lead); Methodology (supporting); Project administration (lead); Resources (lead); Supervision (lead); Writing – original draft (supporting); Writing – review & editing (equal).
DATA AVAILABILITY
Python script for generating the MBDF and DF representations is available at https://github.com/dkhan42/MBDF. Train/test split of the QM7 dataset used for optimizing representation hyperparameters, other data and code for reproducing the results in this study, and some tools for performing KRR are openly available at https://github.com/dkhan42/QML.