The feature vector mapping used to represent chemical systems is a key factor governing the superior data efficiency of kernel based quantum machine learning (QML) models applicable throughout chemical compound space. Unfortunately, the most accurate representations require a high dimensional feature mapping, thereby imposing a considerable computational burden on model training and use. We introduce compact yet accurate, linear scaling QML representations based on atomic Gaussian many-body distribution functionals (MBDF) and their derivatives. Weighted density functions of MBDF values are used as global representations that are constant in size, i.e., invariant with respect to the number of atoms. We report predictive performance and training data efficiency that is competitive with state-of-the-art for two diverse datasets of organic molecules, QM9 and QMugs. Generalization capability has been investigated for atomization energies, highest occupied molecular orbital–lowest unoccupied molecular orbital eigenvalues and gap, internal energies at 0 K, zero point vibrational energies, dipole moment norm, static isotropic polarizability, and heat capacity as encoded in QM9. MBDF based QM9 performance lowers the optimal Pareto front spanned between sampling and training cost to compute node minutes, effectively sampling chemical compound space with chemical accuracy at a sampling rate of 48 molecules per core second.

Modern data-driven statistical Machine Learning (ML) models have emerged as powerful tools over the past decade for inferring quantum mechanical observables throughout chemical compound space, without explicitly solving electronic Schrödinger equations.1–3 Similar success was obtained for ML based interatomic potentials and force-fields4–9 as well as electronic structure modeling throughout Chemical Compound Space (CCS).10,11 For an entire set of extensive in-depth reviews on these and other related ML applications, we refer the reader to the recent special issue in Chemical Reviews.12,13 Various aspects in the development of ML model architecture and training protocols have proven to be essential for data efficiency. In particular, the molecular representation is known to have a strong impact on the performance of similarity based ML models, such as kernel ridge regression (KRR).14–16 This is not surprising as the representation controls the information about the systems, how its weighed, and the consistency of these Quantum Machine Learning (QML) models with ab initio methods.17 These representations are non-linear mappings of the atomistic systems to a suitable Hilbert space where a statistical regression model can easily be applied. The Hilbert space constraint applies due to the requirement of measuring similarity in terms of an inner product.18 These mappings should have some desirable features of which the most important are (i) uniqueness such that different systems necessarily possess different representations14 and (ii) invariance with respect to transformations that leave the target property invariant, such as global translations, rotations, and atomic index permutations of the same chemical elements. Other desirable properties include (iii) an analytical and continuous form of the representation function, (iv) differentiability with respect to nuclear coordinates, charges, number of electrons, and number of atoms, (v) as general as the Hamiltonian, (vi) computationally efficient evaluation, and (vii) compact or even constant size, limiting the computational cost for larger systems.19 

Due to their critical role, many representations have been introduced and investigated within the context of atomistic simulation.20–32 For recent comprehensive reviews, the reader is referred to Refs. 16 and 33. These representations can either describe the molecule as a whole (global) or each atom (local or atomic) separately. For the sake of brevity, we restricted most of our comparative benchmarks within this study to the following representations, which are commonly used to train QML models throughout CCS: Faber–Christensen–Huang–Lilienfeld (FCHL19) representation,34,35 Smooth Overlap of Atomic Positions (SOAP),36 spectrum of London and Axilrod–Teller–Muto potentials (SLATM),37 atom index sorted Coulomb matrix (CM),1 and its vectorized form, Bag of Bonds (BOB).38 Other representations/models tested are mentioned in Sec. III.

While these representations satisfy most of the aforementioned qualities, seen the immense size of CCS, a more compact and scalable representation would still be desirable. Formally, the number of degrees of freedom of any material or molecule would prescribe usage of a 4M dimensional feature vector (three spatial coordinates and one nuclear charge coordinate for M atoms). However, all aforementioned representations when using optimal hyperparameters require a higher dimensional feature vector mapping in order to be accurate and training-data-efficient at regression tasks and some (e.g., CM or BOB) even scale quadratically with M. While verbosity facilitates the inclusion of invariances, the 4M degrees of freedom suggest that the same performance can be obtained using more compact representations. This is especially an issue for kernel based ML models, where the size of the representation directly affects the distance/kernel evaluation time.18,34 Although the scaling for kernel inversion39 is larger [O(n3) for Cholesky solvers], for highly data-efficient (i.e., efficient in training data) QML models, it is the kernel generation and evaluation that consumes the most compute time as demonstrated later. The kernel evaluation pre-factor becomes even worse when using atomic (or local) representations in conjunction with a suitable local kernel.40 Obvious solutions by reducing finite local cutoffs within the representation come at the expense of reducing the predictive power or, conversely, increasing training data needs. As such, a computationally efficient yet accurate solution is desirable as was shown in the discretization of the FCHL18 representation.34,35 Other solutions to this problem, such as sparse kernel models,41 and the recently introduced conjugate gradient based iterative approach for training kernel based models42 would also be well complemented by a compact molecular representation.

Herein, we propose a methodology for generating representations that minimize feature size. We use functionals of many-body distributions (MBDFs) and their derivatives to encode any local chemical environment of any atom in an integrated compact fashion. The representations thus generated preserve the system’s various symmetries (translation, rotation, and atom index invariance) and can be tailored to the physical property of interest through scaling functions. MBDFs are easily extendable to include higher order many-body interactions with minimal increase in size. In the current formulation, while including three-body interactions, MBDF scales as 5M. We further tackle the issue of storing this information in a manner that remains invariant to the number of atoms in a molecule. We do this by generating a discretized density (DF) function of MBDF values and using it as a global molecular representation. Using two diverse datasets, the performance of MDBF is tested against aforementioned SOAP and FCHL19, which are commonly used and state-of-the-art, as well as SLATM, BOB, CM representations and a few other QML models mentioned later. Finally, we explore the bottleneck crossover from kernel evaluation to inversion.

We begin the discussion of our local representation using distribution functions over the internal coordinates, which can be constructed using the atomic coordinates.

An analytical and continuous distribution over the pair-wise internal coordinate defined as the inter-atomic distances (pair correlation function) is easily built using Gaussian probability density functions (PDFs) centered at each inter-atomic distance with respect to an atom i,
(1)
where ρi(r, Rij) is the normalized distribution with atom i as the origin, σr is the Gaussian length-scale (or variance) parameter, M denotes the total number of atoms in the system (or within a radial cutoff if employed), Rij denotes the inter-atomic distance between atoms i and j, and Zj is the nuclear charge. Scaling by nuclear charges defines elemental identities but could also be done in other ways, e.g., having different length-scale parameters σr for each unique chemical element or multiple dimensions, such as period and group specifications.43 In a similar fashion, a continuous distribution (triplet correlation function) over the three-body internal coordinate, inter-atomic angles θ, is defined as
(2)
where θjik is the inter-atomic angle centered on atom i. This can be generalized to define a continuous distribution, or correlation function, over any m-body internal coordinate τ.
Such analytical distribution functions of the internal coordinates have been used as descriptors of atomic environments in Atom Centered Symmetry Functions6,20 (ACSFs), their weighted variants wACSFs,23 and other similar methods.34,35,37 Instead of using these distribution functions as atomic descriptors, we define the MBDF representation as functionals of these m-body distributions, which leads to a more compact descriptor and allows inclusion of higher order terms with minimal change in size (a single scalar for each m-body term). With the two- and three-body distributions defined above, each atom can then be represented by the two zeroth-order functionals,
(3)
(4)
where rc denotes the radial cutoff distance and g0(r, Rij) and g0(θ, Rij, Rjk, Rkj) denote two- and three-body weighting functions. Note that when the weighting functions g0(r, Rij) and g0(θ, Rij, Rjk, Rkj) correspond to suitable two- and three-body potentials, the functionals F0(2) and F0(3) become the average of the corresponding two- and three-body inter-atomic interactions weighted by the pair and triplet correlation functions ρi(r, Rij) and ρi(θ,θjik), respectively. These functionals then form a coarse approximation to the average two- and three-body interactions experienced by a chemical species. Furthermore, we exploit the advantage of using the infinitely differentiable Gaussian PDFs to define higher order functionals, such as
(5)
(6)
with potentially different weighting functions g1(r) and g1(θ, Rij, Rjk, Rkj). The derivative functionals are useful since the functional of any arbitrary distribution is not unique. These functionals also encode the change in the m-body distribution in an atom’s local neighborhood and have not been used in previous works involving internal coordinate distribution functions. For any nth derivative of the two-body distribution, we can define the functional
(7)
where gn(r) is, again, a suitable radial weighting function. Generalizing this to all internal coordinates, a functional Fn(m)[i] can be defined over the nth derivative of any m-body distribution function centered at atom i,
(8)
where τ denotes the m-body internal coordinate, ρi is the m-body distribution function with respect to atom i, gn is the weighting function for the nth derivative of ρi, Hn denotes the Hermite polynomial of degree n, and N(τi1i2..im,στ2) denotes the normalized Gaussian distribution. The Hermite polynomials arise due to the use of Gaussian PDFs and allow convenient computation of n derivatives of the distribution function at any point τ.
We note here that an alternative way to describe a (bounded) distribution in a compact form is through a moment expansion of the form
(9)
where G(m)[i] denotes the mth moment of the distribution centered at atom i. The set of m moments G(m)[i] would then form the local representation of the atom i. These moments can also be evaluated by placing a set of Gaussians (or any basis functions) on each atom i and then evaluating the moments of the atomic density ρi(r) with respect to each atomic position within a radial cutoff rc,
(10)
where |.| is any metric. This form has the advantage of being independent of many-body orders, and the computational cost of evaluating these moments scales solely with the number of atoms within the cutoff radius rc. The integral can be simplified in spherical polar coordinates by expanding the density ρi(r) in a basis set composed of spherical harmonics Ylm and orthogonal radial functions un (which is a common practice36),
(11)
Throughout our work, we use the derivative formalism from Eq. (8) since our numerical results indicated superior performance than the moments expansion in the internal coordinates [Eq. (9)]. However, the moment expansion in Eq. (10), being independent of many-body terms, offers a promising alternative and could be the subject of a future work.
We have tested multiple weighting functions to identify the best combination. In particular, since these functionals correspond to correlation function averages of m-body interactions, we have tested the harmonic, Morse,44 and Lennard-Jones45 potentials and simple decaying functions (power laws, exponential, Gaussian decays, and their combinations) for two-body terms. For three-body terms, we have tested cosine harmonic, Axilrod–Teller,46 and Stillinger–Weber47 potentials and a scaled Fourier series. Through trial and error combined with cross-validated hyper-parameter optimization, we have identified the following five suitable functionals corresponding to two- and three-body distributions,
(12)
(13)
(14)
(15)
(16)
where η, α, an, and the various power laws are all hyperparameters of the representation. Note the scaling functions of MBDF contributions F0(2),F1(2),F0(3),F1(3) being, respectively, reminiscent of Buckingham type-potential, softened London dispersion potential, Fourier series scaled by Lennard-Jones repulsion, and Axilrod–Teller–Muto potential scaled by Lennard-Jones repulsion. The specific reason as to why this particular selection of weighting functions has proven advantageous will be subject of future research.

Figure 1 shows the effect of each functional on the learning capacity of MBDF for the task of predicting atomization energies of the QM948 dataset. It is apparent that both the derivative and many-body terms improve the learning capacity, albeit by different magnitudes.

FIG. 1.

MBDF based QML learning curves using concatenated increasingly higher order and body functionals [Eqs. (12)(16)]. Mean absolute error (MAE) of predicting atomization energies of the QM948 dataset is shown as a function of training set size N.

FIG. 1.

MBDF based QML learning curves using concatenated increasingly higher order and body functionals [Eqs. (12)(16)]. Mean absolute error (MAE) of predicting atomization energies of the QM948 dataset is shown as a function of training set size N.

Close modal

Throughout our testing on the QM7,1 QM9,48 and QMugs49 datasets, we have found these two- and three-body functionals to be sufficient at discriminating between all of the molecular structures. Cases where the two-body information does not suffice include homometric pairs, as already discussed many years ago.14 Even three- and four-body information does not suffice for some cases, as recently discussed in Ref. 50. We note here that, whenever necessary, arbitrarily higher order derivative and many-body information could also be included in MBDFs at minimal increase in size, i.e., one additional term per order [see Eq. (8)]. In particular, we believe that the inclusion of the four-body term as a functional of the dihedrals could further improve the learning capacity for conformational isomers. Inclusion of four-body information has been shown to result in further improvements of learning curves.51 Note that the size of MBDF is invariant to the cutoffs used, which can be raised to arbitrarily higher values while employing a suitable long-range functional (hence increasing the farsightedness of the representation) without affecting the kernel evaluation cost. We further note that other weighting functions and response terms could also be useful for QML models of physical observables, such as dipole moments, vibrational frequencies, and heat capacities.

MBDF is a local representation and its size scales linearly with the number of atoms in the system. In order to eliminate this scaling, we can transform to the frequency space of MBDF functional values. The frequencies can be evaluated by normalizing the functional values to lie within an arbitrary range and then using, for e.g., kernel density estimation.52 For a finite set of MBDF functional values {Xi}i=15M over the range [a, b], a “smooth histogram” of their frequencies can be constructed by placing a set of kernel functions K at each point Xi,
(17)
where f(u) gives the density of the samples at any point u ∈ [a, b]. If the set {Xi}i=15M are the MBDF functional values for any molecule, their distribution density can be evaluated using Eq. (17). The density f(u) can then be used as a global molecular representation whose size is independent of the number of MBDF functional values and number of atoms by extension. The (dis-)similarity between two molecules A and B can be evaluated as, e.g., the l2-distance,
(18)
where [−c, c] is the normalization range chosen as [−10, 10] in our work. The form of the density function used in our work is
(19)
where Xi are MBDF functionals for molecule A and σb is the variance of the Gaussian function and is a hyperparameter (also called bandwidth). Comparing with Eq. (17), the function K is defined as
(20)
Note that this is a divergence53 but not a kernel function because it is asymmetric,
(21)
This function is used because it weights the MBDF functional frequencies by the functional value itself, resulting in the distance measurement [Eq. (18)] being weighted by the difference in functional values. Another advantage of this function is that it eliminates the frequency of null values (or “ghost atoms”) within the MBDF representation, which might be present due to the procedure of zero-padding.1 
In our work, we generate a separate density function f(u) for each of the five MBDF functionals in Eqs. (11)(15) and for each unique chemical element present in the dataset. These are then concatenated to form the global representation of the molecule. Alternatively, it could be done by using multivariate Gaussian functions for the density estimation. Let xi denote the five-dimensional vector of MBDF functional values [Eqs. (11)(15)] for any atom i in molecule A. Then, the multivariate density function fA(u) of this molecule can be evaluated as:
(22)
(23)
The l2-distance between molecules A and B then takes the form
(24)
where the integral is over the normalization region. Since the former method generates a more compact representation, we have chosen to work with it. The abbreviation DF (Density of functionals) will be used throughout for this global representation.

The DF method allows generating a representation that does not scale with the number of atoms in the system. However, in order to use it as a feature vector, the density functions have to be discretized. Through convergence testing, we have set the grid spacing to 0.2 throughout our work. However, we note that this grid spacing could be changed for a different dataset in order to achieve the desirable accuracy vs computational cost trade-off. Furthermore, DF corresponds to a flattened feature vector, which can be used with global kernels (or other ML methods) and which exhibits superior performance when compared to a straightforward concatenation of all MBDF rows (see Fig. 2). The flattened MBDF representation is generated by sorting the MBDF matrix of each molecule by the row norm and then flattening the matrix by concatenation of the rows to vectorize it.1 

FIG. 2.

QML learning curves for MBDF with local kernel, DF (global), and flattened MBDF (global and sorted by row norm). Mean absolute error (MAE) of predicting atomization energies of the QM948 dataset as a function of training set size N.

FIG. 2.

QML learning curves for MBDF with local kernel, DF (global), and flattened MBDF (global and sorted by row norm). Mean absolute error (MAE) of predicting atomization energies of the QM948 dataset as a function of training set size N.

Close modal

Figure 3 shows molecular fingerprints generated using the one and five functional DF representations for three diverse and relevant organic molecules (glucose, uric acid, and testosterone) on the same grid. For each molecule, a distinct fingerprint is obtained, with peak-positions depending on the local chemical environment of each atom. Consequently, peaks of atoms with chemically similar environments are located closer to each other. Peak heights encode both number and type (because of the density estimate being weighted) of chemical environments [See Eq. (20)]. Figure 3 demonstrates that for molecules with increasing size, corresponding DF based fingerprints will grow in magnitude, not in size. In the supplementary material, we also show how DF fingerprints distinguish conformational isomers, as exemplified for the chair and boat conformations of cyclohexane.

FIG. 3.

Two-body (left) and three-body (right) DF version [Eqs. (12) and (15)] representations for glucose, uric acid, and testosterone.

FIG. 3.

Two-body (left) and three-body (right) DF version [Eqs. (12) and (15)] representations for glucose, uric acid, and testosterone.

Close modal

The ML method that we focus on, and use throughout this work, is the supervised learning method called Kernel ridge regression18,39 (KRR). This method has been covered extensively earlier,1–3,34,35 so we skip the details here.

The kernel functions54,55 we use in our work along with global representations are the Gaussian kernel,
(25)
and the Laplacian kernel,
(26)
where xI denotes the representation vector of molecule I.
The kernel function used for the local representations FCHL19, SOAP, and MBDF is a summation of atomic kernels,
(27)
with the local Gaussian kernel,
(28)
the local Laplacian kernel,
(29)
or the local exponential kernel with the Euclidean norm,
(30)
where MI denotes the representation matrix of molecule I, xIa denotes the representation vector of atom a within molecule I, and δZa,Zb denotes a Kronecker Delta over the nuclear charges Za, Zb, which restricts the similarity measurement between atoms of the same chemical element.34 Other kernel functions will also be tested in the future.40,56,57
Throughout this study, we evaluate the performance of ML methods through learning curves for the task of predicting physical properties of molecular systems. Learning curves quantify the model prediction error ɛ [often measured as mean absolute error (MAE)] against the number of training samples N and are key to understand the efficiency of ML models. It is generally known18,58,59 that they are linear on a log–log scale,
(31)
where I is the initial error and S is the slope indicating model improvement, given more training data. We also note that according to the central limit theorem, the distribution of the errors approaches the normal distribution with standard deviation σN and mean 0 as N → ∞. Hence, Eq. (31) becomes
(32)

The current form of the representations has been optimized for Kernel based learning models. It depends on the weighting functions used and a number of hyperparameters that include variances (σr, σθ) of the Gaussian PDFs, weighting function hyperparameters mentioned in Eqs. (12)(16), and bandwidth σb for the DF representation. The hyperparameter optimization was done on a random subset of two thousand molecules from the QM7 dataset1 and then kept fixed for all other datasets. We note that further improvements might be possible if they had been optimized simultaneously on all datasets. The weighting functions gn(τ) in Eq. (8) were chosen by straightforward screening of the functions mentioned earlier. The optimization minimized the atomization energy prediction errors on the QM7 subset using the Gaussian Process (GP) based Bayesian Optimization (BOpt).39 Starting with a Gaussian prior, the method fits a posterior distribution over the objective function using successive function evaluations. The posterior distribution is then used to construct an efficient acquisition function, which can be optimized using, for instance, a quasi-Newton method to determine the next query point. Table I shows the optimized hyperparameter values used throughout this work.

TABLE I.

MBDF and DF hyperparameters after optimization on atomization energies of the QM71 subset.

ParameterValue
σr 
σθ 
η 10.8 
α 1.5 
a0, a1, a2, a3 3, 100, −200, −164 
σb 0.07 
rc (Å) 
ParameterValue
σr 
σθ 
η 10.8 
α 1.5 
a0, a1, a2, a3 3, 100, −200, −164 
σb 0.07 
rc (Å) 

We used the scikit-optimize60 implementation of GP based BOpt using the default Matérn kernel with unit variance and the limited memory BFGS optimizer61 for the acquisition function. In order to enable comparison to other representations (such as FCHL or SOAP) that rely on distance cutoffs, we have chosen to set inter-atomic distance cutoff rc for MBDF to 6 Å throughout this work. We note that larger cutoffs for MBDF would not change its size.

All MBDF functionals were evaluated using the trapezoidal numerical integration method. The grid spacing for discretizing DF densities has been set to 0.2 throughout our work as noted earlier. The bandwidth σb = 0.07 was found to work well on the QM7 subset; however, it is recommended to be screened once (in the range ∼[0.01, 1]) along with the grid-spacing when using with other datasets. The Numpy62 and Numba63 libraries are used in the representation generation code.

The QM9 dataset48 consists of ∼134k small organic molecules with up to nine heavy atoms (C, N, O, and F). The calculations were performed at the B3LYP/6-31G(2df, p)65–66 level of theory.

QMugs is a dataset containing ∼665k biologically and pharmacologically relevant drug-like molecules. It consists of larger molecules than QM9 with up to 100 heavy atoms (C, N, O, F, P, S, Cl, Br, or I) per molecule. The training and predictions were performed on the DFT (ωB97X-D/def2-SVP)67,68 values reported in the dataset. The QMugs subsets we used for Fig. 5 were drawn at random and consist of 20k molecules. Throughout, we used zero-padding for all representations studied in order to accommodate training and test molecules smaller than the maximum present in the data.

In order to keep the FCHL19 and SOAP kernel evaluations computationally tractable, we have (a) restricted ourselves to a maximum 100 atoms per QMugs-molecule and (b) reduced the default hyperparameters of the FCHL19 and SOAP representations to nRs2 = 12, nRs3 = 10, rcut = 6 Å and nmax = lmax = 3, σ = 0.1, rcut = 6 Å, respectively. For consistency, we used the same parameters for all other results reported in this article. These two versions of the representations with the reduced basis sets are denoted as FCHL19* and SOAP* in all reported figures. Note that the latter choice of hyperparameters negligibly deteriorates the predictive accuracy for QM9 (as assessed below and when comparing to the published prediction errors on QM9 for FCHL19 and SOAP2013). For FCHL19 and SOAP based prediction errors reported here within for QMugs, it could still be that the accuracy could improve further if these parameters were optimized.

Figure 4 also includes results for QML models based on the two- (k = 2) and three-body (k = 3) Many-Body Tensor Representations (MBTR)22 and a variant69 of the Atom Centered Symmetry Functions20 (ACSF) as implemented in the QMLcode library.70 The MBTR representations were generated with the same hyperparameters as those used in Ref. 33 for the 10k training point on the QM9 dataset.

FIG. 4.

MBDF/DF performance and comparison on atomization energies from the QM9 dataset (drawn at random from ∼134k organic molecules).48 (a) (Left) Mean absolute error (MAE) of prediction as a function of training set size for representations CM,1 BOB,38 MBTR,22 SLATM,37 PI,81 ACSF,20,69 FCHL19,34 and SOAP.36 Numbers in legend denote representation size (feature vector dimensions) and G and L denote Global and Local kernels, respectively. (b) (Right) Timing for training and testing as a function of training set size required for making chemically accurate (MAE = 1 kcal/mol) predictions on 100k QM9 molecules. Blue, red, and green points indicate local kernels, global kernels, and neural network, respectively. The dashed gray line corresponds to the optimal Pareto front. For SchNet,73 PaiNN,74 and Allegro,76 an Nvidia RTX 3070 GPU has been used. All other timings were evaluated on a compute node equipped with a 24-core AMD EPYC 7402P @ 3.3 GHz CPU and 512 GB RAM. Timings for FCHL18,35 BOB, and CM are estimated using smaller kernels (not taking into account kernel inversion). Asterisk denotes representations with reduced hyperparameters used in this work. N values for ACSF, MBTR, BOB, and CM estimated via extrapolation. Numbers in brackets denote year of publication.

FIG. 4.

MBDF/DF performance and comparison on atomization energies from the QM9 dataset (drawn at random from ∼134k organic molecules).48 (a) (Left) Mean absolute error (MAE) of prediction as a function of training set size for representations CM,1 BOB,38 MBTR,22 SLATM,37 PI,81 ACSF,20,69 FCHL19,34 and SOAP.36 Numbers in legend denote representation size (feature vector dimensions) and G and L denote Global and Local kernels, respectively. (b) (Right) Timing for training and testing as a function of training set size required for making chemically accurate (MAE = 1 kcal/mol) predictions on 100k QM9 molecules. Blue, red, and green points indicate local kernels, global kernels, and neural network, respectively. The dashed gray line corresponds to the optimal Pareto front. For SchNet,73 PaiNN,74 and Allegro,76 an Nvidia RTX 3070 GPU has been used. All other timings were evaluated on a compute node equipped with a 24-core AMD EPYC 7402P @ 3.3 GHz CPU and 512 GB RAM. Timings for FCHL18,35 BOB, and CM are estimated using smaller kernels (not taking into account kernel inversion). Asterisk denotes representations with reduced hyperparameters used in this work. N values for ACSF, MBTR, BOB, and CM estimated via extrapolation. Numbers in brackets denote year of publication.

Close modal

Throughout our work, the FCHL19, SLATM and ACSF representations were generated using the QMLcode library,70 SOAP was generated using the Dscribe library71 with the default Gaussian type radial basis functions, MBTR was generated using the qmmlpack library,72 and SchNet73 and PaiNN74 were built and trained using the SchNetPack75 libray. Allegro76 was built and trained using the nequip77 code78 based on the E(3)-NN framework.79,80

For FCHL19, SOAP, and ACSF, we employed the local Gaussian [Eq. (28)], for SLATM, MBTR, and DF, we use the global Gaussian [Eq. (25)], and for CM and BOB, we use the global Laplacian [Eq. (26)] kernels, respectively. For MBDF, we used the local Gaussian kernel on the MD17 dataset and the local exponential [Eq. (30)] kernel for all other models. These choices were made based on the best performances for each representation. All kernel evaluations were performed using the QMLcode library.

1. QM9

Figure 4(a) shows learning curves for QM9 and the size (dimension of feature vector) of the representation arrays in the legend. For the task of predicting atomization energies, local representations have previously been shown to be more efficient.34,35 Results for the local representations FCHL19 and SOAP are closely reproduced and reach chemical accuracy after training on less than 10k molecules.34 Among the global representations, SLATM has previously also been shown34,37 to perform quite well, reaching chemical accuracy after training on ∼9k molecules, although it shows a smaller slope at larger training sizes. This is closely followed by MBDF, which reaches chemical accuracy after training on ∼10k molecules (less than 10% of the dataset). The global DF representation also performs decently well, reaching chemical accuracy at ∼32k training data. The local ACSF representation shows a larger offset but a better slope, and it reaches chemical accuracy at ∼50k training set size. We note here that for consistency with the other representations used in this work, we did not optimize the hyperparameters of the MBTR representation on every training point but rather kept them fixed throughout. Only the KRR hyperparameters were optimized at each training set size as with all other representations used here. The three-body MBTR reaches chemical accuracy at ∼60k training set size, while the two-body MBTR performs better than the other two-body representations, BOB and CM. We have also included the recently introduced constant size Persistence Images81 (PI) representation for comparison.

Note that MBDF has the smallest size, requiring only five numbers per atom (145 dimensions for QM9 molecules). By contrast, other local representations, such as FCHL19 and SOAP, require ∼400 numbers per atom, while ACSF uses a 150 dimensional feature vector per atom. Encouragingly and despite its compact size, MBDF outperforms most of the other larger representations with the exception of SOAP and FCHL. We note here that while the size of the global DF representation is larger than MBDF, it utilizes a global kernel, implying training and prediction cost invariance with respect to the system size.

This compactness of the representation translates into faster ML model evaluation timings. This is shown in Fig. 4(b), which plots the trade-off between training and prediction timings vs training data needs for reaching mean absolute prediction errors of atomization energies of 1 kcal/mol. We note that there are only two representations located on the Pareto-front, FCHL18,35 and MBDF (this work). We also point out that currently the best performing model on the QM9 dataset is the recently proposed Wigner kernel method,82 which is not included in this study.

As noted earlier, local kernels based on representations, such as FCHL18/19, SOAP, or ACSF, exhibit very good training data efficiency, but this comes at the expense of a larger computational overhead. The exception is the local MBDF based kernel, which achieves the fastest training timing of ∼0.07 compute node minutes (14k training molecules) due to its compact size. Predictions on 100k QM9 molecules using the local MBDF kernel are made in ∼1.46 compute node minutes, which translates to an unprecedented chemically accurate navigation rate of ∼1140 molecules/second. SLATM, being close to the Pareto front, and the DF representation both represent fast global kernel based KRR models. While requiring more training data than SLATM in order to reach chemical accuracy, DF has the advantage that it is largely invariant with respect to system size (see below). For the sake of completeness, Fig. 4(b) also includes results for the deep learning based models SchNet,73 PaiNN,74 and the current state-of-the-art, Allegro,76 which were all trained on a GPU. The reported timing for SchNet refers to 3000 epochs of training on 20k molecules and predictions of 100k molecules taking about 3 h: 27 min and 7 s, respectively. For PaiNN, the reported timing corresponds to 1000 epochs of training on 7k molecules, which took 0 h: 56 mins and a prediction time of 9 s. Allegro reached chemical accuracy after training on 4500 molecules for 728 epochs, with early stopping, which took 17 h: 8 min, while the prediction on 100k molecules took ∼7 mins (using a maximum possible batch size of five on the used GPU).

MBDF achieves the fastest training time out of all models and the fastest prediction rate among the kernel based models. Numerical results for porting KRR results based on FCHL19 to GPUs83 would suggest that it seems likely for the prediction rate to increase significantly once MBDF is reimplemented in CUDA. Figure 2 in the supplementary material shows the training and prediction times separately for some of these methods.

2. QMugs

We have tested the generalizability of our method to larger systems and more diverse chemistries using the QMugs49 dataset. Figure 5 shows the atomization energy learning curves. Due to the large variety in the dataset, the predictive error is larger for all representations than their QM9 counterparts even when predicting on a much smaller test set. MBDF reaches ∼2 kcal/mol prediction error after training on 16k molecules. This is better than the QML based neural network predictions published in Ref. 84 and similar to the Δ-QML numbers they also report. In terms of speed, generating the local MBDF kernel for training and testing on 20k molecules on this dataset takes ∼1.8 compute node mins (see below), which corresponds to a navigation rate of ∼185 molecules/second. By comparison, this is substantially faster than the GPU based prediction rates of ∼50 and five molecules per second for the direct and Δ-learning (using GFN2-xTB85) based ML models, respectively, using the convolutional neural network reported in Ref. 84. Only SLATM and FCHL19 exhibit lower offset than MBDF, while the performance for SOAP and DF is similar, albeit slightly worse than MBDF. As mentioned before, however, in order to make FCHL19 and SOAP tractable, we have dramatically reduced the hyper parameters. In particular, we believe that the learning efficiency of SOAP for QMugs is being reduced due to the use of small basis sets (nmax = lmax = 3). Note that no representation reaches chemical accuracy within 16k training molecules, indicating that QMugs possesses substantially more chemical diversity than QM9.

FIG. 5.

MBDF/DF performance and comparison to CM,1 BOB,38 SLATM,37 FCHL19,34 and SOAP36 representation based atomization energy estimates using the QMugs (∼667k organic molecules with up to one hundred heavy atoms) dataset.49 Training and testing data drawn at random. Prediction mean absolute errors (MAEs) on the holdout test set of 4k molecules shown as a function of training set size. Numbers in legend denote representation size (feature vector dimensions), and G and L denote Global and Local kernels, respectively. The shaded region denotes the standard deviation across four different learning curves (except for FCHL19 and SOAP for which only one learning curve was tractable). Asterisk denotes representations with reduced hyperparameters used in this work.

FIG. 5.

MBDF/DF performance and comparison to CM,1 BOB,38 SLATM,37 FCHL19,34 and SOAP36 representation based atomization energy estimates using the QMugs (∼667k organic molecules with up to one hundred heavy atoms) dataset.49 Training and testing data drawn at random. Prediction mean absolute errors (MAEs) on the holdout test set of 4k molecules shown as a function of training set size. Numbers in legend denote representation size (feature vector dimensions), and G and L denote Global and Local kernels, respectively. The shaded region denotes the standard deviation across four different learning curves (except for FCHL19 and SOAP for which only one learning curve was tractable). Asterisk denotes representations with reduced hyperparameters used in this work.

Close modal

In terms of representation sizes, MBDF again remains the smallest representation since it still requires only five dimensions per atom regardless of the chemical composition. However, being a local representation, on average, its size increases ∼3.4 times compared to QM9. FCHL19 and SOAP, on the other hand, now require more than 1000 dimensions to represent each atom for this larger dataset. CM, BOB, FCHL19, and SOAP show larger than tenfold increase in the representation size compared to QM9, followed by SLATM, which shows an increase of ∼6.6 times. This results in a considerable increase in the train/test time (vide infra), which precludes the straightforward application of these representations to the entire QMugs dataset. The DF representation changes the least in size since it does not formally scale with number of atoms but only with number of different chemical elements. Consequently, its size doubles compared to QM9 since a separate density function is generated for each unique chemical element in the dataset using Eq. (19) as mentioned earlier.

3. QM7b and MD17

Figure 3 in the supplementary material shows learning curves for the QM7b86,87 atomization energies and the size (dimension of feature vector) of the representation arrays in the legend. Similar trends in performance and representation size noted so far are observed on this dataset as well.

Figure 4 in the supplementary material shows learning curves of energies for a few molecules from the revised88,89 MD1790,91 molecular dynamics dataset. Although the comparative performance trend on this dataset is similar to the others, we note that the chosen functionals for MBDF and the current representation hyperparameters are not optimized for potential energy surface learning. Furthermore, the implementation of gradients for MBDF to enable force based learning should significantly enhance the performance for such tasks, as was shown88 for the FCHL19 representation. However, the relatively larger difference in performance between MBDF and FCHL19 on this dataset might indicate that compact representations are sufficient for regression through equilibrium structures across CCS, while more verbose representations are required for potential energy surface learning.

Figure 6 shows scaling plots of kernel evaluation timings across both QM9 and QMugs datasets for various representations and as a function of training set size. As one would expect, the offset increases systematically with increasing representation and training molecule size. More specifically, for the larger molecules of QMugs, the FCHL19 and SOAP kernel evaluations become rapidly intractable very quickly with the 16k kernel (Fig. 5) already taking an entire compute node day. Encouragingly and in stark contrast, DF, being a size invariant representation, shows hardly any change in computational overhead when moving from the small QM9 molecules to the much larger QMugs molecules.

FIG. 6.

Compute node time required for kernel inversion and evaluation as a function training set sizes N drawn from QM9 (squares) and QMugs (diamonds) datasets. QML results shown for SOAP,36 FCHL19,34 SLATM,37 and MBDF and DF. Dotted lines indicate extrapolation [using quadratic (kernel evaluation) and cubic (kernel inversion) polynomial fits]. G and L denote Global and Local kernels, respectively. Compute node: 24-core AMD EPYC 7402P @ 3.3 GHz CPU with 512 GB RAM.

FIG. 6.

Compute node time required for kernel inversion and evaluation as a function training set sizes N drawn from QM9 (squares) and QMugs (diamonds) datasets. QML results shown for SOAP,36 FCHL19,34 SLATM,37 and MBDF and DF. Dotted lines indicate extrapolation [using quadratic (kernel evaluation) and cubic (kernel inversion) polynomial fits]. G and L denote Global and Local kernels, respectively. Compute node: 24-core AMD EPYC 7402P @ 3.3 GHz CPU with 512 GB RAM.

Close modal

For context, the time required in the kernel inversion step for each training kernel is shown as well. The bottleneck crossover from kernel generation (O(n2)) to the inversion step (O(n3)) occurs rather late. When using Cholesky decomposition, it occurs for MBDF at N ∼ 64k training molecules. For less compact representations (for SLATM and larger), the same crossover occurs at training set sizes that exceed ∼1M. As demonstrated above, chemical accuracy is already achieved at substantially smaller training set sizes. Consequently and contrary to popular belief, for any of the more modern and accurate representations, kernel inversion does not constitute a bottleneck.

Table II reports the kernel evaluation and representation generation timings for both the QM9 dataset (130k molecules) and the QMugs subset (20k molecules) used in our work. It can be seen that MBDF reduces the local kernel evaluation timings from days to a few minutes for both small and large molecules. For the representation generation step, we note that our code is currently written in Python and uses the Numba63 library and could be further optimized with a low-level implementation. However, the current timings as well do not affect the overall QML model cost, much given the kernel evaluation bottleneck.

TABLE II.

Compute node times for generating representations (trep) and kernel matrices (tkernel) for 130k molecules from the QM9 dataset and 20k molecules from the QMugs dataset. Global and Local kernels are again denoted by (G) and (L), respectively. Representations with superscript a were generated with QMLcode70 library and b with the Dscribe71 library. Compute node: 24-core AMD EPYC 7402P @ 3.3 GHz CPU and 512 GB RAM.

QM9 130k moleculesQMugs 20k molecules
Representationtrep (min)tkernel (min)Dimensiontrep (min)tkernel (min)Dimension
CMa (G) 0.186 2.862 435 0.012 1.146 5 050 
BoBa (G) 0.216 7.296 1 128 1.362 3.396 18 528 
SLATMa (G) 18.60 86.32 11 960 15.76 14.58 79 045 
FCHL19a (L) 0.846 1071 10 440 1.764 1566 122 000 
SOAPb (L) 0.216 1873 13 920 0.246 2925 186 000 
MBDF (L) 1.626 11.81 145 4.182 1.848 500 
DF (G) 2.262 12.16 2 500 2.442 0.996 5 000 
QM9 130k moleculesQMugs 20k molecules
Representationtrep (min)tkernel (min)Dimensiontrep (min)tkernel (min)Dimension
CMa (G) 0.186 2.862 435 0.012 1.146 5 050 
BoBa (G) 0.216 7.296 1 128 1.362 3.396 18 528 
SLATMa (G) 18.60 86.32 11 960 15.76 14.58 79 045 
FCHL19a (L) 0.846 1071 10 440 1.764 1566 122 000 
SOAPb (L) 0.216 1873 13 920 0.246 2925 186 000 
MBDF (L) 1.626 11.81 145 4.182 1.848 500 
DF (G) 2.262 12.16 2 500 2.442 0.996 5 000 

We assessed the generalization capacities of MBDF/DF on physical properties other than atomization energies. Figure 7 shows the learning curves for the task of predicting eight important molecular quantum properties from the QM9 dataset. These include the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) eigenvalues and the HOMO-LUMO gap, internal energy at 0 K (U0), dipole moment norm (|μ|), static isotropic polarizability (α), zero point vibrational energy (ZPVE), and heat capacity (Cv). Due to substantial computational costs, the KRR hyper-parameters were not optimized at each training size for the FCHL19 and SOAP representations, and we picked the same parameters as those used for learning atomization energies in Fig. 4. We reproduce earlier trends among intensive and extensive properties when using local/global kernels.3,92 MBDF and DF match this trend: They perform better on extensive and on intensive properties, respectively.

FIG. 7.

Learning curves for various representations in QML models of highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) eigenvalues, HOMO–LUMO gap (Δϵ), internal energy at 0 K (U0), dipole moment norm (|μ|), static isotropic polarizability (α), zero point vibrational energy (ZPVE), and heat capacity (Cv) from the QM9 dataset. G and L denote Global and Local kernels, respectively. Asterisk denotes representations with reduced hyperparameters used in this work.

FIG. 7.

Learning curves for various representations in QML models of highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) eigenvalues, HOMO–LUMO gap (Δϵ), internal energy at 0 K (U0), dipole moment norm (|μ|), static isotropic polarizability (α), zero point vibrational energy (ZPVE), and heat capacity (Cv) from the QM9 dataset. G and L denote Global and Local kernels, respectively. Asterisk denotes representations with reduced hyperparameters used in this work.

Close modal

Again, note that the performance on these properties of MBDF/DF could be further improved by including different functionals suited to the property or by augmenting them with response terms as was done for the FCHL19 representation.92 It would also be interesting to see how the learning capacities across these different physical properties is affected by the inclusion of higher order functional terms.

We have introduced compact atomic local many body distribution functional (MBDF) and global density of functionals (DF) representations for use within kernel ridge regression based Quantum Machine Learning (QML) models for rapid sampling of chemical compound space. MBDF/DF can accurately describe any atom/molecule using a minimal number of discrete elements and thereby reduce QML training and prediction times by multiple orders of magnitude for small and large molecules. MBDF and DF correspond to functionals of analytical weighted many body distributions in interatomic distances and angles, as well as their derivatives. Chemical identity is encoded as a prefactor to the atomic functionals. DF is a weighted density estimation of MBDF, i.e., a global molecular fingerprint that is invariant with respect to the number of atoms (not number of chemical elements).

We have demonstrated predictive power competitive with the state-of-the-art for a variety of quantum physical properties, as encoded in the QM9 dataset.48 On the QM9 dataset, MBDF reaches a MAE of atomization energies of only 0.69 kcal/mol after training on 32k molecules while using only five dimensions to describe an atom. Regarding different molecular properties, it is beneficial to use DF representation along with a global kernel for intensive properties and MBDF along with a local kernel for the extensive properties. MBDF and DF generalize also well to other compound spaces, as evinced for the chemically more diverse QMugs dataset49 consisting of substantially larger molecules: After training on 16k molecules, MBDF reaches a MAE for atomization energies of 1.97 kcal/mol while still using only five dimensions per atom. Corresponding training and prediction times for both datasets are ∼2 compute node time minutes for MBDF based models. Furthermore, our results indicate that using MBDF/DF brings the train/test timings of kernel based ML models to their “lower-bound” as imposed by the kernel inversion bottleneck for both small and large molecules.

We have analyzed that the comparative performance for the sampling cost vs training set size needs to reach chemical accuracy for predicting atomization energies in the QM9 dataset. MBDF has extended the corresponding optimal Pareto front toward minimal time requirements.

While the numerical evidence collected is indicative of a surprising effectiveness of MBDF, it is clear that the truncations used in this work may lead to lack of uniqueness. However, neither within QM9 nor within the QMugs subset we sample, we have encountered a case where using this small set of functionals maps two different molecular structures to the same feature vector. Furthermore, the likelihood of uniqueness can easily be increased by inclusion of higher order derivatives and many-body terms. More rigorous numerical analysis would be desirable to quantify this trend.

Thanks to its numerical efficiency, we believe that this approach holds great promise for further accelerating the virtual discovery of novel molecules and materials. Furthermore, this framework provides a possible solution to the general problem of unfavorable scaling due to (i) inclusion of higher order many body and long range physics and (ii) applying these ML models to larger molecules with greater chemical diversity. Future work will deal with the extension of the representations as described to deal with various types of conformers and an assessment of its sensitivity to changes in molecular structures.

See the supplementary material for QM7b86 and MD1789–91 learning curves and a figure showing the transition of cyclohexane from chair to boat conformation alongside the response by its DF fingerprint. It also contains heat maps showing ideal KRR hyperparameters to be used with the MBDF representation and kernel PCAs of some molecules from the QM7b dataset.

D.K. is thankful for discussions with B. Huang, S. Krug, D. Lemm, M. Sahre, and J. Weinreich. O.A.v.L. has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 772834). O.A.v.L. has received support as the Ed Clark Chair of Advanced Materials and as a Canada CIFAR AI Chair.

The authors have no conflicts to disclose.

Danish Khan: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Validation (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). Stefan Heinen: Conceptualization (supporting); Data curation (supporting); Resources (supporting); Supervision (supporting); Validation (supporting); Writing – original draft (supporting); Writing – review & editing (equal). O. Anatole von Lilienfeld: Conceptualization (supporting); Formal analysis (supporting); Funding acquisition (lead); Methodology (supporting); Project administration (lead); Resources (lead); Supervision (lead); Writing – original draft (supporting); Writing – review & editing (equal).

Python script for generating the MBDF and DF representations is available at https://github.com/dkhan42/MBDF. Train/test split of the QM7 dataset used for optimizing representation hyperparameters, other data and code for reproducing the results in this study, and some tools for performing KRR are openly available at https://github.com/dkhan42/QML.

1.
M.
Rupp
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O. A.
von Lilienfeld
,
Phys. Rev. Lett.
108
,
058301
(
2012
).
2.
O. A.
von Lilienfeld
,
Angew. Chem., Int. Ed.
57
,
4164
(
2018
).
3.
F. A.
Faber
,
L.
Hutchison
,
B.
Huang
,
J.
Gilmer
,
S. S.
Schoenholz
,
G. E.
Dahl
,
O.
Vinyals
,
S.
Kearnes
,
P. F.
Riley
, and
O. A.
von Lilienfeld
,
J. Chem. Theory Comput.
13
,
5255
(
2017
).
4.
T. S.
Ho
and
H.
Rabitz
,
J. Chem. Phys.
104
,
2584
(
1996
).
5.
S.
Manzhos
and
T.
Carrington
, Jr.
,
J. Chem. Phys.
125
,
084109
(
2006
).
6.
J.
Behler
and
M.
Parrinello
,
Phys. Rev. Lett.
98
,
146401
(
2007
).
7.
A. P.
Bartók
,
M. C.
Payne
,
R.
Kondor
, and
G.
Csányi
,
Phys. Rev. Lett.
104
,
136403
(
2010
).
8.
C. M.
Handley
and
P. L. A.
Popelier
,
J. Phys. Chem. A
114
,
3371
(
2010
).
9.
J. S.
Smith
,
O.
Isayev
, and
A. E.
Roitberg
,
Chem. Sci.
8
,
3192
(
2017
).
10.
J. C.
Snyder
,
M.
Rupp
,
K.
Hansen
,
K.-R.
Müller
, and
K.
Burke
,
Phys. Rev. Lett.
108
,
253002
(
2012
).
11.
Z. D.
Pozun
,
K.
Hansen
,
D.
Sheppard
,
M.
Rupp
,
K.-R.
Müller
, and
G.
Henkelman
,
J. Chem. Phys.
136
,
174101
(
2012
).
12.
M.
Ceriotti
,
C.
Clementi
, and
O.
Anatole von Lilienfeld
,
Chem. Rev.
121
,
9719
(
2021
).
13.
M.
Ceriotti
,
C.
Clementi
, and
O.
Anatole von Lilienfeld
,
J. Chem. Phys.
154
,
160401
(
2021
).
14.
O. A.
von Lilienfeld
,
R.
Ramakrishnan
,
M.
Rupp
, and
A.
Knoll
,
Int. J. Quantum Chem.
115
,
1084
(
2015
).
15.
L. M.
Ghiringhelli
,
J.
Vybiral
,
S. V.
Levchenko
,
C.
Draxl
, and
M.
Scheffler
,
Phys. Rev. Lett.
114
,
105503
(
2015
).
16.
F.
Musil
,
A.
Grisafi
,
A. P.
Bartók
,
C.
Ortner
,
G.
Csányi
, and
M.
Ceriotti
,
Chem. Rev.
121
,
9759
(
2021
).
17.
R.
Ramakrishnan
and
O. A.
von Lilienfeld
,
CHIMIA
69
,
182
(
2015
).
18.
V. N.
Vapnik
,
Statistical Learning Theory
(
Wiley-Interscience
,
1998
).
19.
C. R.
Collins
,
G. J.
Gordon
,
O. A.
von Lilienfeld
, and
D. J.
Yaron
,
J. Chem. Phys.
148
,
241718
(
2018
).
20.
J.
Behler
,
J. Chem. Phys.
134
,
074106
(
2011
).
21.
L.
Zhu
,
M.
Amsler
,
T.
Fuhrer
,
B.
Schaefer
,
S.
Faraji
,
S.
Rostami
,
S. A.
Ghasemi
,
A.
Sadeghi
,
M.
Grauzinyte
,
C.
Wolverton
, and
S.
Goedecker
,
J. Chem. Phys.
144
,
034203
(
2016
).
22.
H.
Huo
and
M.
Rupp
,
Mach. Learn.: Sci. Technol.
3
,
045017
(
2022
).
23.
M.
Gastegger
,
L.
Schwiedrzik
,
M.
Bittermann
,
F.
Berzsenyi
, and
P.
Marquetand
,
J. Chem. Phys.
148
,
241709
(
2018
).
25.
B. J.
Braams
and
J. M.
Bowman
,
Int. Rev. Phys. Chem.
28
,
577
(
2009
).
26.
M. J.
Hirn
,
N.
Poilvert
, and
S.
Mallat
, arXiv:1502.02077 (
2015
).
27.
A. V.
Shapeev
,
Multiscale Model. Simul.
14
,
1153
(
2016
).
28.
J.
Nigam
,
S.
Pozdnyakov
, and
M.
Ceriotti
,
J. Chem. Phys.
153
,
121101
(
2020
).
29.
V.
Zaverkin
and
J.
Kästner
,
J. Chem. Theory Comput.
16
,
5410
(
2020
).
30.
R.
Jinnouchi
,
F.
Karsai
,
C.
Verdi
,
R.
Asahi
, and
G.
Kresse
,
J. Chem. Phys.
152
,
234102
(
2020
).
31.
A. P.
Thompson
,
L. P.
Swiler
,
C. R.
Trott
,
S. M.
Foiles
, and
G. J.
Tucker
,
J. Comput. Phys.
285
,
316
(
2015
).
32.
M. J.
Willatt
,
F.
Musil
, and
M.
Ceriotti
,
Phys. Chem. Chem. Phys.
20
,
29661
(
2018
).
33.
M. F.
Langer
,
A.
Goeßmann
, and
M.
Rupp
,
npj Comput. Mater.
8
,
41
(
2022
).
34.
A. S.
Christensen
,
L. A.
Bratholm
,
F. A.
Faber
, and
O.
Anatole von Lilienfeld
,
J. Chem. Phys.
152
,
044107
(
2020
).
35.
F. A.
Faber
,
A. S.
Christensen
,
B.
Huang
, and
O. A.
von Lilienfeld
,
J. Chem. Phys.
148
,
241717
(
2018
).
36.
A. P.
Bartók
,
R.
Kondor
, and
G.
Csányi
,
Phys. Rev. B
87
,
184115
(
2013
).
37.
B.
Huang
and
O. A.
von Lilienfeld
, “
Quantum machine learning using atom-in-molecule-based fragments selected on-the-fly
,”
Nature Chemistry
12
,
945
(
2020
).
38.
K.
Hansen
,
F.
Biegler
,
R.
Ramakrishnan
,
W.
Pronobis
,
O. A.
von Lilienfeld
,
K.-R.
Müller
, and
A.
Tkatchenko
,
J. Phys. Chem. Lett.
6
,
2326
(
2015
).
39.
C. E.
Rasmussen
and
C. K. I.
Williams
,
Gaussian Processes for Machine Learning
,
Adaptive Computation and Machine Learning
(
MIT Press
,
2006
), pp.
I-XVIII
1-248
.
40.
S.
De
,
A. P.
Bartók
,
G.
Csányi
, and
M.
Ceriotti
,
Phys. Chem. Chem. Phys.
18
,
13754
(
2016
).
41.
A. P.
Bartók
and
G.
Csányi
,
Int. J. Quantum Chem.
115
,
1051
(
2015
).
42.
S.
Chmiela
,
V.
Vassilev-Galindo
,
O. T.
Unke
,
A.
Kabylda
,
H. E.
Sauceda
,
A.
Tkatchenko
, and
K.-R.
Müller
,
Sci. Adv.
9
,
eadf0873
(
2023
).
43.
F. A.
Faber
,
A.
Lindmaa
,
O. A.
von Lilienfeld
, and
R.
Armiento
,
Phys. Rev. Lett.
117
,
135502
(
2016
).
45.
J. E.
Jones
,
Proc. R. Soc. London, Ser. A
106
,
441
462
(
1924
).
46.
B. M.
Axilrod
and
E.
Teller
,
J. Chem. Phys.
11
,
299
(
1943
).
47.
F. H.
Stillinger
and
T. A.
Weber
,
Phys. Rev. B
31
,
5262
(
1985
).
48.
R.
Ramakrishnan
,
P. O.
Dral
,
M.
Rupp
, and
O. A.
von Lilienfeld
,
Sci. Data
1
,
140022
(
2014
).
49.
C.
Isert
,
K.
Atz
,
J.
Jiménez-Luna
, and
G.
Schneider
,
Sci. Data
9
,
273
(
2022
).
50.
S. N.
Pozdnyakov
,
M. J.
Willatt
,
A. P.
Bartók
,
C.
Ortner
,
G.
Csányi
, and
M.
Ceriotti
,
Phys. Rev. Lett.
125
,
166001
(
2020
).
51.
B.
Huang
and
O. A.
von Lilienfeld
,
J. Chem. Phys.
145
,
161102
(
2016
).
52.
53.
S.-i.
Amari
and
H.
Nagaoka
,
Methods of Information Geometry
(
American Mathematical Society
,
2000
), Vol.
191
.
54.
M. P.
Deisenroth
,
A. A.
Faisal
, and
C. S.
Ong
,
Mathematics for Machine Learning
(
Cambridge University Press
,
2020
).
55.
K. T.
Schütt
,
S.
Chmiela
,
O. A.
von Lilienfeld
,
A.
Tkatchenko
,
K.
Tsuda
, and
K.-R.
Müller
,
Machine Learning Meets Quantum Physics
(
Springer
,
2020
).
56.
O.
Çaylak
,
O.
Anatole von Lilienfeld
, and
B.
Baumeier
,
Mach. Learn.: Sci. Technol.
1
,
03LT01
(
2020
).
57.
R.
Fabregat
,
P.
van Gerwen
,
M.
Haeberle
,
F.
Eisenbrand
, and
C.
Corminboeuf
,
Mach. Learn.: Sci. Technol.
3
,
035015
(
2022
).
58.
C.
Cortes
,
L. D.
Jackel
,
S. A.
Solla
,
V.
Vapnik
, and
J. S.
Denker
, in
Advances in Neural Information Processing Systems
(
Advances in Neural Information Processing Systems
,
1994
), pp.
327
334
.
59.
K.-R.
Müller
,
M.
Finke
,
N.
Murata
,
K.
Schulten
, and
S.
Amari
,
Neural Comput.
8
,
1085
(
1996
).
60.
T.
Head
,
M.
Kumar
,
H.
Nahrstaedt
,
G.
Louppe
, and
I.
Shcherbatyi
,
Scikit-optimize/scikit-optimize
,
2021
.
61.
D. C.
Liu
and
J.
Nocedal
,
Math. Program.
45
,
503
(
1989
).
62.
C. R.
Harris
,
K. J.
Millman
,
S. J.
van der Walt
,
R.
Gommers
,
P.
Virtanen
,
D.
Cournapeau
,
E.
Wieser
,
J.
Taylor
,
S.
Berg
,
N. J.
Smith
,
R.
Kern
,
M.
Picus
,
S.
Hoyer
,
M. H.
van Kerkwijk
,
M.
Brett
,
A.
Haldane
,
J. F.
del Río
,
M.
Wiebe
,
P.
Peterson
,
P.
Gérard-Marchant
,
K.
Sheppard
,
T.
Reddy
,
W.
Weckesser
,
H.
Abbasi
,
C.
Gohlke
, and
T. E.
Oliphant
,
Nature
585
,
357
(
2020
).
63.
S. K.
Lam
,
A.
Pitrou
, and
S.
Seibert
, in
Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM’15
(
Association for Computing Machinery
,
New York
,
2015
).
64.
A. D.
Becke
,
J. Chem. Phys.
98
,
5648
(
1993
).
65.
C.
Lee
,
W.
Yang
, and
R. G.
Parr
,
Phys. Rev. B
37
,
785
(
1988
).
66.
G. A.
Petersson
,
A.
Bennett
,
T. G.
Tensfeldt
,
M. A.
Al‐Laham
,
W. A.
Shirley
, and
J.
Mantzaris
,
J. Chem. Phys.
89
,
2193
(
1988
).
67.
J.-D.
Chai
and
M.
Head-Gordon
,
Phys. Chem. Chem. Phys.
10
,
6615
(
2008
).
68.
F.
Weigend
and
R.
Ahlrichs
,
Phys. Chem. Chem. Phys.
7
,
3297
(
2005
).
69.
K.
Yao
,
J. E.
Herr
,
D. W.
Toth
,
R.
Mckintyre
, and
J.
Parkhill
,
Chem. Sci.
9
,
2261
(
2018
).
70.
A. S.
Christensen
,
L. A.
Bratholm
,
F. A.
Faber
,
B.
Huang
,
A.
Tkatchenko
,
K. R.
Müller
, and
O. A.
von Lilienfeld
,
QML: A python toolkit for quantum machine learning
,
2017
.
71.
L.
Himanen
,
M. O. J.
Jäger
,
E. V.
Morooka
,
F.
Federici Canova
,
Y. S.
Ranawat
,
D. Z.
Gao
,
P.
Rinke
, and
A. S.
Foster
,
Comput. Phys. Commun.
247
,
106949
(
2020
).
72.
M.
Rupp
,
Int. J. Quantum Chem.
115
,
1058
(
2015
).
73.
K. T.
Schütt
,
P.
Kessel
,
M.
Gastegger
,
K. A.
Nicoli
,
A.
Tkatchenko
, and
K.-R.
Müller
,
J. Chem. Theory Comput.
15
,
448
(
2018
).
74.
K. T.
Schütt
,
O. T.
Unke
, and
M.
Gastegger
,
Proceedings of the 38th International Conference on Machine Learning
(PMLR, 2021) Vol. 139, pp.
9377
9388
.
75.
K.
Schütt
,
P.-J.
Kindermans
,
H. E.
Sauceda Felix
,
S.
Chmiela
,
A.
Tkatchenko
, and
K.-R.
Müller
, in
Advances in Neural Information Processing Systems
, edited by
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, and
R.
Garnett
(
Curran Associates, Inc.
,
2017
), Vol.
30
.
76.
A.
Musaelian
,
S.
Batzner
,
A.
Johansson
,
L.
Sun
,
C. J.
Owen
,
M.
Kornbluth
, and
B.
Kozinsky
,
Nat. Commun.
14
,
579
(
2023
).
77.
S.
Batzner
,
A.
Musaelian
,
L.
Sun
,
M.
Geiger
,
J. P.
Mailoa
,
M.
Kornbluth
,
N.
Molinari
,
T. E.
Smidt
, and
B.
Kozinsky
,
Nat. Commun.
13
,
2453
(
2022
).
78.
M.
Geiger
,
T.
Smidt
,
M.
Alby
,
B. K.
Miller
,
W.
Boomsma
,
B.
Dice
,
K.
Lapchevskyi
,
M.
Weiler
,
M.
Tyszkiewicz
,
M.
Uhrin
,
S.
Batzner
,
D.
Madisetti
,
J.
Frellsen
,
N.
Jung
,
S.
Sanborn
,
Jkh
,
M.
Wen
,
J.
Rackers
,
M.
Rød
, and
M.
Bailey
,
e3nn/e3nn: 2022-12-12
,
2022
.
79.
M.
Geiger
,
T.
Smidt
,
M.
Alby
,
B. K.
Miller
,
W.
Boomsma
,
B.
Dice
,
K.
Lapchevskyi
,
M.
Weiler
,
M.
Tyszkiewicz
,
S.
Batzner
,
D.
Madisetti
,
M.
Uhrin
,
J.
Frellsen
,
N.
Jung
,
S.
Sanborn
,
M.
Wen
,
J.
Rackers
,
M.
Rød
, and
M.
Bailey
,
Euclidean neural networks: e3nn
,
2022
.
80.
M.
Geiger
and
T.
Smidt
,
e3nn: Euclidean neural networks
,
2022
.
81.
J.
Townsend
,
C. P.
Micucci
,
J. H.
Hymel
,
V.
Maroulas
, and
K. D.
Vogiatzis
,
Nat. Commun.
11
,
3230
(
2020
).
82.
F.
Bigi
,
S. N.
Pozdnyakov
, and
M.
Ceriotti
, “
Wigner kernels: Body-ordered equivariant machine learning without a basis
,” arXiv:2303.04124
[physics.chem-ph]
(
2023
).
83.
N. J.
Browning
,
F. A.
Faber
, and
O.
Anatole von Lilienfeld
,
J. Chem. Phys.
157
,
214801
(
2022
).
84.
K.
Atz
,
C.
Isert
,
M. N. A.
Böcker
,
J.
Jiménez-Luna
, and
G.
Schneider
,
Phys. Chem. Chem. Phys.
24
,
10775
(
2022
).
85.
C.
Bannwarth
,
S.
Ehlert
, and
S.
Grimme
,
J. Chem. Theory Comput.
15
,
1652
(
2019
).
86.
G.
Montavon
,
M.
Rupp
,
V.
Gobre
,
A.
Vazquez-Mayagoitia
,
K.
Hansen
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O.
Anatole von Lilienfeld
,
New J. Phys.
15
,
095003
(
2013
).
87.
L. C.
Blum
and
J.-L.
Reymond
,
J. Am. Chem. Soc.
131
,
8732
(
2009
).
88.
A. S.
Christensen
and
O. A.
von Lilienfeld
,
Mach. Learn.: Sci. Technol.
1
,
045018
(
2020
).
89.
A. S.
Christensen
and
A. V.
lilienfeld
, (
2020
). “
Revised MD17 dataset (rMD17), figshare
,” https://doi.org/10.6084/m9.figshare.12672038.v3.
90.
S.
Chmiela
,
A.
Tkatchenko
,
H. E.
Sauceda
,
I.
Poltavsky
,
K. T.
Schütt
, and
K.-R.
Müller
,
Sci. Adv.
3
,
1
(
2017
).
91.
S.
Chmiela
,
H. E.
Sauceda
,
K.-R.
Müller
, and
A.
Tkatchenko
,
Nat. Commun.
9
,
3887
(
2018
).
92.
A. S.
Christensen
,
F. A.
Faber
, and
O. A.
von Lilienfeld
,
J. Chem. Phys.
150
,
064105
(
2019
).

Supplementary Material