We introduce a representation of any atom in any chemical environment for the automatized generation of universal kernel ridge regression-based quantum machine learning (QML) models of electronic properties, trained throughout chemical compound space. The representation is based on Gaussian distribution functions, scaled by power laws and explicitly accounting for structural as well as elemental degrees of freedom. The elemental components help us to lower the QML model’s learning curve, and, through interpolation across the periodic table, even enable “alchemical extrapolation” to covalent bonding between elements not part of training. This point is demonstrated for the prediction of covalent binding in single, double, and triple bonds among main-group elements as well as for atomization energies in organic molecules. We present numerical evidence that resulting QML energy models, after training on a few thousand random training instances, reach chemical accuracy for out-of-sample compounds. Compound datasets studied include thousands of structurally and compositionally diverse organic molecules, non-covalently bonded protein side-chains, $(H2O)40$-clusters, and crystalline solids. Learning curves for QML models also indicate competitive predictive power for various other electronic ground state properties of organic molecules, calculated with hybrid density functional theory, including polarizability, heat-capacity, HOMO-LUMO eigenvalues and gap, zero point vibrational energy, dipole moment, and highest vibrational fundamental frequency.

## I. INTRODUCTION

Ground-state properties of chemical compounds can generally be estimated with acceptable accuracy using methods such as *ab initio* quantum chemistry or density functional theory (DFT).^{1} However, these can be computationally expensive and therefore have a limited applicability, especially for larger systems. Alternatively, inductive machine learning of quantum mechanical properties, i.e., quantum machine learning (QML) models, can infer properties directly or even predict the electron density which in turn can be used to calculate properties,^{2} by training on large datasets of reference property/compound pairs. QML models can have an exceptional trade-off between predictive accuracy and computational cost. For example, in 2017 we showed that QML models can estimate hybrid DFT atomization energies as well as several other properties of medium-sized organic molecules with prediction errors lower than chemical accuracy (∼0.04 eV)—multiple orders of magnitude faster than hybrid DFT.^{3}

The system variable defining the ground-state properties of a given compound is its external potential, a simple function of interatomic distances and nuclear charges. However, using this information directly to measure similarity results in QML models with rather disappointing predictive power. This can be mitigated by the transformation of system variables into “representations.” Such transformations can either be designed by human intuition or be included in the learning problem, e.g., when using neural networks (NNs) which include representation learning in the supervised learning task. Letting a NN find the representation has proven to yield models with low out-of-sample prediction errors.^{4–6} This approach, however, has the drawback that the representation and model are intermingled within the NN, making it less amenable to human understanding, interpretation, adaptation, and further improvement. Furthermore, such machine designed representations do not necessarily lead to better QML performance than representations designed by humans (*vide infra*).

There are many ways of manually encoding the 3D structure and chemical composition of a compound into a suitable representation. For example, we can represent a compound as a list of interatomic potentials.^{7–9} Another approach consists of creating a fingerprint of the compound, transforming internal coordinates into a fixed set of numbers. For example, this can be done by projecting the coordinates on to a set of basis functions^{10} or by using the topology of the structure.^{11} Distributions of internal coordinates represent another systematic approach, shown to yield well performing QML models applicable throughout chemical space.^{12,13} Additional use of bags containing angular and dihedral distributions has led to further improvements in resulting QML models.^{3,9,14,15} Bagging based on atom types, however, severely hinders resulting QML models from transferring what has been learned from one atom type to another—a desirable feature for chemically diverse systems.

In this work, we introduce a new atomic environment representation, with two key differences to previous distribution-based work. (i) The representation is not binned by atomic species. Instead, compositional information is encoded directly into the distributions. This allows measuring not only structural differences but also “alchemical” differences between elements in the atomic environments. The idea of computational alchemy, amounting to continuous interpolation of Hamiltonians of two different systems, is well established in quantum chemistry and statistical mechanics and can be exploited for virtual exploration campaigns in chemical space with increased efficiency.^{16} Recently, it has been shown that alchemical estimates of covalent bond potentials can even surpass generalized gradient approximated DFT accuracy.^{17} The foundation of a continuous chemical space has been reviewed previously.^{12} Alchemical distance measures in the context of QML were already exploited previously when using the Coulomb matrix (CM),^{7} the Fourier series distribution based representations,^{18} the Faber, Lindmaa, Lilienfeld, Armiento (FLLA) crystal representation,^{19} and within smooth overlap of atomic potential (SOAP) representations.^{20} For this work, we have identified a new functional form with improved performance due to alchemical contributions to the distance measure. (ii) We use a set of multidimensional distributions of interatomic many-body expansions rather than several 1D bins of internal coordinates. The distributions are built recursively so that an *m*-body distribution contains the same information as the (*m* − 1)-body distribution plus additional *m*-body information. This particular combination combines similarity to the potential energy target function and compliance with many known (translational, rotational, permutational) invariances.

## II. THEORY

In this section, we first motivate the ideas that have led to this study. Thereafter, we discuss the functional form and the variational degrees of freedom, which we have introduced, as well as the resulting compound distances. Then, an analysis of the functional form is performed using the molecule water as an example. Finally, numerical results for parameters’ optimization runs are discussed.

### A. Kernel ridge regression

In order to profit from robustness, ease of error convergence, computational efficiency, and simplicity, we base our studies preferably on kernel ridge regression (KRR) models.^{21–24} However, we consider this rather a question of taste and believe that other regressors, such as neural networks, will produce similar results if properly converged.

KRR estimates property *p* of query compound **C** as a weighted sum of kernel basis functions placed on each of *N* training compounds {**C**_{k}},

where the solution for the weights {*α*_{k}} is obtained through linear regression with regularizer *λ* (typically negligibly small because of the absence of noise in training data obtained from quantum chemistry calculations).

Throughout this work we rely on atomistic (scalable) Gaussian kernels, $K(C,C\u2032)=\u2211I\u2208C\u2211J\u2208C\u2032k(\Delta (AM(I),AM(J)))$, as already used in Refs. 14 and 25–27. As such, KRR renders the selection of a functional form which represents an atom in its chemical environment mandatory. Obviously, this choice is fundamentally related to our understanding of chemistry and is known to dramatically affect the performance of resulting QML models, see, e.g., Refs. 3 and 9. It is for this reason that we draw our inspiration from the fundamental laws of quantum mechanics which specify the definition of system (Hamiltonian) and property (observable) and which spell out the numerical recipe which links the two.^{1}

The genesis of this study is due to the fact that the total potential energy, the expectation value of a compound’s electronic Hamiltonian, constitutes the central figure of merit for convergence towards the wavefunction by virtue of the variational principle. When considering Eq. (2), it should be obvious that the kernel (and thereby representation) is independent of a specific property. Units and property dependence are introduced through the regression weights only. This has also already been demonstrated numerically for multiple properties using the same kernel.^{28} As such, the role of the kernel is reminiscent of the wavefunction which can be used to predict arbitrarily many observables by evaluating the expectation values of the corresponding operators, always using the same wavefunction: Once the kernel is inverted, arbitrarily many sets of regression coefficients can easily be generated provided that their corresponding property reference values are known. The Hamiltonian’s expectation value, i.e., the potential energy, therefore governs the shape of the wavefunction. We therefore assume that a representation, optimized for energy predictions only, is fundamentally more advantageous than representations obtained by minimizing prediction errors of some integrated observable. Consequently, the focus employed in this study has been to identify a representation which is inspired by the energy changes occurring due to changes in chemical composition and covalent and non-covalent bonding. The accuracy of quantum mechanics when predicting other properties (observables) as expectation values of operators depends crucially on the quality of the wavefunction. Here, we follow a similar argument: The better the representation the better the energy prediction, implying that energy prediction errors should be minimized in the functional space of the representation.

### B. Representation

We use a set of interatomic *M*-body expansions $AM(I)={A1(I),A2(I),A3(I),\u2026,AM(I)}$ which contain up to *M*-body interactions to represent the structural and chemical environment of an atom *I* in compound **C**. *A*_{m}(*I*) is a weighted sum that runs over all *m*-body interactions. Each element in the sums consists of Gaussian basis functions, placed on structural and elemental degrees of freedom and multiplied by a scaling function *ξ*_{m}. Structural values encode geometrical information about the system, such as interatomic distances or angles. As elemental parameters, we use the period *P* and group *G* from the periodic table. The scaling functions *ξ*_{m} are used to weigh the importance of each Gaussian, based on internal system coordinates. We now consider only the first three distributions in $AM(I)$ for an atom *I*. We have also derived, implemented, and tested the 4-body *A*_{4}(*I*) distributions. However, the predictive accuracy improvements of resulting QML models were found to be negligible in comparison to the 3-body expansion. Also, the computational cost for generating large kernel matrices increases substantially when going from third to fourth order terms. We refer to the supplementary material for the derivation.

The first-order expansion *A*_{1}(*I*) accounts for chemical composition (stoichiometry) and is modeled by a Gaussian function placed on period *P*_{I} and group *G*_{I} in the periodic table of element *I*,

where $xI(1)={PI,\sigma P;GI,\sigma G}$, with respective widths *σ*_{P} and *σ*_{G}. *σ*_{P} and *σ*_{G} can be seen as elemental smearing parameters, which control the near-sightedness of elements in the periodic table. *χ*_{1} and *χ*_{2} represent dummy variables for the period and group, to be integrated out when evaluating the Euclidean distance [see Eq. (4)]. For *A*_{1}(*I*), the scaling function is set to unity since stoichiometry is geometry independent. We are not aware of other representations in the literature which employ similar distribution functions in the periodic table.

*A*_{2}(*I*) is a product of *A*_{1}(*I*) and a sum that runs over all neighboring atoms *i*: $A2(I)=N(xI(1))\u2211i\u2260IN(xiI(2))\xi 2(diI)$, $xiI(2)=diI,\sigma d;Pi,\sigma P;Gi,\sigma G$, where *d*_{iI} and *σ*_{d} correspond to the interatomic distance at which a Gaussian is placed and its width, respectively. *ξ*_{2} corresponds to the 2-body, interatomic distance dependent, scaling function which takes the form of the power laws discussed below. Note that letting *σ*_{P} and *σ*_{G} approach zero is equivalent to using a radial distribution function (RDF) for each element pair. This attribute of the representation holds for any of *A*_{m}(*I*), i.e., *σ*_{P}, *σ*_{G} → 0 is equivalent to creating a separate distribution for each chemical element *m*-tuple in *A*_{m}(*I*). *A*_{3}(*I*) is the logical extension from *A*_{2}(*I*); it has a different scaling function with an additional summation, running over all neighboring atoms *j*: $A3(I)=N(x(1))\u2211i\u2260IN(xiI(2))\u2211j\u2260i,IN(xijI(3))\xi 3diI,djI,\theta ijI$, $xijI(3)=\theta ijI,\sigma \theta ;Pj,\sigma P;Gj,\sigma G$. *P*_{j} and *G*_{j}, similar to *P*_{i} and *G*_{i}, correspond to the period and group of atom *j*. Again, $\xi 3diI,djI,\theta ijI$ is the (three-body) scaling function, and $\theta ijI$ is the principal angle between the two distance vectors $r\u2192Ii$ and $r\u2192Ij$ which span from *I* to *i* and *I* to *j*, respectively. *σ*_{θ} is the width of the Gaussian placed at $\theta ijI$. Letting *σ*_{d} go to infinity in *A*_{3} is equivalent to using a type of angular distribution function (ADF), which in one form or another has already been used in several representations.^{3,14,15} *A*_{3} can therefore be seen as a generalized ADF containing more structural information. Figure 1 illustrates how *A*_{3}(*I*) looks for hydrogen, carbon, and oxygen atoms in ethanol.

The scaling functions *ξ* we have chosen for this work correspond to simple power laws. They have been modified from the leading order two- and three-body dispersion laws by London, 1/*r*^{6}, and Axilrod-Teller-Muto,^{29,30} 1/*r*^{9}. Such dispersion expressions were previously already used by some of us.^{14} Our scaling functions, however, use different exponents for the radial decay and set the *C*_{6} and *C*_{9} coefficients to unity, as early tests indicated better performance for this choice. For periodic systems, however, a very large cutoff radius would be needed in order to converge the distances between two atomic environments, when using the optimized exponents. We have therefore augmented the scaling functions by a previously used soft cutoff function,^{31} which goes to zero at 9 Å.

### C. Distances and scalar products

In order to train and evaluate the KRR model in Eq. (1), proper distance measures must be specified. We have found good performance when using a weighted sum of the distances between each *m*-body expansion $\Delta (AM(I),AM(J))2\u2261\u2211m=0M\beta m\Delta (Am(I),Am(J))2$ as a distance between two atomic environments $AM(I)$ and $AM(J)$. Here, *β*_{m} is another hyperparameter which weighs the importance of each expansion order.

The distances between each distribution term are evaluated as Euclidean (*L*_{2}) norms, as shown in Eq. (4). *ς*_{m} are normalization constants, which ensure that all individual basis functions integrate to 1 in the *L*_{2}-norm. All integrals can be solved analytically since they consist of a sum of Gaussian products. The explicit form of the *A*_{m} integrals for *m* = 1, …, 3 is shown in Eqs. (5)–(7). Details on how to evaluate the *A*_{3} and *A*_{4} integrals in Fourier space can be found in the supplementary material,

Note that third and fourth order terms become prohibitively expensive to calculate directly. However, this can be circumvented to a large extent by slightly modifying the distributions and solving the angular integrals in Fourier space. Further details about the corresponding equations and derivations can also be found in the supplementary material.

### D. Comparison to other distribution based representations

Probably the largest difference in how $A$ represents nuclear configurations, when compared to many of the previously published distribution based representations, lies in the 3-body term (since *A*_{2} is a radial distribution function if *σ*_{P} and *σ*_{G} go to zero). In this subsection, we highlight the differences between *A*_{3}, or conventional ADF or RDF for representing the structure of the water molecule.

As ADF, we use *A*_{3} with the limit *σ*_{d} → *∞* and we model RDF by *A*_{2}. Furthermore, no scaling function (*ξ*_{2} = *ξ*_{3} = 1) is used and we let *σ*_{P} and *σ*_{G} go to zero since we only examine how representations distinguish structural differences among different geometries of the water molecule. This results in *A*_{3} and ADF being $\u2211i\u2260IN(diI,\sigma d)\u2211j\u2260i,IN(\theta ijI,\sigma \theta )$ and $\u2211i\u2260I\u2211j\u2260i,IN(\theta ijI,\sigma \theta )$, respectively, for each element triple and RDF being $\u2211i\u2260IN(diI,\sigma d)$ for each element pair.

Figure 2 shows how the distance measure between two water molecules changes as one distorts the geometry of one of them away from its equilibrium structure. Both, RDF and ADF result for oxygen as well as for H in large configurational domains with substantially zero distance to the minimum, implying a severe lack of sensitivity. *A*_{3} by contrast produces a qualitatively meaningful picture with a single well-defined well around the minimum.

We have also studied the performance for modeling the energy of the water molecule. In Fig. 3, the training error for atomization energies is shown for a linear kernel KRR model with *A*_{3}, ADF, RDF, or RDF + ADF as representations. The linear kernel is used as a difficult test in how far representations can model a nonlinear property, such as the energy, in terms of linear basis functions. The errors are significantly lower when using *A*_{3} instead of the other representations, including RDF + ADF. Generally, potential energy surfaces of a three-atom system cannot be decomposed into a sum of functions of only one internal coordinate (internuclear distance **d** or angle *θ*). That is, *E*(**d**, *θ*) ≠ *E*(**d**) + *E*(*θ*). Using an ADF, an RDF, or a linear combination of the two however would result precisely in such a model, as well as most force fields. This also explains the relatively large errors for these representations as well as the unreliable performance of pair-wise potentials when it comes to distorted molecules. *A*_{3} on the other hand does not decouple distances and angles and can, by construction, model any three-body potential.

These observations give insight as to why our new representation performs better than the other distribution based representations: Using ADF’s and RDF’s as representations one might be able to capture slices of the many-body picture, the fact that there is a linear mapping between *A*_{n} and a *n*-body potential energy surface, however, appears to make it easier to improve the performance also for non-linear kernels.

### E. Optimization

#### 1. Hyperparameters

The use of our representation in combination with KRR yields multiple hyperparameters. While one could, in principle, attempt to optimize all of them, using several datasets, and efficient optimizers, such as gradient, Monte Carlo, genetic, or simplex methods, we have found that the problem is sensitive only to a small subset of parameters. As such, the exact choice of many hyperparameters is not critical for the out-of-sample errors, and resulting models perform typically well as long as values are used which have a similar order of magnitude. Unless otherwise specified, the following hyperparameter values have been used: *σ*_{P} = *σ*_{G} = 1.6, *σ*_{d} = 0.2 Å, *σ*_{θ} = *π* black, *β*_{1} = 2, *β*_{2} = $8$, *β*_{3} = 1.6. For the water cluster and the protein-side chain-side chain interaction (SSI) dataset there is no to little variation in chemical composition, and no elemental smearing has been used.

#### 2. Scaling power law parameters

We have screened radial exponents *n*2 and *n*3 for the scaling functions $\xi 2(diI)=1diIn2$ and $\xi 3(diI,djI,\theta ijI)=1\u22123\u2061cos(\theta ijI)cos(\theta Iji)cos(\theta iIj)(diIdjIdij)n3$, using atomization energies for a subset of the QM9 dataset in order to identify the optimal exponents. Corresponding learning curves are shown in Fig. 4. First, we have screened *ξ*_{2}, using $A2$ as a representation, yielding the lowest offset for *n*2 = 4. Keeping this exponent for *ξ*_{2} fixed, we then proceeded to screen the exponent *ξ*_{3} in $A3$. We found that *n*3 = 2 corresponded to the best exponents for *ξ*_{3}. We have used these values throughout this work, and unless something else is specified, the optimal scaling functions read

#### 3. Alchemical smearing

Parameters associated with the elemental smearing have a strong effect on the predictive power of the QML models. We have screened the corresponding values of *σ*_{P} and *σ*_{G} using energy prediction errors for the open quantum materials database (OQMD) and QM9 dataset for different training set sizes. These two datasets have been used due to their (relatively) high (OQMD) and low (QM9) chemical diversity in terms of number of differing elements in the dataset. The optimal alchemical Gaussian widths vary only slightly across the two sets, as shown in Fig. 5. A circular Gaussian with width *σ*_{P} = *σ*_{G} = ∼1.6, which amounts to ∼90% overlap between neighboring elements, corresponds to a relatively deep well with minimal mean absolute error (MAE) for the OQMD dataset, no matter the training set size. The fact that the optimal width stays constant with respect to training set size is beneficial: the elemental smearing can be optimized using relatively small training sets and can then be applied to larger training sets. Comparing the MAE from a model with *σ*_{P} = *σ*_{G} = 0.1 (which in practice is equivalent to zero overlap between different atomic types), using the optimal *σ*_{P} = *σ*_{G} lowers the MAE by ∼9.9% for the OQMD dataset at 100 training samples, which increases up to ∼34% when 1k training samples are used. Prediction errors for the QM9 dataset indicate similar behavior, yet much less pronounced. For the largest training set (1000 molecules), the optimization well becomes very shallow, consistent with the lack of compositional diversity in QM9.

Unsurprisingly, datasets with higher chemical diversity benefit more from using the optimized elemental widths. It may therefore not always be beneficial to include any elemental overlap, especially for datasets with low elemental diversity, as it is computationally more expensive to do so.

## III. DATASETS

We have used multiple datasets to benchmark out-of-sample accuracy of energy predictions of our model. These datasets include organic molecules, crystals, biomolecular dimers, water clusters, and main-group diatomics. Some of the datasets are high-quality, have already been published, and are in widespread use. Additional low quality datasets have been generated, merely in order to accumulate additional evidence for the relative improvement of the new representation. Since test set predictions are always close to zero by construction, we exclusively report prediction errors as out-of-sample errors (averaged through cross-validation) with respect to reference validation numbers. All errors reported correspond to at least 10 cross-validation runs for each training set size.

The reader should note that we only report errors of ML models trained on individual datasets. Simultaneous training on several datasets would introduce significant noise (due to the datasets originating from different levels of theory) and thereby hamper an unequivocal comparative analysis of the results. For future applications and models generated within multi-fidelity ML frameworks,^{32} pooling the various datasets to train a single ML model might be more desirable.

### A. Organic molecules: QM9

The QM9 dataset^{33} corresponds to hybrid DFT^{34} based structures and properties of 134k organic molecules with up to nine atoms (C, O, N, or F), not counting hydrogen. SMILES strings of these molecules correspond to a subset of the GDB-17 dataset.^{35} The 3k organic molecules, which fail SMILES consistency tests,^{33} were removed before use, i.e., structures where the SMILES strings of the relaxed structure differ from the original GDB-17 SMILES strings.

A random subset of 22k molecules was selected from QM9 for training and testing. 2k molecules were used for testing and up to 20k molecules were used for training.

The datasets were sampled differently from QM9 in Sec. IV B when we investigated how excluding elements from the training set affected the out-of-sample predictions. In total, two tests sets were used, each associated with two training sets. The first test set consisted of molecules containing nitrogen, and the second test set consisted of molecules containing oxygen. The two training sets associated with the first test set consisted of molecules containing nitrogen or not, respectively. The two training sets associated with the second test set consisted of molecules containing oxygen or not, respectively.

### B. Organic molecules: QM7b

Due to widespread use, we also included the more established QM7b dataset.^{36} QM7b was also derived from GDB.^{37} It contains hybrid DFT (PBE0^{38,39}) structures and properties of ∼7k organic molecules with up to seven atoms (C, O, N, S, or Cl), not counting H. We have drawn at random up to 5k molecules for training and 2k molecules for testing.

### C. Biomolecular dimers: SSI

For intra-molecular and non-equilibrium interactions, we used a subset of 2356 neutral dimers from the recently published SSI dataset.^{40} The SSI dataset is a collection of dimers mimicking configurations of interacting amino-acid side chains as observed in a set of 47 high-resolution protein crystal structures. The energies correspond to the DW-CCSD(T**)-F12 level of theory.^{41}

### D. Water cluster

We also include a dataset which we calculated for 4000 structures containing 40 water molecules, drawn from an NVE-molecular dynamics (MD) trajectory of a water droplet, simulated at 300 K, treated with the TIP3P potential^{42} as implemented in CHARMM C41a1.^{43} For each structure, a single-point energy was calculated at the DFT level using the PBEh-3c method.^{44} Additional details, as well as the full dataset, can be found in the supplementary material.

### E. Solids: OQMD

We have used the Inorganic Crystal Structure Database^{45,46} subset corresponding to the OQMD by Wolverton and co-workers.^{47,48} This dataset has already been used to develop and benchmark random forest based QML model (Voronoi).^{49} The dataset consists of ∼30k crystal structures and formation energies, calculated using high-throughput DFT with generalized gradient approximation (GGA + U). We have used a random subset consisting of 3k structures with less than 40 atoms in the unit cell and formation energies lower than 5 eV/atom for training and testing. 1k crystals were used for testing and up to 2k crystals were used for training.

### F. Solids: Elpasolites

We have also tested our representation for the elpasolite crystal structure dataset.^{19} This dataset consists of ∼10k elpasolite structures and DFT (PBE^{50}) formation energies. The crystals correspond to quaternary main group elemental composition with all elements up to bismuth (39 in total). We have used a random subset of 7k structures, with up to 6k and 1k structures for training and testing, respectively.

### G. Main group diatomics

To test the predictive power for alchemical interpolation, we have also included a set of previously published DFT (PBE^{50}) results for single, double, and triple bonds among main-group diatomics saturated with hydrogens.^{17} The training/test splits are generated differently, as explained in Sec. IV B.

## IV. RESULTS AND DISCUSSION

Using learning curves (generally resulting in straight lines when recorded on log-log plots due to their inverse power law relationship^{51}), we first present numerical results which indicate the predictive power of our QML model for atomization and formation energies in various datasets. When available for the same dataset, we also compare to other QML models in the literature. Thereafter, the alchemical extrapolation capacity is demonstrated for predicting covalent bonds in molecules with elements that were not part of training. Finally, log-log plots of learning curves for nine electronic ground-state properties of organic molecules (QM9) are reported and discussed.

### A. Energies of molecules, clusters, and solids

Figure 6 displays the performance overview for energy predictions on six different datasets (QM9, QM7b, SSI, water, elapsolites, OQMD). Mean absolute out-of-sample energy prediction errors are shown as a function of training set size. The results indicate remarkable performance for all datasets, indicating a well-working QML model yielding systematic improvement with increasing training set size. The learning curves also indicate out-of-sample MAEs which are consistently lower, or similar, than previously published models in the literature. For QM9, the MAE reaches the highly coveted chemical accuracy threshold (1 kcal/mol or ∼0.043 eV for enthalpy of formation) with only 2k training points on the QM9 dataset. Previously published QML models had to include an order of magnitude more training molecules to reach such accuracy. This is similar to the amount of training molecules necessary when using the Coulomb matrix representation in conjunction with semi-empirical or DFT based baselines in order to estimate electron correlated energies, as demonstrated in 2015 with the Δ-ML model.^{52}

For QM9, atomic Spectral London Axilrod-Teller-Muto (aSLATM)^{14} and SOAP multi-kernel model^{20,53} reach a performance nearly as good as our QML model. aSLATM, however, performs worse for the SSI and the water cluster. The SOAP multi-kernel QML model, however, performs an expansion in kernel function space acting on the distance for which all degrees of freedom have already been integrated out. As such it is, strictly speaking, not the same as an improved representation, but rather an improved regressor. Note that single kernel based SOAP QML models perform significantly worse. The reader should take notice however that in the SOAP learning curve results presented in Fig. 6, the ∼3k structures which had failed the SMILES consistency test, were included. As such, these QML models are not exactly comparable, and the SOAP results are still likely to slightly improve if these faulty structures were to be removed. One should also note that the SOAP results shown for QM7b correspond to the multi-kernel SOAP kernel.^{20,54}

Other models presented correspond to the Coulomb matrix (CM),^{7} bags of bonds (BOB),^{8} Bonds and Angles based Machine Learning (BAML),^{9} Histogram of Distances, Angles, and Dihedrals (HDAD),^{3} Spectral London Axilrod-Teller-Muto (SLATM), aSLATM,^{14} crystal representation by Faber, Lindmaa, Lilienfeld, Armiento (FLLA),^{19} sine matrix,^{55} and many-body tensor representation (MBTR).^{15} We also compared to QML models which are not based on KRR, such as the message passing neural network model (enn-s2s),^{5} and a Voronoi-tessellation based random forest model (Voronoi).^{49}

The MAE of our new QML model is consistently the lowest for all datasets and large training sets. For the set of 4000 non-equilibrium water clusters, there is a noticeable difference between the global (CM, BOB, and SLATM) and the atomic representations (i.e., aSLATM and the new model we introduce in this work): The global models exhibit very little learning at first, only for larger *N* the learning curves begin to turn downward. The atomic models, however, our new representation based QML model as well as aSLATM, improve rapidly with increasing training dataset size. We believe that sorting and crowding in the global representations make it difficult to accurately account for the purely geometrical changes in structures that contribute to total energy variations.

Impressive predictive power is also observed for the OQMD dataset, a structurally and compositionally very diverse set of solids. Our new model has a lower out-of-sample MAE for all *N* when compared to the sine matrix representation on the OQMD dataset. The offset of the learning curve of our new model is larger than that of the Voronoi-based random-forest model.^{49} However, the learning rate of our QML model is significantly steeper, surpassing the Voronoi model already at just ∼250 training samples. Results for a solid state variant of the CM, designed for use in periodic systems, have also been included (SineMatrix).^{55} It has a similar slope as the Voronoi model, but a substantially larger offset.

For the elpasolite dataset,^{19} with large composition diversity but identical crystal structures, the learning-curve of the FLLA representation has a slightly higher offset than our new QML model, yet exhibits a steeper learning curve, though our model converges toward the same slope for larger training set sizes. We can only speculate on the reasons for such behavior. The FLLA representation differs qualitatively from the other representations in this study: It does not include any explicit information about coordinates and only encodes periodic row and column of the elements which populate each crystal structure site. The QML model then learns to infer ground state energies without knowing the exact configuration. This leads to a very low dimensional model that is still unique for the system, which might be the cause of the steep slope. This however needs to be investigated more carefully before any conclusions can be drawn.

### B. Alchemical predictions

Our new scaled many-body expansion explicitly accounts not only for distributions of interatomic distances and angles but also for elemental distributions in the periodic table. We have studied its capability to predict covalent binding of molecules containing chemical elements which were not present in the molecules used for training. More specifically, we have investigated single, double, and triple bonds with one bonding atom coming from group (IV), i.e., C, Si, or Ge. In order to increase covalent bond order, we have varied the valency of their bonding partner as follows: For single bonds, group IV atoms are bound to halogens (group VII). For double bonds, group IV atoms are bound to chalcogen atoms (group VI), and for triple bonds, group IV atoms are bound to group V atoms. Dangling valencies of group IV atoms have been saturated with hydrogen. Similar covalent bonding potentials have also recently been used in order to assess the predictive power of first- and second-order perturbation theory based alchemical predictions.^{17}

In order to test the alchemical “extrapolation,” we trained on the covalent bonds of all other compounds (16 curves) which did contain neither the group IV atom nor the corresponding bonding partner in question. The predictive power for the out-of-sample molecule, as displayed in Fig. 7, is impressive. Albeit not quantitative (chemical accuracy is not reached), the results are semi-quantitative and certainly provide a physically very adequate picture of the covalent bonding in single, double, and triple bonds for main-group atoms in periods 2–4. The fact that predictions for the central elements H_{2}SiS are more accurate (easier to interpolate) than others is consistent with this interpretation. We also note that the deviation is the worst for 2nd-row elements (due to lack of *d*-orbitals they differ substantially more from 3rd and 4th rows than 3rd and 4th rows differ from each other). Because of poor performance, we do not compare to other representations in this test.

We have also investigated thousands of organic molecules in order to obtain improved quantitative statistics on the question of how out-of-sample prediction errors of different representations are affected when elements in the test set are excluded from training. We tested our new model, with and without elemental smearing (*σ*_{P}, *σ*_{G} = 1.6 and *σ*_{P}, *σ*_{G} → 0, respectively), and we also included BOB and CM representations for comparison. Details about how the training and test sets were selected can be found in Sec. III A.

Corresponding prediction errors [MAE, root mean squared error (RMSE), and maximal error] for test (training) sets with (without) nitrogen or with (without) oxygen are shown in Fig. 8. Obviously, the models perform best when both, training *and* test molecules, contain the same elements. However, even for models trained on sets without nitrogen/oxygen and without elemental smearing, considerable predictive power and, maybe more importantly, systematic improvement with training set size is observed for our new model. Use of elemental smearing results in a slight improvement.

The loss in accuracy due to the absence of elements in training is substantially worse for BOB and CM. Notably, BOB, in general considered to be more accurate than CM,^{3,8} experiences a more dramatic loss than CM resulting in BOB being worse than CM. This can be understood if one considers the fact that BOB bags nuclear Coulomb repulsion terms by element pairs, whereas the CM matrix directly compares the Coulomb interactions without (explicit) regard for which elements-pairs are being compared, effectively already performing an “alchemical” comparison. This allows the CM-based models to meaningfully interpolate even toward elements, not part of training. Our model, however, clearly outperforms CM and BOB: For example, its MAE reaches chemical accuracy (∼0.03 eV) for the nitrogen lacking training set at ∼6.4k molecules (rather than at ∼1.6k training molecules containing N). Note that CM and BOB based KRR models are far from reaching such accuracy even after being trained on molecules *containing* the element in question!

The relatively high accuracy of our model achieved without any of the elemental smearing, however, is surprising. We would have expected that the lack of the appropriate elements in training introduces prediction errors which can no longer be decreased through the addition of more molecules. However, the learning curves in Fig. 8 do not indicate any worsening of the learning rate. While possible that the expected deterioration could still be observed for larger training sets, we do find it surprising that such high accuracy can be reached without any elemental smearing at all.

These results clearly demonstrate that alchemical extrapolation is possible when interpolating elemental groups and periods in the periodic table through an appropriate representation. Since the representation is continuous in the corresponding compositional space, we also believe that indication is given that the calculation of alchemical derivatives is meaningful, similar in spirit to Ref. 57.

### C. Other ground state properties of molecules

Finally, we also investigated how well QML models, based on our new representation and optimized for energies, perform when predicting other ground-state quantum properties, part of the QM9 dataset. More specifically, we have included atomization energies, highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) eigenvalues as well as the HOMO-LUMO gap, dipole moment, polarizability, zero point vibrational energy (ZPVE), heat capacity, and the vibrational frequency of the highest lying fundamental (*ω*). Results are shown in Fig. 9 and provide overwhelming evidence that resulting models enable predictions systematically improving with training set size, no matter what property. For comparison, we have also included results for the aSLATM model. aSLATM results are typically worse when dealing with extensive properties, such as energies, polarizability, or heat-capacity. When dealing with intensive properties, such as eigenvalues or dipole moment, aSLATM is on par or even slightly better than our model, with the exception of *ω*. *ω* corresponds to the vibrational stretch frequency of CH, NH, or OH bonds, a property with hardly any variance at all. Previously we have seen that this property is best predicted by a random forest model which has relatively poor performance for all other properties.^{3} The interpretation is that predicting this property is much more a classification problem than a supervised learning task.

Figure 9 also includes learning curves for the RMSE, indicating a slightly higher offset (to be expected) and systematic improvement with training set size with similar slopes as the mean absolute error. This is an assuring result, indicating that also predictions for outliers improve as training set size is increased, as already discussed in Refs. 9 and 52.

Furthermore, for Fig. 9 we have also distinguished between two- and three-body contributions (as well as four-body for BAML). For all properties but for *ω*, the trend meets the expectation, as also already confirmed previously for BAML:^{9} Addition of the higher order term systematically lowers the learning curves by a significant amount.

## V. CONCLUSION

We have introduced a universal representation of an atom in a chemical compound for use in QML models. An atom is represented by a sum of multidimensional Gaussians, each term corresponding to elemental, atom-pairwise, and angular distributions and scaled by respective power laws. For the compounds and properties studied, we have found four-body contributions to be insignificant. Most system-independent hyperparameters, such as exponents in scaling functions and Gaussian widths, were found to be not critical to the preference of resulting ML-models, as long as “reasonable” heuristics^{28} was used. This could, however, be explored more systematically within future work. Analytical expressions have been derived for corresponding distances between arbitrary chemical compounds. These distances can directly be used within KRR based QML models of electronic ground state properties. For energies of organic molecules, water clusters, amino-acid side chains, and crystalline solids, the resulting QML models lead to learning curves with a very low offset and steep learning rate. For compositionally diverse systems, chemical accuracy (∼1 kcal/mol) can now be reached using only thousands of training instances. We have also studied the effect of explicitly accounting for inter-elemental distances in the periodic table: Our new QML model can produce semi-qualitatively accurate covalent bonding potentials for single, double, and triple bonds which include chemical element-pairs which were not part of training. For thousands of organic molecules, we also demonstrated that our model, after being trained on molecules which do not contain nitrogen or oxygen, still outperforms by a margin CM or BOB based models trained on molecules which do contain nitrogen or oxygen. For various electronic ground state properties of organic molecules, numerical results indicate that remarkable predictive power can be reached.

While the reference data used in this study has mostly been obtained at the hybrid DFT level of theory, the steep learning curves of our QML models suggest that it has now become a realistic possibility to obtain a sufficiently large training set at the post-Hartree-Fock level of theory (or from experiment) and to use it for the training of QML models which enable subsequent high-throughput screening efforts with similar accuracy.

Combining our new representation with the recently proposed QML model trained on molecular quasi-particles representing atoms-in-molecules (also known as “am-on” approach) might provide the possibility to generate accurate models which scale with the size of a query system.^{14} Subsequent work will deal with forces and other properties.

## SUPPLEMENTARY MATERIAL

See supplementary material for a definition of the four body distribution ($A4$); derivation of the overlap integrals where the angular parts are solved in Fourier space; and a full description of the water cluster calculations as well as the water cluster data.

## ACKNOWLEDGMENTS

F.A.F. would like to thank Kuang-Yu Samuel Chang and Michele Ceriotti for data provided. O.A.v.L. acknowledges funding from the Swiss National Science Foundation (Nos. PP00P2_138932 and 407540_167186 NFP 75 Big Data). This research was partly supported by the NCCR MARVEL, funded by the Swiss National Science Foundation. Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing core facility at the University of Basel.