We present a machine learning algorithm for the prediction of molecule properties inspired by ideas from density functional theory (DFT). Using Gaussian-type orbital functions, we create surrogate electronic densities of the molecule from which we compute invariant “solid harmonic scattering coefficients” that account for different types of interactions at different scales. Multilinear regressions of various physical properties of molecules are computed from these invariant coefficients. Numerical experiments show that these regressions have near state-of-the-art performance, even with relatively few training examples. Predictions over small sets of scattering coefficients can reach a DFT precision while being interpretable.

## I. INTRODUCTION

Many chemical and physical properties of atomistic systems result from the underlying quantum mechanical behavior of the electrons in the system, which are modeled through the Schrödinger equation with molecular Hamiltonian. However, computing numerically the wave function,^{1} the electronic density,^{2,3} and other molecular properties requires significant computational resources that restrict the size and number of atomistic systems that can be studied. This has led to substantial interest recently in developing machine learning techniques that can efficiently and accurately predict the chemical properties of atomistic systems after being trained on examples where these properties are known. The scope of these approaches ranges from estimating potential energy surfaces^{4–15} to methods that estimate a variety of properties of molecular configurations spread across chemical compound space.^{16–24}

These machine learning algorithms are designed to output predictions that respect the underlying physical laws of the system. In particular, global chemical and physical properties of atomistic systems are invariant to rigid movements and vary smoothly with respect to the change of single atomic positions. In addition, chemical systems have multiscale structures and interactions. The systematic treatment of these multiscale properties is absent among the machine learning methods put forward to date. This trend is, in part, due to the types of databases currently being studied in the machine learning context. These databases include neutral small organic molecules,^{25} as well as certain types of materials for which the potential energy can be separated into atom centered components that have localized interactions and a long-range component incorporating electrostatics and dispersion.^{8} Nevertheless, traits such as energies, vibrational frequencies, energy gaps, and dipole moments are complex properties of the atomistic state that are dependent upon all pairwise interactions in the system. These interactions occur at different scales, e.g., ionic and covalent bonds at short range and van der Waals interactions at the meso-scale, in addition to long-range Coulomb interactions. Furthermore, many chemical properties are not linearized over scales, but are rather the result of complex nonlinear coupling of small- to large-scale interactions in the system.

In order to build a learned approximation that accounts for the multiscale aspect of the quantum interaction, we extended the two-dimensional scattering transform for quantum energy regression^{26} to a three-dimensional spherical scattering transform.^{27} The molecule is represented by several 3D pseudo-electronic density channels, corresponding to core and valence electrons, approximated by Gaussians centered at atomic locations. A translation- and rotation-invariant representation is computed with a solid harmonic scattering transform that separates interactions at different scales and across scales, with cascaded wavelet transforms. This is illustrated in Fig. 1.

Scattering coefficients have the same invariance and stability properties as a quantum energy functional: they are invariant to rigid body motion and stable to deformations of the input density, as proved in Ref. 28. This computational approach has some resemblance with quantum chemical theory. Similar to the density functional introduced by Hohenberg and Kohn,^{2} which computes the ground-state energy for an input density, our procedure takes as input a surrogate density *ρ* and outputs an approximation of the ground-state energy. It involves solid harmonic wavelets which are similar to Gaussian Type Orbitals (GTOs), often used as a basis function for molecular orbitals in Kohn-Sham density functional theory (DFT).^{3} However, in contrast to DFT, we do not tune the wavelets’ parameters to compute a valid electronic density. We rather use them to parametrize the complex non-linear relation between the molecule geometry and its energy.

Numerical experiments are performed on the QM9 database,^{25} which contains ground-state energies of thousands of organic molecules computed with DFT using the hybrid Becke, 3-parameter, Lee-Yang-Parr (B3LYP) functional. A multilinear regression of ground-state energies from scattering invariants achieves a mean absolute error of 0.50 kcal/mol. This result is close to state-of-the-art machine learning techniques – in particular, those computed with deep neural networks. By contrast, we use a much simpler multiscale representation whose mathematical properties are relatively well understood and which can lead to interpretable results with sparse linear regressions. Using an orthogonal least squares (OLSs) regression, we obtain an error of 1.96 kcal/mol with only 400 coefficients and 10 000 samples. Resamplings show that the first few selected coefficients are always the same and thus seem to consistently capture chemical information of specific, conditionally ordered levels of importance to the task.

## II. MULTISCALE INVARIANTS

Let $x={(zk,rk)\u2208Z\xd7R3}k$ be an atomistic state, consisting of atoms with nuclear charges ${zk}k$, located at positions ${rk}k$. We denote by *f*(*x*) a scalar physical property of the state *x*, e.g., its energy, its HOMO-LUMO gap, or the magnitude of its dipole moment. To accurately approximate *f*(*x*) from examples, we must take advantage of the known regularity properties. We summarize some well-known properties of *f*(*x*) as follows:

Invariance to indexation of the atoms in the state

*x*;invariance to rigid motion of the state

*x*;Lipschitz continuity with respect to relative variations of distances between atoms;

dependence upon multiscale geometric properties of

*x*, with nonlinear couplings between scales.

An approximation $f\u0303(x)$ of the energy functional *f*(*x*) is computed as a multilinear regression over coefficients having the same invariance and stability properties as *f*(*x*). Indeed, invariance properties reduce the dimensionality of the approximation. This is implemented with a scattering regression illustrated in Fig. 1, which can be summarized into three steps. The first step, which is described in Sec. II A, maps *x* to several three-dimensional electronic densities, denoted here by *ρ*_{x}(*u*). They carry information about the spatial distributions of core and valence electrons, as well as bonds between atoms. These electronic embeddings are invariant to the permutation of atomic indexation. The second step computes an invariant scattering transform *Sρ*_{x} of each density channel. It separates the variations of *ρ*_{x} at different scales and includes scale interaction terms. These coefficients are Lipschitz continuous to deformations of *x* and hence to relative variations of distances between atoms. The computation of these coefficients is described in Secs. II B, II C, and III. These coefficients satisfy the properties listed above. A multilinear regression, described in Sec. IV, computes the estimation $f\u0303(x)$ of the molecular property *f*(*x*). Regression coefficients are optimized from the scattering transforms of a training set of molecules with their molecular values ${S\rho xi,f(xi)}i\u2264n$.

### A. Three-dimensional densities

The Hohenberg-Kohn theorem^{2} implies that the ground-state energy of an electronic system can be computed as a functional of the electronic density. DFT for molecules relies on this and models the electronic density as the square sum of a linear combination of many Gaussian-type orbitals centered on atom positions. We represent the state *x* by several density surrogates, computed as sums of non-interacting densities which depend upon the position and charge of each atom. This initial embedding ensures that the resulting prediction $f\u0303(x)$ will be invariant to permutations of the indexation of the atoms (property 1).

We compute three non-interacting density channels: one for core electrons, one for valence electrons, and a total density, which is the sum of the core and valence densities. These densities are computed as sums of non-interacting single-atom density functions placed at the atom locations. We use Gaussians to model these single atom densities, scaled to integrate to the amount of charge they represent. A three-dimensional density is computed as

where *γ*_{k} is a number of electrons at atom location *r*_{k} and *g*(*u*) is a three-dimensional Gaussian normalized to have unit integral.

Core and valence densities are obtained by setting *γ*_{k} to be the number of core electrons or the number of valence electrons of atom *k*. The total electronic density is the sum of the core and valence densities,

Due to the normalizations of the core and valence densities, the total density integrates to the total charge of the system: $\u222b\rho xtotal(u)=\u2211kzk$.

To disentangle bond information from the atom type, we create an additional channel to encode atomic bond information explicitly. This is a surrogate density concentrated along line segments between pairs of bonded atoms, whose integral is equal to the bond order. Let *B* = {(*i*, *j*)∣ atoms *i* and *j* are bonded} and

where *d*_{ij}(*u*) is the distance between *u* and the segment between the atoms *i* and *j*, *d*_{0} is a characteristic width, *γ*_{ij} is the number of electrons involved in the bond, and *C* is a constant so that $\u222b\rho xbonds(u)\u2009du=\u2211ij\gamma ij$.

All density channels provide unique geometric and chemical information about the molecule. Indeed, the position on the two axes of the periodic table, which encodes a large amount of chemical variability, is accounted for in the valence and core channels. This gives direct access to the properties that vary along these axes, such as the distance of the outer orbitals to the nucleus and the number of available electrons for bonding. The full density channel gives direct access to the actual atom identity, which can help encode its typical purposes within the molecules. The bonds channel can, when this information is available, provide the additional structure indicator of how many valence electrons are actually bonded. Figure 2 plots a slice of each of the core, valence, and bonds channels of *ρ*_{x} for an organic molecule with planar symmetry.

An alternative, more detailed approach would be to model the densities for small submolecular structures involving more than one atom, as described in Ref. 29.

### B. Scale separation with solid harmonic wavelets

We separate the different scale components of each density channel by computing a solid harmonic wavelet transform. In the following, we write *ρ* for any density channel $\rho x\alpha $ of an arbitrary state *x*.

A wavelet transform integrates the density function *ρ*(*u*) against a family of localized oscillating waveforms, called wavelets, which have zero mean and are dilated to different scales. The resulting wavelet coefficients give a complete representation of *ρ* (and hence *x*), but reorganize its information content according to the different scales within the system.

We use solid harmonic wavelets, first introduced in Ref. 27 in the context of a molecular energy regression, but which have also been used in generic 3D image processing.^{30} They are defined as

where ${Y\u2113m:\u2113\u22650,\u2212\u2113\u2264m\u2264\u2113}$ are the Laplacian spherical harmonics. Solid harmonic wavelets are solid harmonics, $|u|\u2113Y\u2113m(u/|u|)$, multiplied by a Gaussian, which localizes their support around 0. Decomposing angular frequencies using spherical harmonics will facilitate the computation of rotational invariants, which are described in Sec. II C. Dilations of $\psi \u2113m$ at the scale 2^{j} are given by

Figure 3 plots slices of their response in $R3$.

Solid harmonic wavelets are of the same functional form as Gaussian-type orbitals (GTOs),^{31} which are often used in computational quantum chemistry, e.g., as basis sets for the Kohn-Sham orbitals in DFT. In these approaches, one electronic density may be made up of a linear combination of hundreds of atom-centered GTOs, in which several GTOs are used to model each electron orbital of each atom and the corresponding electronic interactions that make up the bonds in the system.

Here, solid harmonic wavelets are used as convolution filters. A solid harmonic wavelet transform integrates the density *ρ*($v$) against each solid harmonic wavelet $\psi j,\u2113m(v)$ centered at each spatial location $u\u2208R3$,

where the scales go from *j* = 0, …, *J* − 1, and the angular frequency bands go from *ℓ* = 0, …, *L* − 1, for prescribed values of *J* and *L*, respectively. The wavelet coefficients *Wρ* separate the geometry of *ρ* into different scales *j* and angular frequency bands *ℓ*. The additional frequency variable *m* further subdivides the angular frequency information, while the spatial variable *u* is necessary due to the localization of the waveform and encodes the response of *ρ* against $\psi j,\u2113m$ at this location.

For an electronic density $\rho x(u)=\u2211k\gamma k\u2009g(u\u2212rk)$,

Since $g*\psi j,\u2113m$ is another (wider) Gaussian-type orbital, we may view these coefficients as emitting the same GTO from each atom, scaled by the electronic charge *γ*_{k}. These coefficients encode interferences of the solid harmonic waveforms at different scales. For small scales, these interferences resemble those found in the computation of electronic orbitals in computational chemistry software. Solid harmonic wavelet coefficients resulting from larger-scale waveforms aggregate subgroups of atoms within the system and encode macroscopic geometric patterns. While these coefficients do not have direct analogues in classical density estimation procedures, they are required for deriving invariant coefficients that can account for long-range interactions within the system.

### C. Solid harmonic wavelet invariants

Although the solid harmonic wavelet transform ${\rho *\psi j,lm(u)}$ forms a complete representation of *ρ*, it is not suitable as input to a regression for global chemical properties. Indeed, a translation of *ρ* [obtained by shifting the spatial input variable *u*, $\tau v$*ρ*(*u*) = *ρ*(*u* − $v$)] results in a translation of the wavelet coefficients. Rotations of *ρ* similarly rotate the variable *u*, but also redistribute the wavelet coefficients across the *m* angular frequencies within each angular frequency band *ℓ*. Using these coefficients for regression would require the regression to learn all invariance properties that are already known, entailing a very high sample complexity. We address this point by computing coefficients with the same invariances as the target model. In order to obtain invariant coefficients to rotations of the state *x*, we first aggregate the energy of the solid harmonic wavelet coefficients across *m*,

The value *U*[*j*, *ℓ*]*ρ* is computed with a Euclidean norm that we shall call a *modulus operator*. It eliminates the rotational phase subspace information from the wavelet coefficients, and one can prove that the resulting coefficients are covariant to both translations and rotations (meaning that a translation or rotation of *ρ* results in the same translation or rotation of *U*[*j*, *ℓ*]*ρ*). See Fig. 4 for the plots of *U*[*j*, *ℓ*]*ρ*_{x}.

It follows from this covariance property that rigid motion invariant coefficients may be obtained by integrating over *u*,

We refer to *Sρ*[*j*, *ℓ*, *q*] as first-order solid harmonic wavelet scattering coefficients. Each value *Sρ*[*j*, *ℓ*, *q*] collapses *U*[*j*, *ℓ*]*ρ* (illustrated by a single box in Fig. 4) into a single coefficient. They give invariant descriptions of the state *x* that are stable to deformations of the atomic positions. Furthermore, they separate the scales of the system by aggregating geometric patterns related to the arrangement of the atoms and bonds at each scale.

The power *q* = 1 scales linearly with the number of electronic particles represented by the density *ρ*, while *q* = 2 encodes pairwise interactions, which are related to electrostatic Coulomb interactions in Ref. 26. Specifically, the wavelet modulus at *ℓ* = 1, *U*[*j*, 1]*ρ*, computes the gradient magnitude of the density at scale *j*. When integrated over space with *q* = 1, this results in the total variation of the scaled density, a quantity well known in functional analysis. It measures the lengths of density level sets with emphasis on inflection lines. At a sufficiently large scale, it measures the perimeter of the molecule (see the first line of Fig. 4). Powers smaller than 1 provide sparsity information: A small exponent *q* gives more importance to small non-zero values, essentially counting them. Large exponents are sensitive to peaks of large amplitudes.

First-order solid harmonic scattering coefficients initially encode the state *x* in a family of Gaussian densities *ρ*_{x} and use spherical harmonics to obtain rotational invariants; in this regard, they are similar to the smooth overlap of atomic positions (SOAP) algorithm introduced in Ref. 10 and subsequently used in Refs. 11 and 21. SOAP representations and resulting similarity measures, however, are computed on localized atom-centered molecular fragments at a fixed scale, which are then combined to form global similarity measures between different molecules.^{21} First-order scattering coefficients *Sρ*_{x}[*j*, *ℓ*, *q*] are global integrated descriptions of the state *x*, which are computed at each scale 2^{j}.

## III. SCALE INTERACTION COEFFICIENTS

First-order scattering coefficients separate the scales of the system *x*. They encode both micro- and macroscopic invariant geometric descriptions of *x*, corresponding to short- and long-range interactions. However, the chemical and physical properties of complex quantum systems with many particles are the result of not only interactions at each scale but also the nonlinear mixing of scales (as in London dispersion forces^{32}). We thus complement first-order scattering coefficients with second-order scattering coefficients that combine information across scales within the system. If a certain sub-molecule property becomes visible to a solid harmonic scattering modulus at a given scale, then interactions between these sub-molecule units can be captured by analyzing this modulus at a larger scale.

Solid harmonic wavelet modulus coefficients *U*[*j*, *ℓ*]*ρ* are decomposed with a second wavelet modulus transform, with wavelets $\psi j\u2032,\u2113m$ that are larger than the first wavelet,

The resulting coefficients mix relatively small-scale geometries of *ρ* across the larger scale 2^{j′}, in a distinctly different fashion than, for example, computing the product *Sρ*[*j*, *ℓ*, *q*]*Sρ*[*j*′, *ℓ*, *q*]. Indeed, wavelet modulus coefficients *U*[*j*, *ℓ*]*ρ* are covariant to rotations and translations and thus maintain oriented geometric information within the state *x* at the scale 2^{j}. This information is analogous to localized multipole moments, such as local dipoles that arise due to the polarizability of atoms or sub-molecules within the state *x*. These weak localized interactions within the state *x*, in aggregate, non-trivially impact the large-scale structure of the system. The second wavelet $\psi j\u2032,\u2113m$ sets the scale at which this structural impact is measured and the localized contributions at the smaller scale are aggregated via integration,

The resulting coefficients *Sρ*[*j*, *j*′, *ℓ*, *q*] are second-order solid harmonic wavelet scattering coefficients, which are depicted in Fig. 1. These coefficients are invariant to the rigid motion of *ρ*, stable to deformations of the state *x*, and couple two scales within the atomistic system. We conjecture that the resulting terms are qualitatively similar to first-order approximations of van der Waals interactions, which are given by *Cα*_{1}*α*_{2}/*R*^{6}, where *α*_{k}, *k* = 1, 2, are the polarizabilities of two local substructures within *x* that are well separated by a relatively large distance *R*. A second-order scattering coefficient can potentially capture such a term if the substructures are of size proportional to 2^{j} and *R* ∼ 2^{j′}.

These van der Waals terms, in addition to electrostatic terms and localized energy terms, are utilized in expansions of the total energy of *x* via perturbation theory. Since we do not have a direct bijective correspondence with such terms, we instead learn and compute such energy expansions via multilinear regressions, which are described in Sec. IV.

While the described forces play a subordinate role in the systems we study in Sec. V, we find empirically that the inclusion of the described scale interaction terms significantly increases the predictive power of our analysis method. Furthermore, we expect that for systems with significant van der Waals interactions second-order scattering coefficients will be necessary to achieve reasonable regression errors. The mathematical and empirical study of this hypothesis is the subject of ongoing and future work.

## IV. MULTILINEAR REGRESSION

Physical properties of atomistic states are regressed with multilinear combinations of scattering coefficients *Sρ*_{x}. A multilinear regression of order *r* is defined by

For *r* = 1, it gives a linear regression that we write by expanding the inner product

where different scales *j* and *j*′, oscillation numbers *ℓ*, and integration exponents *q* are separated in different coefficients. For *r* = 2, this form introduces a non-linearity similar to those found in factored gated autoencoders.^{33} We optimize the parameters of the multilinear regression by minimizing the quadratic loss over the training data,

using the Adam algorithm for stochastic gradient descent.^{34}

If *f*(*x*) is the total energy of the state *x* and *x* is decomposable into a number of simpler subsystems, then perturbation theory analysis expands the energy *f*(*x*) into an infinite series of higher order energy terms,

in which *f*_{0} captures the energies of the isolated subsystems, *f*_{1} captures their electrostatic interactions, and *f*_{2} captures induction and dispersion energies, which result from van der Waals interactions. The linear regression (1) similarly expands the total energy into successively higher order energy terms, but which are defined by wavelet scattering coefficients. As described in Secs. II and III, first-order scattering coefficients for *q* = 2 likewise encode pairwise Coulombic interactions, while second-order scattering coefficients couple scales within the system, analogous to the first-order approximations of induction and dispersion energies.

## V. NUMERICAL RESULTS

We train the regression algorithm to predict energies and other properties of organic molecules with first- and second-order solid harmonic wavelet scattering coefficients. Detailed numerical estimates and comparisons with the state of the art are presented in Sec. V A. In Sec. V B, we interpret scattering embeddings of organic molecules by analyzing sparse linear regressions.

### A. Numerical regression errors

We use our algorithm to approximate molecular energies, i.e., the atomization energy at 0 K *U*_{0}, atomization energy at room temperature *U*, enthalpy of atomization at room temperature *H*, and atomization free energy at room temperature *G*. We use the values of the QM9 dataset,^{25} which contains 133 885 different small organic molecules with maximally 29 atoms of the elements C, H, O, N, and F, and maximally 9 heavy (non-hydrogen) atoms. Their molecular configurations and properties were estimated using B3LYP DFT.

First- and second-order scattering invariants were computed for core, valence, full, and bonds channels, with *J* = 4 and *ℓ* ∈{1, 2, 3}. The moduli were raised to the powers *q* ∈ {1/2, 1, 2, 3, 4}.

Multilinear regressions were trained and evaluated on five random splits of the dataset, for *r* = 1, i.e., linear regression, and *r* = 3, trilinear regression. The evaluation criterion for the fits was mean absolute error, which is the most prevalent error measure in the literature. We use 5-fold cross-validation, separating the data in 107 108 training datapoints and 26 777 test datapoints per fit.

For the atomization energy at room temperature *U*, trilinear regression achieves close to state-of-the-art error rates on core, valence, and full densities (0.56 kcal/mol). When incorporating bond information via the bonds channel, the result improves to 0.51 kcal/mol. This is expected since the bonds channel contains relevant information pertaining to the known bond structure, which are not immediately deducible from the number of valence electrons of the constituent atoms.

Notably, the error of a simple linear regression is much lower than that of Coulomb matrices fit with a kernel ridge regression at full sample complexity.

The QM9 dataset contains eight other molecular properties that we are able to predict. However, our technique has been primarily optimized to regress the energy properties which are at the top of Table I. The bottom part provides prediction errors for the dipole moment *μ* (Debye), the static polarizability *α* ($a03$, Bohr radius cubed), the *ϵ*_{HOMO} energy (kcal/mol), the *ϵ*_{LUMO} energy (kcal/mol), the *ϵ*_{gap} gap energy (kcal/mol), the electronic spatial extent ⟨*R*^{2}⟩ ($a02$), the zero point vibrational energy (eV), and the heat capacity at room temperature $Cv$ (cal/mol/K). The results show good prediction performance on all properties of the molecules, relative to their DFT estimated values. Some error values are close to state of the art and the majority is no further than a factor 2 larger in error.

. | L-Scat . | T-Scat . | NMP . | CM . | SchNet . |
---|---|---|---|---|---|

U_{0} | 1.89 | 0.50 | 0.45 | 2.95 | 0.31 |

U | 2.4 | 0.51 | 0.45 | 2.99 | |

H | 1.9 | 0.51 | 0.39 | 2.99 | |

G | 1.87 | 0.51 | 0.44 | 2.97 | |

μ | 0.63 | 0.34 | 0.030 | 0.45 | |

α | 0.52 | 0.16 | 0.092 | 0.43 | |

ϵ_{HOMO} | 4.08 | 1.97 | 0.99 | 3.06 | |

ϵ_{LUMO} | 5.39 | 1.76 | 0.87 | 4.22 | |

ϵ_{gap} | 7 | 2.73 | 1.60 | 5.28 | |

⟨R^{2}⟩ | 6.67 | 0.41 | 0.18 | 3.39 | |

zpve | 0.004 | 0.002 | 0.0015 | 0.0048 | |

$Cv$ | 0.10 | 0.049 | 0.04 | 0.12 |

. | L-Scat . | T-Scat . | NMP . | CM . | SchNet . |
---|---|---|---|---|---|

U_{0} | 1.89 | 0.50 | 0.45 | 2.95 | 0.31 |

U | 2.4 | 0.51 | 0.45 | 2.99 | |

H | 1.9 | 0.51 | 0.39 | 2.99 | |

G | 1.87 | 0.51 | 0.44 | 2.97 | |

μ | 0.63 | 0.34 | 0.030 | 0.45 | |

α | 0.52 | 0.16 | 0.092 | 0.43 | |

ϵ_{HOMO} | 4.08 | 1.97 | 0.99 | 3.06 | |

ϵ_{LUMO} | 5.39 | 1.76 | 0.87 | 4.22 | |

ϵ_{gap} | 7 | 2.73 | 1.60 | 5.28 | |

⟨R^{2}⟩ | 6.67 | 0.41 | 0.18 | 3.39 | |

zpve | 0.004 | 0.002 | 0.0015 | 0.0048 | |

$Cv$ | 0.10 | 0.049 | 0.04 | 0.12 |

Using linear regression, we checked the impact of omission of one or several of the input channels. While the valence atomic channel accounts for the strongest change in variance explained, both core and full density channels are required to achieve satisfactory results. The additional accuracy provided by the bonds channel is smaller, but significant. Notably, comparing the linear regression error incorporating all four channels (1.89 kcal/mol) to the four linear regression models in which each channel is left out in turn, we obtain the following error reductions: Removing the full density incurs 0.64 kcal/mol, removing the core density incurs 0.51 kcal/mol, removing the valence channel incurs 0.73 kcal/mol, and removing the bonds channel results in an error increase of 0.16 kcal/mol from the joint prediction error of 1.89 kcal/mol. Removing any of the channels results in a non-negligible error increase. Future directions include the exploration of molecular dynamics via force derivatives of the energy functional. In this setting, the bonds channel will be omitted since in that setting bonds are subject to temporal variation and may be broken or created.

### B. Scattering model interpretation

Linear regressions from scattering invariants have good prediction properties. We use this fact to compute small-size interpretable models, with sparse regressions. These sparse regressions are computed with an Orthogonal Least Square (OLS) pursuit, which is a greedy algorithm that decorrelates regressors. At every iteration, this algorithm^{26} selects the scattering coefficient that predicts the target best and then orthogonalizes the system with respect to that predictor. It gives us a list of coefficients ordered by conditional importance. We apply this procedure to different splits of the dataset, which can yield different coefficient lists if selection is unstable (e.g., when two features add similar amounts of predictive power such that the randomness of data splitting can make either one be selected first). Figure 5 (left) shows the average error for the atomization energy *U* with respect to the number of features selected by OLS.

We can see clearly that the more the coefficients are included in a regression, the further the estimation error decreases. In particular, we see a strong initial drop in error with each new coefficient and with only 40 coefficients, and we are already within the range of the Coulomb matrix reference error. The error continues to monotonically decrease to 1.96 kcal/mol at 400 coefficients.

We also analyze how many samples are needed for the training: with 6 coefficients, only 100 samples are needed to train almost fully, but the error cannot go under 21 kcal/mol; with 40 coefficients, the error asymptote of 4.63 kcal/mol is reached at roughly 1000 samples. With 400 coefficients, the performance is similar to the full linear regression, and the error asymptote of 1.96 kcal/mol is reached at around 10 000 samples.

Given this knowledge, we can conclude that our current linear estimators can be trained on very small subsets of QM9. Indeed, our sparse linear models require considerably less data when compared to scattering trilinear regressions and the current state of the art—message passing neural network algorithms and SchNet. Furthermore, while kernel methods with appropriate kernels may be able to decrease the error to 0 given a large or infinite amount of data, their representation sizes increase with the number of samples. Using fixed-size and especially linear regressions, as we do, we are able to detect their capacity limit and use the relevant prediction depending on the case.

Another direction of extrapolation relevant to predictions evaluation is molecule size. Ideally, a machine learning technique should be able to predict properties of molecules of a different size from those on which it was trained. An estimator with this property could be useful for computing chemical properties of larger-scale structures. Defining molecule size by the number of heavy atoms it contains, we created training sets of molecules with up to 7, up to 8, and up to 9 heavy atoms. We evaluate predictive accuracy of the linear regression on test sets likewise composed of up to 7, 8, or 9 heavy atom molecules. Learning on up to 7 or 8 heavy atoms incurs an error of around 4 kcal/mol on molecules with 9 heavy atoms and 2 kcal/mol on the smaller ones. Learning with up to 9 heavy atoms gives the usual 2 kcal/mol error. It thus seems as though scattering coefficients permit scaling to larger molecule sizes while incurring a constant error. For larger datasets to come, this test will permit clearer conclusions. We believe that such molecular extension properties should be included in the evaluation procedure of machine learning algorithms.

As stated above, we select the coefficients with OLS using different splits of the dataset. Notably, the first 6 selected coefficients are always the same, independently of the split of the dataset. The stability in the selection of the same first 6 coefficients invites closer examination. We apply the same OLS selection of 6 coefficients for the internal energy *U*_{int}, which is equal to the atomization energy *U* plus the internal energy of all the atoms that form the molecule. The 6 selected coefficients are shown in Table II.

U
. | Density . | j
. | j′
. | ℓ
. | q . |
---|---|---|---|---|---|

1 | Bonds | 2 | 1 | 1/2 | |

2 | Bonds | 0 | 1 | 1 | 1/2 |

3 | Bonds | 1 | 2 | 1/2 | |

4 | Core | 0 | 2 | 3 | 4 |

5 | Valence | 3 | 3 | 3 | |

6 | Full | 0 | 2 | 1/2 |

U
. | Density . | j
. | j′
. | ℓ
. | q . |
---|---|---|---|---|---|

1 | Bonds | 2 | 1 | 1/2 | |

2 | Bonds | 0 | 1 | 1 | 1/2 |

3 | Bonds | 1 | 2 | 1/2 | |

4 | Core | 0 | 2 | 3 | 4 |

5 | Valence | 3 | 3 | 3 | |

6 | Full | 0 | 2 | 1/2 |

U_{int}
. | Density . | j
. | j′
. | ℓ
. | q . |
---|---|---|---|---|---|

1 | Valence | 0 | 1 | 2 | |

2 | Core | 3 | 4 | 2 | 2 |

3 | Full | 3 | 4 | 2 | 2 |

4 | Full | 1 | 1 | 3 | |

5 | Full | 0 | 1 | 1 | 2 |

6 | Valence | 0 | 1 | 1 | 4 |

U_{int}
. | Density . | j
. | j′
. | ℓ
. | q . |
---|---|---|---|---|---|

1 | Valence | 0 | 1 | 2 | |

2 | Core | 3 | 4 | 2 | 2 |

3 | Full | 3 | 4 | 2 | 2 |

4 | Full | 1 | 1 | 3 | |

5 | Full | 0 | 1 | 1 | 2 |

6 | Valence | 0 | 1 | 1 | 4 |

Interestingly, we see that the coefficients computed from the bonds density are highly relevant to the atomization energies, but accurate estimation requires information about the nature of the atoms that are contained in the coefficients computed from the core and full densities. On the other hand, coefficients computed from the full, core, and valence density are highly relevant to the internal energy *U*_{int} of the molecule, which contains the individual free atom energies as well as the bond energies. We also see the importance of the second-order coefficients (coefficients for which *j*′ ≠ $\u2205$) that account for multiscale interactions.

## VI. CONCLUSION

We have presented an algorithm that introduces a representation of molecules that is invariant to three-dimensional rotations and translations. The algorithm relies on a multiscale representation of molecular densities based on solid harmonic wavelets. Solid harmonic wavelets are also used as Gaussian-type orbitals in molecular DFT, establishing an interesting qualitative link between the methods. Our method does not optimize the parameters of the solid harmonics, but rather uses them to analyze surrogate densities created to resemble the molecule electronic density. The representation is general enough to be applied to any three-dimensional density dataset and will extend to molecules with a higher number of atoms than the ones presented in the training set. We have shown that the representation can achieve near state-of-the-art predictive performance using a nonlinear regression.

Notably, compromising in predictive performance, we can use sparse linear regressions to add interpretability to our results. We show that specific coefficients are reliably selected by an orthogonal least squares method across random data splits. The selected coefficients contain bond information when atomization energy is fitted, whose primary fluctuations are derived from bonding, and contain atomic information when full ground-state energy is fitted because free atom contributions then dominate the term. We thus have a precise handle on the interpretation of the coefficients.

## ACKNOWLEDGMENTS

M.E., G.E., and S.M. are supported by ERC Grant InvariantClass No. 320959; M.H. is supported by the Alfred P. Sloan Fellowship (Grant No. FG-2016-6607), the DARPA YFA (Grant No. D16AP00117), and NSF Grant No. 1620216.

### APPENDIX: SOFTWARE AND DATA AVAILABILITY

Software to reproduce the results is available at http://www.di.ens.fr/data/software.