We present a machine learning algorithm for the prediction of molecule properties inspired by ideas from density functional theory (DFT). Using Gaussian-type orbital functions, we create surrogate electronic densities of the molecule from which we compute invariant “solid harmonic scattering coefficients” that account for different types of interactions at different scales. Multilinear regressions of various physical properties of molecules are computed from these invariant coefficients. Numerical experiments show that these regressions have near state-of-the-art performance, even with relatively few training examples. Predictions over small sets of scattering coefficients can reach a DFT precision while being interpretable.

Many chemical and physical properties of atomistic systems result from the underlying quantum mechanical behavior of the electrons in the system, which are modeled through the Schrödinger equation with molecular Hamiltonian. However, computing numerically the wave function,1 the electronic density,2,3 and other molecular properties requires significant computational resources that restrict the size and number of atomistic systems that can be studied. This has led to substantial interest recently in developing machine learning techniques that can efficiently and accurately predict the chemical properties of atomistic systems after being trained on examples where these properties are known. The scope of these approaches ranges from estimating potential energy surfaces4–15 to methods that estimate a variety of properties of molecular configurations spread across chemical compound space.16–24 

These machine learning algorithms are designed to output predictions that respect the underlying physical laws of the system. In particular, global chemical and physical properties of atomistic systems are invariant to rigid movements and vary smoothly with respect to the change of single atomic positions. In addition, chemical systems have multiscale structures and interactions. The systematic treatment of these multiscale properties is absent among the machine learning methods put forward to date. This trend is, in part, due to the types of databases currently being studied in the machine learning context. These databases include neutral small organic molecules,25 as well as certain types of materials for which the potential energy can be separated into atom centered components that have localized interactions and a long-range component incorporating electrostatics and dispersion.8 Nevertheless, traits such as energies, vibrational frequencies, energy gaps, and dipole moments are complex properties of the atomistic state that are dependent upon all pairwise interactions in the system. These interactions occur at different scales, e.g., ionic and covalent bonds at short range and van der Waals interactions at the meso-scale, in addition to long-range Coulomb interactions. Furthermore, many chemical properties are not linearized over scales, but are rather the result of complex nonlinear coupling of small- to large-scale interactions in the system.

In order to build a learned approximation that accounts for the multiscale aspect of the quantum interaction, we extended the two-dimensional scattering transform for quantum energy regression26 to a three-dimensional spherical scattering transform.27 The molecule is represented by several 3D pseudo-electronic density channels, corresponding to core and valence electrons, approximated by Gaussians centered at atomic locations. A translation- and rotation-invariant representation is computed with a solid harmonic scattering transform that separates interactions at different scales and across scales, with cascaded wavelet transforms. This is illustrated in Fig. 1.

FIG. 1.

Scattering regression of a molecular property f(x). The atomistic state x is represented by several three-dimensional densities. An invariant scattering transform is applied to each density. It computes a solid harmonic wavelet transform of the density, and the moduli of these coefficients are transformed again by a second solid harmonic wavelet transform. Invariant scattering coefficients are spatial lq norms of the resulting first and second-order modulus densities for several exponents q. A multilinear regression computes an estimation f̃(x) of the molecular property from these invariant scattering coefficients.

FIG. 1.

Scattering regression of a molecular property f(x). The atomistic state x is represented by several three-dimensional densities. An invariant scattering transform is applied to each density. It computes a solid harmonic wavelet transform of the density, and the moduli of these coefficients are transformed again by a second solid harmonic wavelet transform. Invariant scattering coefficients are spatial lq norms of the resulting first and second-order modulus densities for several exponents q. A multilinear regression computes an estimation f̃(x) of the molecular property from these invariant scattering coefficients.

Close modal

Scattering coefficients have the same invariance and stability properties as a quantum energy functional: they are invariant to rigid body motion and stable to deformations of the input density, as proved in Ref. 28. This computational approach has some resemblance with quantum chemical theory. Similar to the density functional introduced by Hohenberg and Kohn,2 which computes the ground-state energy for an input density, our procedure takes as input a surrogate density ρ and outputs an approximation of the ground-state energy. It involves solid harmonic wavelets which are similar to Gaussian Type Orbitals (GTOs), often used as a basis function for molecular orbitals in Kohn-Sham density functional theory (DFT).3 However, in contrast to DFT, we do not tune the wavelets’ parameters to compute a valid electronic density. We rather use them to parametrize the complex non-linear relation between the molecule geometry and its energy.

Numerical experiments are performed on the QM9 database,25 which contains ground-state energies of thousands of organic molecules computed with DFT using the hybrid Becke, 3-parameter, Lee-Yang-Parr (B3LYP) functional. A multilinear regression of ground-state energies from scattering invariants achieves a mean absolute error of 0.50 kcal/mol. This result is close to state-of-the-art machine learning techniques – in particular, those computed with deep neural networks. By contrast, we use a much simpler multiscale representation whose mathematical properties are relatively well understood and which can lead to interpretable results with sparse linear regressions. Using an orthogonal least squares (OLSs) regression, we obtain an error of 1.96 kcal/mol with only 400 coefficients and 10 000 samples. Resamplings show that the first few selected coefficients are always the same and thus seem to consistently capture chemical information of specific, conditionally ordered levels of importance to the task.

The remainder of the paper is organized as follows. Sections II and III describe the wavelet scattering coefficients. Section IV details multilinear regressions. Numerical results, discussion, and interpretation are presented in Sec. V.

Let x={(zk,rk)Z×R3}k be an atomistic state, consisting of atoms with nuclear charges {zk}k, located at positions {rk}k. We denote by f(x) a scalar physical property of the state x, e.g., its energy, its HOMO-LUMO gap, or the magnitude of its dipole moment. To accurately approximate f(x) from examples, we must take advantage of the known regularity properties. We summarize some well-known properties of f(x) as follows:

  1. Invariance to indexation of the atoms in the state x;

  2. invariance to rigid motion of the state x;

  3. Lipschitz continuity with respect to relative variations of distances between atoms;

  4. dependence upon multiscale geometric properties of x, with nonlinear couplings between scales.

An approximation f̃(x) of the energy functional f(x) is computed as a multilinear regression over coefficients having the same invariance and stability properties as f(x). Indeed, invariance properties reduce the dimensionality of the approximation. This is implemented with a scattering regression illustrated in Fig. 1, which can be summarized into three steps. The first step, which is described in Sec. II A, maps x to several three-dimensional electronic densities, denoted here by ρx(u). They carry information about the spatial distributions of core and valence electrons, as well as bonds between atoms. These electronic embeddings are invariant to the permutation of atomic indexation. The second step computes an invariant scattering transform x of each density channel. It separates the variations of ρx at different scales and includes scale interaction terms. These coefficients are Lipschitz continuous to deformations of x and hence to relative variations of distances between atoms. The computation of these coefficients is described in Secs. II B, II C, and III. These coefficients satisfy the properties listed above. A multilinear regression, described in Sec. IV, computes the estimation f̃(x) of the molecular property f(x). Regression coefficients are optimized from the scattering transforms of a training set of molecules with their molecular values {Sρxi,f(xi)}in.

The Hohenberg-Kohn theorem2 implies that the ground-state energy of an electronic system can be computed as a functional of the electronic density. DFT for molecules relies on this and models the electronic density as the square sum of a linear combination of many Gaussian-type orbitals centered on atom positions. We represent the state x by several density surrogates, computed as sums of non-interacting densities which depend upon the position and charge of each atom. This initial embedding ensures that the resulting prediction f̃(x) will be invariant to permutations of the indexation of the atoms (property 1).

We compute three non-interacting density channels: one for core electrons, one for valence electrons, and a total density, which is the sum of the core and valence densities. These densities are computed as sums of non-interacting single-atom density functions placed at the atom locations. We use Gaussians to model these single atom densities, scaled to integrate to the amount of charge they represent. A three-dimensional density is computed as

ρx(u)=kγkg(urk),

where γk is a number of electrons at atom location rk and g(u) is a three-dimensional Gaussian normalized to have unit integral.

Core and valence densities are obtained by setting γk to be the number of core electrons or the number of valence electrons of atom k. The total electronic density is the sum of the core and valence densities,

ρxtotal(u)=ρxcore(u)+ρxvalence(u).

Due to the normalizations of the core and valence densities, the total density integrates to the total charge of the system: ρxtotal(u)=kzk.

To disentangle bond information from the atom type, we create an additional channel to encode atomic bond information explicitly. This is a surrogate density concentrated along line segments between pairs of bonded atoms, whose integral is equal to the bond order. Let B = {(i, j)∣ atoms i and j are bonded} and

ρxbonds(u)=C(i,j)B|rirj|1γijexp(dij2(u)/2d02),

where dij(u) is the distance between u and the segment between the atoms i and j, d0 is a characteristic width, γij is the number of electrons involved in the bond, and C is a constant so that ρxbonds(u)du=ijγij.

All density channels provide unique geometric and chemical information about the molecule. Indeed, the position on the two axes of the periodic table, which encodes a large amount of chemical variability, is accounted for in the valence and core channels. This gives direct access to the properties that vary along these axes, such as the distance of the outer orbitals to the nucleus and the number of available electrons for bonding. The full density channel gives direct access to the actual atom identity, which can help encode its typical purposes within the molecules. The bonds channel can, when this information is available, provide the additional structure indicator of how many valence electrons are actually bonded. Figure 2 plots a slice of each of the core, valence, and bonds channels of ρx for an organic molecule with planar symmetry.

FIG. 2.

Slice plots of the core electronic and valence densities thresholded, color-coded by charge count, and bond density of a planar molecule. The core density does not show hydrogen atoms.

FIG. 2.

Slice plots of the core electronic and valence densities thresholded, color-coded by charge count, and bond density of a planar molecule. The core density does not show hydrogen atoms.

Close modal

An alternative, more detailed approach would be to model the densities for small submolecular structures involving more than one atom, as described in Ref. 29.

We separate the different scale components of each density channel by computing a solid harmonic wavelet transform. In the following, we write ρ for any density channel ρxα of an arbitrary state x.

A wavelet transform integrates the density function ρ(u) against a family of localized oscillating waveforms, called wavelets, which have zero mean and are dilated to different scales. The resulting wavelet coefficients give a complete representation of ρ (and hence x), but reorganize its information content according to the different scales within the system.

We use solid harmonic wavelets, first introduced in Ref. 27 in the context of a molecular energy regression, but which have also been used in generic 3D image processing.30 They are defined as

ψm(u)=1(2π)3e|u|2/2|u|Ymu|u|,

where {Ym:0,m} are the Laplacian spherical harmonics. Solid harmonic wavelets are solid harmonics, |u|Ym(u/|u|), multiplied by a Gaussian, which localizes their support around 0. Decomposing angular frequencies using spherical harmonics will facilitate the computation of rotational invariants, which are described in Sec. II C. Dilations of ψm at the scale 2j are given by

ψj,m(u)=23jψm(2ju).

Figure 3 plots slices of their response in R3.

FIG. 3.

Solid harmonic wavelets’ real parts at a fixed scale viewed on an off-center slice at a fixed x-value. For a given , the global order of oscillation is always the same, but across different values of m, the oscillations are traded off between the visible plane and the sliced z-dimension. The planar cuts are also circular harmonics.

FIG. 3.

Solid harmonic wavelets’ real parts at a fixed scale viewed on an off-center slice at a fixed x-value. For a given , the global order of oscillation is always the same, but across different values of m, the oscillations are traded off between the visible plane and the sliced z-dimension. The planar cuts are also circular harmonics.

Close modal

Solid harmonic wavelets are of the same functional form as Gaussian-type orbitals (GTOs),31 which are often used in computational quantum chemistry, e.g., as basis sets for the Kohn-Sham orbitals in DFT. In these approaches, one electronic density may be made up of a linear combination of hundreds of atom-centered GTOs, in which several GTOs are used to model each electron orbital of each atom and the corresponding electronic interactions that make up the bonds in the system.

Here, solid harmonic wavelets are used as convolution filters. A solid harmonic wavelet transform integrates the density ρ(v) against each solid harmonic wavelet ψj,m(v) centered at each spatial location uR3,

Wρ=ρ*ψj,m(u)=R3ρ(v)ψj,m(uv)dvj,,m,

where the scales go from j = 0, …, J − 1, and the angular frequency bands go from = 0, …, L − 1, for prescribed values of J and L, respectively. The wavelet coefficients separate the geometry of ρ into different scales j and angular frequency bands . The additional frequency variable m further subdivides the angular frequency information, while the spatial variable u is necessary due to the localization of the waveform and encodes the response of ρ against ψj,m at this location.

For an electronic density ρx(u)=kγkg(urk),

ρx*ψj,m(u)=kγkg*ψj,m(urk).

Since g*ψj,m is another (wider) Gaussian-type orbital, we may view these coefficients as emitting the same GTO from each atom, scaled by the electronic charge γk. These coefficients encode interferences of the solid harmonic waveforms at different scales. For small scales, these interferences resemble those found in the computation of electronic orbitals in computational chemistry software. Solid harmonic wavelet coefficients resulting from larger-scale waveforms aggregate subgroups of atoms within the system and encode macroscopic geometric patterns. While these coefficients do not have direct analogues in classical density estimation procedures, they are required for deriving invariant coefficients that can account for long-range interactions within the system.

Although the solid harmonic wavelet transform {ρ*ψj,lm(u)} forms a complete representation of ρ, it is not suitable as input to a regression for global chemical properties. Indeed, a translation of ρ [obtained by shifting the spatial input variable u, τvρ(u) = ρ(uv)] results in a translation of the wavelet coefficients. Rotations of ρ similarly rotate the variable u, but also redistribute the wavelet coefficients across the m angular frequencies within each angular frequency band . Using these coefficients for regression would require the regression to learn all invariance properties that are already known, entailing a very high sample complexity. We address this point by computing coefficients with the same invariances as the target model. In order to obtain invariant coefficients to rotations of the state x, we first aggregate the energy of the solid harmonic wavelet coefficients across m,

U[j,]ρ(u)=m=|ρ*ψj,m(u)|21/2.

The value U[j, ]ρ is computed with a Euclidean norm that we shall call a modulus operator. It eliminates the rotational phase subspace information from the wavelet coefficients, and one can prove that the resulting coefficients are covariant to both translations and rotations (meaning that a translation or rotation of ρ results in the same translation or rotation of U[j, ]ρ). See Fig. 4 for the plots of U[j, ]ρx.

FIG. 4.

Solid harmonic wavelet modulus coefficients U[j,]ρxvalence of a planar molecule from the QM9 dataset, for = 1, 2, 3 oscillations and five scales. Depending on angular frequency and scale, different interference patterns arise, which capture complementary aspects of the molecule.

FIG. 4.

Solid harmonic wavelet modulus coefficients U[j,]ρxvalence of a planar molecule from the QM9 dataset, for = 1, 2, 3 oscillations and five scales. Depending on angular frequency and scale, different interference patterns arise, which capture complementary aspects of the molecule.

Close modal

It follows from this covariance property that rigid motion invariant coefficients may be obtained by integrating over u,

Sρ[j,,q]=R3|U[j,]ρ(u)|qdu,=R3m=|ρ*ψj,m(u)|2q/2du.

We refer to [j, , q] as first-order solid harmonic wavelet scattering coefficients. Each value [j, , q] collapses U[j, ]ρ (illustrated by a single box in Fig. 4) into a single coefficient. They give invariant descriptions of the state x that are stable to deformations of the atomic positions. Furthermore, they separate the scales of the system by aggregating geometric patterns related to the arrangement of the atoms and bonds at each scale.

The power q = 1 scales linearly with the number of electronic particles represented by the density ρ, while q = 2 encodes pairwise interactions, which are related to electrostatic Coulomb interactions in Ref. 26. Specifically, the wavelet modulus at = 1, U[j, 1]ρ, computes the gradient magnitude of the density at scale j. When integrated over space with q = 1, this results in the total variation of the scaled density, a quantity well known in functional analysis. It measures the lengths of density level sets with emphasis on inflection lines. At a sufficiently large scale, it measures the perimeter of the molecule (see the first line of Fig. 4). Powers smaller than 1 provide sparsity information: A small exponent q gives more importance to small non-zero values, essentially counting them. Large exponents are sensitive to peaks of large amplitudes.

First-order solid harmonic scattering coefficients initially encode the state x in a family of Gaussian densities ρx and use spherical harmonics to obtain rotational invariants; in this regard, they are similar to the smooth overlap of atomic positions (SOAP) algorithm introduced in Ref. 10 and subsequently used in Refs. 11 and 21. SOAP representations and resulting similarity measures, however, are computed on localized atom-centered molecular fragments at a fixed scale, which are then combined to form global similarity measures between different molecules.21 First-order scattering coefficients x[j, , q] are global integrated descriptions of the state x, which are computed at each scale 2j.

First-order scattering coefficients separate the scales of the system x. They encode both micro- and macroscopic invariant geometric descriptions of x, corresponding to short- and long-range interactions. However, the chemical and physical properties of complex quantum systems with many particles are the result of not only interactions at each scale but also the nonlinear mixing of scales (as in London dispersion forces32). We thus complement first-order scattering coefficients with second-order scattering coefficients that combine information across scales within the system. If a certain sub-molecule property becomes visible to a solid harmonic scattering modulus at a given scale, then interactions between these sub-molecule units can be captured by analyzing this modulus at a larger scale.

Solid harmonic wavelet modulus coefficients U[j, ]ρ are decomposed with a second wavelet modulus transform, with wavelets ψj,m that are larger than the first wavelet,

U[j,j,]ρ(u)=m=1|U[j,]ρ*ψj,m(u)|21/2,j<j.

The resulting coefficients mix relatively small-scale geometries of ρ across the larger scale 2j, in a distinctly different fashion than, for example, computing the product [j, , q][j′, , q]. Indeed, wavelet modulus coefficients U[j, ]ρ are covariant to rotations and translations and thus maintain oriented geometric information within the state x at the scale 2j. This information is analogous to localized multipole moments, such as local dipoles that arise due to the polarizability of atoms or sub-molecules within the state x. These weak localized interactions within the state x, in aggregate, non-trivially impact the large-scale structure of the system. The second wavelet ψj,m sets the scale at which this structural impact is measured and the localized contributions at the smaller scale are aggregated via integration,

Sρ[j,j,,q]=R3|U[j,j,]ρ(u)|qdu.

The resulting coefficients [j, j′, , q] are second-order solid harmonic wavelet scattering coefficients, which are depicted in Fig. 1. These coefficients are invariant to the rigid motion of ρ, stable to deformations of the state x, and couple two scales within the atomistic system. We conjecture that the resulting terms are qualitatively similar to first-order approximations of van der Waals interactions, which are given by 1α2/R6, where αk, k = 1, 2, are the polarizabilities of two local substructures within x that are well separated by a relatively large distance R. A second-order scattering coefficient can potentially capture such a term if the substructures are of size proportional to 2j and R ∼ 2j.

These van der Waals terms, in addition to electrostatic terms and localized energy terms, are utilized in expansions of the total energy of x via perturbation theory. Since we do not have a direct bijective correspondence with such terms, we instead learn and compute such energy expansions via multilinear regressions, which are described in Sec. IV.

While the described forces play a subordinate role in the systems we study in Sec. V, we find empirically that the inclusion of the described scale interaction terms significantly increases the predictive power of our analysis method. Furthermore, we expect that for systems with significant van der Waals interactions second-order scattering coefficients will be necessary to achieve reasonable regression errors. The mathematical and empirical study of this hypothesis is the subject of ongoing and future work.

Physical properties of atomistic states are regressed with multilinear combinations of scattering coefficients x. A multilinear regression of order r is defined by

f̃r(x)=b+i(νik=1r(Sρx,wi(k)+ci(k))).

For r = 1, it gives a linear regression that we write by expanding the inner product

f̃(x)=b+j,,qwj,qSρx[j,,q]+j>jwj,j,qSρx[j,j,,q],
(1)

where different scales j and j′, oscillation numbers , and integration exponents q are separated in different coefficients. For r = 2, this form introduces a non-linearity similar to those found in factored gated autoencoders.33 We optimize the parameters of the multilinear regression by minimizing the quadratic loss over the training data,

i=1nf(xi)f̃r(xi)2,

using the Adam algorithm for stochastic gradient descent.34 

If f(x) is the total energy of the state x and x is decomposable into a number of simpler subsystems, then perturbation theory analysis expands the energy f(x) into an infinite series of higher order energy terms,

f(x)=f0(x)+f1(x)+f2(x)+,

in which f0 captures the energies of the isolated subsystems, f1 captures their electrostatic interactions, and f2 captures induction and dispersion energies, which result from van der Waals interactions. The linear regression (1) similarly expands the total energy into successively higher order energy terms, but which are defined by wavelet scattering coefficients. As described in Secs. II and III, first-order scattering coefficients for q = 2 likewise encode pairwise Coulombic interactions, while second-order scattering coefficients couple scales within the system, analogous to the first-order approximations of induction and dispersion energies.

We train the regression algorithm to predict energies and other properties of organic molecules with first- and second-order solid harmonic wavelet scattering coefficients. Detailed numerical estimates and comparisons with the state of the art are presented in Sec. V A. In Sec. V B, we interpret scattering embeddings of organic molecules by analyzing sparse linear regressions.

We use our algorithm to approximate molecular energies, i.e., the atomization energy at 0 K U0, atomization energy at room temperature U, enthalpy of atomization at room temperature H, and atomization free energy at room temperature G. We use the values of the QM9 dataset,25 which contains 133 885 different small organic molecules with maximally 29 atoms of the elements C, H, O, N, and F, and maximally 9 heavy (non-hydrogen) atoms. Their molecular configurations and properties were estimated using B3LYP DFT.

First- and second-order scattering invariants were computed for core, valence, full, and bonds channels, with J = 4 and ∈{1, 2, 3}. The moduli were raised to the powers q ∈ {1/2, 1, 2, 3, 4}.

Multilinear regressions were trained and evaluated on five random splits of the dataset, for r = 1, i.e., linear regression, and r = 3, trilinear regression. The evaluation criterion for the fits was mean absolute error, which is the most prevalent error measure in the literature. We use 5-fold cross-validation, separating the data in 107 108 training datapoints and 26 777 test datapoints per fit.

For the atomization energy at room temperature U, trilinear regression achieves close to state-of-the-art error rates on core, valence, and full densities (0.56 kcal/mol). When incorporating bond information via the bonds channel, the result improves to 0.51 kcal/mol. This is expected since the bonds channel contains relevant information pertaining to the known bond structure, which are not immediately deducible from the number of valence electrons of the constituent atoms.

Notably, the error of a simple linear regression is much lower than that of Coulomb matrices fit with a kernel ridge regression at full sample complexity.

The QM9 dataset contains eight other molecular properties that we are able to predict. However, our technique has been primarily optimized to regress the energy properties which are at the top of Table I. The bottom part provides prediction errors for the dipole moment μ (Debye), the static polarizability α (a03, Bohr radius cubed), the ϵHOMO energy (kcal/mol), the ϵLUMO energy (kcal/mol), the ϵgap gap energy (kcal/mol), the electronic spatial extent ⟨R2⟩ (a02), the zero point vibrational energy (eV), and the heat capacity at room temperature Cv (cal/mol/K). The results show good prediction performance on all properties of the molecules, relative to their DFT estimated values. Some error values are close to state of the art and the majority is no further than a factor 2 larger in error.

TABLE I.

Prediction errors for molecular energies of the QM9 dataset in kcal/mol. From left to right: The scattering with a linear regression, the scattering with a trilinear regression, neural message passing, and Coulomb matrices.

L-ScatT-ScatNMPCMSchNet
U0 1.89 0.50 0.45 2.95 0.31 
U 2.4 0.51 0.45 2.99  
H 1.9 0.51 0.39 2.99  
G 1.87 0.51 0.44 2.97  
μ 0.63 0.34 0.030 0.45  
α 0.52 0.16 0.092 0.43  
ϵHOMO 4.08 1.97 0.99 3.06  
ϵLUMO 5.39 1.76 0.87 4.22  
ϵgap 2.73 1.60 5.28  
R2⟩ 6.67 0.41 0.18 3.39  
zpve 0.004 0.002 0.0015 0.0048  
Cv 0.10 0.049 0.04 0.12  
L-ScatT-ScatNMPCMSchNet
U0 1.89 0.50 0.45 2.95 0.31 
U 2.4 0.51 0.45 2.99  
H 1.9 0.51 0.39 2.99  
G 1.87 0.51 0.44 2.97  
μ 0.63 0.34 0.030 0.45  
α 0.52 0.16 0.092 0.43  
ϵHOMO 4.08 1.97 0.99 3.06  
ϵLUMO 5.39 1.76 0.87 4.22  
ϵgap 2.73 1.60 5.28  
R2⟩ 6.67 0.41 0.18 3.39  
zpve 0.004 0.002 0.0015 0.0048  
Cv 0.10 0.049 0.04 0.12  

Using linear regression, we checked the impact of omission of one or several of the input channels. While the valence atomic channel accounts for the strongest change in variance explained, both core and full density channels are required to achieve satisfactory results. The additional accuracy provided by the bonds channel is smaller, but significant. Notably, comparing the linear regression error incorporating all four channels (1.89 kcal/mol) to the four linear regression models in which each channel is left out in turn, we obtain the following error reductions: Removing the full density incurs 0.64 kcal/mol, removing the core density incurs 0.51 kcal/mol, removing the valence channel incurs 0.73 kcal/mol, and removing the bonds channel results in an error increase of 0.16 kcal/mol from the joint prediction error of 1.89 kcal/mol. Removing any of the channels results in a non-negligible error increase. Future directions include the exploration of molecular dynamics via force derivatives of the energy functional. In this setting, the bonds channel will be omitted since in that setting bonds are subject to temporal variation and may be broken or created.

Linear regressions from scattering invariants have good prediction properties. We use this fact to compute small-size interpretable models, with sparse regressions. These sparse regressions are computed with an Orthogonal Least Square (OLS) pursuit, which is a greedy algorithm that decorrelates regressors. At every iteration, this algorithm26 selects the scattering coefficient that predicts the target best and then orthogonalizes the system with respect to that predictor. It gives us a list of coefficients ordered by conditional importance. We apply this procedure to different splits of the dataset, which can yield different coefficient lists if selection is unstable (e.g., when two features add similar amounts of predictive power such that the randomness of data splitting can make either one be selected first). Figure 5 (left) shows the average error for the atomization energy U with respect to the number of features selected by OLS.

FIG. 5.

Left: Mean absolute error of orthogonal least squares regression for property U on a held-out test set as a function of the sparsity level. For every property, the first six coefficients selected are always the same across random splits. Error decreases rapidly for sparse regressions up to 40 coefficients. Sparse regression with 400 coefficients is close to or better than a full linear regression. Right: MAE prediction error for three fixed sparse OLS regressions (6, 40, and 400 coefficients) and a full linear regression as a function of the number of samples. Different numbers of coefficients lead to different bias-variance trade-offs. Smaller numbers of coefficients require few samples to train, but cannot improve given more samples. Larger numbers of coefficients require more samples, but the error decreases further.

FIG. 5.

Left: Mean absolute error of orthogonal least squares regression for property U on a held-out test set as a function of the sparsity level. For every property, the first six coefficients selected are always the same across random splits. Error decreases rapidly for sparse regressions up to 40 coefficients. Sparse regression with 400 coefficients is close to or better than a full linear regression. Right: MAE prediction error for three fixed sparse OLS regressions (6, 40, and 400 coefficients) and a full linear regression as a function of the number of samples. Different numbers of coefficients lead to different bias-variance trade-offs. Smaller numbers of coefficients require few samples to train, but cannot improve given more samples. Larger numbers of coefficients require more samples, but the error decreases further.

Close modal

We can see clearly that the more the coefficients are included in a regression, the further the estimation error decreases. In particular, we see a strong initial drop in error with each new coefficient and with only 40 coefficients, and we are already within the range of the Coulomb matrix reference error. The error continues to monotonically decrease to 1.96 kcal/mol at 400 coefficients.

We also analyze how many samples are needed for the training: with 6 coefficients, only 100 samples are needed to train almost fully, but the error cannot go under 21 kcal/mol; with 40 coefficients, the error asymptote of 4.63 kcal/mol is reached at roughly 1000 samples. With 400 coefficients, the performance is similar to the full linear regression, and the error asymptote of 1.96 kcal/mol is reached at around 10 000 samples.

Given this knowledge, we can conclude that our current linear estimators can be trained on very small subsets of QM9. Indeed, our sparse linear models require considerably less data when compared to scattering trilinear regressions and the current state of the art—message passing neural network algorithms and SchNet. Furthermore, while kernel methods with appropriate kernels may be able to decrease the error to 0 given a large or infinite amount of data, their representation sizes increase with the number of samples. Using fixed-size and especially linear regressions, as we do, we are able to detect their capacity limit and use the relevant prediction depending on the case.

Another direction of extrapolation relevant to predictions evaluation is molecule size. Ideally, a machine learning technique should be able to predict properties of molecules of a different size from those on which it was trained. An estimator with this property could be useful for computing chemical properties of larger-scale structures. Defining molecule size by the number of heavy atoms it contains, we created training sets of molecules with up to 7, up to 8, and up to 9 heavy atoms. We evaluate predictive accuracy of the linear regression on test sets likewise composed of up to 7, 8, or 9 heavy atom molecules. Learning on up to 7 or 8 heavy atoms incurs an error of around 4 kcal/mol on molecules with 9 heavy atoms and 2 kcal/mol on the smaller ones. Learning with up to 9 heavy atoms gives the usual 2 kcal/mol error. It thus seems as though scattering coefficients permit scaling to larger molecule sizes while incurring a constant error. For larger datasets to come, this test will permit clearer conclusions. We believe that such molecular extension properties should be included in the evaluation procedure of machine learning algorithms.

As stated above, we select the coefficients with OLS using different splits of the dataset. Notably, the first 6 selected coefficients are always the same, independently of the split of the dataset. The stability in the selection of the same first 6 coefficients invites closer examination. We apply the same OLS selection of 6 coefficients for the internal energy Uint, which is equal to the atomization energy U plus the internal energy of all the atoms that form the molecule. The 6 selected coefficients are shown in Table II.

TABLE II.

Specification of the first six coefficients used to regress the internal energy Uint and the atomization energy U. These six coefficients are stable across 16 resamplings. Note the systematic selection of bonds channels for the estimation of atomization energy.

UDensityjjq
Bonds  1/2 
Bonds 1/2 
Bonds  1/2 
Core 
Valence  
Full  1/2 
UDensityjjq
Bonds  1/2 
Bonds 1/2 
Bonds  1/2 
Core 
Valence  
Full  1/2 
UintDensityjjq
Valence  
Core 
Full 
Full  
Full 
Valence 
UintDensityjjq
Valence  
Core 
Full 
Full  
Full 
Valence 

Interestingly, we see that the coefficients computed from the bonds density are highly relevant to the atomization energies, but accurate estimation requires information about the nature of the atoms that are contained in the coefficients computed from the core and full densities. On the other hand, coefficients computed from the full, core, and valence density are highly relevant to the internal energy Uint of the molecule, which contains the individual free atom energies as well as the bond energies. We also see the importance of the second-order coefficients (coefficients for which j′ ≠ ) that account for multiscale interactions.

We have presented an algorithm that introduces a representation of molecules that is invariant to three-dimensional rotations and translations. The algorithm relies on a multiscale representation of molecular densities based on solid harmonic wavelets. Solid harmonic wavelets are also used as Gaussian-type orbitals in molecular DFT, establishing an interesting qualitative link between the methods. Our method does not optimize the parameters of the solid harmonics, but rather uses them to analyze surrogate densities created to resemble the molecule electronic density. The representation is general enough to be applied to any three-dimensional density dataset and will extend to molecules with a higher number of atoms than the ones presented in the training set. We have shown that the representation can achieve near state-of-the-art predictive performance using a nonlinear regression.

Notably, compromising in predictive performance, we can use sparse linear regressions to add interpretability to our results. We show that specific coefficients are reliably selected by an orthogonal least squares method across random data splits. The selected coefficients contain bond information when atomization energy is fitted, whose primary fluctuations are derived from bonding, and contain atomic information when full ground-state energy is fitted because free atom contributions then dominate the term. We thus have a precise handle on the interpretation of the coefficients.

M.E., G.E., and S.M. are supported by ERC Grant InvariantClass No. 320959; M.H. is supported by the Alfred P. Sloan Fellowship (Grant No. FG-2016-6607), the DARPA YFA (Grant No. D16AP00117), and NSF Grant No. 1620216.

Software to reproduce the results is available at http://www.di.ens.fr/data/software.

1.
R. J.
Bartlett
and
M.
Musiał
, “
Coupled-cluster theory in quantum chemistry
,”
Rev. Mod. Phys.
79
(
1
),
291
352
(
2007
).
2.
P.
Hohenberg
and
W.
Kohn
, “
Inhomogeneous electron gas
,”
Phys. Rev.
136
(
3B
),
B864
B871
(
1964
).
3.
W.
Kohn
and
L. J.
Sham
, “
Self-consistent equations including exchange and correlation effects
,”
Phys. Rev.
140
,
A1133
A1138
(
1965
).
4.
B. G.
Sumpter
and
D. W.
Noid
, “
Neural networks and graph theory as computational tools for predicting polymer properties
,”
Macromol. Theory Simul.
3
(
2
),
363
378
(
1994
).
5.
S.
Manzhos
,
X.
Wang
,
R.
Dawes
, and
T.
Carrington
, “
A nested molecule-independent neural network approach for high-quality potential fits
,”
J. Phys. Chem. A
110
(
16
),
5295
5304
(
2006
).
6.
J.
Behler
and
M.
Parrinello
, “
Generalized neural-network representation of high-dimensional potential-energy surfaces
,”
Phys. Rev. Lett.
98
(
14
),
146401
(
2007
).
7.
J.
Behler
, “
Atom-centered symmetry functions for constructing high-dimensional neural network potentials
,”
J. Chem. Phys.
134
(
7
),
074106
(
2011
).
8.
A. P.
Bartók
,
M. C.
Payne
,
R.
Kondor
, and
G.
Csányi
, “
Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons
,”
Phys. Rev. Lett.
104
(
13
),
136403
(
2010
).
9.
A.
Sadeghi
,
S. A.
Ghasemi
,
B.
Schaefer
,
S.
Mohr
,
M. A.
Lill
, and
S.
Goedecker
, “
Metrics for measuring distances in configuration spaces
,”
J. Chem. Phys.
139
(
18
),
184118
(
2013
).
10.
A. P.
Bartók
,
R.
Kondor
, and
G.
Csányi
, “
On representing chemical environments
,”
Phys. Rev. B
87
(
18
),
184115
(
2013
).
11.
W. J.
Szlachta
,
A. P.
Bartók
, and
G.
Csányi
, “
Accuracy and transferability of Gaussian approximation potential models for tungsten
,”
Phys. Rev. B
90
(
10
),
104108
(
2014
).
12.
Z.
Li
,
J. R.
Kermode
, and
A. D.
Vita
, “
Molecular dynamics with on-the-fly machine learning of quantum-mechanical forces
,”
Phys. Rev. Lett.
114
,
096405
(
2015
).
13.
L.
Zhu
,
M.
Amsler
,
T.
Fuhrer
,
B.
Schaefer
,
S.
Faraji
,
S.
Rostami
,
S. A.
Ghasemi
,
A.
Sadeghi
,
M.
Grauzinyte
,
C.
Wolverton
, and
S.
Goedecker
, “
A fingerprint based metric for measuring similarities of crystalline structures
,”
J. Chem. Phys.
144
(
3
),
034203
(
2016
).
14.
A.
Shapeev
, “
Moment tensor potentials: A class of systematically improvable interatomic potentials
,”
Multiscale Model. Simul.
14
(
3
),
1153
1173
(
2016
).
15.
S.
Chmiela
,
A.
Tkatchenko
,
H. E.
Sauceda
,
I.
Poltavsky
,
K. T.
Schütt
, and
K.-R.
Müller
, “
Machine learning of accurate energy-conserving molecular force fields
,”
Sci. Adv.
3
(
5
),
e1603015
(
2017
).
16.
M.
Rupp
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O. A.
von Lilienfeld
, “
Fast and accurate modeling of molecular atomization energies with machine learning
,”
Phys. Rev. Lett.
108
(
5
),
058301
(
2012
).
17.
K.
Hansen
,
G.
Montavon
,
F.
Biegler
,
S.
Fazli
,
M.
Rupp
,
M.
Scheffler
,
O. A.
von Lilienfeld
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
Assessment and validation of machine learning methods for predicting molecular atomization energies
,”
J. Chem. Theory Comput.
9
(
8
),
3404
3419
(
2013
).
18.
G.
Montavon
,
M.
Rupp
,
V.
Gobre
,
A.
Vazquez-Mayagoitia
,
K.
Hansen
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O. A.
von Lilienfeld
, “
Machine learning of molecular electronic properties in chemical compound space
,”
New J. Phys.
15
,
095003
(
2013
).
19.
R.
Ramakrishnan
,
M.
Hartmann
,
E.
Tapavicza
, and
O. A.
von Lilienfeld
, “
Electronic spectra from TDDFT and machine learning in chemical space
,”
J. Chem. Phys.
143
,
084111
(
2015
).
20.
K.
Hansen
,
F.
Biegler
,
R.
Ramakrishnan
,
W.
Pronobis
,
O. A.
von Lilienfeld
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space
,”
J. Phys. Chem. Lett.
6
,
2326
2331
(
2015)
.
21.
S.
De
,
A. P.
Bartók
,
G.
Csányi
, and
M.
Ceriotti
, “
Comparing molecules and solids across structural and alchemical space
,”
Phys. Chem. Chem. Phys.
18
(
20
),
13754
13769
(
2016
).
22.
K. T.
Schütt
,
F.
Arbabzadah
,
S.
Chmiela
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Quantum-chemical insights from deep tensor neural networks
,”
Nat. Commun.
8
,
13890
(
2017
); e-print arXiv:1609.08259.
23.
J.
Gilmer
,
S. S.
Schoenholz
,
P. F.
Riley
,
O.
Vinyals
, and
G. E.
Dahl
, “
Neural message passing for quantum chemistry
,” in
Proceedings of the 34th International Conference on Machine Learning
,
Sydney, Australia
,
2017
.
24.
K. T.
Schütt
,
P.-J.
Kindermans
,
H. E.
Sauceda
,
S.
Chmiela
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
Schnet: A continuous-filter convolutional neural network for modeling quantum interactions
,” in
Advances in Neural Information Processing Systems 30 (NIPS)
, edited by
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, and
R.
Garnett
(
NIPS
,
2017
), pp.
992
1002
.
25.
R.
Ramakrishnan
,
P. O.
Dral
,
M.
Rupp
, and
O. A.
von Lilienfeld
, “
Quantum chemistry structures and properties of 134 kilo molecules
,”
Sci. Data
1
,
140022
(
2014
).
26.
M.
Hirn
,
S.
Mallat
, and
N.
Poilvert
, “
Wavelet scattering regression of quantum chemical energies
,”
Multiscale Model. Simul.
15
(
2
),
827
863
(
2017
); e-print arXiv:1605.04654.
27.
M.
Eickenberg
,
G.
Exarchakis
,
M.
Hirn
, and
S.
Mallat
, “
Solid harmonic wavelet scattering: Predicting quantum molecular energy from invariant descriptors of 3D electronic densities
,” in
Neural Information Processing Systems (NIPS)
,
2017
.
28.
S.
Mallat
, “
Group invariant scattering
,”
Commun. Pure Appl. Math.
65
(
10
),
1331
1398
(
2012
).
29.
M.
Gastegger
,
L.
Schwiedrzik
,
M.
Bittermann
,
F.
Berzsenyi
, and
P.
Marquetand
, “
WACSF-weighted atom-centered symmetry functions as descriptors in machine learning potentials
,”
J. Chem. Phys.
148
(
24
),
241709
(
2018
); e-print arXiv:1712.05861 (submitted).
30.
M.
Reisert
and
H.
Burkhardt
, “
Harmonic filters for generic feature detection in 3D
,” in
Pattern Recognition. DAGM 2009
, Lecture Notes in Computer Science (
Springer
,
Berlin, Heidelberg
,
2009
), Vol. 5748.
31.
S. F.
Boys
, “
Electronic wave functions. I. A general method of calculation for the stationary states of any molecular system
,”
Proc. R. Soc. London, Ser. A
200
(
1063
),
542
554
(
1950
).
32.
F.
London
, “
The general theory of molecular forces
,”
Trans. Faraday Soc.
33
,
8
26
(
1937
).
33.
R.
Memisevic
, “
Gradient-based learning of higher-order image features
,” in
2011 IEEE International Conference on Computer Vision (ICCV)
(
IEEE
,
2011
), pp.
1591
1598
.
34.
D.
Kingma
and
J.
Ba
, “
Adam: A method for stochastic optimization
,” in
3rd International Conference for Learning Representations
,
San Diego, CA, USA
,
2015
.