Mapping an atomistic configuration to a symmetrized N-point correlation of a field associated with the atomic positions (e.g., an atomic density) has emerged as an elegant and effective solution to represent structures as the input of machine-learning algorithms. While it has become clear that low-order density correlations do not provide a complete representation of an atomic environment, the exponential increase in the number of possible N-body invariants makes it difficult to design a concise and effective representation. We discuss how to exploit recursion relations between equivariant features of different order (generalizations of N-body invariants that provide a complete representation of the symmetries of improper rotations) to compute high-order terms efficiently. In combination with the automatic selection of the most expressive combination of features at each order, this approach provides a conceptual and practical framework to generate systematically improvable, symmetry adapted representations for atomistic machine learning.
I. INTRODUCTION
Equivariant, atom-centered structural representations have driven the progress of atomistic machine-learning over the past decade.1–15 These representations preserve the transformation rules of the target property with respect to fundamental symmetries such as translation, rotation, inversion, and permutation of identical atoms. Such atomic descriptions have found widespread usage because the incorporation of symmetries, as well as the locality that derives from their atom-centered nature, makes the data-driven regression models more transferable and efficient when learning invariant (or covariant16,17) target properties. Most of these equivariant representations can be seen as projections of the many-body correlation functions of a decorated atom density onto more or less arbitrary choices of bases, and linear regression models based on these features are equivalent to a body-ordered expansion of the target property.9,13,14,18 The expansion is typically truncated at the third or fourth order correlation,3,4 which is problematic because low-order correlations are incomplete, so that one can build atomic environments, which have the same features despite having different structures and properties.19 With increasing body order, however, the number of terms in the projection grows exponentially.
In this Communication, we discuss a recursive construction for equivariant features, which avoids some of the formally (and computationally) daunting expressions that one encounters when writing explicitly the form of high-order invariant features,14,20,21 while greatly simplifying the relations between different body orders. To prevent the exponential increase in the total number of features, we then introduce an N-body iterative contraction of equivariants (NICE) framework and demonstrate it on a simple—yet challenging—dataset of CH4 configurations, as well as on a more realistic benchmark comprising ≈130 000 organic molecules.22
II. THEORY
The notation we use is a refinement of that introduced in Refs. 13 and 23. The bra-ket ⟨I|A⟩ indicates a feature (labeled by I), which describes a structure and its associated properties (labeled by A). Both indices can be expressed in a contracted or expanded form, depending on the level of detail that is needed for a given manipulation. We start by defining (ν + 1)-body equivariants as averages of the atom-centered density |ρi⟩ over improper rotations,
where ν indicates the number of densities included in the correlation function and is the Wigner matrix associated with the rotation . The ket indicates that for each feature ⟨I|, we must compute a set of 2λ + 1 entries, labeled by μ, that transform under rotations as the spherical harmonic of order λ.17 The label σ indicates parity with respect to inversion î; σ = 1 features behave as scalars/tensors, and σ = −1 features behave as pseudoscalars/pseudotensors. The expansion in radial functions and spherical harmonics of the atom density centered on atom i can be written as,
where g is a localized function (possibly a Dirac δ, a limit for which this formulation yields the atomic cluster expansion14 as a special case), and rji = rj − ri is the distance vector between atoms i and j. The construction is easily extended to multiple atomic species, as well as to other attributes of the atoms, by computing a tensor product between the atom density and these additional quantities.5 The indices associated with the atomic nature can be gathered together with the radial index n—so all of the developments in this work apply equally well to the more general case by considering n as a compound index.
A. Recursive construction of equivariant features
As discussed in the supplementary material, Eq. (1) is somewhat redundant: L is bound to be equal to λ, M can be fixed to any value and only introduces some inconsequential phases, and there are constraints on the values of mi—corresponding to the loss of degrees of freedom associated with the covariant integration. For instance, ∑jmj = −M and = −μ. To obtain a more transparent and concise enumeration of the N-body equivariants, we devise a labeling that makes it simpler to identify those that are linearly independent and a recursion relation to build them efficiently. We start by defining the ν = 1 equivariants
Given that many indices are redundant and that all ν = 1 terms behave with σ = 1 parity, we also introduce the shorthand notation . Higher-order terms can be obtained using an iterative formula modeled after the addition of angular momenta,
where ⟨lm; kq|λμ⟩ is a Clebsch–Gordan coefficient and . The index σ tracks the parity of the equivariants, ensuring that terms with an even λ + ∑jlj are associated with σ = +1 and those with an odd sum with σ = −1. The mapping between the integral form (1) and the recursive form (4) is derived in the supplementary material.24
1. Linearly independent covariants
This construction simplifies the determination of the features that are linearly independent: well-established angular-momentum theory results25 show that one only needs to consider terms where the l indices are sorted in ascending order; if the density is expanded up to an angular momentum cutoff lmax, this implies l1 ≤ l2 ≤ l3 ⋯ ≤ lν ≤ lmax. When two l indices are equal, the covariants are symmetric with respect to an exchange of the corresponding n indices. Thus, although, in general, the n indices need not be sorted, whenever lν = lν+1, the terms with nν+1 < nν can be discarded. Further terms can be found that are linearly dependent or evaluate to zero, some classes of which are discussed in the supplementary material. Except when ν becomes comparable to the numbers of radial and angular momentum channels included in the density expansion, these constitute a small fraction of the equivariants, as can be seen by enumerating the independent invariants with computer algebra.26 While angular-momentum theory result can reduce substantially the number of independent terms, they do not avoid the exponential scaling of the feature size with . As we discuss in the next section, a data-driven contraction step can reduce the complexity of the model much more substantially.
2. Polynomially independent invariants
The case of invariant features is particularly important as they provide a basis to expand properties such as the potential energy that are left unchanged by rigid rotations and inversion. Equation (4) shows clearly how they can also be obtained efficiently by keeping track of all the equivariants of lower-order to compute
an expression that neatly encompasses the well-known formulas for the SOAP power spectrum and bispectrum.3 In the case of invariant features, in addition to the rules that identify linearly independent terms based on angular-momentum theory, it is also relevant whether higher-order terms can be written as polynomials of lower-order invariants. This is because non-linear regression schemes (kernel methods, polynomial regression, or neural networks) produce arbitrary polynomial combinations of the low-order invariants. Thus, terms that cannot be written as polynomials of lower invariants are likely to be more informative and more worthy of being retained for use in non-linear regression. As shown in the supplementary material, when one of the intermediate couplings k is zero, the higher-body order term becomes a polynomial of two lower-order terms as
B. Iterative construction of contracted equivariants
Even though Eq. (4) makes it possible to compute a given equivariant feature with a cost that scales only linearly with the body order, and even though linear and polynomial relationships between features allow one to discard several terms, the number of independent features scales exponentially with ν, making it impractical to enumerate all the equivariants. Selecting the most important terms for a given application is therefore crucial to obtain a viable scheme based on high-order features. Several of such schemes have been proposed,4,27–32 either based on selecting the most relevant features from a large pool of candidates (that still requires computing all of them at least in a preliminary phase), by selecting a subset based on heuristic arguments, or by designing a neural network architecture that incorporates SO(3) combination rules33–35 analogous to (4). We propose a strategy to generate a set of high-order equivariants based on a combination of an iteration rule and a simple selection scheme based on principal component analysis, which serves as a proof of principle for more sophisticated schemes based on linear or non-linear feature extraction (Fig. 1).
A schematic representation of the NICE framework. A hierarchy of N-body equivariant features is built by iterative combination with the atom density coefficients, and the exponential increase in feature space size is kept at bay by successive contractions.
A schematic representation of the NICE framework. A hierarchy of N-body equivariant features is built by iterative combination with the atom density coefficients, and the exponential increase in feature space size is kept at bay by successive contractions.
1. Feature selection/contraction
Assume that a pool of O(3) equivariant features has been computed for a given order ν. We disregard the internal structure of the featurization and simply tag these features as . We look for a contraction that extracts the largest amount of (linearly) independent information. The most straightforward approach is to compute correlation matrices
where the sum runs over all structures A and environments i in a reference dataset. This matrix can then be diagonalized as Cν;σλ = U diag(v)UT, and the most significant features extracted as
For clarity of exposition, we consider a correlation matrix built exclusively on terms of order ν, but it would clearly be possible to combine all of the features that have been retained up to the current iteration.
2. Body order iteration
The contracted features can be combined with the density coefficients, using Eq. (4), to build a set of (ν + 1)-order equivariants:
Pooling together all of the feature indices when constructing the correlation matrix mixes the nlk channels, which makes it impossible to keep track of the trivial linear dependencies between angular momentum combinations. They are however identified automatically by the contraction step, together with non-trivial correlations between the features, and are immediately discarded. Another important consideration is that for a given angular cutoff of the density expansion, each application of (9) doubles the maximum possible value of λ. To prevent an exponential increase in the number of covariant components, we cutoff λ to the same lmax used for the density expansion.
We also want to stress that this scheme is just one of the many conceivable combinations of a body-order recursion and feature-selection steps. Completely disregarding the physical significance of the nlk indices, one can generate high-order features by combining lower-order equivariants using the sum rules for angular momenta, similar to what is done in covariant neural networks.33 One can also introduce, at each level, an arbitrary non-linear function of the invariant terms of lower-order, as suggested in Ref. 13, yielding an expression of the form
We name the family of schemes that builds body-order equivariants or invariants using a sequence of angular-momentum iterations and feature selections the N-body iterative contraction of equivariants (NICE) framework.
III. RESULTS
We demonstrate the construction of NICE features on a subset of the 3 × 106 quasi-random configurations CH4 that has recently been introduced in Ref. 19. The high number of structures and their random nature reveals the behavior of N-body invariants in a way that is less biased by the nature of the reference dataset. In a way, this dataset represents a worst-case scenario for the iterative contraction since it does not benefit from a reduction in the intrinsic dimensionality associated with the distribution of input configurations. To provide a direct comparison with the existing approaches, we also present the results for the atomization energy of the ≈130 000 organic compounds that are part of the QM9 dataset.22,36
A. Correlations between N-body features
We begin by showing how the contraction process behaves during successive body order iterations. We consider C-centered features, so each sample in the dataset is associated with a single environment. In implementing the NICE scheme, we apply two additional optimizations. First, we keep track of the eigenvalues vν;σλ associated with the principal components computed at each step, and we use them to “screen” the body order iteration (9), computing only the terms for which is greater than a set threshold. Second, after each body-order iteration and before computing the contraction, we project out the components of the new features that can be expressed as a linear combination of lower-order equivariants.
Figure 2 shows the eigenvalues of Cν;σλ at different stages of the procedure. The atom-centered density is expanded on a basis of nmax = 10 Gaussian-type radial functions, including angular momentum channels up to lmax = 10. Even though we consider a separate density for C and H atoms, there are only ≈10 independent ν = 1 equivariants, reflecting the fact that the density contribution corresponding to carbon is identical for every C centered environment and thus irrelevant. For ν = 2 equivariants (corresponding to the most common implementation of λ-SOAP17), the correlation spectrum decays very rapidly, which is consistent with the observation that the power spectrum can be truncated very aggressively with little loss of regression performance.27 Higher-ν spectra decay more slowly. Nevertheless, at each body order, a few hundreds of features represent 99% of the dataset variance. The decay of the covariance underpins the viability of the NICE framework and is in striking contrast with the expected exponential scaling of the number of linearly independent terms: for the angular and radial cutoff we use, there would be more than 66 000 invariants just for ν = 3. Figure 2 also reflects the importance of contracting separately equivariants of different parity. For ν = 2, there is no pseudoscalar component, and all of the pseudotensor features decay faster than the corresponding tensorial equivariant. Even though in this proof-of-principle work, we ignore non-asymptotic optimizations, exploiting the different behavior of low-order features of different parities can provide a noticeable reduction in memory and computational requirements.
Eigenvalues of the correlation matrix between NICE features of order ν for 2000 C-centered environments extracted from the random CH4 dataset. Increasing the number of environments does not change significantly the eigenvalue spectra. At each NICE iteration, we save 400 most important invariant features and retain 150 sets of contracted equivariants that are combined with in the next iteration. Full lines correspond to polar tensor terms ( =1) and the dashed lines to the pseudotensor terms ( = −1).
Eigenvalues of the correlation matrix between NICE features of order ν for 2000 C-centered environments extracted from the random CH4 dataset. Increasing the number of environments does not change significantly the eigenvalue spectra. At each NICE iteration, we save 400 most important invariant features and retain 150 sets of contracted equivariants that are combined with in the next iteration. Full lines correspond to polar tensor terms ( =1) and the dashed lines to the pseudotensor terms ( = −1).
B. Regression performance
Figure 3 shows learning curves for linear NICE models based on increasingly high body-order features and highly converged parameters of the density expansion. In the case of CH4 structures, C-centered features of order ν = 4 provide, in principle, a complete linear basis to describe the interatomic potential. In practice, however, the learning curves of linear models saturate at a relatively small train set size. Including terms of increasing ν delays the saturation and improves the asymptotic accuracy, but the improvement becomes less dramatic with increasing body order. The bottom panel of Fig. 3 shows that indeed the slower decay of the principle component analysis (PCA) spectrum is reflected in a slow convergence of the error with the number of PCA components. It is possible to systematically improve a linear NICE model by ramping up nPCA, nmax, lmax even further (see supplementary material for a discussion of convergence with body order and number of invariants), but one should contrast this with the use of more flexible models of the target property. For instance, a ≈50% drop in error can be achieved by simultaneously using features centered on C and H atoms. Incorporating H centered information helps, for instance, achieve an accurate description of the binding of two H atoms in the region far from the carbon, which would otherwise require extremely high resolutions for the density expansion. For ntrain < 105 the resulting model outperforms both linear and non-linear models based on SOAP and bispectrum features (see supplementary material). Another way to increase the flexibility of the model involves using a non-linear functional form. Indeed, Fig. 3 shows that a NN model based on NICE invariant features avoids saturation altogether, and outperforms all linear models in the data-rich limit.
Learning curves for the formation energy of CH4 structures using linear models based on NICE features truncated to increasing body order (rcut = 6 Å, nmax = 10, and lmax = 10, up to 3200 invariants retained at each body order), a NN model using NICE features up to ν = 4. Errors are expressed both in absolute terms and as a percentage of the standard deviation of the dataset. The top panel shows the results for models using only C-centered features; and the middle panel uses features centered on both C and H. The bottom panel shows the convergence of linear models trained on 100 K CH4 structures, using only C-centered NICE features and including increasingly large numbers of PCA components – i.e., a ν = 3, nPCA = 100 model contains all NICE features with ν = 1, 2 and the top 100 ν = 3 features.
Learning curves for the formation energy of CH4 structures using linear models based on NICE features truncated to increasing body order (rcut = 6 Å, nmax = 10, and lmax = 10, up to 3200 invariants retained at each body order), a NN model using NICE features up to ν = 4. Errors are expressed both in absolute terms and as a percentage of the standard deviation of the dataset. The top panel shows the results for models using only C-centered features; and the middle panel uses features centered on both C and H. The bottom panel shows the convergence of linear models trained on 100 K CH4 structures, using only C-centered NICE features and including increasingly large numbers of PCA components – i.e., a ν = 3, nPCA = 100 model contains all NICE features with ν = 1, 2 and the top 100 ν = 3 features.
C. Performance comparison on the QM9 dataset
A thorough comparison of the performance of a NICE-based model with existing frameworks would require to benchmark many schemes that can be used to perform the contraction and the non-linear models that can be built on top of NICE features, however this is beyond the scope of this communication. In order to provide a preliminary assessment of the potential of this framework in building ML models of more complicated systems we use the QM9 dataset,22 that has been used as the go-to benchmark for chemical machine learning37 and that allows us to compare with kernel models based on SOAP,23,38 SLATM and aSLATM,39 FCHL18,10 and FCHL1940 representations, as well as the neural networks PhysNet41 and SchNet.42 Given that QM9 molecules are composed of HCNOF, this dataset also allows us to demonstrate how easily NICE can be applied to multi-component systems. As shown in Fig. 4, the quality of NICE features is such that – using a simple linear model, and without specifically optimizing the hyperparameters of the representation – they are competitive with state-of-the-art models, including those (such as FCHL1810 and the alchemically optimized SOAP model of Ref. 23) that have been specifically tuned for QM9 energy predictions. The saturation of the learning curve indicates that a non-linear model, or further increase of the number of contracted features that are retained at each iteration, would be required to reduce the error below 0.3 kcal/mol.
Learning curves for the formation energy of molecules in the QM9 dataset. The NICE curve corresponds to a linear model with the same density expansion details as in Ref. 23, nmax = 12, lmax = 7, and 1000 invariants retained up to order ν = 5.
Learning curves for the formation energy of molecules in the QM9 dataset. The NICE curve corresponds to a linear model with the same density expansion details as in Ref. 23, nmax = 12, lmax = 7, and 1000 invariants retained up to order ν = 5.
IV. CONCLUSIONS
The description of atomic structures in terms of features that can be construed as symmetrized N-point correlation functions of the atom density has proven to be a very successful approach to construct accurate and transferable machine-learning models of atomic-scale properties. A formulation of these representations in terms of a recursion for equivariant features simplifies the calculation of high body-order terms, and can be combined with a contraction step—which we demonstrate in its simplest form using principal component analysis—to keep the exponential increase in complexity at bay. Even though the N-body iterative contraction of equivariants provides a practical approach to construct a systematically improvable linear basis to model atomic-scale properties, whether doing so constitutes the most robust and computationally efficient approach to atomistic machine-learning remains an open research problem.43 In the data-rich regime, NN models could provide a computationally more efficient approach to increase the flexibility of the model than increasing the complexity of a linear model.
Systematic benchmarking on more diverse datasets, the incorporation of a contraction step informed by supervised-learning criteria,44,45 the use of sparse feature selection methods,27,30 the combination with N-point correlation features designed to treat long-range physics,46 and ultimately, the comparison with kernel methods and more general non-linear regression strategies33 are just some of the many lines of investigation that can be pursued based on the NICE framework to increase the accuracy and reduce the computational effort of atomistic machine learning.
SUPPLEMENTARY MATERIAL
The supplementary material includes a detailed derivation of the transformation of the symmetrized features onto the contracted basis and of all the analytical results discussed in the main text, as well as further demonstrations of the convergence of the contraction with number of PCA features, and the comparison with linear and non-linear regression schemes.
AUTHORS’ CONTRIBUTIONS
J.N. and S.P. contributed equally to this work.
DATA AVAILABILITY
The random methane dataset is freely available in the Materials Cloud Archive, DOI: 10.24435/materialscloud:s6-nq. An implementation of the NICE evaluation library, and a list of independent invariants evaluated by CAS are made available from Ref. 26
ACKNOWLEDGMENTS
We thank Christoph Ortner, Gábor Csányi, and Geneviève Dusson for insightful discussion on the independence of invariant features, and Anders Christensen for sharing up-to-date benchmark data for the QM9 dataset. M.C. and S.P. acknowledge support from the Swiss National Science Foundation (Project No. 200021-182057). J.N. was supported by a MARVEL INSPIRE Potentials Master’s Fellowship. MARVEL is a National Center of Competence in Research funded by the Swiss National Science Foundation.
REFERENCES
Given that k1 and k2 have to be equal to l1, we suggest to use a compact notation , similar to the one for the = 1 term in Eq. (3).