In this paper, we address the challenge of obtaining a comprehensive and symmetric representation of point particle groups, such as atoms in a molecule, which is crucial in physics and theoretical chemistry. The problem has become even more important with the widespread adoption of machine-learning techniques in science, as it underpins the capacity of models to accurately reproduce physical relationships while being consistent with fundamental symmetries and conservation laws. However, some of the descriptors that are commonly used to represent point clouds— notably those based on discretized correlations of the neighbor density that power most of the existing ML models of matter at the atomic scale—are unable to distinguish between special arrangements of particles in three dimensions. This makes it impossible to machine learn their properties. Atom-density correlations are provably complete in the limit in which they simultaneously describe the mutual relationship between all atoms, which is impractical. We present a novel approach to construct descriptors of finite correlations based on the relative arrangement of particle triplets, which can be employed to create symmetry-adapted models with universal approximation capabilities, and have the resolution of the neighbor discretization as the sole convergence parameter. Our strategy is demonstrated on a class of atomic arrangements that are specifically built to defy a broad class of conventional symmetric descriptors, showing its potential for addressing their limitations.
The construction of data-driven models of physics starts by defining a suitable mathematical representation of the problem, which is then used as the input to different types of model architectures.1 In many cases, this involves describing the spatial arrangement of point particles2,3—e.g., atoms in the construction of microscopic models of matter—in a way that is consistent with the fundamental physical symmetries governing the relationship between a structure and its properties.4,5 It would be desirable to determine a complete set of invariants, i.e., a set of geometric descriptors that are equal for two configurations if and only if these configurations are the same, modulo translations, rotations, inversion, and permutations of equivalent atoms (see also Sec. 5 of Ref. 6 for a discussion of different definitions of completeness). Unfortunately, it has been recently shown that many popular structural representations are incomplete,7,8 leading to the outright impossibility of predicting different values for the properties of structures even though they are unrelated by symmetry and more broadly to numerical instabilities that extend to nearby configurations.9,10
Even though the significance of this problem for data-driven atomistic modeling has been recognized only recently, the problem of describing an arrangement of particles based on limited and/or symmetrized information is a long-standing one.11 This issue is the subject of a substantial body of literature, touching upon open problems in invariant theory12 and the Euclidean distance geometry problem,13–15 with applications to fields as diverse as signal processing and the determination of protein structure by nuclear magnetic resonance (NMR). The broad connection to geometric deep learning and graph convolution,2 as well as to the construction of equivariant physical models,5 is also well-understood. In the application domain that we focus on here, symmetry has been widely used since the early efforts to construct machine-learning interatomic potentials,16–18 and it also underlies efforts geared toward recognizing the (dis)similarity between structures in simulations and databases.19,20
A systematic study of the representations used to characterize atomic structures6,21 has revealed that many of these frameworks rely on the same physics-inspired descriptors to characterize groups/clusters of atoms relative to a central one: discretized versions of structural correlation functions of the neighbor density with increasing body order22—which are equivalent to the unordered list of interatomic distances, angles, and dihedrals, relative to the central atom. While it was shown that keeping all orders of correlations provides a complete, symmetry-adapted linear basis to expand structure–property relations,23,24 the low-order versions—which are most commonly adopted, in practice—cannot distinguish groups of degenerate configurations.7 Similar problems affect other machine-learning architectures, such as distance-based graph convolutional networks,8 and even equivariant neural networks,4,25 which are, in theory, universal approximators, but rely on constructing a hierarchy of symmetric polynomials that are closely related to the density-correlation features discussed above.26,27 Completeness, in this context, is only guaranteed in the limit of infinitely deep networks, which makes it harder to allocate computational effort between different architecture parameters, such as network depth and basis resolution.24 In this Letter, we discuss a construction that relies only on correlations between a finite number of neighbors to define invariant descriptors of a local atomic environment, which are complete in the sense that they can be used to build simple universal approximators for atomic properties and to differentiate between arbitrary configurations while scaling favorably with respect to system size and allowing for efficient implementations.
To appreciate the significance of this result without delving into the theory of density-based representations (interested readers may refer to the discussion in Ref. 6 or to the summary in the supplementary material), it is useful to consider the concrete case of two-neighbor representations [Fig. 1(a)]. To evaluate a property y(A) of an atomic structure A, it is convenient to express it as a sum of atom-centered contributions y(Ai), which in turn can be expanded into body-ordered terms y(ν)(Ai) simultaneously depending on the relative position of ν of the neighbors of the central atom i. In line with previous literature, we indicate terms associated with correlations between ν neighbors as having body order (ν + 1). In the case of a scalar target (e.g., the interatomic potential), the two-neighbor term can be expressed as a sum of a function of two distances and one angle over all neighbors pairs. Alternatively, and equivalently,21,23,24,28 we can expand the histogram of (r, r′, θ) on a discrete basis ⟨q| with coefficients and approximate y(2)(Ai) as a linear combination of these coefficients, . Many widespread descriptors from atom-centered symmetry functions29 to the SOAP (smooth overlap of atomic positions) power spectrum30 can be written in one of these forms. The resolution of the basis is a convergence parameter that, in practical applications24,31,32 and, in particular, when using the descriptors as inputs of a non-linear model,33 can be usually converged with ease.
(a) Representations of an atomic environment Ai in terms of neighbor-atom pairs or symmetrized three-point correlations. The two schemes are equivalent in the complete basis set limit and generalize to higher-order representations. (b) A pair of environments that are degenerate to ν = 1 (list-of-distances) representations.
(a) Representations of an atomic environment Ai in terms of neighbor-atom pairs or symmetrized three-point correlations. The two schemes are equivalent in the complete basis set limit and generalize to higher-order representations. (b) A pair of environments that are degenerate to ν = 1 (list-of-distances) representations.
Higher-order terms can be obtained by computing (ν + 1)-body symmetric density correlations, which we indicate with the shorthand . This body-ordered expansion of y(A) converges systematically with ν but is rather cumbersome. Instead, many frameworks rely on low-ν descriptors and use them as inputs to artificial neural networks.16 The non-linear functional form generates some, but not all, the terms with higher-body order,26,34 leading to more flexible models and making more effective use of the features restricted to a finite-body order ν.
Finite body-order descriptors, however, have a fundamental limitation: low-ν invariants are not complete in a way that cannot be remedied by non-linear transformations. For example, if one uses only interatomic distances , it is easy to realize multiple structures with neighbors placed in arbitrary directions but the same list of distances {rji} [Fig. 1(b)]. All these structures will have identical features and, therefore, also the same output for any non-linear model built on them. Less trivial constructions have been found that lead to degenerate pairs of configurations for triangle-histogram features (the power spectrum ) as well as for tetrahedra-histogram features (the bispectrum ).7 Counterexamples for higher ν are not known, but there is no proof of completeness either: for a general function, no finite order ν of rotationally symmetric correlations suffices for a complete reconstruction.35 The existence of degeneracies means that, for finite ν, there are pairs of distinct structures for which it is impossible to build an ML model that predicts the differences in their properties, regardless of the resolution of the descriptors. It also determines numerical instabilities in their vicinity, and therefore, models built on these incomplete features can suffer from degraded accuracy and transferability.10,36,37
(a) 3-center-1-neighbor features describe the position of atoms relative to two tagged atoms around the center i. This is equivalent to a list of tetrahedra sharing the ii1i2 triangle. (b) Tagging two atoms defines a local orthogonal coordinate system—except when the tagged atoms are collinear, and the azimuthal directions are ill-defined. (c) An encoder–decoder architecture can be used to determine non-linear environment features that contain enough information to reproduce high-order representations.
(a) 3-center-1-neighbor features describe the position of atoms relative to two tagged atoms around the center i. This is equivalent to a list of tetrahedra sharing the ii1i2 triangle. (b) Tagging two atoms defines a local orthogonal coordinate system—except when the tagged atoms are collinear, and the azimuthal directions are ill-defined. (c) An encoder–decoder architecture can be used to determine non-linear environment features that contain enough information to reproduce high-order representations.
It is important to note the relationships between this procedure and several ideas that have been used in the literature, built around the definition of a rigid coordinate system relative to which one can build local (or global) geometry descriptors39,45 or to that of summing over all (or some) neighbor index permutations, a representation that is invariant to rotations but not to neighbor ordering.46,47 These approaches typically suffer from a lack of smoothness because the procedure that determines the reference environment, or the index permutations that should be included in the summation, leads to discontinuous variations as the underlying structure is deformed continuously. The recently introduced Weighted Matrices Invariant (WMI) framework42,48 is the closest to the spirit of our construction. It uses the unordered set of local atomic coordinates as an invariant and provides a practical algorithm to compare structures using a suitably defined smoothly varying metric. Our approach, instead, effectively defines body-order-complete, smooth, invariant descriptors that can be readily understood and used in relation with several popular schemes based on discretized correlations of the neighbor density. A more in-depth comparison of the current approach with alternatives can be found in the supplementary material.
The expression in Eq. (2) does, in principle, have sufficient flexibility to fit arbitrary symmetric functions of the atoms within Ai. For example, a single universal approximator ψ would suffice to make a complete predictor of a (short-ranged) interatomic potential, which would effectively be obtained as a sum of triplet energies. From this point of view, it is interesting to note the excellent performance that has been recently shown49 for ML potentials that are built as a sum of pair terms, which goes one step toward the provably complete framework we propose here. It is easy to see, however, that a single feature cannot provide a complete description of an environment: a simple counting argument implies that at least (3ni − 6) features are needed to completely describe the degrees of freedom of an environment that contains ni atoms. We propose to use an encoder–decoder architecture to generate a set of non-linear features that can predict a large set of higher-order invariants (ideally with ν approaching the number of neighbors), as a way to generate a complete description [Fig. 2(c)].
To demonstrate this idea, we take the family of degenerate pairs of environments from Ref. 7 that are indistinguishable with respect to all atom-centered invariants up to the ν = 3 bispectrum. This construction involves seven neighbors around a central atom and generates a manifold with six continuous geometric parameters and can be decorated with up to three atom types. For simplicity, we realize it with a single element, choosing boron clusters B8 that are known to exhibit remarkable structural diversity.50,51 By varying the parameters in an appropriate range, we generate a dataset of 4000 pairs, computing their energy with Perdew–Burke–Ernzerhof Density Functional Theory (PBE-DFT) in the cc-pVDZ basis, using PySCF52 (details and resulting structures are given in the supplementary material). We then compute both 3-center-1-neighbor descriptors of triplets, as well as atom-centered invariant correlation features of order ν = 7, based on the NICE (N-body iterative contraction of equivariants) framework.53 We emphasize that—given that the degeneracy affects the environment and is resolved by including multiple centers—all descriptors and models are built for a single atom at the origin. We build two simple encoder/decoder architectures that combine 3-center descriptors, contract them into environment descriptors, and then apply a multi-layer perceptron (MLP) with SiLU activation to approximate the high-order features. One architecture (AL model) uses only a linear transformation before summing over (i1, i2), while the other (ANL model) also includes a non-linear layer (another MLP) before the contraction. Details of the architectures are given in the supplementary material, but are not essential to our findings. We use the intermediate features in the two models to learn the energies of the B8 clusters. We also build “direct” models for the energy as a sum of linear and non-linear triplet energies (BL and BNL), full-body-ordered atom-centered features (CL and CNL), and bispectrum features (DL and DNL). Table I summarizes the validation set accuracy of all these architectures. The accuracy in reproducing the full-body-order correlation features can be interpreted as a measure of the resolving power of features, while the accuracy in predicting energies measures the impact of lack of resolution on a common regression task.
Accuracy of different models (we indicate with “L” and “NL” linear and non-linear layers) for learning the high-ν features and the energy of a validation set of 500 bispectrum-degenerate B8 pairs. We separately report the error on the mean of the features and energies of degenerate pairs (A+, A−) and on their difference, normalized by the standard deviation in the validation set—i.e., for the error on the mean and the difference . We report the error relative to the standard deviations and σ(E(A+) − E(A−)), respectively. Absolute energy errors can be obtained by multiplying the values by 4.26 eV (mean A±) and 0.245 eV (Diff. A±).
Model . | Relative validation error . | ||||
---|---|---|---|---|---|
Name . | Architecture . | Mean A± . | Diff. A± . | ||
0.017 | 1.00 | ||||
AL | 0.172 | 1.00 | |||
0.006 | 0.143 | ||||
ANL | 0.110 | 0.457 | |||
BL | E | 0.385 | 1.00 | ||
BNL | E | 0.017 | 0.199 | ||
CL | E | 0.127 | 0.887 | ||
CNL | E | 0.100 | 0.431 | ||
DL | E | 0.389 | 1.00 | ||
DNL | E | 0.168 | 1.00 |
Model . | Relative validation error . | ||||
---|---|---|---|---|---|
Name . | Architecture . | Mean A± . | Diff. A± . | ||
0.017 | 1.00 | ||||
AL | 0.172 | 1.00 | |||
0.006 | 0.143 | ||||
ANL | 0.110 | 0.457 | |||
BL | E | 0.385 | 1.00 | ||
BNL | E | 0.017 | 0.199 | ||
CL | E | 0.127 | 0.887 | ||
CNL | E | 0.100 | 0.431 | ||
DL | E | 0.389 | 1.00 | ||
DNL | E | 0.168 | 1.00 |
As expected, all models relying on a linear contraction of the 3-center descriptors , being equivalent to models based on bispectrum, are incapable of resolving the degenerate pairs, leading to a fractional error of 1 on the difference of features and energies. The lack of resolving power and the consequent loss of regression accuracy are also evident in the sensitivity analysis [Fig. 3(a)] and in the energy prediction parity plot [Fig. 3(b)]. Even though the performance of individual models depends in a non-trivial way on the model details and the train-set composition and size, one can make a few observations based on the fact that we kept a consistent architecture for the non-linear layers among all models. For starters, the relative errors on feature and energy differences among degenerate pairs are also large for the models that have—in principle—universal approximating power. This is, in part, due to the fact that these differences are small compared to the overall variation in energy and that the magnitude of correlation features is dominated by low-order terms. Furthermore, the practical model accuracy depends on the fine-tuning of the architecture, which we deliberately avoided to facilitate a qualitative comparison between models. For example, one can clearly see that the NICE discretization of is not sufficiently converged to provide a complete linear basis: the CL model has a validation set error of almost 90% for the energy difference between pairs. As for many frameworks in the literature, this lack of basis set convergence can be mitigated by applying a non-linear layer to the high-order features: both the CNL and ANL models (the latter being computed on top of the latent features that allow for the reconstruction of to a very good accuracy) achieve an error on the energy differences around 45%. From this point of view, the fact that a latent layer with only 128 features performs similarly to the full set of (that is iteratively contracted to hold ≈8000 features) underscores the effectiveness of non-linearities in tackling issues of basis set convergence. The error in approximating converges quickly with latent-space size (see the supplementary material), which is consistent with the notion that a minimal set of symmetric descriptors does not have to be much larger than the number of degrees of freedom in the atomic environment being described, even though we cannot provide a rigorous upper bound to the number of non-linear features needed to be complete. Finally, we observe that the BNL model based directly on non-linear functions of the 3-center features performs much better than all other options—not only in terms of accuracy in describing pair differences but also the mean energy of each pair. The impact of degeneracies on realistic datasets, which do not explicitly include them, depends on the details of the model,37 and previous studies of power spectrum degeneracies showed that a clear effect is only observed in the limit of very large datasets.7
(a) Correlation plot of the distances between pairs of structures from the B8 dataset, computed based on the bispectrum (d3, red), the intermediate encoded features from model (, blue), and the full-body features (d7), taken as fully discriminating descriptors. Configurations are randomly distorted ( in the image) to reveal the behavior in the vicinity of the singular points (a bispectrum degenerate pair A+ and A− is also shown in the inset), for which the bispectrum distance would be exactly zero. (b) Predictions for representative 250 pairs of the validation set from model BNL (blue) and DNL (red). Degenerate pairs are joined by a line. The bispectrum-based predictions are identical for the pair, whereas the triplet features can resolve the difference in energy of the pair and achieve better accuracy overall.
(a) Correlation plot of the distances between pairs of structures from the B8 dataset, computed based on the bispectrum (d3, red), the intermediate encoded features from model (, blue), and the full-body features (d7), taken as fully discriminating descriptors. Configurations are randomly distorted ( in the image) to reveal the behavior in the vicinity of the singular points (a bispectrum degenerate pair A+ and A− is also shown in the inset), for which the bispectrum distance would be exactly zero. (b) Predictions for representative 250 pairs of the validation set from model BNL (blue) and DNL (red). Degenerate pairs are joined by a line. The bispectrum-based predictions are identical for the pair, whereas the triplet features can resolve the difference in energy of the pair and achieve better accuracy overall.
Summarizing our findings, we propose a practical solution to the open problem of determining a complete local representation of an atomic structure (more generally, a 3D point-cloud) based on a finite-dimensional and low body-order O(3)-invariant description of an environment. We demonstrate the idea on an artificial, but very challenging, set of B8 cluster configurations, designed to defy linear atom-centered descriptors of the same-body order. The accuracy and computational cost of an ML model in real-life scenarios depend on many different considerations. This approach indicates a way forward to designing ML frameworks that have universal approximation power while being smooth, symmetric, and free from the restrictions of architectures based on O(3) equivariant layers (see, e.g., Ref. 43 for a very recent example of a model that achieves excellent performance, enforcing equivariance by averaging over an ensemble of coordinate systems). The close connection with established, well-understood body-ordered descriptors means that the ideas we expose here can be easily incorporated into existing frameworks, such as those based on atom-centered symmetry functions or those that build non-linear models based on invariant density-correlation features, which can now be safely truncated to a fixed and low order by evaluating features and models at the level of neighbor triplets.
SUPPLEMENTARY MATERIAL
The supplementary material contains a detailed derivation of the main result and a discussion of the relation to alternative approaches to build symmetry-invariant descriptors, as well as a description of the architecture of the models we used in our example. The dataset used for benchmarks as well as code to train and evaluate the different models are available in Ref. 54.
M.C. and J.N. acknowledge funding from the European Research Council (ERC) under the Research and Innovation Program (Grant Agreement No. 101001890-FIAMMA) and the NCCR MARVEL, funded by the Swiss National Science Foundation (SNSF, Grant No. 182892). M.C., S.N.P., and K.K.H.-D. acknowledge support from the Swiss Platform for Advanced Scientific Computing (PASC). M. C. acknowledges support from the SNSF Project (Grant No. 200020_214879).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Jigyasa Nigam: Conceptualization (equal); Data curation (lead); Formal analysis (equal); Investigation (equal); Software (equal); Validation (lead); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Sergey N. Pozdnyakov: Conceptualization (equal); Formal analysis (equal); Software (supporting); Validation (supporting); Writing – original draft (supporting). Kevin K. Huguenin-Dumittan: Formal analysis (equal); Writing – original draft (supporting). Michele Ceriotti: Conceptualization (equal); Data curation (supporting); Formal analysis (equal); Investigation (equal), Funding acquisition (lead); Resources (lead); Software (equal); Supervision (lead); Validation (supporting); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are openly available in Bispectrum degenerate B8 data at https://doi.org/10.5281/zenodo.8003294.54