In this paper, we address the challenge of obtaining a comprehensive and symmetric representation of point particle groups, such as atoms in a molecule, which is crucial in physics and theoretical chemistry. The problem has become even more important with the widespread adoption of machine-learning techniques in science, as it underpins the capacity of models to accurately reproduce physical relationships while being consistent with fundamental symmetries and conservation laws. However, some of the descriptors that are commonly used to represent point clouds— notably those based on discretized correlations of the neighbor density that power most of the existing ML models of matter at the atomic scale—are unable to distinguish between special arrangements of particles in three dimensions. This makes it impossible to machine learn their properties. Atom-density correlations are provably complete in the limit in which they simultaneously describe the mutual relationship between all atoms, which is impractical. We present a novel approach to construct descriptors of *finite* correlations based on the relative arrangement of particle triplets, which can be employed to create symmetry-adapted models with universal approximation capabilities, and have the resolution of the neighbor discretization as the sole convergence parameter. Our strategy is demonstrated on a class of atomic arrangements that are specifically built to defy a broad class of conventional symmetric descriptors, showing its potential for addressing their limitations.

The construction of data-driven models of physics starts by defining a suitable mathematical representation of the problem, which is then used as the input to different types of model architectures.^{1} In many cases, this involves describing the spatial arrangement of point particles^{2,3}—e.g., atoms in the construction of microscopic models of matter—in a way that is consistent with the fundamental physical symmetries governing the relationship between a structure and its properties.^{4,5} It would be desirable to determine a *complete* set of invariants, i.e., a set of geometric descriptors that are equal for two configurations if and only if these configurations are the same, modulo translations, rotations, inversion, and permutations of equivalent atoms (see also Sec. 5 of Ref. 6 for a discussion of different definitions of completeness). Unfortunately, it has been recently shown that many popular structural representations are *incomplete*,^{7,8} leading to the outright impossibility of predicting different values for the properties of structures even though they are unrelated by symmetry and more broadly to numerical instabilities that extend to nearby configurations.^{9,10}

Even though the significance of this problem for data-driven atomistic modeling has been recognized only recently, the problem of describing an arrangement of particles based on limited and/or symmetrized information is a long-standing one.^{11} This issue is the subject of a substantial body of literature, touching upon open problems in invariant theory^{12} and the Euclidean distance geometry problem,^{13–15} with applications to fields as diverse as signal processing and the determination of protein structure by nuclear magnetic resonance (NMR). The broad connection to geometric deep learning and graph convolution,^{2} as well as to the construction of equivariant physical models,^{5} is also well-understood. In the application domain that we focus on here, symmetry has been widely used since the early efforts to construct machine-learning interatomic potentials,^{16–18} and it also underlies efforts geared toward recognizing the (dis)similarity between structures in simulations and databases.^{19,20}

A systematic study of the representations used to characterize atomic structures^{6,21} has revealed that many of these frameworks rely on the same physics-inspired descriptors to characterize groups/clusters of atoms relative to a central one: discretized versions of structural correlation functions of the neighbor density with increasing *body order*^{22}—which are equivalent to the unordered list of interatomic distances, angles, and dihedrals, relative to the central atom. While it was shown that keeping *all* orders of correlations provides a complete, symmetry-adapted *linear* basis to expand structure–property relations,^{23,24} the low-order versions—which are most commonly adopted, in practice—cannot distinguish groups of *degenerate* configurations.^{7} Similar problems affect other machine-learning architectures, such as distance-based graph convolutional networks,^{8} and even equivariant neural networks,^{4,25} which are, in theory, universal approximators, but rely on constructing a hierarchy of symmetric polynomials that are closely related to the density-correlation features discussed above.^{26,27} Completeness, in this context, is only guaranteed in the limit of infinitely deep networks, which makes it harder to allocate computational effort between different architecture parameters, such as network depth and basis resolution.^{24} In this Letter, we discuss a construction that relies only on correlations between a finite number of neighbors to define invariant descriptors of a local atomic environment, which are complete in the sense that they can be used to build simple universal approximators for atomic properties and to differentiate between arbitrary configurations while scaling favorably with respect to system size and allowing for efficient implementations.

To appreciate the significance of this result without delving into the theory of density-based representations (interested readers may refer to the discussion in Ref. 6 or to the summary in the supplementary material), it is useful to consider the concrete case of two-neighbor representations [Fig. 1(a)]. To evaluate a property **y**(*A*) of an atomic structure *A*, it is convenient to express it as a sum of atom-centered contributions **y**(*A*_{i}), which in turn can be expanded into body-ordered terms **y**^{(ν)}(*A*_{i}) simultaneously depending on the relative position of *ν* of the neighbors of the central atom *i*. In line with previous literature, we indicate terms associated with correlations between *ν* neighbors as having body order (*ν* + 1). In the case of a scalar target (e.g., the interatomic potential), the two-neighbor term can be expressed as a sum of a function of two distances and one angle $y(2)(ri1i,ri2i,r\u0302i1i\u22c5r\u0302i2i)$ over all neighbors pairs. Alternatively, and equivalently,^{21,23,24,28} we can expand the histogram of (*r*, *r*′, *θ*) on a discrete basis ⟨*q*| with coefficients $yq(2)$ and approximate *y*^{(2)}(*A*_{i}) as a linear combination of these coefficients, $\u2211qyq(2)\u2329q|\rho i\u22972\u0304\u232a$. Many widespread descriptors from atom-centered symmetry functions^{29} to the SOAP (smooth overlap of atomic positions) power spectrum^{30} can be written in one of these forms. The resolution of the basis is a convergence parameter that, in practical applications^{24,31,32} and, in particular, when using the descriptors as inputs of a non-linear model,^{33} can be usually converged with ease.

Higher-order terms can be obtained by computing (*ν* + 1)-body symmetric density correlations, which we indicate with the shorthand $|\rho i\u2297\nu \u0304\u232a$. This body-ordered expansion of **y**(*A*) converges systematically with *ν* but is rather cumbersome. Instead, many frameworks rely on low-*ν* descriptors and use them as inputs to artificial neural networks.^{16} The non-linear functional form generates *some*, but not all, the terms with higher-body order,^{26,34} leading to more flexible models and making more effective use of the features restricted to a *finite*-body order *ν*.

Finite body-order descriptors, however, have a fundamental limitation: low-*ν* invariants are not *complete* in a way that cannot be remedied by non-linear transformations. For example, if one uses only interatomic distances $(|\rho i\u22971\u0304\u232a)$, it is easy to realize multiple structures with neighbors placed in arbitrary directions but the same list of distances {*r*_{ji}} [Fig. 1(b)]. All these structures will have identical $|\rho i\u22971\u0304\u232a$ features and, therefore, also the same output for any non-linear model built on them. Less trivial constructions have been found that lead to degenerate pairs of configurations for triangle-histogram features (the power spectrum $|\rho i\u22972\u0304\u232a$) as well as for tetrahedra-histogram features (the bispectrum $|\rho i\u22973\u0304\u232a$).^{7} Counterexamples for higher *ν* are not known, but there is no proof of completeness either: for a general function, no finite order *ν* of rotationally symmetric correlations suffices for a complete reconstruction.^{35} The existence of degeneracies means that, for finite *ν*, there are pairs of distinct structures for which it is *impossible* to build an ML model that predicts the differences in their properties, regardless of the resolution of the descriptors. It also determines numerical instabilities in their vicinity, and therefore, models built on these incomplete features can suffer from degraded accuracy and transferability.^{10,36,37}

*N*+ 1)-center quantities, i.e., descriptors of

*N*-tuples of atoms in the environment

*A*

_{i}, which are invariant to the ordering of their neighbors, but equivariant to the labeling of the

*N*tagged particles.

^{38}The central idea in our construction is to build invariant descriptors representing the histogram of one neighbor (

*ν*= 1) in the coordinate system defined by the central atom and two tagged neighbors (

*N*= 2), translating complete frameworks based on local coordinate systems

^{39–43}to the language of density correlation descriptors. Using Dirac’s bra-ket notation to enumerate the components of the descriptor, we compute invariant features of the following form:

*i*

_{1}and

*i*

_{2}in the basis of radial functions

*R*

_{nl}and spherical harmonics $Ylm$. ⟨

*nlm*|

*ρ*

_{i}⟩ = ∑

_{j}⟨

*nlm*|

**r**

_{ji}⟩ is the expansion of the neighbor density on the same basis. The contraction with Clebsch–Gordan coefficients ⟨

*l*

_{1}

*m*

_{1};

*l*

_{2}

*m*

_{2}|

*l*

_{3}

*m*

_{3}⟩ selects the rotationally invariant parts. The first ingredient in our approach is the realization that Eq. (1) provides a complete description of the environment

*A*

_{i}. A detailed proof is reported in the supplementary material, but the reasoning can be explained in relatively simple terms. As long as atoms do not overlap, the peaks in the neighbor density precisely encode information on the position of all the atoms. Averaging |

*ρ*

_{i}⟩ directly over rotations discards angular information. However, the presence of descriptors of the position of two tagged neighbors (

*i*

_{1},

*i*

_{2}) means that rotational averaging of the entire tensor product preserves the full information on the position of all neighbors

*relative to these two atoms*[Fig. 2(a)]. In other terms, the descriptors in Eq. (1) are equivalent to the knowledge of the position of the neighbors in the coordinate system defined by $(ri1i,ri2i)$. Provided that $ri1i$ and $ri2i$ are not collinear, it is possible to reconstruct the position of all the atoms within the environment relative to the center

*i*, modulo rotations. Thus, by feeding a sufficiently fine-grained discretization of $|\rho ii1i2\u22971\u0304\u232a$ to a universal approximator, any function of the coordinates of the atoms in

*A*

_{i}can be approximated in a way that is invariant to rotations and to neighbor permutations,

*except for the tagged atoms*

*i*

_{1}and

*i*

_{2}. As we shall see, the convergence with the number of three-center descriptors and non-linear features can be achieved easily. In order to restore invariance and recover a description of the single-center environment, we sum over these two indices. Performing this sum for Eq. (1) recovers the bispectrum $|\rho i\u22973\u0304\u232a$ (recall that ∑

_{j}|

**r**

_{ji}⟩ = |

*ρ*

_{i}⟩) that we know to be incomplete.

^{7}Instead, we apply a non-linear layer on top of the three-center (

*N*= 2) features

*before*summing over

*i*

_{1}and

*i*

_{2}. Information on individual pair-dependent features is discarded, but any property of the environment

*A*

_{i}can be learned, provided that one uses a sufficiently expressive architecture. Defining the shorthand $\xi q(Aii1i2)=\u2329q|\rho ii1i2\u22971\u0304\u232a$, where

*q*is a compound index that encompasses the labels (

*n*

_{1},

*l*

_{1},

*n*

_{2},

*l*

_{2},

*n*

_{3},

*l*

_{3}), we can write the environment features,

*ψ*

_{p}to be a scalar universal approximator, Eq. (2) implements a

*deep set*architecture,

^{44}which is known to provide

*permutation invariant*universal approximation. Combining information on

*all*(

*i*

_{1},

*i*

_{2}) pairs within

*A*

_{i}also ensures completeness in the presence of collinearity of certain pairs. For pairs corresponding to collinear vectors, $|\rho ii1i2\u22971\u0304\u232a$ cannot resolve the position of the neighbors along the azimuthal direction [Fig. 2(b)]. However, if

*some*of the neighbors in

*A*

_{i}are not collinear, there will be at least one pair that forms a basis for three dimensions and provides a complete description, and the compound features $\xi \u0303$ incorporate this information. If all the atoms are aligned along the same axis, then there is no azimuthal information to be encoded, and the description remains complete.

It is important to note the relationships between this procedure and several ideas that have been used in the literature, built around the definition of a rigid coordinate system relative to which one can build local (or global) geometry descriptors^{39,45} or to that of summing over all (or some) neighbor index permutations, a representation that is invariant to rotations but not to neighbor ordering.^{46,47} These approaches typically suffer from a lack of smoothness because the procedure that determines the reference environment, or the index permutations that should be included in the summation, leads to discontinuous variations as the underlying structure is deformed continuously. The recently introduced Weighted Matrices Invariant (WMI) framework^{42,48} is the closest to the spirit of our construction. It uses the unordered set of local atomic coordinates as an invariant and provides a practical algorithm to compare structures using a suitably defined smoothly varying *metric*. Our approach, instead, effectively defines body-order-complete, smooth, invariant *descriptors* that can be readily understood and used in relation with several popular schemes based on discretized correlations of the neighbor density. A more in-depth comparison of the current approach with alternatives can be found in the supplementary material.

The expression in Eq. (2) does, *in principle*, have sufficient flexibility to fit arbitrary symmetric functions of the atoms within *A*_{i}. For example, a single universal approximator *ψ* would suffice to make a complete predictor of a (short-ranged) interatomic potential, which would effectively be obtained as a sum of triplet energies. From this point of view, it is interesting to note the excellent performance that has been recently shown^{49} for ML potentials that are built as a sum of *pair* terms, which goes one step toward the provably complete framework we propose here. It is easy to see, however, that a single feature cannot provide a complete description of an environment: a simple counting argument implies that at least (3*n*_{i} − 6) features are needed to completely describe the degrees of freedom of an environment that contains *n*_{i} atoms. We propose to use an encoder–decoder architecture to generate a set of non-linear features $\xi \u0303$ that can predict a large set of higher-order invariants (ideally with *ν* approaching the number of neighbors), as a way to generate a complete description [Fig. 2(c)].

To demonstrate this idea, we take the family of degenerate pairs of environments from Ref. 7 that are indistinguishable with respect to all atom-centered invariants up to the *ν* = 3 bispectrum. This construction involves seven neighbors around a central atom and generates a manifold with six continuous geometric parameters and can be decorated with up to three atom types. For simplicity, we realize it with a single element, choosing boron clusters B_{8} that are known to exhibit remarkable structural diversity.^{50,51} By varying the parameters in an appropriate range, we generate a dataset of 4000 pairs, computing their energy with Perdew–Burke–Ernzerhof Density Functional Theory (PBE-DFT) in the cc-pVDZ basis, using PySCF^{52} (details and resulting structures are given in the supplementary material). We then compute both 3-center-1-neighbor descriptors of $Aii1i2$ triplets, as well as atom-centered invariant correlation features of order *ν* = 7, based on the NICE (N-body iterative contraction of equivariants) framework.^{53} We emphasize that—given that the degeneracy affects the *environment* and is resolved by including multiple centers—all descriptors and models are built for a single atom at the origin. We build two simple encoder/decoder architectures that combine 3-center descriptors, contract them into environment descriptors, and then apply a multi-layer perceptron (MLP) with SiLU activation to approximate the high-order features. One architecture (A_{L} model) uses only a linear transformation before summing over (*i*_{1}, *i*_{2}), while the other (A_{NL} model) also includes a non-linear layer (another MLP) *before* the contraction. Details of the architectures are given in the supplementary material, but are not essential to our findings. We use the intermediate $\xi \u0303(Ai)$ features in the two models to learn the energies of the B_{8} clusters. We also build “direct” models for the energy as a sum of linear and non-linear triplet energies (B_{L} and B_{NL}), full-body-ordered atom-centered features (C_{L} and C_{NL}), and bispectrum features (D_{L} and D_{NL}). Table I summarizes the validation set accuracy of all these architectures. The accuracy in reproducing the full-body-order $|\rho i\u22977\u0304\u232a$ correlation features can be interpreted as a measure of the resolving power of features, while the accuracy in predicting energies measures the impact of lack of resolution on a common regression task.

Model . | Relative validation error . | ||||
---|---|---|---|---|---|

Name . | Architecture . | Mean A^{±}
. | Diff. A^{±}
. | ||

$AL\rho $ | $\rho ii1i2\u22971\u2192L$ | $\xi \u0303p$ | $\u2192NL\rho i\u22977$ | 0.017 | 1.00 |

A_{L} | $\u21aaNLE$ | 0.172 | 1.00 | ||

$ANL\rho $ | $\rho ii1i2\u22971\u2192NL$ | $\xi \u0303p$ | $\u2192NL\rho i\u22977$ | 0.006 | 0.143 |

A_{NL} | $\u21aaNLE$ | 0.110 | 0.457 | ||

B_{L} | $\rho ii1i2\u22971\u2192L$ E | 0.385 | 1.00 | ||

B_{NL} | $\rho ii1i2\u22971\u2192NL$ E | 0.017 | 0.199 | ||

C_{L} | $\rho i\u22977\u2192L$ E | 0.127 | 0.887 | ||

C_{NL} | $\rho i\u22977\u2192NL$ E | 0.100 | 0.431 | ||

D_{L} | $\rho i\u22973\u2192L$ E | 0.389 | 1.00 | ||

D_{NL} | $\rho i\u22973\u2192NL$ E | 0.168 | 1.00 |

Model . | Relative validation error . | ||||
---|---|---|---|---|---|

Name . | Architecture . | Mean A^{±}
. | Diff. A^{±}
. | ||

$AL\rho $ | $\rho ii1i2\u22971\u2192L$ | $\xi \u0303p$ | $\u2192NL\rho i\u22977$ | 0.017 | 1.00 |

A_{L} | $\u21aaNLE$ | 0.172 | 1.00 | ||

$ANL\rho $ | $\rho ii1i2\u22971\u2192NL$ | $\xi \u0303p$ | $\u2192NL\rho i\u22977$ | 0.006 | 0.143 |

A_{NL} | $\u21aaNLE$ | 0.110 | 0.457 | ||

B_{L} | $\rho ii1i2\u22971\u2192L$ E | 0.385 | 1.00 | ||

B_{NL} | $\rho ii1i2\u22971\u2192NL$ E | 0.017 | 0.199 | ||

C_{L} | $\rho i\u22977\u2192L$ E | 0.127 | 0.887 | ||

C_{NL} | $\rho i\u22977\u2192NL$ E | 0.100 | 0.431 | ||

D_{L} | $\rho i\u22973\u2192L$ E | 0.389 | 1.00 | ||

D_{NL} | $\rho i\u22973\u2192NL$ E | 0.168 | 1.00 |

As expected, all models relying on a linear contraction of the 3-center descriptors $|\rho ii1i2\u22971\u0304\u232a$, being equivalent to models based on bispectrum, are incapable of resolving the degenerate pairs, leading to a fractional error of 1 on the difference of features and energies. The lack of resolving power and the consequent loss of regression accuracy are also evident in the sensitivity analysis [Fig. 3(a)] and in the energy prediction parity plot [Fig. 3(b)]. Even though the performance of individual models depends in a non-trivial way on the model details and the train-set composition and size, one can make a few observations based on the fact that we kept a consistent architecture for the non-linear layers among all models. For starters, the relative errors on feature and energy differences among degenerate pairs are also large for the models that have—in principle—universal approximating power. This is, in part, due to the fact that these differences are small compared to the overall variation in energy and that the magnitude of correlation features is dominated by low-order terms. Furthermore, the practical model accuracy depends on the fine-tuning of the architecture, which we deliberately avoided to facilitate a qualitative comparison between models. For example, one can clearly see that the NICE discretization of $|\rho i\u22977\u0304\u232a$ is not sufficiently converged to provide a complete *linear* basis: the C_{L} model has a validation set error of almost 90% for the energy difference between pairs. As for many frameworks in the literature, this lack of basis set convergence can be mitigated by applying a non-linear layer to the high-order features: both the C_{NL} and A_{NL} models (the latter being computed on top of the latent features that allow for the reconstruction of $|\rho i\u22977\u0304\u232a$ to a very good accuracy) achieve an error on the energy differences around 45%. From this point of view, the fact that a latent layer with only 128 features performs similarly to the full set of $|\rho i\u22977\u0304\u232a$ (that is iteratively contracted to hold ≈8000 features) underscores the effectiveness of non-linearities in tackling issues of basis set convergence. The error in approximating $|\rho i\u22977\u0304\u232a$ converges quickly with latent-space size (see the supplementary material), which is consistent with the notion that a minimal set of symmetric descriptors does not have to be much larger than the number of degrees of freedom in the atomic environment being described, even though we cannot provide a rigorous upper bound to the number of non-linear features needed to be complete. Finally, we observe that the B_{NL} model based *directly* on non-linear functions of the 3-center features performs much better than all other options—not only in terms of accuracy in describing pair differences but also the mean energy of each pair. The impact of degeneracies on realistic datasets, which do not explicitly include them, depends on the details of the model,^{37} and previous studies of power spectrum degeneracies showed that a clear effect is only observed in the limit of very large datasets.^{7}

Summarizing our findings, we propose a practical solution to the open problem of determining a complete local representation of an atomic structure (more generally, a 3D point-cloud) based on a finite-dimensional and low body-order *O*(3)-invariant description of an environment. We demonstrate the idea on an artificial, but very challenging, set of B_{8} cluster configurations, designed to defy linear atom-centered descriptors of the same-body order. The accuracy and computational cost of an ML model in real-life scenarios depend on many different considerations. This approach indicates a way forward to designing ML frameworks that have universal approximation power while being smooth, symmetric, and free from the restrictions of architectures based on *O*(3) equivariant layers (see, e.g., Ref. 43 for a very recent example of a model that achieves excellent performance, enforcing equivariance by averaging over an ensemble of coordinate systems). The close connection with established, well-understood body-ordered descriptors means that the ideas we expose here can be easily incorporated into existing frameworks, such as those based on atom-centered symmetry functions or those that build non-linear models based on invariant density-correlation features, which can now be safely truncated to a fixed and low order by evaluating features and models at the level of neighbor triplets.

## SUPPLEMENTARY MATERIAL

The supplementary material contains a detailed derivation of the main result and a discussion of the relation to alternative approaches to build symmetry-invariant descriptors, as well as a description of the architecture of the models we used in our example. The dataset used for benchmarks as well as code to train and evaluate the different models are available in Ref. 54.

M.C. and J.N. acknowledge funding from the European Research Council (ERC) under the Research and Innovation Program (Grant Agreement No. 101001890-FIAMMA) and the NCCR MARVEL, funded by the Swiss National Science Foundation (SNSF, Grant No. 182892). M.C., S.N.P., and K.K.H.-D. acknowledge support from the Swiss Platform for Advanced Scientific Computing (PASC). M. C. acknowledges support from the SNSF Project (Grant No. 200020_214879).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Jigyasa Nigam**: Conceptualization (equal); Data curation (lead); Formal analysis (equal); Investigation (equal); Software (equal); Validation (lead); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **Sergey N. Pozdnyakov**: Conceptualization (equal); Formal analysis (equal); Software (supporting); Validation (supporting); Writing – original draft (supporting). **Kevin K. Huguenin-Dumittan**: Formal analysis (equal); Writing – original draft (supporting). **Michele Ceriotti**: Conceptualization (equal); Data curation (supporting); Formal analysis (equal); Investigation (equal), Funding acquisition (lead); Resources (lead); Software (equal); Supervision (lead); Validation (supporting); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in Bispectrum degenerate B8 data at https://doi.org/10.5281/zenodo.8003294.^{54}

## REFERENCES

*Neural Information Processing Systems Vol. 34*(Curran Associates, Inc. 2021) pp. 28848–28863, https://proceedings.neurips.cc/paper_files/paper/2021/file/f1b0775946bc0329b35b823b86eeb5f5-Paper.pdf.

*n*-point configurations from the distribution of distances or areas

*Statistical Mechanics: Entropy, Order Parameters, and Complexity*

*Oxford Master Series in Statistical, Computational, and Theoretical Physics Vol. 14*

*n*-body force fields from machine learning

*N*-center atomic-scale properties

*Advances in Neural Information Processing Systems*

*Advances in Neural Information Processing Systems*

_{n}(n = 2–14)

*N*-body equivariant features