Faithfully representing chemical environments is essential for describing materials and molecules with machine learning approaches. Here, we present a systematic classification of these representations and then investigate (i) the sensitivity to perturbations and (ii) the effective dimensionality of a variety of atomic environment representations and over a range of material datasets. Representations investigated include atom centered symmetry functions, Chebyshev Polynomial Symmetry Functions (CHSF), smooth overlap of atomic positions, many-body tensor representation, and atomic cluster expansion. In area (i), we show that none of the atomic environment representations are linearly stable under tangential perturbations and that for CHSF, there are instabilities for particular choices of perturbation, which we show can be removed with a slight redefinition of the representation. In area (ii), we find that most representations can be compressed significantly without loss of precision and, further, that selecting optimal subsets of a representation method improves the accuracy of regression models built for a given dataset.

## I. INTRODUCTION

Machine Learning (ML) as a predictive modeling tool has gained much attention in recent years in fields ranging from biology^{1} and chemistry^{2} to materials science,^{3,4} building on decades of successful applications in image recognition, natural language processing, and artificial intelligence (AI).^{5,6} While machine learning has also been increasingly popular in many fields due to its relatively simplicity of application and powerful prediction features, a key driving force is the increasing availability of information through data repositories, archives, and databases of materials and molecules that include billions of structures.^{3,7} In recent years, a large number of ML models for materials and molecules have been developed and many novel representation methods have been proposed.^{8–15} Representations are the feature sets used as input data for data-driven machine learning models, which, after training, can be used to predict quantities of interest.

Some of these databases are focused on molecular configurations, such as PubChem,^{16} DrugBank,^{17} and ChEMBL,^{18} which are widely used in biological, chemical, and pharmacological applications driven by high-throughput screening for drug design and novel molecule discovery. Other databases such as the Cambridge Structural Database (CSD)^{19} include crystals and small molecules. While these databases contain mostly molecular structures with their chemical or biological properties, recent efforts such as the Materials Project,^{20,21} Automatic FLOW for Material Discovery Library (AFLOWLIB.org),^{22,23} Open Quantum Materials Database (OQMD),^{24,25} and the Novel Materials Discovery (NOMAD) archive^{26} (which collates data from many other databases and contains data that are not available at them) now provide extensive electronic structure results for molecules, bulk materials, surfaces, and nanostructures in a range of ordered and disordered phases, with chemical composition ranging from inorganic systems to metals, alloys, and semiconductors. With recent advancements in materials databases, access is available to billions of properties of materials and molecules and millions of high-accurate calculations through online archives. With such an extensive and diverse collection of molecular and materials data to hand, we can ask questions such as how to identify of the most informative subset of data for a particular class of materials and how best to interpret it for prediction through ML models. These questions are relevant for both classification and regression applications: for example, classifying materials as metallic or semiconducting or predicting the bandgap from structure, respectively. In both cases, the input data are constructed from information ranging from basic physical and chemical properties to precise geometrical information based on atomic structure and bonding topology. There have been many studies to determine these input information, which are named descriptors, fingerprints, or representations^{13,27–32}

Here, we define the prediction problem as

where **t** is the target property of the material, *χ*_{k} represents the structure with index *k* within the database, *M*_{a} is the machine learning model with identifier *a*, and **V**_{b} is the input representation determined with identifier *b* that is used for optimizing *f*. We define *representations* as

where *x*_{j} are the coordinates of atom with index *j* within structure *k*, *α*_{m} is a physical or chemical property of the material and chemical environment, *h*_{i} is the *descriptor* mapping from the input space to hyper-dimensional space with each dimension being an input *feature* of model *M*, and *g* is the encoding function that combines all of the *h*_{i} descriptors to make an overall *representation*. In Eq. (1), the optimization of the parameters of *f* can be performed either for a single model *M* or for multiple models *M*_{1}, *M*_{2}, … in combination with potentially multiple different representations **V**_{b}.

From this general perspective, many approaches can be combined to form a predictive model, raising the critical open question of how to optimally choose which pieces of information are needed to represent a material through descriptors and how to combine these to form representations. These can vary from defining sets of physical and chemical properties and atomic geometries to specialized hyper-dimensional mapping algorithms (see Fig. 1). These representations and their building blocks and the descriptors can be classified into three broad classes based on their construction: (i) atomic neighborhood density definitions, (ii) topology expansions, and (iii) property-based selections. The representations can be further grouped according to their combination rules, mapping basis or filtering functions, histograms, connectivity maps or graphs, and finally direct contributions from material properties. Within each category, there have been various developments of representations and many successful applications, which we review briefly below in Sec. II, before specializing on atomic neighborhood densities in Sec. III. Section IV lists the data sources used here for evaluating representations, and Sec. V describes our analysis methodology. Results and discussion follow in Sec. VI.

## II. REPRESENTATION CLASSES

### A. Property-based representations

The idea of classification of molecules based on the relationships between their structure and the resulting activities or properties is an underpinning approach of modern chemistry.^{33} However, using theoretical *descriptors* to identify the relations was first utilized for quantitative structure–activity/property relationship (QSAR/QSPR) modeling,^{33,34} starting with the work of Wiener and Platt in 1947, who used indices based on chemical graph theory.^{33} We use the name QSAR refer to both QSAR/QSPR models.

QSAR models are highly successful predictive modeling approaches in chemistry and biology in various applications.^{33,34} These models are determined using sets of descriptors that are defined through a selection of physical, chemical properties of materials and/or structural descriptors of molecules such as atoms and their molecular bonds from chemical *topology* information to graphs where the fractions of the bonding information are simplified to *indices*.^{33–35}

The success of QSAR mostly depends on the choice of descriptor sets and, in particular, determining the optimum number of descriptors in such sets.^{34} For example, compressed-sensing approaches have been developed to systematically select the optimal set of the descriptors.^{35,36}

Direct property-based approaches also employ reduction techniques to identify optimal models {*M*_{a}} in Eq. (1) by choosing mathematical operators for *g* that combine descriptors from a given set of {*h*_{i}} with *x*_{j}, {*α*_{m}}, or both. This optimization scheme, referred to as *sure independence screening and sparsifying operator* (SISSO), has been successfully applied as a classification technique to group materials (e.g., metals vs non-metals) and as a regression method to determine a target property of interest.^{37}

The connectivity information of each species in the chemical stoichiometry has also been introduced recently with the *atom2vec*^{13} method as a representation where a single atom-to-environment connectivity matrix can be constructed. This simple approximation enables the approach to be used to screen billions of materials and make predictions such as identifying possible candidates for Li-ion battery elements^{38} based only on chemical composition and stoichiometry.

Recent efforts also address the direct usage of electronic structure data to define descriptors. In these methods, the essence of the electronic structure information is extracted through histograms over density-of-states (DOS)^{39} or over the selected band structures^{40} into the descriptor vectors.

### B. Topology-based representations

In this group of descriptors, topological information is derived from the full connectivity graph of the atomic environment instead of simplified *indices*. Examples of this class of methods include the Coulomb and sine matrices,^{41} *n*-gram graphs,^{42,43} and graph-based Neural Network (NN) models such as DTNN,^{44} SchNET,^{12} and CGCNN.^{45} While Coulomb and sine matrices use atom-to-atom connection information directly and are thus based on the inferred chemical bonding between atoms, the input representations of graph neural network models are defined through neighborhood analysis of atomic connectivity graphs and can be constructed from the full *n*-body topology.

Another approach to use topology information such as bonds, angles, and higher-order many-body contributions is to group them into sets of building blocks and apply histograms on these blocks over the input datasets. Several successful implementations of these approach are Bag-of-Bonds (BoB)^{46} and its analogy to higher order contributions, for example, bag-of-angles and k-bags.

Topology information provides atomic connectivity data and is very useful for structural similarity comparison, since this information is automatically invariant under changes in the atomic positions. Topology is a blueprint of chemical bonding definitions for materials or molecules. However, it is not a physical observable, and therefore, it cannot be constructed uniquely. For novel material discovery, one needs to use electronic structure calculations to generate new data comprising the properties and construction of structural representations of uncharted material compositions. However, such database construction for new structures is extremely expensive.

### C. Atomic neighbourhood density representations

Modeling materials and molecules at the atomic scale is essential to identify novel materials for applications of interest or to determine candidate molecules with a specific function in a desired medium. When accuracy and reliability of concern, preference is usually given to *ab initio* calculations. However, these first-principles calculations are inaccessible for systems requiring long time scales and/or large numbers of atoms. To significantly reduce the computational cost, interatomic potentials or force fields are employed to define the interactions between atoms using parametrized forms of functions in place of the electronic interactions. Although these approaches allow larger numbers of atoms to be included in simulations and provide access to much longer time scales, their prediction accuracies are typically limited as a result of simplified models describing interatomic interactions.

A prominent recent trend has been to address this limitation with *machine learning interatomic potentials* (MLIPs), i.e., by replacing fixed functional forms for the atomic interactions with data-driven machine learning models that map from atomic positions to potential energy surface (PES), via descriptors based on the local neighborhood of each atom. This approach based on atomic-densities extraction has been applied in pioneering models using symmetry functions^{8} and, more recently, the so-called smooth-overlap of atomic positions (SOAP).^{9,47} Symmetry functions are widely used in neural network interatomic potentials (NNP)^{8,32,48,49} successfully for many molecules and materials, including water, crystals with various phases,^{50} and amorphous solids in various concentrations,^{51,52} while SOAP is typically incorporated in Gaussian Approximation Potentials (GAP).^{9,47} GAP has been applied to many molecules and materials.^{4,53–55}

In this final class of descriptors, there have been many developments in recent years. In addition to atom-centered symmetry functions (ACSF) and SOAP descriptors, different basis expansions such as bispectrum and Chebyshev polynomials have also been utilized in SNAP potentials^{56,57} and NNPs,^{58} respectively. Further advances are also provided in ML models based on atomic and many-body expansions with tensor representations such as Many-Body Tensor Representation (MBTR).^{10}

An alternative approach is to employ linear regression using a symmetric polynomial basis. This approach was pioneered by Bowman and Braams,^{59} while more recent symmetric bases are the Moment Tensor Potentials (MTP)^{15} and the Atomic Cluster Expansion (ACE).^{60} The MTP and ACE basis are also based on density projections and therefore closely related to the descriptors described above. In particular, ACE can also be seen as a direct generalization of SOAP and SNAP, providing features of arbitrary correlation order.

Recent works have demonstrated that the atomic-density representations can be unified in a common mathematical framework^{61} and that the descriptors can be decorated with additional properties to extend their capabilities, such as in *λ*-SOAP representations for learning from tensor properties.^{61}

For all these ML models, different combinations of descriptors and representations have been used in many studies and for a wide range of materials.^{55,62,63} As these descriptors and representations are general structural identifiers, they can also be utilized within different regression models other than the ML models for which they were primarily developed. A recent work also shows the comparison of different regressors with these representations.^{43} While different approaches have been utilized in these studies, comparative assessments of the representation approaches have thus far been limited. Studies for the property and topology based descriptors on the QSAR feature sets and the performance of graph based NN models show that selection of optimal descriptor sets is very important to ensure high predictive accuracy of the ML models.^{63} A number of studies of the atomic neighborhood density approaches provide analyses primarily based on trained models’ precision on available datasets,^{63} but so far, there is no performance assessment on descriptors and representations for materials and molecules focused solely on structure to hyper-dimensional encoding.

Following the works by Huo and Rupp^{10} and Jäger *et al.*,^{64} we identify the essential properties of descriptors and representations for encoding materials and molecules as follows:

*Invariance*: descriptors should be invariant under symmetry operations—permutation of atoms and translation and rotation of structure.*Sensitivity (local stability)*: small changes in the atomic positions should result in proportional changes in the descriptor, and vice versa.*Global uniqueness/faithfulness*: the mapping of the descriptor should be unique for a given input atomic environment (i.e., the mapping is injective).*Dimensionality*: relatedly, the dimension of the spanned hyper-dimensional space of the descriptor should be sufficient to ensure uniqueness, but not larger.*Differentiability*: having continuous functions that are differentiable.*Interpretability*: features of the encoding can be mapped directly to structural or material properties for easy interpretation of results.*Scalability*: ideally, descriptors should be easily generalized to any system or structure with a preference to have no limitations on number of elements, atoms, or properties.*Complexity*: to have a low computational cost so the method can be fast enough to scale to the required size of the simulations and to be used in high-throughput screening of big data.*Discrete mapping*: always map to the same hyper-dimensional space with constant size feature sets, regardless of the input atomic environment.

In this article, we concentrate our efforts on analyzing sensitivity (i.e., local stability) and compressibility (i.e., dimensionality) of the selected set of descriptors and representations and do not address uniqueness/faithfullness; for an interesting recent investigation of this important issue, see Ref. 65. *Scalability*, *complexity*, and *discrete mapping* are also not addressed here as these are mainly related to the implementation and cost of methods in ML models, such as the evaluation costs of MLIPs or scalability limits of ML models, which are outside the scope of this work. On the other hand, the hunt for the *interpretability* of ML models considering the construction of the descriptors is not a new concept and has been studied in the context of QSAR models^{66} and many-body descriptors.^{67} As also discussed in property-based representations, *interpretability* is automatically inherited by models that are built directly from properties. As the concept depends on the application of the ML models, we leave addressing this for future studies. While there have been many related efforts on the analysis of property and topology based descriptors,^{35–37} we choose to focus here on the atomic neighborhood density based descriptors along with representative descriptors from three groups of mapping basis functions, tensor representations, and polynomial representations as follows: ACSF, Chebyshev polynomials in SF (CHSF), SOAP, SOAPlite, MBTR, and ACE.^{60} We provide an analysis of the sensitivity and local stability of the descriptors as well as testing their invariance under symmetry operations. We further assess the descriptors information packing ability using CUR decomposition where the matrix is decomposed into columns (C), rows (R), and reconstruction (U) matrices,^{70} farthest point sampling (FPS),^{70} and principal component analysis (PCA)^{50,52} dimension reduction techniques. We also provide analyses using linear and kernel ridge regression methods on the selected features from CUR to showcase the effect of the dimensionality of representations on ML models.

## III. ATOMIC NEIGHBORHOOD DENSITY REPRESENTATIONS

The descriptors sharing neighborhood density extraction can be unified in a general atomic expansion following the works in Refs. 60 and 61 and can be defined by

where

Here, *x*_{i} is the position of atom *i* and *b*_{i,n} can be expressed with radial and angular basis functions or as a polynomial expansion. In the following, we will give a brief summary of selected atomic neighborhood density representations based on the above definition. These are Atom Centered Symmetry Functions (ACSF), Chebyshev polynomial representations, Smooth Overlap of Atomic Positions (SOAP), Many-Body Tensor Representation (MBTR), and Atomic Cluster Expansion (ACE).

### A. Atom centered symmetry functions (ACSF)

The descriptors of atom centered symmetry functions were introduced by Behler and Parrinello with their atomic neural network potentials (NNP)^{8,68} and further used in a wide variety of applications.^{8,48,50,52,64,68} The method extracts atomic environment information for each atom in the configuration using radial contributions given by

and angular contributions of the form

where the distances *R*_{ij} = |**R**_{j} − **R**_{i}| and angle dependent contributions are defined through their cosines via *A*_{ijk} = cos *θ*_{ijk} and cos *θ*_{ijk} = **R**_{ij} · **R**_{jk}/(|**R**_{ij}||**R**_{ik}|) and are invariant under symmetry operations of translations and rotations, hence the name *symmetry functions*. While many symmetry functions have been proposed,^{8} two choices of functions commonly used for *b*^{r} and *b*^{a} in many applications^{8,49,52,58,69,70} take the radial function to be centered on atom *i* to be defined as

and the angular function centered on atom *i* as

where *f*_{c} is a cutoff function given by

and *R*_{c} is the cutoff distance.

In this work, we used *G*^{2} and *G*^{4} with two parameter sets: one is the traditional parameter set taken from Refs. 8, 48, 49, and 68 that is used in many ACSF-based NN potential models and the second one is extracted from the automatic ACSF parameter generation proposed in Ref. 70. For the rest of this work, we label the representation of the original parameter set ASCF and the newer systematic extended parameter set ACSF-X, for which we take the same parameters as in Ref. 70.

### B. Chebyshev polynomial representation within symmetry functions (CHSF)

By introducing radial and angular basis that are invariant under translation and rotational symmetries, one can define different functions that provide invariance under symmetry operations. Another example that exploits this idea defines radial and angular functions with

where *N*_{α} is the expansion order and the basis functions *ϕ*_{α} and their duals $\varphi \xaf\alpha $ are defined in terms of the Chebyshev polynomials *T*_{α} via

In Eqs. (10)–(12), *α* is the order of the polynomials, *k* = 1 except for *α* = 0 where *k* = 2, and (*x*, *T*_{c}) are taken to be (*r*, *r*_{c}) for radial functions and (*A*_{ijk}, *π*) for angular functions, respectively.^{58} The atomic descriptors *V*_{b} are then defined using the set of coefficients $c\alpha (2)$ and $c\alpha (3)$ corresponding to $bir$ and $bia$ with

where *w*_{t} is the weight for species *t*. For single-species configurations *w*_{t} = 1, for multi-species configurations, both structural and compositional parts contribute to the final descriptor.

A practical advantage of using polynomial expansion for radial and angular functions is the reduced number of input parameters as the only parameter for the expansion is the expansion order *N*_{α} of the Chebyshev polynomials. In this work, we select *N*_{α} = 9 for both radial and angular polynomial functions.

In this representation, the radial and angular contributions *b*^{r} and *b*^{a} are separate functions of distances and angles, respectively. *b*^{r} is defined by Eq. (10) and provides a *histogram of distances* present in the atomic environment. However, angular contributions *b*^{a} are significantly different. While the Chebyshev polynomial variant defines *b*^{a} to be a *histogram of angles* using only *A*_{ijk} in Eq. (11), ACSF combines both distances and angles in Eq. (8) and thus defines *histogram of triangles*.

### C. Smooth overlap of atomic positions (SOAP)

Descriptors can be constructed for extracting neighboring atomic environments using the smooth overlap of atomic positions (SOAP) approach.^{9} In this method, atomic densities centered at atom positions are defined by a sum of atom-centered Gaussians with the overall atomic density of a structure *χ* given by

and one can build SOAP kernel *K*(*χ*, *χ*′) with

where the exponent *ζ* > 1 and the integral is calculated over all possible rotations $R^$ of the overlapping densities of *χ* and *χ*′ environments. In practice, as is elegantly shown in Ref. 9, an equivalent kernel can be rewritten in the form of $K(\chi ,\chi \u2032)=p^(\chi )\u22c5p^(\chi \u2032)$ by selecting a set of orthonormal radial basis functions *g*_{n}(*r*) and angular basis functions with the spherical harmonic functions *Y*_{lm}(*θ*, *ϕ*) to expand the atom centered density at atom *i* with

and using the power spectrum of the expansion coefficients $cnlmi$, given by

where *n* and *n*′ are indices for the radial basis functions and *l* and *m* are the angular momentum numbers for the spherical harmonics. In SOAP, as is defined in Ref. 9, the radial basis functions are given by

in terms of polynomials. The representation of atomic environment *χ* is then defined by $p^(\chi )$, where *p*^{i}(*χ*) can be identified as atomic descriptors for atom *i*. The SOAP descriptors and representation are specified by the expansion orders *n*_{max} for the radial basis and *l*_{max} for the angular basis. In this work, for compatibility with SOAPlite and with the polynomial expansion order of Chebyshev polynomials in SF, we select *n*_{max} = 9 and *l*_{max} = 9.

### D. Modified basis expansion for SOAP (SOAPlite)

Introducing a different radial basis function and treatment of spherical harmonics basis, a modified version of SOAP referred to as SOAPlite has been proposed recently.^{64} In this version of SOAP, radial basis functions are replaced by

where *α*_{nl} are decay parameters of non-orthonormal functions *ϕ*_{nl}(*r*) that determine the decay of *ϕ*_{nl} to 10^{−3} at a cutoff radius specified by (*R*_{c} − 1)/*n*_{max} steps between 1 Å and *R*_{c}. The method also selects the real (tesseral) spherical harmonics for the angular basis as described in Ref. 64. For fair comparison, we select *n*_{max} = 9 and *l*_{max} = 9 for SOAPlite with all other parameters taken as the defaults implemented in the GAP^{9,47} and QUIP^{71,72} codes.

### E. Many-body tensor representation (MBTR)

Many-body tensor representation (MBTR)^{10} constructs representations of structures by defining contributions from *k* atoms in *k*-body terms with *g*_{k} geometry functions. In MBTR, these contributions from *k* atoms are smoothed with the probability distribution function *D*, and the resulting contributions to the representation are given by

where *Z* are atomic numbers, *C*_{z} is an element–element correlation matrix consisting of Kronecker *δ* values, *w*_{k} are weighting functions, and *g*_{k} are scalars for *k* atoms, while *i* and *j* are neighboring atoms in *i* = (*i*_{1}, …, *i*_{k}). Common selections for the functions *g*_{k} are atomic number *g*_{1}(*Z*_{i}) = *Z*_{i}, inverse distances of *i*–*j* pairs with *g*_{2}(*i*, *j*) = 1/|*R*_{i} − *R*_{j}|, and angles with *g*_{3}(*i*, *j*, *k*) = *∠*(*R*_{i} − *R*_{j}, *R*_{i} − *R*_{k}). In this work, we select *g*^{r} = *g*_{2} and *g*^{a} = *g*_{3} for geometry functions with the exponential decay function

where *β* is taken to be 4.0 for *k* = 2 and 3.0 for *k* = 3. We select Gaussian distribution for *D* and the continuous broadened results are discretized to *N*_{x} = (*x*_{max} − *x*_{min})/Δ*x* values using Δ*x* steps where *N*_{x} is 100 and *x*_{min}, *x*_{max} with intervals of [0.1, 1.1] and [−0.15, *π* + 0.1*π*] for distances and angles, respectively. MBTR is the only global representation studied in this work, since it does not include any atomic descriptors but instead provides a representation for a given structure overall.

### F. Atomic cluster expansion (ACE)

The ACE^{60} method constructs a complete basis of invariant polynomials. Each basis function may be interpreted as an invariant feature, which can then be collected into a descriptor map. Similar to SOAP, the ACE starts with a density projection,

where *g*_{n} is a radial basis. The atomic positions are not smeared as in SOAP, SOAPlite, and MBTR. Isometry invariant features are then obtained by integrating the *N*-correlations over the symmetry group: for $n=(n\alpha )\alpha =1N,l=(l\alpha )\alpha =1N,m=(m\alpha )\alpha =1N$, we obtain

Finally, one selects a linearly independent subset of the *B*_{nlm} basis functions. A detailed description of this construction is provided in Ref. 73.

Aside from the lack of smearing and the choice of radial basis, the two-correlation functions are equivalent to SOAP, while the three-correlation functions are equivalent to SNAP. Since the ACE construction readily applies to higher order correlations, we will use up to five-body correlations in order to test the effect of introducing significantly higher correlations into the descriptor. To control the size of the feature set, we use an *a priori* chosen sparse selection, as described in Ref. 73.

To complete the specification of the ACE descriptors, we must define the radial basis: here, we choose

where (*P*_{n}) is a basis of orthogonal polynomials such that *g*_{n}(*R*_{c}) = $gn\u2032$(*R*_{c}) = 0. In this work, the ACE method is used as implemented in the ACE.jl package in Ref. 74.

### G. Modified Chebyshev polynomial symmetry functions (CHSF-mix)

Noting the *histogram of triangles* provided by ACSF, SOAP, SOAPlite, and ACE through Eqs. (8), (17), and (25), we examine the contributions from the angular terms with *A*_{ijk} and radial basis of CHSF. Combining both radial *b*^{r} and angular *b*^{a} basis expansion of Chebyshev polynomials, we introduced a new *b*^{a} with

where $cnn\u2032l(3)$ is defined as

and *α* index is substituted with *l* for angular basis and *n*, *n*′ for radial basis sets with the choice of (*x*, *T*_{c}) that are selected in *ϕ*_{α}(*R*_{ij}) and *ϕ*_{α}(*A*_{ijk}) as in Eqs. (13) and (14), respectively.

We analyze the benefit of these novel modifications over CHSF-mix in Sec. VI A using *N*_{n} = 9 and *N*_{l} = 9.

### H. Descriptor implementations

All our analyses are carried out using our DescriptorZoo code^{75} (github.com/DescriptorZoo) that includes implementations of the CUR, FPS, and PCA analyses and uses AMP,^{76} Dscribe,^{41,77} qmml-pack,^{78,86} QUIP,^{71,72} and GAP^{9,47} with its Python interface quippy, æpy^{79} a wrapper code for ÆNET^{49,58,80} Fortran code of the NN ML model based on ACSF and Chebyshev polynomial descriptors (CHSF), CHSF.jl for both CHSF and CHSF-mix,^{81} and ACE.jl^{73} code for ACE representation (labeled ACE in results below).

## IV. DATASETS OF MATERIALS AND MOLECULES FOR ANALYSIS

We used a wide range of materials and molecules databases to provide datasets to test the various representation methods. For diversity, we selected a range of materials and molecular systems: Si for single species tests, TiO_{2} for metal-oxides, AlNiCu for metals and metal alloys, and molecular configurations containing the elements C, H, O, and N.

### A. Si dataset

This dataset was constructed using the available GAP Si potential database from Ref. 82 plus Si molecular dynamics (MD) database from Ref. 50. While the overall dataset includes various crystalline phases of Si, it also includes MD data. This dataset includes 3583 structures with 242 139 atomic environments.

### B. TiO_{2} dataset

We used a TiO_{2} dataset that was designed to build atom neural network potentials (ANN) by Artrith *et al.*^{49,58} using the ÆNET package. This dataset includes various crystalline phases of TiO_{2} and MD data that are extracted from *ab inito* calculations. The dataset includes 7815 structures with 165 229 atomic environments in the stochiometric ratio of 66% O to 34% Ti.

### C. AlNiCu dataset

This dataset is formed from two parts: single species datasets for Al, Ni, and Cu from the NOMAD Encyclopedia and multi-species datasets that include Al, Ni and Cu from NOMAD Archive. All single-species data were fetched from the NOMAD Encyclopedia after removing duplicate records with degenerate atomic environments (e.g., equivalent structures from different *ab initio* calculations uploaded to NOMAD). For the multi-species data, we used only the last configuration steps for each NOMAD Archive record, since these records include all intermediate calculation cycles, with the last configuration entry typically corresponding to a fully relaxed configuration. In our dataset, the NOMAD unique reference access IDs are retained along with a subset of their meta information that includes whether the supplied configuration is from a converged calculation as well as the Density Functional Theory (DFT) code, version, and type of DFT functionals with the total potential energies. This dataset consists of 39.1% Al, 30.7% Ni, and 30.2% Cu and has 27 987 atomic environments in 3337 structures.

### D. CHON dataset

This dataset of molecular structures was extracted from all available structures in the NOMAD Archive that only include C, H, O, and N using the NOMAD API. The same procedure of selecting only the last entries in each record was applied. This dataset consists of 50.42% H, 30.41% C, 10.36% N, and 8.81% O and includes 96 804 atomic environments in 5217 structures.

## V. ANALYSIS METHODS

Our analyses for the representations are based on the desired features of encoding structural information of the materials that are listed above as *invariance*, *sensitivity*, *dimensionality*, *differentiability*, *interpretability*, *scalability*, *complexity*, and *discrete mapping*. While many of these features are important according to the application, we focus here on the *invariance*, *sensitivity*, and *dimensionality*.

*Invariance* of a representation or a descriptor under symmetry operations such as translation and rotation is of high concern in developing mapping methods since the properties of a material should be identical under these changes in the configuration description. The structural representations are constructed to follow these conditions; otherwise, every possible transformation of the material must be included in the training of the machine learning model, leading to unaffordable numbers of permutations. Even a successful construction of such a model can lead to undesired predictions for the uncharted permutations that are not in the training datasets.

*Sensitivity* is also an important property of a descriptor since any application needs distinguishable and unique values for the descriptors. How sensitive the descriptor is to changes in the structure of the material determines the outcome of similarity analysis or molecular dynamics simulations. For example, if a descriptor produces exactly the same values for any perturbation, the outcome will be indistinguishable. In MD simulations, such insensitivity will result in inaccurate dynamics due to inaccuracies in energy and force evaluations as a result of artifacts introduced by the descriptor.

*Dimensionality* on the other hand is more related to the under- or over-determination of the feature space for the ML application. While under-determination in the mapping of hyper-dimensional space can easily lead to inaccurate predictions of an ML model, over-determination may also lead to undesired predictions according to how the over-determined features are eliminated.

### A. Invariance and sensitivity analysis

Our analysis of invariance, local stability, and sensitivity are carried out using diamond cubic Si structures. All the selected descriptors in this work start from neighbor analysis based on atom-centered perspective, as described in Sec. III, and therefore, the descriptors generate values invariant under translational symmetries by definition. However, whether they can maintain invariance under rotation needs to be verified. To evaluate this, we perturbed the 4 × 4 × 4 cubic diamond crystal Si (c-Si) as follows: rotating the structure around the *y*-axis in a non-periodic unit cell, we calculated the difference of descriptors and representations *dV*_{all} from the reference non-rotated structure [see Fig. 2(a)].

For the sensitivity analysis, we perform three types of perturbations to a 4 × 4 × 4 c-Si unit cell as follows:

**Mixed Perturbation**: the central Si atom with dark blue color in Fig. 2(b) is moved along the [100] direction (i.e., line joining the light green atoms on the*x*-axis) within a periodic supercell by a distance*dx*ranging from 10^{−8}Å to 0.1 Å.**Radial Perturbations**: The atoms in the groups of first, second, third, and fourth neighbor atom shells at different distances from the central dark blue atom are perturbed along radial and tangential directions [see Fig. 2(c)]. For the radial perturbation, atoms in each shell are moved along the vector that separates the atom from the central atom. Position change*dx*ranges from 10^{−8}Å to 0.1 Å.**Tangential Perturbation**: Neighboring atoms in the same shells are perturbed on the sphere inscribed by the distance vector*R*_{ij}[see Fig. 2(c)]. This perturbation only changes angular contributions of the descriptors since the radial distances*R*_{ij}are kept fixed. Position change ranges from 10^{−5}Å to 0.1 Å.

For case 1, we look at the difference in the full structure representation *dV* (comprising all atomic descriptors) with respect to the unperturbed crystal to investigate whether the full representation is sensitive to small perturbations of a single atom in the structure. In this mixed perturbation test case, all the neighbor distances for the perturbed central atom change from the perspective of the central atom *i*, and the difference *dV* from the reference unperturbed structure depends on both radial and angular contributions.

For cases 2 and 3, we look at the difference of the descriptor *dV*_{i} of the central atom *i* with respect to an unperturbed neighborhood. This allows us to test the sensitivity of the individual radial and angular contributions.

#### 1. Sensitivity to perturbations

We consider the question of whether the representation $V=V({Rij}j)$ changes in a locally smooth and stable manner near some reference configuration $R^={R^ij}j$. Small changes in the configuration should lead to proportional changes in the representation, a property that we called *sensitivity*. This requirement is necessary to represent an arbitrary smooth function with the same symmetries and to inherit its regularity, which in turn is key to obtain accurate fits with few parameters or basis functions.

To illustrate this concept, consider a “feature map” $v:R\u2192R$, which is strictly increasing and hence invertible, but assume that *v*(*x*) ∼ *a*_{2}*x*^{2} as *x* → 0. It follows that $x(v)\u223cv/a2$, i.e., the inverse has a singularity. Suppose now that we wish to represent the linear function *f*(*x*) = *x* as *f*(*x*) = *g*(*v*(*x*)), then *g*(*v*) = *f*(*x*(*v*)) = *x*(*v*), i.e., it inherits the singularity which makes it challenging or even impossible to obtain an accurate fit.

In general, we consider paths $R(t)={Rij(t)}j$ with $R(0)=R^$ and expand the change in the descriptor to leading order,

for some *k* ≥ 1. We call a descriptor *V* linearly stable $R^$ if *k* = 1 for all possible perturbation paths, i.e., if the change in the descriptor is linear as the perturbation amplitude *t* → 0. If *k* > 1, then we call *V* linearly unstable at $R^$.

In our sensitivity analyses, we choose different perturbation paths *R*_{ij}(*t*) leading to different paths in descriptor space *V*_{t}. From Eq. (30), we then obtain

that is, we can observe the stability or instability of a descriptor by analyzing the slopes on a logarithmic scale. A linear slope, i.e., linear stability, is guaranteed to fulfill our *sensitivity* requirement. However, in certain high symmetry settings, this requirement must be relaxed, as we will see in Sec. VI A.

### B. Dimensionality analysis

The descriptors analyzed here are all constructed from feature sets extracted from the structural mapping of the atomic neighborhood density to hyper-dimensional spaces. Their central objective is the requirement to cover all possible perturbations of the structure to ensure a faithful representation to the ML model. However, strictly following this concern can lead to over-determination in the hyper-dimensional space. In other words, the representation may cover only a small subspace of the full hyper-space, with the subspace depending on the parameter set used. In the case of over-determination, feature sequences may contain many zero entries or, for multi-species systems, may need to be padded with zeros to account for species missing from individual environments. Both of these cases lead to high sparsity in the descriptors, which, in principle, could be eliminated by carefully selecting parameters to remove unnecessary features from the final descriptor sets and hence from the representations. Moreover, using an over-determined mapping is likely to induce *overfitting* and noise in subsequent ML training. In ML applications, such non-informative data should be eliminated before the actual training of the model to reduce the error in the training and increase the accuracy of the resulting models.

Due to the well-known *curse of the dimensionality*, the dimension of the parameter space of a global optimization problem is the key determiner of the difficulty of obtaining optimal solutions. When representations form the input data for an optimization problem, their *dimensionality* thus has a crucial role in determining complexity. As the dimensionality increases, the number of possibilities rises combinatorially, drastically hindering the task of optimization.

To keep the features at an affordable level for optimization while maintaining an accurate description, one can use dimensionality reduction techniques such as CUR decomposition,^{70} farthest point sampling (FPS),^{70} Pearson correlation coefficient (PC),^{70} and principal component analysis (PCA).^{50,52} While these techniques help identify the most informative features in the descriptors, they also help analyze how the features of the descriptors can provide sufficiently informative data through an analysis of its “compressibility” and whether the representation leads to an over-determined embedding. Here, we have used CUR and FPS, as implemented in Ref. 70, and PCA to select the optimum number of features for the descriptors.

### C. Analysis of dimensionality with regression

A key question for models based on representations is how the outcome of predictions depends on the dimensionality of the representation. To assess this, we analyzed the model prediction errors of representations on the structures and total potential energies, *E*_{DFT}, taken from the Si GAP fitting database in Ref. 82. This is a subset of the complete Si *dataset* used elsewhere in this work, chosen here to allow comparison with results for the same dataset in Refs. 73, 82, and 83. For the target values **t** in Eq. (1), we use the cohesive energies, i.e.,

where $Nkat$ and *E*_{0} are the number of atoms in each *χ*_{k} structure and the energy of an isolated Si atom, respectively. The cohesive energy of structure *χ*_{k} can then be estimated via the linear combination

where the index *v* runs over all *N*_{f} features within the representation.

For a global representation of a structure such as in MBTR, the **V**_{k} are taken directly as the representation vectors for each structure *χ*_{k}. For atom-centered descriptors, however, we first construct a global representation by summing over the descriptor vectors for each atom $Vk(i)$, i.e.,

In both cases, we normalized the **V**_{k} representations using the training dataset for each representation method.

The cohesive energies can now be estimated by minimizing the *L*_{2}-regularized quadratic loss function,

To estimate the coefficients **c**, we build two regression models based on (i) linear ridge regression with *L*_{2}-norm regularization, obtaining the least-squares solution with the QR method (RR) defined in Refs. 73 and 83 and (ii) kernel ridge regression (KRR) as detailed in Ref. 64.

For the linear ridge regression case, the regularized least squares problem becomes

where **t** is a vector comprising all the target cohesive energies, **Ψ** is an *N*_{k} × *N*_{f} matrix with representations along each rows and structures down each column, and *λ* is a regularization parameter.

For the kernel ridge regression case, we use the **V**_{k} representation to build an *N*_{k} × *N*_{k} kernel matrix with elements

where *γ* is a lengthscale hyperparameter. After constructing the kernel matrix, one can then predict the cohesive energy for a new configuration with representation **V** as

where **c** = **t**^{T}(**K** + *λ***I**)^{−1} and *κ*_{k} = *K*(**V**_{k}, **V**).

## VI. RESULTS AND DISCUSSIONS

### A. Sensitivity

#### 1. Sensitivity to rotations

In Fig. 3, we present the norm of the difference vector *dV* between the full structure representations of the rotated and the reference c-Si system for each approach considered.

Our analyses show that all descriptors maintain the rotational invariance with high precision, with all errors below 10^{−9} (above a machine precision *ϵ* of ∼10^{−16}). Together with built-in invariance with respect to translations and permutation of like atoms of all the representations based on density projections, our analysis indicates that all approaches considered in this work fulfill the properties of *invariance* in rotation, translation, and permutation for the structural representations.

Considering the wide range of outcomes on the rotation tests, we ascribe the error to the numerical precision differences, i.e., floating point roundoff error of the underlying codes. For the sake of comparison of the outputs from different approaches considering the numerical precision of underlying codes, we selected 10^{−8} as the lower bound in all our subsequent sensitivity analysis following the lowest precision observed in these rotation tests.

#### 2. Sensitivity to perturbations

Further analyses are carried out for the sensitivity of representations under the atomic motions in structures, as described in Sec. V A. Recall from Sec. V A 1 that a slope of one indicates local stability (also smooth invertibility of the descriptor), while a slope greater than one leads to singularities in the representation.

Figure 4 shows the results for the mixed perturbation with representation changes ∥*dV*∥ normalized so that all curves pass through the point (0.1, 0.1) to enable direct comparison. This lets us to provide an absolute comparison for ∥*dV*∥ metric between reference and perturbed structures. All representations have linear sensitivity within the entire range of the path, except for MBTR, which shows a mild preasymptotic sign of instability (change in slope from 1 to 2 above 0.01 Å), which is unlikely to cause any significant deterioration in the stability of the representation. This is apparent here since MBTR is a global representation method. Compared to other atomic descriptors, the results for MBTR show that its *histogram of angles* leads to a smooth change in the slope. In the atomic descriptors, this calculation includes contributions from all descriptors, and hence, the slope is dominated by radial contributions, which are discussed detailed in Sec. VI A 3.

#### 3. Radial perturbation of reference crystal

A deeper analysis of the sensitivity of representations can be made by analyzing the responses of the atomic descriptors to different perturbation modes. As described in Sec. V A, we calculated the change in the descriptor of the central atom *i* to a radial perturbation of a neighbor *j*, as shown in Fig. 2(c). The sensitivity curves corresponding to the radial perturbation of a neighboring atom in the fourth shell are given in Fig. 5(a). All descriptors have slope-1 sensitivity curves, indicating linear stability under this perturbation. This is unsurprising since all descriptors provide a relatively high resolution of the two-body histogram.

#### 4. Tangential perturbation of reference crystal

Next, we repeat the test of the foregoing section with a tangential perturbation of an atom in the first shell. The resulting sensitivity curves are given in Fig. 5(b), clearly showing slope 2 for all descriptors. Thus, according to Sec. V A 1, *all* descriptors are unstable with respect to tangential perturbations, raising concerns due to the resulting singularity in the inverse of the descriptor map. However, the origin of this instability is invariance with respect to reflections about a plane, and fitting any target function with the same reflection symmetry need not be affected by the singularity in the inverse descriptor map.

Concretely, let *i* denote the center atom and *k* denote the neighbor that is being perturbed in the tangential direction, i.e.,

where *dR*_{ij} = 0 for *j*≠*k* and ∥*dR*_{ik}∥ = 1 and $dRij\u22a5Rij0$. If *R*^{0} is symmetric under reflection through the plane that contains the origin and is orthogonal to *dR*_{ik} (this is the case here), then the configuration *R*^{−t} is the reflection of *R*^{t} to within *O*(*t*^{2}) accuracy. Since all descriptors *V* we consider are invariant with respect to reflections, they necessarily satisfy $ddtVt|t=0=0$ (this is true for any function of the distances *r*_{ij} and the cosines *A*_{ijk} = cos *θ*_{ijk}), and hence, *V*^{t} ∼ *a*_{2}*t*^{2} as *t* → 0. In particular, the inverse descriptor map *V* ↦ *R* must contain a square-root singularity along the path *V*^{t}.

On the other hand, assuming that we aim to represent a property, e.g., site potential, $\u03f5=\u03f5({Rij}j)$, then *ϵ* will also satisfy this reflection symmetry, which indicates that the square-root singularity is again removed. To illustrate this point further, we modify the one-dimensional example from Sec. V A 1: Assume that we wish to represent *f*(*x*) = *x*^{2} = *g*(*v*), and then, *g*(*v*) = *f*(*x*(*v*)) = *x*(*v*)^{2} ∼ *v*/*a*_{2} as *v* → 0, i.e., the singularity is removed in this case. More generally, this occurs whenever *f*(*x*) ∼ *b*_{2}*x*^{2} as *x* → 0, i.e., when *f* is symmetric about the origin to the leading order.

#### 5. Tangential perturbation of a perturbed crystal

The analysis of the previous paragraph suggests that any perturbation in the configuration of the structure that breaks the symmetry should lead to the linearly stable slope-1 cases. We therefore test which descriptors are capable of capturing this symmetry breaking. We break the symmetry in several ways: we perturb an atom in a random tangential direction, we perturb atoms in the second shell that does not exhibit the same reflection symmetry, and we perturb the reference crystalline structure in the radial direction from a chosen center atom *i* before applying these tangential perturbations. The results are shown in Figs. 5(b)–5(d).

As predicted, any such symmetry breaking leads to changes in the slopes of sensitivity curves of descriptors in the limit *t* → 0. However, there are differences across descriptors how well the symmetry breaking is captured. First, there are some variations across descriptors how significant the pre-asymptotic slope-2 regimes are, which indicate a reduced sensitivity. However, the most concerning effect is the “dip” in the CHSF descriptor in Fig. 5(c), highlighting a region of significantly reduced sensitivity (it can almost be thought of as a *blind spot*) for atomic displacements in the descriptors, where the perturbation does not change the output values of representations. To test whether adding additional features can remove this dip, we implemented an extended CHSF descriptor, labeled CHSF-mix, for which the radial and angular histograms are fully mixed, giving a similar description of the three-body histogram as SOAP and ACE do. This addition clearly removes the reduced sensitivity regions.

### B. Dimensionality of representations

In the second phase of our analyses, we consider four different datasets selected from those described in Sec. IV, namely, Si, CHON, AlNiCu, and TiO_{2}. Each dataset contains a diverse range of configurations with thousands of structures. To identify how the dimensionality of the representations changes with different datasets using the same parameters, we used CUR and FPS feature selection techniques and analyzed the reduced dimensions of the representations by comparing them with the outcomes of PCA calculations. This analysis can also be accounted as a measure of the *compressibility* of each representation.

#### 1. CUR decomposition

As a first step, we analyzed the representations including all element-wise descriptors. MBTR is considered as a representation-only approach as its output cannot be broken down to element-wise descriptors. In Fig. 6, the total error between the full feature sets of each dataset and the reduced feature sets that are extracted from CUR analysis is presented. As each approach provides a different number of features for the datasets at hand depending on the selection of (hyper)parameters, we provide a complete list of the number of non-zero features in each representation with the corresponding datasets in Table I and in the legends of each panel in Fig. 6.

Desc. . | Si . | CHON . | AlNiCu . | TiO_{2}
. |
---|---|---|---|---|

ACSF | 51 | 462 | 282 | 145 |

ACSF-X | 57(195) | 634(1644) | 544(1017) | 534 |

SOAP | 450 | 6660 | 3780 | 1710 |

SOAPlite | 450 | 4500 | 2700 | 1350 |

CHSF | 20 | 44 | 44 | 44 |

ACE | 30 | 30 | 30 | 30 |

MBTR | 182 | 5000 | 2400 | 900 |

Desc. . | Si . | CHON . | AlNiCu . | TiO_{2}
. |
---|---|---|---|---|

ACSF | 51 | 462 | 282 | 145 |

ACSF-X | 57(195) | 634(1644) | 544(1017) | 534 |

SOAP | 450 | 6660 | 3780 | 1710 |

SOAPlite | 450 | 4500 | 2700 | 1350 |

CHSF | 20 | 44 | 44 | 44 |

ACE | 30 | 30 | 30 | 30 |

MBTR | 182 | 5000 | 2400 | 900 |

After removing any features of ACSF-X that are all zero from Si and AlNiCu dataset [see Fig. 6(a)], the selection method CUR in Sec. V B is applied throughout the dataset and features from the full feature set are selected one-by-one and added to the new feature set by calculating the error with respect to the full feature representation. Using this method, the features that contribute most to the representation can be selected. As the lower contributions are added to the new feature set, the error cannot be reduced more and the overall error becomes constant, equal to zero within numerical precision. The number of features that are selected at the beginning of this plateau can be counted to determine the size of the compressed representation, and conversely, the number of remaining features can be thought of as the over-determination of the representation.

In Fig. 6, one can see three types of results: (i) those with error curves that gradually decrease to the point where the error plateaus as for the TiO_{2} results of ACE, ACSF-X, and MBTR, (ii) errors decrease step-wise to a plateau such as in SOAP results, and (iii) where errors drop rapidly as for the AlNiCu results in ACSF, ACSF-X, ACE, and CHSF.

SOAP and SOAPlite, in Figs. 6(c) and 6(d), respectively, can be directly compared since they differ only in the choice of radial basis function. The results for the Si, CHON, and TiO_{2} datasets can be examined for both approaches. While SOAPlite has close to exponential decay up to ∼70% of selected features, in SOAP, this regime extends only up to 30%. After this, the SOAP error has a more step-like character. Similar behavior can also be seen in type (iii) results such as ACSF, ACSF-X, CHSF, and ACE. For these methods, while the first ∼10% of features vary the error reduction, the rest of the features only slightly reduce the error. For the AlNiCu dataset, SOAP and SOAPlite have substantially different outcome from the rest of the datasets, which is an indication on how the choice of radial basis can be a key determiner for the relevance of features in the representations. One can also see the dominance of radial basis in the global representations when Fig. 4 is compared with Fig. 5. The optimal choice of basis functions for descriptor performance is another open question and is outside the scope of the present work. We leave further investigation of the basis functions’ role on the sparsification and descriptor performance for future work.

In most of our dimensionality reduction results, one can see distinct constant error regions associated with over-determination. These features should ideally be eliminated before using the representation in ML models. For example, consider the results for ACSF-X and ACSF in Figs. 6(a) and 6(b). As the aim of ACSF-X is to extend the number of features in set from the widely used and well-tested standard ACSF parameter set, some of the features are expected to be irrelevant for representing structures in our datasets. It is thus unsurprising that the dimension reduction analyses in Fig. 6(a) show that there, only fractions of features contribute significantly—around 14% for Si, 33% for CHON, 54% for AlNiCu, and 33% for TiO_{2}. A smoother feature reduction can be seen for the standard ACSF parameter set in Fig. 6(b), where the error decay is very similar with about 85% of the total features sufficient across all four datasets and thus around 15% of redundant features.

A similar result is seen for SOAP, where around 75% of the features of SOAP representations are sufficient to cover the structural variance across all datasets. This result is more striking than that for ACSF since SOAP has about an order of magnitude more features in its representation. We can conclude that both ACSF and SOAP are robust approaches that cover the hyper-dimensional space of structural representation for a wide-range of crystals and molecules.

The ACE AlNiCu results are significantly different than the other datasets. To identify whether the degree of the polynomial is the reason for this pattern, we carried out additional analyses with ACE, increasing the degree of polynomial from 6° to 15° in steps of 3°, as shown in Fig. 6(h). Increasing the polynomial degree significantly increases the number of features; however, we find that the percentage of selected features on the final representation does not change significantly and the pattern of error decay is still quite different from the rest of the datasets (e.g., Si, where 25% of the features can be removed from the representation although it has order of magnitude less features in the descriptor vectors than with 15° of polynomial expansion.

To further investigate the extensive redundancy identified for MBTR features across all four datasets, we consider whether the discretized smearing of positions and angles used by the MBTR representation leads to clustering of the features representation space. To identify any clustering of the data, we perform standardization of the features for all the representations of Si that are generated by MBTR and show the feature selection curve in Fig. 6(g) labeled Si(STD). When compared with Si curve, Si(STD) does reduce the error when less than ∼20% of features are selected by around an order of magnitude in comparison to the raw representation. However, standardization suggests an even smaller feature set selection of about only 20% of the full feature set. This may be due to the Gaussian smearing with *D* in MBTR that significantly increases the correlation between features.

#### 2. Principal component analysis

The CUR selection process is closely related to the *principal components* of the representation data for each dataset. To show whether these principal components are related to the final feature selections in CUR, we further analyzed the datasets using PCA. In Fig. 7, the fraction of variance explained by the principal components of the four datasets is given for each representation. Since the PCA variances decrease from one to very small values, we determine a lower bound after which we consider the variance to be zero, shown as the *zero level* in our figures. PCA variation results for principal components follow the same trends as the CUR curves discussed above. Although CUR results show directly the hyper-dimensional space of the representation, PCA results are not solely based on the selected features but are a collective property of all features based on the covariance matrix from which the principal components are extracted. As seen in Fig. 7, the outcome of this selection following the highest to lowest variances and extraction of the features with highest values in the covariance matrix gives similar results to the CUR decomposition.

#### 3. Farthest point sampling

CUR and PCA are both based on Singular Value Decomposition (SVD), selecting the features as orthogonal dimensions of the hyper-dimensional space and treating each feature as linearly independent since linearly dependent features cannot span the space. However, neither of these methods consider if the selected features have non-linear dependencies on other features. Another approach to select features without using SVD decomposition is Farthest Point Sampling (FPS). In FPS, the feature selected at each iteration is chosen as the farthest from those already selected. Hence, FPS does not provide information on whether the selected feature is indeed linearly independent of those already selected. This can be understood by considering each features’ distance from the others at a time late in the process when there are few remaining features to be selected. These remaining features either represent very small distances as minor additions to the previously selected and clustered features or repeat similar distances.

In Fig. 8, the FPS results for the feature selection show that there are significant differences in error reduction and hence dimension compression for SOAP, SOAPlite, ACE, and MBTR representations in comparison to our earlier CUR results. However, these FPS results do not allow insight into each features’ contribution as a dimension in the hyper-dimensional space. The relatively small reduction in errors in FPS selection or the constant regions are due to its selection criteria, which is based only on the hyper-spatial distance. This poses a limitation to the analysis if one would like to find the full extent of the representation and remove non-informative features from the descriptors. However, determining the cutoff for the features according to the error is not obvious since there may not be a plateau in the error—as is seen for in ACSF-X, where only 25% of features contribute—but instead a more gradual decrease as in the results for all other representations.

### C. Regression with representations

The RR and KRR methods are applied to all representations using a subset of the Si *dataset* (see Sec. V C), and root-mean-square error (RMSE) is calculated as a function of the number of features in each representation.

For each representation, the regularization parameter *λ* used in both RR and KRR methods and the *γ* hyperparameter of **K** in KRR were chosen to optimize the convergence of the RMSE in the predictions on an independent test set.

In Fig. 9, we present our regression results for the two ridge regression methods RR and KRR in the first and second rows of the figure, respectively. While RMSE of both training (thick lines) and test datasets (thin lines) is shown in Figs. 9(a) and 9(c) for each regression method and for the top 65 features that are selected by the CUR method, the RMSE of full feature extension of representations is provided for comparison in Figs. 9(b) and 9(d).

Our intention here is not to provide the best potential energy surface (PES) estimator for each representation method, since this has already been extensively studied in the literature but instead to provide a common metric to compare different representations. Here, by fitting those methods to same dataset, we can assess how each method performs under this task and, in particular, how the dimensionality of each method effects the outcomes.

The effect of the dimensionality of the feature space can be seen from the RR outcomes of representation for the full feature set. In Fig. 9(b), one can identify plateaus at three typical RMSE values: 0.9 eV/atom, 0.1 eV/atom, and 0.04 eV/atom. While RMSE for both training and test dataset of SOAP has the highest accuracy with 0.04 eV/atom, MBTR and CHSF have the lowest accuracy of around 0.9 eV/atom. The rest of the methods have final errors of around 0.1 eV/atom. As the ridge regression is a linear fit to representation features, the RMSE can be compared directly between methods. The results indicate that SOAP has enough features to cover the structural variation, but convergence is slow using a linear fit. However, using a non-linear fit as in Fig. 9(d) outperforms RR. While SOAP has still the highest accuracy, albeit with a significant split of RMSE between test and training datasets, ACSF, ACE, and SOAPlite result in predictions that are at least an order of magnitude higher in accuracy in KRR than in RR.

To identify the role of the most important features in each representation on the prediction RMSE, we analyzed the RMSE of methods while only using the top-ranked selected features from CUR up to 65 features. The RMSE of RR shows that ACE and ACSF have very similar results up to 50 features, where they reach an accuracy comparable to that of 65 features with the SOAP or SOAPlite models. When non-linearity is introduced in the KRR models in Fig. 9(c), the difference is even more stark, for example, CHSF with only 20 features has comparable accuracy with SOAP at 65 features.

These results raise the question: what is the best feature set for a given representation or, equivalently, what is the necessary dimensionality for a given representation for a specific dataset or material? In Fig. 9(d), we can see that RMSE of training and test datasets for SOAP and ACSF starts to diverge at about 50%, and 70% of features, respectively. These points coincide with the region in Fig. 7 where the variance of these representations drops below about 10^{−12}. Similar results can also be observed from the KRR results for MBTR, CHSF, and ACE where either RMSE does not reduce more with the addition of new features such as in ACE beyond 45% features or RMSE of the test dataset significantly increases as in MBTR after 40% features are selected.

These results show that there are clear inconsistencies between the selections of features in representation and even within the same representation as two critical points need to be considered before applying a feature set to MLIPs using dimension reduction: (1) the number of features may not be indicative of complete coverage of structural representation of a method, and (2) increasing the number of features may not result in gaining benefit and may also cause overfitting.

## VII. CONCLUSION

We have carried out a comprehensive assessment of the sensitivity of atomic environment representations using several methods to analyze the sensitivity under rotation and various perturbations. Our results show that although many representations provide an overall acceptable accuracy for sensitivity, there is still room to balance sensitivities to radial and angular perturbations. We thus conclude that further investigation of how insensitivities affect applications of interatomic potentials and hence observables in MD simulations is necessary to improve ML driven simulation approaches.

We also carried out extensive dimensionality analyses of various representations, which have identified significant opportunities to eliminate unnecessary information that may reduce the accuracy of predictions from ML models. We also conducted regression tests to provide a comparison between representations as their dimensionality varies. The results show clear differences in the number and fraction of important dimensions in the different representations. This is expected to become increasingly important as more complex representations are developed and especially when incorporating property-based descriptors alongside atomic environment representations.

## ACKNOWLEDGMENTS

We thank Gábor Csányi and Albert Bartok-Partay for useful discussions. This work was supported by grants from the Leverhulme Trust under Grant No. RPG-2017-191. This work received funding from the EU’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 676580), the NOMAD Laboratory CoE.

We used data from the Novel Materials Discovery (NOMAD) Laboratory (https://nomad-coe.eu/). We are grateful for computational support from the UK national high performance computing service, ARCHER, for which access was obtained via the UKCP consortium and funded by EPSRC (Grant Reference No. EP/P022065/1). Additional computing facilities were provided by the Scientific Computing Research Technology Platform of the University of Warwick.

## DATA AVAILABILITY

The data that support the findings of this study are openly available in github.com/DescriptorZoo/Materials-Datasets at http://doi.org/10.5281/zenodo.3871650, version v1.0 that are extracted from open-access NOMAD Archive (http://nomad-coe.eu) and in github.com/DescriptorZoo/sensitivity-dimensionality-results at http://doi.org/10.5281/zenodo.4055141, version v1.1.^{84} The details of all datasets are given at Sec. IV. The corresponding citations for other data that are used in this study are available from the following publications: GAP Si potential database from Ref. 82, Si molecular dynamics (MD) database from Ref. 50, and TiO_{2} dataset from Ref. 49 through web access from Ref. 85. The codes that are used to create representations are detailed in Sec. III H.

## REFERENCES

_{2}