Faithfully representing chemical environments is essential for describing materials and molecules with machine learning approaches. Here, we present a systematic classification of these representations and then investigate (i) the sensitivity to perturbations and (ii) the effective dimensionality of a variety of atomic environment representations and over a range of material datasets. Representations investigated include atom centered symmetry functions, Chebyshev Polynomial Symmetry Functions (CHSF), smooth overlap of atomic positions, many-body tensor representation, and atomic cluster expansion. In area (i), we show that none of the atomic environment representations are linearly stable under tangential perturbations and that for CHSF, there are instabilities for particular choices of perturbation, which we show can be removed with a slight redefinition of the representation. In area (ii), we find that most representations can be compressed significantly without loss of precision and, further, that selecting optimal subsets of a representation method improves the accuracy of regression models built for a given dataset.
I. INTRODUCTION
Machine Learning (ML) as a predictive modeling tool has gained much attention in recent years in fields ranging from biology1 and chemistry2 to materials science,3,4 building on decades of successful applications in image recognition, natural language processing, and artificial intelligence (AI).5,6 While machine learning has also been increasingly popular in many fields due to its relatively simplicity of application and powerful prediction features, a key driving force is the increasing availability of information through data repositories, archives, and databases of materials and molecules that include billions of structures.3,7 In recent years, a large number of ML models for materials and molecules have been developed and many novel representation methods have been proposed.8–15 Representations are the feature sets used as input data for data-driven machine learning models, which, after training, can be used to predict quantities of interest.
Some of these databases are focused on molecular configurations, such as PubChem,16 DrugBank,17 and ChEMBL,18 which are widely used in biological, chemical, and pharmacological applications driven by high-throughput screening for drug design and novel molecule discovery. Other databases such as the Cambridge Structural Database (CSD)19 include crystals and small molecules. While these databases contain mostly molecular structures with their chemical or biological properties, recent efforts such as the Materials Project,20,21 Automatic FLOW for Material Discovery Library (AFLOWLIB.org),22,23 Open Quantum Materials Database (OQMD),24,25 and the Novel Materials Discovery (NOMAD) archive26 (which collates data from many other databases and contains data that are not available at them) now provide extensive electronic structure results for molecules, bulk materials, surfaces, and nanostructures in a range of ordered and disordered phases, with chemical composition ranging from inorganic systems to metals, alloys, and semiconductors. With recent advancements in materials databases, access is available to billions of properties of materials and molecules and millions of high-accurate calculations through online archives. With such an extensive and diverse collection of molecular and materials data to hand, we can ask questions such as how to identify of the most informative subset of data for a particular class of materials and how best to interpret it for prediction through ML models. These questions are relevant for both classification and regression applications: for example, classifying materials as metallic or semiconducting or predicting the bandgap from structure, respectively. In both cases, the input data are constructed from information ranging from basic physical and chemical properties to precise geometrical information based on atomic structure and bonding topology. There have been many studies to determine these input information, which are named descriptors, fingerprints, or representations13,27–32
Here, we define the prediction problem as
where t is the target property of the material, χk represents the structure with index k within the database, Ma is the machine learning model with identifier a, and Vb is the input representation determined with identifier b that is used for optimizing f. We define representations as
where xj are the coordinates of atom with index j within structure k, αm is a physical or chemical property of the material and chemical environment, hi is the descriptor mapping from the input space to hyper-dimensional space with each dimension being an input feature of model M, and g is the encoding function that combines all of the hi descriptors to make an overall representation. In Eq. (1), the optimization of the parameters of f can be performed either for a single model M or for multiple models M1, M2, … in combination with potentially multiple different representations Vb.
From this general perspective, many approaches can be combined to form a predictive model, raising the critical open question of how to optimally choose which pieces of information are needed to represent a material through descriptors and how to combine these to form representations. These can vary from defining sets of physical and chemical properties and atomic geometries to specialized hyper-dimensional mapping algorithms (see Fig. 1). These representations and their building blocks and the descriptors can be classified into three broad classes based on their construction: (i) atomic neighborhood density definitions, (ii) topology expansions, and (iii) property-based selections. The representations can be further grouped according to their combination rules, mapping basis or filtering functions, histograms, connectivity maps or graphs, and finally direct contributions from material properties. Within each category, there have been various developments of representations and many successful applications, which we review briefly below in Sec. II, before specializing on atomic neighborhood densities in Sec. III. Section IV lists the data sources used here for evaluating representations, and Sec. V describes our analysis methodology. Results and discussion follow in Sec. VI.
Classification of representations based on their method of construction (horizontal axis) and when they were first proposed (vertical axis). QSAR and SISSO do not indicate representations but instead indicates the representations that are constructed or selected using these methods (see the text). a,b,c Representations that are classified with multiple methods: direct and connectivity, histogram and mapping functions, and connectivity and mapping functions.
Classification of representations based on their method of construction (horizontal axis) and when they were first proposed (vertical axis). QSAR and SISSO do not indicate representations but instead indicates the representations that are constructed or selected using these methods (see the text). a,b,c Representations that are classified with multiple methods: direct and connectivity, histogram and mapping functions, and connectivity and mapping functions.
II. REPRESENTATION CLASSES
A. Property-based representations
The idea of classification of molecules based on the relationships between their structure and the resulting activities or properties is an underpinning approach of modern chemistry.33 However, using theoretical descriptors to identify the relations was first utilized for quantitative structure–activity/property relationship (QSAR/QSPR) modeling,33,34 starting with the work of Wiener and Platt in 1947, who used indices based on chemical graph theory.33 We use the name QSAR refer to both QSAR/QSPR models.
QSAR models are highly successful predictive modeling approaches in chemistry and biology in various applications.33,34 These models are determined using sets of descriptors that are defined through a selection of physical, chemical properties of materials and/or structural descriptors of molecules such as atoms and their molecular bonds from chemical topology information to graphs where the fractions of the bonding information are simplified to indices.33–35
The success of QSAR mostly depends on the choice of descriptor sets and, in particular, determining the optimum number of descriptors in such sets.34 For example, compressed-sensing approaches have been developed to systematically select the optimal set of the descriptors.35,36
Direct property-based approaches also employ reduction techniques to identify optimal models {Ma} in Eq. (1) by choosing mathematical operators for g that combine descriptors from a given set of {hi} with xj, {αm}, or both. This optimization scheme, referred to as sure independence screening and sparsifying operator (SISSO), has been successfully applied as a classification technique to group materials (e.g., metals vs non-metals) and as a regression method to determine a target property of interest.37
The connectivity information of each species in the chemical stoichiometry has also been introduced recently with the atom2vec13 method as a representation where a single atom-to-environment connectivity matrix can be constructed. This simple approximation enables the approach to be used to screen billions of materials and make predictions such as identifying possible candidates for Li-ion battery elements38 based only on chemical composition and stoichiometry.
B. Topology-based representations
In this group of descriptors, topological information is derived from the full connectivity graph of the atomic environment instead of simplified indices. Examples of this class of methods include the Coulomb and sine matrices,41 n-gram graphs,42,43 and graph-based Neural Network (NN) models such as DTNN,44 SchNET,12 and CGCNN.45 While Coulomb and sine matrices use atom-to-atom connection information directly and are thus based on the inferred chemical bonding between atoms, the input representations of graph neural network models are defined through neighborhood analysis of atomic connectivity graphs and can be constructed from the full n-body topology.
Another approach to use topology information such as bonds, angles, and higher-order many-body contributions is to group them into sets of building blocks and apply histograms on these blocks over the input datasets. Several successful implementations of these approach are Bag-of-Bonds (BoB)46 and its analogy to higher order contributions, for example, bag-of-angles and k-bags.
Topology information provides atomic connectivity data and is very useful for structural similarity comparison, since this information is automatically invariant under changes in the atomic positions. Topology is a blueprint of chemical bonding definitions for materials or molecules. However, it is not a physical observable, and therefore, it cannot be constructed uniquely. For novel material discovery, one needs to use electronic structure calculations to generate new data comprising the properties and construction of structural representations of uncharted material compositions. However, such database construction for new structures is extremely expensive.
C. Atomic neighbourhood density representations
Modeling materials and molecules at the atomic scale is essential to identify novel materials for applications of interest or to determine candidate molecules with a specific function in a desired medium. When accuracy and reliability of concern, preference is usually given to ab initio calculations. However, these first-principles calculations are inaccessible for systems requiring long time scales and/or large numbers of atoms. To significantly reduce the computational cost, interatomic potentials or force fields are employed to define the interactions between atoms using parametrized forms of functions in place of the electronic interactions. Although these approaches allow larger numbers of atoms to be included in simulations and provide access to much longer time scales, their prediction accuracies are typically limited as a result of simplified models describing interatomic interactions.
A prominent recent trend has been to address this limitation with machine learning interatomic potentials (MLIPs), i.e., by replacing fixed functional forms for the atomic interactions with data-driven machine learning models that map from atomic positions to potential energy surface (PES), via descriptors based on the local neighborhood of each atom. This approach based on atomic-densities extraction has been applied in pioneering models using symmetry functions8 and, more recently, the so-called smooth-overlap of atomic positions (SOAP).9,47 Symmetry functions are widely used in neural network interatomic potentials (NNP)8,32,48,49 successfully for many molecules and materials, including water, crystals with various phases,50 and amorphous solids in various concentrations,51,52 while SOAP is typically incorporated in Gaussian Approximation Potentials (GAP).9,47 GAP has been applied to many molecules and materials.4,53–55
In this final class of descriptors, there have been many developments in recent years. In addition to atom-centered symmetry functions (ACSF) and SOAP descriptors, different basis expansions such as bispectrum and Chebyshev polynomials have also been utilized in SNAP potentials56,57 and NNPs,58 respectively. Further advances are also provided in ML models based on atomic and many-body expansions with tensor representations such as Many-Body Tensor Representation (MBTR).10
An alternative approach is to employ linear regression using a symmetric polynomial basis. This approach was pioneered by Bowman and Braams,59 while more recent symmetric bases are the Moment Tensor Potentials (MTP)15 and the Atomic Cluster Expansion (ACE).60 The MTP and ACE basis are also based on density projections and therefore closely related to the descriptors described above. In particular, ACE can also be seen as a direct generalization of SOAP and SNAP, providing features of arbitrary correlation order.
Recent works have demonstrated that the atomic-density representations can be unified in a common mathematical framework61 and that the descriptors can be decorated with additional properties to extend their capabilities, such as in λ-SOAP representations for learning from tensor properties.61
For all these ML models, different combinations of descriptors and representations have been used in many studies and for a wide range of materials.55,62,63 As these descriptors and representations are general structural identifiers, they can also be utilized within different regression models other than the ML models for which they were primarily developed. A recent work also shows the comparison of different regressors with these representations.43 While different approaches have been utilized in these studies, comparative assessments of the representation approaches have thus far been limited. Studies for the property and topology based descriptors on the QSAR feature sets and the performance of graph based NN models show that selection of optimal descriptor sets is very important to ensure high predictive accuracy of the ML models.63 A number of studies of the atomic neighborhood density approaches provide analyses primarily based on trained models’ precision on available datasets,63 but so far, there is no performance assessment on descriptors and representations for materials and molecules focused solely on structure to hyper-dimensional encoding.
Following the works by Huo and Rupp10 and Jäger et al.,64 we identify the essential properties of descriptors and representations for encoding materials and molecules as follows:
Invariance: descriptors should be invariant under symmetry operations—permutation of atoms and translation and rotation of structure.
Sensitivity (local stability): small changes in the atomic positions should result in proportional changes in the descriptor, and vice versa.
Global uniqueness/faithfulness: the mapping of the descriptor should be unique for a given input atomic environment (i.e., the mapping is injective).
Dimensionality: relatedly, the dimension of the spanned hyper-dimensional space of the descriptor should be sufficient to ensure uniqueness, but not larger.
Differentiability: having continuous functions that are differentiable.
Interpretability: features of the encoding can be mapped directly to structural or material properties for easy interpretation of results.
Scalability: ideally, descriptors should be easily generalized to any system or structure with a preference to have no limitations on number of elements, atoms, or properties.
Complexity: to have a low computational cost so the method can be fast enough to scale to the required size of the simulations and to be used in high-throughput screening of big data.
Discrete mapping: always map to the same hyper-dimensional space with constant size feature sets, regardless of the input atomic environment.
In this article, we concentrate our efforts on analyzing sensitivity (i.e., local stability) and compressibility (i.e., dimensionality) of the selected set of descriptors and representations and do not address uniqueness/faithfullness; for an interesting recent investigation of this important issue, see Ref. 65. Scalability, complexity, and discrete mapping are also not addressed here as these are mainly related to the implementation and cost of methods in ML models, such as the evaluation costs of MLIPs or scalability limits of ML models, which are outside the scope of this work. On the other hand, the hunt for the interpretability of ML models considering the construction of the descriptors is not a new concept and has been studied in the context of QSAR models66 and many-body descriptors.67 As also discussed in property-based representations, interpretability is automatically inherited by models that are built directly from properties. As the concept depends on the application of the ML models, we leave addressing this for future studies. While there have been many related efforts on the analysis of property and topology based descriptors,35–37 we choose to focus here on the atomic neighborhood density based descriptors along with representative descriptors from three groups of mapping basis functions, tensor representations, and polynomial representations as follows: ACSF, Chebyshev polynomials in SF (CHSF), SOAP, SOAPlite, MBTR, and ACE.60 We provide an analysis of the sensitivity and local stability of the descriptors as well as testing their invariance under symmetry operations. We further assess the descriptors information packing ability using CUR decomposition where the matrix is decomposed into columns (C), rows (R), and reconstruction (U) matrices,70 farthest point sampling (FPS),70 and principal component analysis (PCA)50,52 dimension reduction techniques. We also provide analyses using linear and kernel ridge regression methods on the selected features from CUR to showcase the effect of the dimensionality of representations on ML models.
III. ATOMIC NEIGHBORHOOD DENSITY REPRESENTATIONS
The descriptors sharing neighborhood density extraction can be unified in a general atomic expansion following the works in Refs. 60 and 61 and can be defined by
where
Here, xi is the position of atom i and bi,n can be expressed with radial and angular basis functions or as a polynomial expansion. In the following, we will give a brief summary of selected atomic neighborhood density representations based on the above definition. These are Atom Centered Symmetry Functions (ACSF), Chebyshev polynomial representations, Smooth Overlap of Atomic Positions (SOAP), Many-Body Tensor Representation (MBTR), and Atomic Cluster Expansion (ACE).
A. Atom centered symmetry functions (ACSF)
The descriptors of atom centered symmetry functions were introduced by Behler and Parrinello with their atomic neural network potentials (NNP)8,68 and further used in a wide variety of applications.8,48,50,52,64,68 The method extracts atomic environment information for each atom in the configuration using radial contributions given by
and angular contributions of the form
where the distances Rij = |Rj − Ri| and angle dependent contributions are defined through their cosines via Aijk = cos θijk and cos θijk = Rij · Rjk/(|Rij||Rik|) and are invariant under symmetry operations of translations and rotations, hence the name symmetry functions. While many symmetry functions have been proposed,8 two choices of functions commonly used for br and ba in many applications8,49,52,58,69,70 take the radial function to be centered on atom i to be defined as
and the angular function centered on atom i as
where fc is a cutoff function given by
and Rc is the cutoff distance.
In this work, we used G2 and G4 with two parameter sets: one is the traditional parameter set taken from Refs. 8, 48, 49, and 68 that is used in many ACSF-based NN potential models and the second one is extracted from the automatic ACSF parameter generation proposed in Ref. 70. For the rest of this work, we label the representation of the original parameter set ASCF and the newer systematic extended parameter set ACSF-X, for which we take the same parameters as in Ref. 70.
B. Chebyshev polynomial representation within symmetry functions (CHSF)
By introducing radial and angular basis that are invariant under translation and rotational symmetries, one can define different functions that provide invariance under symmetry operations. Another example that exploits this idea defines radial and angular functions with
where Nα is the expansion order and the basis functions ϕα and their duals are defined in terms of the Chebyshev polynomials Tα via
In Eqs. (10)–(12), α is the order of the polynomials, k = 1 except for α = 0 where k = 2, and (x, Tc) are taken to be (r, rc) for radial functions and (Aijk, π) for angular functions, respectively.58 The atomic descriptors Vb are then defined using the set of coefficients and corresponding to and with
where wt is the weight for species t. For single-species configurations wt = 1, for multi-species configurations, both structural and compositional parts contribute to the final descriptor.
A practical advantage of using polynomial expansion for radial and angular functions is the reduced number of input parameters as the only parameter for the expansion is the expansion order Nα of the Chebyshev polynomials. In this work, we select Nα = 9 for both radial and angular polynomial functions.
In this representation, the radial and angular contributions br and ba are separate functions of distances and angles, respectively. br is defined by Eq. (10) and provides a histogram of distances present in the atomic environment. However, angular contributions ba are significantly different. While the Chebyshev polynomial variant defines ba to be a histogram of angles using only Aijk in Eq. (11), ACSF combines both distances and angles in Eq. (8) and thus defines histogram of triangles.
C. Smooth overlap of atomic positions (SOAP)
Descriptors can be constructed for extracting neighboring atomic environments using the smooth overlap of atomic positions (SOAP) approach.9 In this method, atomic densities centered at atom positions are defined by a sum of atom-centered Gaussians with the overall atomic density of a structure χ given by
and one can build SOAP kernel K(χ, χ′) with
where the exponent ζ > 1 and the integral is calculated over all possible rotations of the overlapping densities of χ and χ′ environments. In practice, as is elegantly shown in Ref. 9, an equivalent kernel can be rewritten in the form of by selecting a set of orthonormal radial basis functions gn(r) and angular basis functions with the spherical harmonic functions Ylm(θ, ϕ) to expand the atom centered density at atom i with
and using the power spectrum of the expansion coefficients , given by
where n and n′ are indices for the radial basis functions and l and m are the angular momentum numbers for the spherical harmonics. In SOAP, as is defined in Ref. 9, the radial basis functions are given by
in terms of polynomials. The representation of atomic environment χ is then defined by , where pi(χ) can be identified as atomic descriptors for atom i. The SOAP descriptors and representation are specified by the expansion orders nmax for the radial basis and lmax for the angular basis. In this work, for compatibility with SOAPlite and with the polynomial expansion order of Chebyshev polynomials in SF, we select nmax = 9 and lmax = 9.
D. Modified basis expansion for SOAP (SOAPlite)
Introducing a different radial basis function and treatment of spherical harmonics basis, a modified version of SOAP referred to as SOAPlite has been proposed recently.64 In this version of SOAP, radial basis functions are replaced by
where αnl are decay parameters of non-orthonormal functions ϕnl(r) that determine the decay of ϕnl to 10−3 at a cutoff radius specified by (Rc − 1)/nmax steps between 1 Å and Rc. The method also selects the real (tesseral) spherical harmonics for the angular basis as described in Ref. 64. For fair comparison, we select nmax = 9 and lmax = 9 for SOAPlite with all other parameters taken as the defaults implemented in the GAP9,47 and QUIP71,72 codes.
E. Many-body tensor representation (MBTR)
Many-body tensor representation (MBTR)10 constructs representations of structures by defining contributions from k atoms in k-body terms with gk geometry functions. In MBTR, these contributions from k atoms are smoothed with the probability distribution function D, and the resulting contributions to the representation are given by
where Z are atomic numbers, Cz is an element–element correlation matrix consisting of Kronecker δ values, wk are weighting functions, and gk are scalars for k atoms, while i and j are neighboring atoms in i = (i1, …, ik). Common selections for the functions gk are atomic number g1(Zi) = Zi, inverse distances of i–j pairs with g2(i, j) = 1/|Ri − Rj|, and angles with g3(i, j, k) = ∠(Ri − Rj, Ri − Rk). In this work, we select gr = g2 and ga = g3 for geometry functions with the exponential decay function
where β is taken to be 4.0 for k = 2 and 3.0 for k = 3. We select Gaussian distribution for D and the continuous broadened results are discretized to Nx = (xmax − xmin)/Δx values using Δx steps where Nx is 100 and xmin, xmax with intervals of [0.1, 1.1] and [−0.15, π + 0.1π] for distances and angles, respectively. MBTR is the only global representation studied in this work, since it does not include any atomic descriptors but instead provides a representation for a given structure overall.
F. Atomic cluster expansion (ACE)
The ACE60 method constructs a complete basis of invariant polynomials. Each basis function may be interpreted as an invariant feature, which can then be collected into a descriptor map. Similar to SOAP, the ACE starts with a density projection,
where gn is a radial basis. The atomic positions are not smeared as in SOAP, SOAPlite, and MBTR. Isometry invariant features are then obtained by integrating the N-correlations over the symmetry group: for , we obtain
Finally, one selects a linearly independent subset of the Bnlm basis functions. A detailed description of this construction is provided in Ref. 73.
Aside from the lack of smearing and the choice of radial basis, the two-correlation functions are equivalent to SOAP, while the three-correlation functions are equivalent to SNAP. Since the ACE construction readily applies to higher order correlations, we will use up to five-body correlations in order to test the effect of introducing significantly higher correlations into the descriptor. To control the size of the feature set, we use an a priori chosen sparse selection, as described in Ref. 73.
To complete the specification of the ACE descriptors, we must define the radial basis: here, we choose
where (Pn) is a basis of orthogonal polynomials such that gn(Rc) = (Rc) = 0. In this work, the ACE method is used as implemented in the ACE.jl package in Ref. 74.
G. Modified Chebyshev polynomial symmetry functions (CHSF-mix)
Noting the histogram of triangles provided by ACSF, SOAP, SOAPlite, and ACE through Eqs. (8), (17), and (25), we examine the contributions from the angular terms with Aijk and radial basis of CHSF. Combining both radial br and angular ba basis expansion of Chebyshev polynomials, we introduced a new ba with
where is defined as
and α index is substituted with l for angular basis and n, n′ for radial basis sets with the choice of (x, Tc) that are selected in ϕα(Rij) and ϕα(Aijk) as in Eqs. (13) and (14), respectively.
We analyze the benefit of these novel modifications over CHSF-mix in Sec. VI A using Nn = 9 and Nl = 9.
H. Descriptor implementations
All our analyses are carried out using our DescriptorZoo code75 (github.com/DescriptorZoo) that includes implementations of the CUR, FPS, and PCA analyses and uses AMP,76 Dscribe,41,77 qmml-pack,78,86 QUIP,71,72 and GAP9,47 with its Python interface quippy, æpy79 a wrapper code for ÆNET49,58,80 Fortran code of the NN ML model based on ACSF and Chebyshev polynomial descriptors (CHSF), CHSF.jl for both CHSF and CHSF-mix,81 and ACE.jl73 code for ACE representation (labeled ACE in results below).
IV. DATASETS OF MATERIALS AND MOLECULES FOR ANALYSIS
We used a wide range of materials and molecules databases to provide datasets to test the various representation methods. For diversity, we selected a range of materials and molecular systems: Si for single species tests, TiO2 for metal-oxides, AlNiCu for metals and metal alloys, and molecular configurations containing the elements C, H, O, and N.
A. Si dataset
This dataset was constructed using the available GAP Si potential database from Ref. 82 plus Si molecular dynamics (MD) database from Ref. 50. While the overall dataset includes various crystalline phases of Si, it also includes MD data. This dataset includes 3583 structures with 242 139 atomic environments.
B. TiO2 dataset
We used a TiO2 dataset that was designed to build atom neural network potentials (ANN) by Artrith et al.49,58 using the ÆNET package. This dataset includes various crystalline phases of TiO2 and MD data that are extracted from ab inito calculations. The dataset includes 7815 structures with 165 229 atomic environments in the stochiometric ratio of 66% O to 34% Ti.
C. AlNiCu dataset
This dataset is formed from two parts: single species datasets for Al, Ni, and Cu from the NOMAD Encyclopedia and multi-species datasets that include Al, Ni and Cu from NOMAD Archive. All single-species data were fetched from the NOMAD Encyclopedia after removing duplicate records with degenerate atomic environments (e.g., equivalent structures from different ab initio calculations uploaded to NOMAD). For the multi-species data, we used only the last configuration steps for each NOMAD Archive record, since these records include all intermediate calculation cycles, with the last configuration entry typically corresponding to a fully relaxed configuration. In our dataset, the NOMAD unique reference access IDs are retained along with a subset of their meta information that includes whether the supplied configuration is from a converged calculation as well as the Density Functional Theory (DFT) code, version, and type of DFT functionals with the total potential energies. This dataset consists of 39.1% Al, 30.7% Ni, and 30.2% Cu and has 27 987 atomic environments in 3337 structures.
D. CHON dataset
This dataset of molecular structures was extracted from all available structures in the NOMAD Archive that only include C, H, O, and N using the NOMAD API. The same procedure of selecting only the last entries in each record was applied. This dataset consists of 50.42% H, 30.41% C, 10.36% N, and 8.81% O and includes 96 804 atomic environments in 5217 structures.
V. ANALYSIS METHODS
Our analyses for the representations are based on the desired features of encoding structural information of the materials that are listed above as invariance, sensitivity, dimensionality, differentiability, interpretability, scalability, complexity, and discrete mapping. While many of these features are important according to the application, we focus here on the invariance, sensitivity, and dimensionality.
Invariance of a representation or a descriptor under symmetry operations such as translation and rotation is of high concern in developing mapping methods since the properties of a material should be identical under these changes in the configuration description. The structural representations are constructed to follow these conditions; otherwise, every possible transformation of the material must be included in the training of the machine learning model, leading to unaffordable numbers of permutations. Even a successful construction of such a model can lead to undesired predictions for the uncharted permutations that are not in the training datasets.
Sensitivity is also an important property of a descriptor since any application needs distinguishable and unique values for the descriptors. How sensitive the descriptor is to changes in the structure of the material determines the outcome of similarity analysis or molecular dynamics simulations. For example, if a descriptor produces exactly the same values for any perturbation, the outcome will be indistinguishable. In MD simulations, such insensitivity will result in inaccurate dynamics due to inaccuracies in energy and force evaluations as a result of artifacts introduced by the descriptor.
Dimensionality on the other hand is more related to the under- or over-determination of the feature space for the ML application. While under-determination in the mapping of hyper-dimensional space can easily lead to inaccurate predictions of an ML model, over-determination may also lead to undesired predictions according to how the over-determined features are eliminated.
A. Invariance and sensitivity analysis
Our analysis of invariance, local stability, and sensitivity are carried out using diamond cubic Si structures. All the selected descriptors in this work start from neighbor analysis based on atom-centered perspective, as described in Sec. III, and therefore, the descriptors generate values invariant under translational symmetries by definition. However, whether they can maintain invariance under rotation needs to be verified. To evaluate this, we perturbed the 4 × 4 × 4 cubic diamond crystal Si (c-Si) as follows: rotating the structure around the y-axis in a non-periodic unit cell, we calculated the difference of descriptors and representations dVall from the reference non-rotated structure [see Fig. 2(a)].
Perturbations on the Si structure with 4 × 4 × 4 conventional diamond unit cells. (a) Rotation of the structure around y-axis with θ angle, (b) perturbation of the central atom (dark blue) along the +x direction (mixed perturbation), and (c) radial and tangential perturbations of atoms at neighboring shells of the central atom. In (c), the cross section of the structure along the x–y plane is shown with four conventional surrounding unit cells. Only the atoms within four layers along z-axis [up to light blue atom in (b)] can be seen in (c). While green and red atoms are fourth-shell neighbors along x- and y-axes, gray, orange, and purple atoms are neighbors at first, second, and third shells, respectively. dθijk shows the angle that corresponds to the tangential perturbation dx on the sphere of fourth-shell neighbors.
Perturbations on the Si structure with 4 × 4 × 4 conventional diamond unit cells. (a) Rotation of the structure around y-axis with θ angle, (b) perturbation of the central atom (dark blue) along the +x direction (mixed perturbation), and (c) radial and tangential perturbations of atoms at neighboring shells of the central atom. In (c), the cross section of the structure along the x–y plane is shown with four conventional surrounding unit cells. Only the atoms within four layers along z-axis [up to light blue atom in (b)] can be seen in (c). While green and red atoms are fourth-shell neighbors along x- and y-axes, gray, orange, and purple atoms are neighbors at first, second, and third shells, respectively. dθijk shows the angle that corresponds to the tangential perturbation dx on the sphere of fourth-shell neighbors.
For the sensitivity analysis, we perform three types of perturbations to a 4 × 4 × 4 c-Si unit cell as follows:
Mixed Perturbation: the central Si atom with dark blue color in Fig. 2(b) is moved along the [100] direction (i.e., line joining the light green atoms on the x-axis) within a periodic supercell by a distance dx ranging from 10−8 Å to 0.1 Å.
Radial Perturbations: The atoms in the groups of first, second, third, and fourth neighbor atom shells at different distances from the central dark blue atom are perturbed along radial and tangential directions [see Fig. 2(c)]. For the radial perturbation, atoms in each shell are moved along the vector that separates the atom from the central atom. Position change dx ranges from 10−8 Å to 0.1 Å.
Tangential Perturbation: Neighboring atoms in the same shells are perturbed on the sphere inscribed by the distance vector Rij [see Fig. 2(c)]. This perturbation only changes angular contributions of the descriptors since the radial distances Rij are kept fixed. Position change ranges from 10−5 Å to 0.1 Å.
For case 1, we look at the difference in the full structure representation dV (comprising all atomic descriptors) with respect to the unperturbed crystal to investigate whether the full representation is sensitive to small perturbations of a single atom in the structure. In this mixed perturbation test case, all the neighbor distances for the perturbed central atom change from the perspective of the central atom i, and the difference dV from the reference unperturbed structure depends on both radial and angular contributions.
For cases 2 and 3, we look at the difference of the descriptor dVi of the central atom i with respect to an unperturbed neighborhood. This allows us to test the sensitivity of the individual radial and angular contributions.
1. Sensitivity to perturbations
We consider the question of whether the representation changes in a locally smooth and stable manner near some reference configuration . Small changes in the configuration should lead to proportional changes in the representation, a property that we called sensitivity. This requirement is necessary to represent an arbitrary smooth function with the same symmetries and to inherit its regularity, which in turn is key to obtain accurate fits with few parameters or basis functions.
To illustrate this concept, consider a “feature map” , which is strictly increasing and hence invertible, but assume that v(x) ∼ a2x2 as x → 0. It follows that , i.e., the inverse has a singularity. Suppose now that we wish to represent the linear function f(x) = x as f(x) = g(v(x)), then g(v) = f(x(v)) = x(v), i.e., it inherits the singularity which makes it challenging or even impossible to obtain an accurate fit.
In general, we consider paths with and expand the change in the descriptor to leading order,
for some k ≥ 1. We call a descriptor V linearly stable if k = 1 for all possible perturbation paths, i.e., if the change in the descriptor is linear as the perturbation amplitude t → 0. If k > 1, then we call V linearly unstable at .
In our sensitivity analyses, we choose different perturbation paths Rij(t) leading to different paths in descriptor space Vt. From Eq. (30), we then obtain
that is, we can observe the stability or instability of a descriptor by analyzing the slopes on a logarithmic scale. A linear slope, i.e., linear stability, is guaranteed to fulfill our sensitivity requirement. However, in certain high symmetry settings, this requirement must be relaxed, as we will see in Sec. VI A.
B. Dimensionality analysis
The descriptors analyzed here are all constructed from feature sets extracted from the structural mapping of the atomic neighborhood density to hyper-dimensional spaces. Their central objective is the requirement to cover all possible perturbations of the structure to ensure a faithful representation to the ML model. However, strictly following this concern can lead to over-determination in the hyper-dimensional space. In other words, the representation may cover only a small subspace of the full hyper-space, with the subspace depending on the parameter set used. In the case of over-determination, feature sequences may contain many zero entries or, for multi-species systems, may need to be padded with zeros to account for species missing from individual environments. Both of these cases lead to high sparsity in the descriptors, which, in principle, could be eliminated by carefully selecting parameters to remove unnecessary features from the final descriptor sets and hence from the representations. Moreover, using an over-determined mapping is likely to induce overfitting and noise in subsequent ML training. In ML applications, such non-informative data should be eliminated before the actual training of the model to reduce the error in the training and increase the accuracy of the resulting models.
Due to the well-known curse of the dimensionality, the dimension of the parameter space of a global optimization problem is the key determiner of the difficulty of obtaining optimal solutions. When representations form the input data for an optimization problem, their dimensionality thus has a crucial role in determining complexity. As the dimensionality increases, the number of possibilities rises combinatorially, drastically hindering the task of optimization.
To keep the features at an affordable level for optimization while maintaining an accurate description, one can use dimensionality reduction techniques such as CUR decomposition,70 farthest point sampling (FPS),70 Pearson correlation coefficient (PC),70 and principal component analysis (PCA).50,52 While these techniques help identify the most informative features in the descriptors, they also help analyze how the features of the descriptors can provide sufficiently informative data through an analysis of its “compressibility” and whether the representation leads to an over-determined embedding. Here, we have used CUR and FPS, as implemented in Ref. 70, and PCA to select the optimum number of features for the descriptors.
C. Analysis of dimensionality with regression
A key question for models based on representations is how the outcome of predictions depends on the dimensionality of the representation. To assess this, we analyzed the model prediction errors of representations on the structures and total potential energies, EDFT, taken from the Si GAP fitting database in Ref. 82. This is a subset of the complete Si dataset used elsewhere in this work, chosen here to allow comparison with results for the same dataset in Refs. 73, 82, and 83. For the target values t in Eq. (1), we use the cohesive energies, i.e.,
where and E0 are the number of atoms in each χk structure and the energy of an isolated Si atom, respectively. The cohesive energy of structure χk can then be estimated via the linear combination
where the index v runs over all Nf features within the representation.
For a global representation of a structure such as in MBTR, the Vk are taken directly as the representation vectors for each structure χk. For atom-centered descriptors, however, we first construct a global representation by summing over the descriptor vectors for each atom , i.e.,
In both cases, we normalized the Vk representations using the training dataset for each representation method.
The cohesive energies can now be estimated by minimizing the L2-regularized quadratic loss function,
To estimate the coefficients c, we build two regression models based on (i) linear ridge regression with L2-norm regularization, obtaining the least-squares solution with the QR method (RR) defined in Refs. 73 and 83 and (ii) kernel ridge regression (KRR) as detailed in Ref. 64.
For the linear ridge regression case, the regularized least squares problem becomes
where t is a vector comprising all the target cohesive energies, Ψ is an Nk × Nf matrix with representations along each rows and structures down each column, and λ is a regularization parameter.
For the kernel ridge regression case, we use the Vk representation to build an Nk × Nk kernel matrix with elements
where γ is a lengthscale hyperparameter. After constructing the kernel matrix, one can then predict the cohesive energy for a new configuration with representation V as
where c = tT(K + λI)−1 and κk = K(Vk, V).
VI. RESULTS AND DISCUSSIONS
A. Sensitivity
1. Sensitivity to rotations
In Fig. 3, we present the norm of the difference vector dV between the full structure representations of the rotated and the reference c-Si system for each approach considered.
The norm of the difference in representation values between the reference structure and rotated whole structure by an angle θ.
The norm of the difference in representation values between the reference structure and rotated whole structure by an angle θ.
Our analyses show that all descriptors maintain the rotational invariance with high precision, with all errors below 10−9 (above a machine precision ϵ of ∼10−16). Together with built-in invariance with respect to translations and permutation of like atoms of all the representations based on density projections, our analysis indicates that all approaches considered in this work fulfill the properties of invariance in rotation, translation, and permutation for the structural representations.
Considering the wide range of outcomes on the rotation tests, we ascribe the error to the numerical precision differences, i.e., floating point roundoff error of the underlying codes. For the sake of comparison of the outputs from different approaches considering the numerical precision of underlying codes, we selected 10−8 as the lower bound in all our subsequent sensitivity analysis following the lowest precision observed in these rotation tests.
2. Sensitivity to perturbations
Further analyses are carried out for the sensitivity of representations under the atomic motions in structures, as described in Sec. V A. Recall from Sec. V A 1 that a slope of one indicates local stability (also smooth invertibility of the descriptor), while a slope greater than one leads to singularities in the representation.
Figure 4 shows the results for the mixed perturbation with representation changes ∥dV∥ normalized so that all curves pass through the point (0.1, 0.1) to enable direct comparison. This lets us to provide an absolute comparison for ∥dV∥ metric between reference and perturbed structures. All representations have linear sensitivity within the entire range of the path, except for MBTR, which shows a mild preasymptotic sign of instability (change in slope from 1 to 2 above 0.01 Å), which is unlikely to cause any significant deterioration in the stability of the representation. This is apparent here since MBTR is a global representation method. Compared to other atomic descriptors, the results for MBTR show that its histogram of angles leads to a smooth change in the slope. In the atomic descriptors, this calculation includes contributions from all descriptors, and hence, the slope is dominated by radial contributions, which are discussed detailed in Sec. VI A 3.
The norm of the difference in representation values between the reference structure and c-Si with perturbation of a single-atom by a distance dx corresponding to the mixed perturbation in Sec. V A.
The norm of the difference in representation values between the reference structure and c-Si with perturbation of a single-atom by a distance dx corresponding to the mixed perturbation in Sec. V A.
3. Radial perturbation of reference crystal
A deeper analysis of the sensitivity of representations can be made by analyzing the responses of the atomic descriptors to different perturbation modes. As described in Sec. V A, we calculated the change in the descriptor of the central atom i to a radial perturbation of a neighbor j, as shown in Fig. 2(c). The sensitivity curves corresponding to the radial perturbation of a neighboring atom in the fourth shell are given in Fig. 5(a). All descriptors have slope-1 sensitivity curves, indicating linear stability under this perturbation. This is unsurprising since all descriptors provide a relatively high resolution of the two-body histogram.
Norm of difference of atomic descriptors on atom i as a neighboring atom j is perturbed from its reference position. (a) Radial perturbation and (b)–(d) tangential perturbations, as shown in Fig. 2(c). The tangential perturbations (b) in a high symmetry direction for the first shell, (c) in a random direction for the first shell, and (d) in a random direction for the second shell with random radial perturbations on the same shell before applying tangential perturbation on one of the same shell atom. In (b) and (c), no radial perturbation is applied before the tangential perturbations.
Norm of difference of atomic descriptors on atom i as a neighboring atom j is perturbed from its reference position. (a) Radial perturbation and (b)–(d) tangential perturbations, as shown in Fig. 2(c). The tangential perturbations (b) in a high symmetry direction for the first shell, (c) in a random direction for the first shell, and (d) in a random direction for the second shell with random radial perturbations on the same shell before applying tangential perturbation on one of the same shell atom. In (b) and (c), no radial perturbation is applied before the tangential perturbations.
4. Tangential perturbation of reference crystal
Next, we repeat the test of the foregoing section with a tangential perturbation of an atom in the first shell. The resulting sensitivity curves are given in Fig. 5(b), clearly showing slope 2 for all descriptors. Thus, according to Sec. V A 1, all descriptors are unstable with respect to tangential perturbations, raising concerns due to the resulting singularity in the inverse of the descriptor map. However, the origin of this instability is invariance with respect to reflections about a plane, and fitting any target function with the same reflection symmetry need not be affected by the singularity in the inverse descriptor map.
Concretely, let i denote the center atom and k denote the neighbor that is being perturbed in the tangential direction, i.e.,
where dRij = 0 for j≠k and ∥dRik∥ = 1 and . If R0 is symmetric under reflection through the plane that contains the origin and is orthogonal to dRik (this is the case here), then the configuration R−t is the reflection of Rt to within O(t2) accuracy. Since all descriptors V we consider are invariant with respect to reflections, they necessarily satisfy (this is true for any function of the distances rij and the cosines Aijk = cos θijk), and hence, Vt ∼ a2t2 as t → 0. In particular, the inverse descriptor map V ↦ R must contain a square-root singularity along the path Vt.
On the other hand, assuming that we aim to represent a property, e.g., site potential, , then ϵ will also satisfy this reflection symmetry, which indicates that the square-root singularity is again removed. To illustrate this point further, we modify the one-dimensional example from Sec. V A 1: Assume that we wish to represent f(x) = x2 = g(v), and then, g(v) = f(x(v)) = x(v)2 ∼ v/a2 as v → 0, i.e., the singularity is removed in this case. More generally, this occurs whenever f(x) ∼ b2x2 as x → 0, i.e., when f is symmetric about the origin to the leading order.
5. Tangential perturbation of a perturbed crystal
The analysis of the previous paragraph suggests that any perturbation in the configuration of the structure that breaks the symmetry should lead to the linearly stable slope-1 cases. We therefore test which descriptors are capable of capturing this symmetry breaking. We break the symmetry in several ways: we perturb an atom in a random tangential direction, we perturb atoms in the second shell that does not exhibit the same reflection symmetry, and we perturb the reference crystalline structure in the radial direction from a chosen center atom i before applying these tangential perturbations. The results are shown in Figs. 5(b)–5(d).
As predicted, any such symmetry breaking leads to changes in the slopes of sensitivity curves of descriptors in the limit t → 0. However, there are differences across descriptors how well the symmetry breaking is captured. First, there are some variations across descriptors how significant the pre-asymptotic slope-2 regimes are, which indicate a reduced sensitivity. However, the most concerning effect is the “dip” in the CHSF descriptor in Fig. 5(c), highlighting a region of significantly reduced sensitivity (it can almost be thought of as a blind spot) for atomic displacements in the descriptors, where the perturbation does not change the output values of representations. To test whether adding additional features can remove this dip, we implemented an extended CHSF descriptor, labeled CHSF-mix, for which the radial and angular histograms are fully mixed, giving a similar description of the three-body histogram as SOAP and ACE do. This addition clearly removes the reduced sensitivity regions.
B. Dimensionality of representations
In the second phase of our analyses, we consider four different datasets selected from those described in Sec. IV, namely, Si, CHON, AlNiCu, and TiO2. Each dataset contains a diverse range of configurations with thousands of structures. To identify how the dimensionality of the representations changes with different datasets using the same parameters, we used CUR and FPS feature selection techniques and analyzed the reduced dimensions of the representations by comparing them with the outcomes of PCA calculations. This analysis can also be accounted as a measure of the compressibility of each representation.
1. CUR decomposition
As a first step, we analyzed the representations including all element-wise descriptors. MBTR is considered as a representation-only approach as its output cannot be broken down to element-wise descriptors. In Fig. 6, the total error between the full feature sets of each dataset and the reduced feature sets that are extracted from CUR analysis is presented. As each approach provides a different number of features for the datasets at hand depending on the selection of (hyper)parameters, we provide a complete list of the number of non-zero features in each representation with the corresponding datasets in Table I and in the legends of each panel in Fig. 6.
CUR decomposition based dimensionality reduction analysis of (a) ACSF-X, (b) ACSF, (c) SOAP, (d) SOAPlite, (e) CHSF, (f) ACE, and (g) MBTR representations with four datasets that are indicated in the legend with the number of features of each within parentheses. Figures show the error in the representation from the non-reduced feature set as a function of the percentage of features selected. Legends show the corresponding datasets of Si, CHON, AlNiCu, and TiO2, while in panel (g), Si(STD) shows the standardized representation output for the Si dataset, and in (h), the legend shows the results of ACE with different polynomial degrees from six [as used in panel (f)] to 15 for the AlNiCu dataset.
CUR decomposition based dimensionality reduction analysis of (a) ACSF-X, (b) ACSF, (c) SOAP, (d) SOAPlite, (e) CHSF, (f) ACE, and (g) MBTR representations with four datasets that are indicated in the legend with the number of features of each within parentheses. Figures show the error in the representation from the non-reduced feature set as a function of the percentage of features selected. Legends show the corresponding datasets of Si, CHON, AlNiCu, and TiO2, while in panel (g), Si(STD) shows the standardized representation output for the Si dataset, and in (h), the legend shows the results of ACE with different polynomial degrees from six [as used in panel (f)] to 15 for the AlNiCu dataset.
Number of features for each representation in different datasets. The numbers in parentheses show the full feature set before non-zero elements are selected.
Desc. . | Si . | CHON . | AlNiCu . | TiO2 . |
---|---|---|---|---|
ACSF | 51 | 462 | 282 | 145 |
ACSF-X | 57(195) | 634(1644) | 544(1017) | 534 |
SOAP | 450 | 6660 | 3780 | 1710 |
SOAPlite | 450 | 4500 | 2700 | 1350 |
CHSF | 20 | 44 | 44 | 44 |
ACE | 30 | 30 | 30 | 30 |
MBTR | 182 | 5000 | 2400 | 900 |
Desc. . | Si . | CHON . | AlNiCu . | TiO2 . |
---|---|---|---|---|
ACSF | 51 | 462 | 282 | 145 |
ACSF-X | 57(195) | 634(1644) | 544(1017) | 534 |
SOAP | 450 | 6660 | 3780 | 1710 |
SOAPlite | 450 | 4500 | 2700 | 1350 |
CHSF | 20 | 44 | 44 | 44 |
ACE | 30 | 30 | 30 | 30 |
MBTR | 182 | 5000 | 2400 | 900 |
After removing any features of ACSF-X that are all zero from Si and AlNiCu dataset [see Fig. 6(a)], the selection method CUR in Sec. V B is applied throughout the dataset and features from the full feature set are selected one-by-one and added to the new feature set by calculating the error with respect to the full feature representation. Using this method, the features that contribute most to the representation can be selected. As the lower contributions are added to the new feature set, the error cannot be reduced more and the overall error becomes constant, equal to zero within numerical precision. The number of features that are selected at the beginning of this plateau can be counted to determine the size of the compressed representation, and conversely, the number of remaining features can be thought of as the over-determination of the representation.
In Fig. 6, one can see three types of results: (i) those with error curves that gradually decrease to the point where the error plateaus as for the TiO2 results of ACE, ACSF-X, and MBTR, (ii) errors decrease step-wise to a plateau such as in SOAP results, and (iii) where errors drop rapidly as for the AlNiCu results in ACSF, ACSF-X, ACE, and CHSF.
SOAP and SOAPlite, in Figs. 6(c) and 6(d), respectively, can be directly compared since they differ only in the choice of radial basis function. The results for the Si, CHON, and TiO2 datasets can be examined for both approaches. While SOAPlite has close to exponential decay up to ∼70% of selected features, in SOAP, this regime extends only up to 30%. After this, the SOAP error has a more step-like character. Similar behavior can also be seen in type (iii) results such as ACSF, ACSF-X, CHSF, and ACE. For these methods, while the first ∼10% of features vary the error reduction, the rest of the features only slightly reduce the error. For the AlNiCu dataset, SOAP and SOAPlite have substantially different outcome from the rest of the datasets, which is an indication on how the choice of radial basis can be a key determiner for the relevance of features in the representations. One can also see the dominance of radial basis in the global representations when Fig. 4 is compared with Fig. 5. The optimal choice of basis functions for descriptor performance is another open question and is outside the scope of the present work. We leave further investigation of the basis functions’ role on the sparsification and descriptor performance for future work.
In most of our dimensionality reduction results, one can see distinct constant error regions associated with over-determination. These features should ideally be eliminated before using the representation in ML models. For example, consider the results for ACSF-X and ACSF in Figs. 6(a) and 6(b). As the aim of ACSF-X is to extend the number of features in set from the widely used and well-tested standard ACSF parameter set, some of the features are expected to be irrelevant for representing structures in our datasets. It is thus unsurprising that the dimension reduction analyses in Fig. 6(a) show that there, only fractions of features contribute significantly—around 14% for Si, 33% for CHON, 54% for AlNiCu, and 33% for TiO2. A smoother feature reduction can be seen for the standard ACSF parameter set in Fig. 6(b), where the error decay is very similar with about 85% of the total features sufficient across all four datasets and thus around 15% of redundant features.
A similar result is seen for SOAP, where around 75% of the features of SOAP representations are sufficient to cover the structural variance across all datasets. This result is more striking than that for ACSF since SOAP has about an order of magnitude more features in its representation. We can conclude that both ACSF and SOAP are robust approaches that cover the hyper-dimensional space of structural representation for a wide-range of crystals and molecules.
The ACE AlNiCu results are significantly different than the other datasets. To identify whether the degree of the polynomial is the reason for this pattern, we carried out additional analyses with ACE, increasing the degree of polynomial from 6° to 15° in steps of 3°, as shown in Fig. 6(h). Increasing the polynomial degree significantly increases the number of features; however, we find that the percentage of selected features on the final representation does not change significantly and the pattern of error decay is still quite different from the rest of the datasets (e.g., Si, where 25% of the features can be removed from the representation although it has order of magnitude less features in the descriptor vectors than with 15° of polynomial expansion.
To further investigate the extensive redundancy identified for MBTR features across all four datasets, we consider whether the discretized smearing of positions and angles used by the MBTR representation leads to clustering of the features representation space. To identify any clustering of the data, we perform standardization of the features for all the representations of Si that are generated by MBTR and show the feature selection curve in Fig. 6(g) labeled Si(STD). When compared with Si curve, Si(STD) does reduce the error when less than ∼20% of features are selected by around an order of magnitude in comparison to the raw representation. However, standardization suggests an even smaller feature set selection of about only 20% of the full feature set. This may be due to the Gaussian smearing with D in MBTR that significantly increases the correlation between features.
2. Principal component analysis
The CUR selection process is closely related to the principal components of the representation data for each dataset. To show whether these principal components are related to the final feature selections in CUR, we further analyzed the datasets using PCA. In Fig. 7, the fraction of variance explained by the principal components of the four datasets is given for each representation. Since the PCA variances decrease from one to very small values, we determine a lower bound after which we consider the variance to be zero, shown as the zero level in our figures. PCA variation results for principal components follow the same trends as the CUR curves discussed above. Although CUR results show directly the hyper-dimensional space of the representation, PCA results are not solely based on the selected features but are a collective property of all features based on the covariance matrix from which the principal components are extracted. As seen in Fig. 7, the outcome of this selection following the highest to lowest variances and extraction of the features with highest values in the covariance matrix gives similar results to the CUR decomposition.
The variance of features vs the percentage of the selected components of (a) ACSF-X, (b) ACSF, (c) SOAP, (d) SOAPlite, (e) CHSF, (f) ACE, and (g) MBTR representations using PCA analysis. Different colors show results for each representation with selected datasets. The black dashed and blue lines show the machine precision and the lower bound in our variance analyses where any value below this threshold is treated as zero.
The variance of features vs the percentage of the selected components of (a) ACSF-X, (b) ACSF, (c) SOAP, (d) SOAPlite, (e) CHSF, (f) ACE, and (g) MBTR representations using PCA analysis. Different colors show results for each representation with selected datasets. The black dashed and blue lines show the machine precision and the lower bound in our variance analyses where any value below this threshold is treated as zero.
3. Farthest point sampling
CUR and PCA are both based on Singular Value Decomposition (SVD), selecting the features as orthogonal dimensions of the hyper-dimensional space and treating each feature as linearly independent since linearly dependent features cannot span the space. However, neither of these methods consider if the selected features have non-linear dependencies on other features. Another approach to select features without using SVD decomposition is Farthest Point Sampling (FPS). In FPS, the feature selected at each iteration is chosen as the farthest from those already selected. Hence, FPS does not provide information on whether the selected feature is indeed linearly independent of those already selected. This can be understood by considering each features’ distance from the others at a time late in the process when there are few remaining features to be selected. These remaining features either represent very small distances as minor additions to the previously selected and clustered features or repeat similar distances.
In Fig. 8, the FPS results for the feature selection show that there are significant differences in error reduction and hence dimension compression for SOAP, SOAPlite, ACE, and MBTR representations in comparison to our earlier CUR results. However, these FPS results do not allow insight into each features’ contribution as a dimension in the hyper-dimensional space. The relatively small reduction in errors in FPS selection or the constant regions are due to its selection criteria, which is based only on the hyper-spatial distance. This poses a limitation to the analysis if one would like to find the full extent of the representation and remove non-informative features from the descriptors. However, determining the cutoff for the features according to the error is not obvious since there may not be a plateau in the error—as is seen for in ACSF-X, where only 25% of features contribute—but instead a more gradual decrease as in the results for all other representations.
Comparison of CUR and FPS based dimension reduction analysis on ACSF both with (a) extended and (b) standard parameter sets, (c) SOAP, (d) SOAPlite, (e) CHSF, (f) ACE, and (g) MBTR representations for the CHON dataset.
Comparison of CUR and FPS based dimension reduction analysis on ACSF both with (a) extended and (b) standard parameter sets, (c) SOAP, (d) SOAPlite, (e) CHSF, (f) ACE, and (g) MBTR representations for the CHON dataset.
C. Regression with representations
The RR and KRR methods are applied to all representations using a subset of the Si dataset (see Sec. V C), and root-mean-square error (RMSE) is calculated as a function of the number of features in each representation.
For each representation, the regularization parameter λ used in both RR and KRR methods and the γ hyperparameter of K in KRR were chosen to optimize the convergence of the RMSE in the predictions on an independent test set.
In Fig. 9, we present our regression results for the two ridge regression methods RR and KRR in the first and second rows of the figure, respectively. While RMSE of both training (thick lines) and test datasets (thin lines) is shown in Figs. 9(a) and 9(c) for each regression method and for the top 65 features that are selected by the CUR method, the RMSE of full feature extension of representations is provided for comparison in Figs. 9(b) and 9(d).
Comparison of RMSE using [(a) and (b)] RR and [(c) and (d)] KRR with the Si dataset on representations, while (a) and (c) showing up to the top 65 features that are selected with the CUR method. Thick lines represent training errors, and thin lines represent test errors.
Comparison of RMSE using [(a) and (b)] RR and [(c) and (d)] KRR with the Si dataset on representations, while (a) and (c) showing up to the top 65 features that are selected with the CUR method. Thick lines represent training errors, and thin lines represent test errors.
Our intention here is not to provide the best potential energy surface (PES) estimator for each representation method, since this has already been extensively studied in the literature but instead to provide a common metric to compare different representations. Here, by fitting those methods to same dataset, we can assess how each method performs under this task and, in particular, how the dimensionality of each method effects the outcomes.
The effect of the dimensionality of the feature space can be seen from the RR outcomes of representation for the full feature set. In Fig. 9(b), one can identify plateaus at three typical RMSE values: 0.9 eV/atom, 0.1 eV/atom, and 0.04 eV/atom. While RMSE for both training and test dataset of SOAP has the highest accuracy with 0.04 eV/atom, MBTR and CHSF have the lowest accuracy of around 0.9 eV/atom. The rest of the methods have final errors of around 0.1 eV/atom. As the ridge regression is a linear fit to representation features, the RMSE can be compared directly between methods. The results indicate that SOAP has enough features to cover the structural variation, but convergence is slow using a linear fit. However, using a non-linear fit as in Fig. 9(d) outperforms RR. While SOAP has still the highest accuracy, albeit with a significant split of RMSE between test and training datasets, ACSF, ACE, and SOAPlite result in predictions that are at least an order of magnitude higher in accuracy in KRR than in RR.
To identify the role of the most important features in each representation on the prediction RMSE, we analyzed the RMSE of methods while only using the top-ranked selected features from CUR up to 65 features. The RMSE of RR shows that ACE and ACSF have very similar results up to 50 features, where they reach an accuracy comparable to that of 65 features with the SOAP or SOAPlite models. When non-linearity is introduced in the KRR models in Fig. 9(c), the difference is even more stark, for example, CHSF with only 20 features has comparable accuracy with SOAP at 65 features.
These results raise the question: what is the best feature set for a given representation or, equivalently, what is the necessary dimensionality for a given representation for a specific dataset or material? In Fig. 9(d), we can see that RMSE of training and test datasets for SOAP and ACSF starts to diverge at about 50%, and 70% of features, respectively. These points coincide with the region in Fig. 7 where the variance of these representations drops below about 10−12. Similar results can also be observed from the KRR results for MBTR, CHSF, and ACE where either RMSE does not reduce more with the addition of new features such as in ACE beyond 45% features or RMSE of the test dataset significantly increases as in MBTR after 40% features are selected.
These results show that there are clear inconsistencies between the selections of features in representation and even within the same representation as two critical points need to be considered before applying a feature set to MLIPs using dimension reduction: (1) the number of features may not be indicative of complete coverage of structural representation of a method, and (2) increasing the number of features may not result in gaining benefit and may also cause overfitting.
VII. CONCLUSION
We have carried out a comprehensive assessment of the sensitivity of atomic environment representations using several methods to analyze the sensitivity under rotation and various perturbations. Our results show that although many representations provide an overall acceptable accuracy for sensitivity, there is still room to balance sensitivities to radial and angular perturbations. We thus conclude that further investigation of how insensitivities affect applications of interatomic potentials and hence observables in MD simulations is necessary to improve ML driven simulation approaches.
We also carried out extensive dimensionality analyses of various representations, which have identified significant opportunities to eliminate unnecessary information that may reduce the accuracy of predictions from ML models. We also conducted regression tests to provide a comparison between representations as their dimensionality varies. The results show clear differences in the number and fraction of important dimensions in the different representations. This is expected to become increasingly important as more complex representations are developed and especially when incorporating property-based descriptors alongside atomic environment representations.
ACKNOWLEDGMENTS
We thank Gábor Csányi and Albert Bartok-Partay for useful discussions. This work was supported by grants from the Leverhulme Trust under Grant No. RPG-2017-191. This work received funding from the EU’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 676580), the NOMAD Laboratory CoE.
We used data from the Novel Materials Discovery (NOMAD) Laboratory (https://nomad-coe.eu/). We are grateful for computational support from the UK national high performance computing service, ARCHER, for which access was obtained via the UKCP consortium and funded by EPSRC (Grant Reference No. EP/P022065/1). Additional computing facilities were provided by the Scientific Computing Research Technology Platform of the University of Warwick.
DATA AVAILABILITY
The data that support the findings of this study are openly available in github.com/DescriptorZoo/Materials-Datasets at http://doi.org/10.5281/zenodo.3871650, version v1.0 that are extracted from open-access NOMAD Archive (http://nomad-coe.eu) and in github.com/DescriptorZoo/sensitivity-dimensionality-results at http://doi.org/10.5281/zenodo.4055141, version v1.1.84 The details of all datasets are given at Sec. IV. The corresponding citations for other data that are used in this study are available from the following publications: GAP Si potential database from Ref. 82, Si molecular dynamics (MD) database from Ref. 50, and TiO2 dataset from Ref. 49 through web access from Ref. 85. The codes that are used to create representations are detailed in Sec. III H.