Measuring similarities/dissimilarities between atomic structures is important for the exploration of potential energy landscapes. However, the cell vectors together with the coordinates of the atoms, which are generally used to describe periodic systems, are quantities not directly suitable as fingerprints to distinguish structures. Based on a characterization of the local environment of all atoms in a cell, we introduce crystal fingerprints that can be calculated easily and define configurational distances between crystalline structures that satisfy the mathematical properties of a metric. This distance between two configurations is a measure of their similarity/dissimilarity and it allows in particular to distinguish structures. The new method can be a useful tool within various energy landscape exploration schemes, such as minima hopping, random search, swarm intelligence algorithms, and high-throughput screenings.
I. INTRODUCTION
Large data sets of crystalline structures are nowadays available in two major contexts. On one hand, databases of materials have been created containing structural information of both experimental and theoretical compounds from high-throughput calculations, which are the basis for data-mining techniques in materials discovery projects.1–7 On the other hand, ab initio structure predictions8–15 can produce a huge number of new structures that have either not yet been found experimentally or are metastable.16–21 In both cases, it is essential to quantify similarities and dissimilarities between structures in the data sets, requiring a configurational distance that satisfies the properties of a metric. Databases frequently contain duplicates and insufficiently characterized structures which need to be identified and filtered. In experimental data, the representation of identical structures as obtained from different experiments will always slightly differ due to noise in the measurements, such that the configurational distance is never exactly zero. Noise is also present in theoretical calculations where a geometry relaxation is, for instance, stopped once a certain, possibly insufficient convergence threshold is reached. In ab initio structure prediction schemes, it is typically necessary to maintain some structural diversity which can be quantified as a certain minimal configurational distance. All these examples clearly show the need for a metric that allows to measure configurational distances and local structures in a reliable and efficient way.
Crystalline structures are typically given in a dual representation. The first part specifies the cell and the second part the atomic positions within the cell. The former can, for instance, be given by the three lattice vectors a, b, and c, or by their lengths a, b and c, and the intermediate angles α, β, and γ. The atomic positions can either be specified by cartesian coordinates or by reduced coordinates with respect to the lattice vectors. However, such representations are not unique, since any choice of lattice points can serve as cell vectors of the same crystalline structure. Unique and preferably standardized cell parameters are required for comparison and analysis of different crystals.22 Algorithms to transform unit cells to a reduced form are frequently used in crystallography, such as the Niggli-reduction23–27 which produces cells with shortest possible vectors (|a + b + c| = minimal). Unfortunately, in the presence of noisy lattice vectors, cells can change discontinuously within the Niggli-reduction algorithm. Symmetry analysis and the corresponding classification in the 230 crystallographic space groups are another tool to compare crystal structures. However, the outcome of a symmetry analysis algorithm strongly depends on a tolerance parameter such that the introduction of some noise can change the resulting space group in a discontinuous manner. Because of the above described problems, it is difficult to quantify similarities based directly on dual representations.
Within the structure prediction community, fingerprints that are not directly based on such a dual representation have been proposed. Radial distribution functions have been used as a fingerprint to measure the distances between crystal structures,28,29 as well as methods based on comparison of the calculated powder diffraction patterns.30,31 In the same spirit, Oganov and Valle32 used element resolved radial distribution functions as a crystal fingerprint. For a crystal containing one element, only a single function is obtained for the entire system. The difference between the radial distribution functions of two crystals is then taken as the configurational distance. By definition, the radial distribution function contains only radial information, but no information about the angular distribution of the atoms. Such angular information has been added in the bond characterization matrix (BCM) fingerprint.13,33 In this fingerprint, spherical harmonic and exponential functions are used to set up modified bond-orientational order descriptors34 of the entire configuration. The distance between two configurations can be measured by the Euclidean distance between their BCMs. For molecular crystals, Chisholm et al.35 used intermolecular contact distances and a matching algorithm to characterize and compare structures. Atomic and molecular environment descriptors are also needed in the context of machine learning schemes for force fields,36–38 bonding pattern recognition,39 or to compare vacancy, interstitial, and intercalation sites.40 These descriptors could also be used to measure similarities between structures. Even though they have never been used in this context, we will present a comparison with such a descriptor.
When humans decide by visual inspection whether two structures are similar, they proceed typically in a different way. They try to find matching atoms which have the same structural environment. If all the atoms in one structure can be matched with the atoms of the other structure, the two structures are considered to be identical. Such a matching approach based on the Hungarian algorithm41 has already turned out to be useful for the distinction of clusters.42,43
In this paper, we will present a fingerprint for crystalline structures which is based on such a matching approach. The environment of each atom is described by an atomic fingerprint which is calculated in real space for an infinite crystal and represents some kind of environmental scattering properties observed from the central atom. Therefore, all the ambiguities of a dual representation do not enter into the fingerprint, allowing an efficient and precise comparison of structures.
II. FINGERPRINT DEFINITION
Recently, we have proposed a configurational fingerprint for clusters.43 In this approach, an overlap matrix is calculated for an atom centered Gaussian basis set. The vector formed by the eigenvalues of this matrix forms a global fingerprint that characterizes the entire structure. The Euclidean norm of the difference vector between two structures is the configurational distance between them and satisfies the properties of a metric.
Since there is no unique representation of a crystal by a group of atoms (e.g., the atoms in some unit cell), we will use atomic fingerprints instead of global fingerprints in the crystalline case. However, this atomic fingerprint is closely related to our global fingerprint for non-periodic systems. For each atom k in a crystal located at Rk, we obtain a cluster of atoms by considering only those contained in a sphere centered at Rk. For this cluster, we calculate the overlap matrix elements as described in Ref. 43 for a non-periodic system, i.e., we put on each atom one or several Gaussian type orbitals and calculate the resulting overlap integral. The orbitals are indexed by the letters i and j and the index w(i) gives the index of the atom on which the Gaussian Gi(r) is centered, i.e.,
In this first step, the amplitudes of the Gaussians cnorm are chosen such that the Gaussians are normalized to one, and the width of each Gaussian Gi(r) is given by the covalent radius of the atom w(i) on which it is centered. To avoid discontinuities in the eigenvalues when an atom enters into or leaves the sphere, we construct in a second step another matrix Tk such that
The cutoff function fc smoothly goes to zero on the surface of the sphere with radius ,
In the limit where n tends to infinity, the cutoff function converges to a Gaussian of width σc. The characteristic length scale σc is typically chosen to be the sum of the two largest covalent radii in the system.
The value n determines how many derivatives of the cutoff function are continuous on the surface of the sphere, and n = 3 was used in the following. One can consider the modified matrix Tk to be the overlap matrix of the cluster where the amplitude of the Gaussian at atom i is determined by cnormfc(|Ri − Rk|). In this way atoms close to the surface of the sphere give rise to very small eigenvalues of Tk and are thus weighted less than the atoms closer to the center. The eigenvalues of this matrix Tk are sorted in descending order and form the atomic fingerprint vector Vk. Since we cannot predict exactly how many atoms will be in the sphere, we estimate a maximum length for the atomic fingerprint vector. If the number of atoms is too small to generate enough eigenvalues to fill up the entire vector, the entries at the end of the fingerprint vector are filled up with zeros. This also guarantees that the fingerprint is a continuous function with respect to the motion of the atoms when atoms might enter or leave the sphere. If an atom enters into the sphere, some zeros towards the end of the fingerprint vector are transformed in a continuous way into some very small entries which only contribute little to the overall fingerprint. The Euclidean norm |Vk − Vl| measures the dissimilarity between the atomic environments of atoms k and l.
The atomic fingerprints and of all the Nat atoms in two crystalline configurations p and q can now be used to define a configurational distance d(p, q) between the two crystals,
where P is a permutation function which matches a certain atom k in crystal p with atom P(k) in crystal q. The optimal permutation function which minimizes d(p, q) can be found with the Hungarian algorithm41 in polynomial time. If the two crystals p and q are identical, the Hungarian algorithm will in this way assign corresponding atoms to each other. The Hungarian algorithm needs as its input only the cost matrix C given by
In the following, it will be shown that d(p, q) satisfies the properties of a metric, namely,
positiveness: d(p, q) ≥ 0,
symmetry: d(p, q) = d(q, p),
coincidence axiom: d(p, q) = 0 if and only if p = q,
triangle inequality: d(p, r) + d(r, q) ≥ d(p, q).
From the definition (Eq. (4)), it is obvious that the positiveness and symmetry conditions are fulfilled. The coincidence theorem is satisfied if the individual atomic fingerprints are unique, i.e., if there are not two different atomic environments that give rise to identical atomic fingerprints. In our work on fingerprints for clusters, we have shown that the fingerprints can be considered to be unique if they have a length larger or equal to 3 per atom. The triangle inequality can be established in this way,
where P, P′, and Q are assumed to be the permutations that minimize respectively the Euclidean vector norms associated to d(p, r), d(r, q), and d(p, q).
III. CONTRACTED FINGERPRINTS
Since the Rk-centered spheres contain typically about 50 atoms, an atomic fingerprint has at least length 50 if only s-type Gaussian orbitals or length 200 if both s and p orbitals are used. Since a configuration is characterized by the ensemble of all the atomic fingerprints of all the atoms in the cell, the amount of data needed to characterize a structure is quite large even though it is certainly manageable for crystals with a small number of atoms per unit cell. Storage requirements might however become too high in certain cases such as large molecular crystals. We will, therefore, introduce contraction schemes that allow to considerably reduce the amount of data necessary to characterize a crystalline structure. Two such schemes will briefly be discussed below.
A. Contractions by properties
Let us introduce a function τ(i) that designates a certain property of the Gaussian orbital i and encodes it in form of a contiguous integer index. In case of a multicomponent crystal, it can indicate on which kind of chemical element the Gaussians are centered and whether the orbital is of s or p type. The principal vector is thus chopped into pieces whose elements all carry the same value τ(i). In the following presentation of numerical results, we have always considered the central atom to be special, independent of its true chemical type. Having m atomic species in the unit cell and using atomic Gaussian orbitals with a maximum angular momentum lmax, τ(i) runs from 1 to (m + 1)(lmax + 1). Now we can construct a contracted matrix tk,
together with its metric tensor sk,
where uk is the principal vector of the matrix Tk of Eq. (2). The eigenvalues λ of the generalized eigenvalue problem,
form again an atomic fingerprint of length (m + 1)(lmax + 1) which is much shorter than the non-contracted fingerprint Vk.
B. Contractions to form molecular orbitals for molecular crystals
The fingerprints described so far can in principle also be used for molecular crystals. However, the amount of data needed to characterize such crystals can be quite large if the molecules forming the crystal contain many atoms. By creating molecular orbitals in analogy with standard methods in electronic structure calculations, the required amount of data can be considerably reduced. The eigenvalues arising from the overlap matrix in this molecular basis set will then form a fingerprint for the molecular crystal. The molecular orbitals can be obtained in the following way: for each molecule k in our unit cell, we cut out a cluster of molecules within a sphere of a certain radius. For each molecule α in this sphere, we set up the overlap matrix by putting Gaussian type orbitals on all its constituent atoms. Then, we calculate for this matrix the eigenvalues and eigenvectors. The principal vectors Wα,μ belonging to several of the largest eigenvalues are subsequently used for the contraction,
No metric tensor is required since the set of vectors used for the contraction is orthogonal. The molecular orbitals have characteristic patterns, such that the orbital corresponding to the first principal vector has no nodes, while the orbitals of the following principal vectors have increasing number of nodes. They are therefore similar to the atomic orbitals of s, p and higher angular momentum character, which were used for the fingerprints in the ordinary crystals. In Fig. 1, these orbitals are shown for the case of the paracetamol molecule.
By multiplying with some cutoff function as in Eq. (3), we can then obtain molecule centered overlap matrices in this molecular basis which is free of discontinuities with respect to the motion of the atoms. In the molecular case, the value of the cutoff function depends on some short range pseudo-interaction between the central and the surrounding molecules. This interaction Uk,α between the central molecule k and another molecule α is given by
where the sum over i runs over all the atoms in the central molecule k and the sum over j over all the atoms in the surrounding molecule α. di,j is the distance between the atoms i and j and is the sum of the van der Waals radii of the two atoms. The interaction is taken to vanish beyond its first zero. Because of the short range of the interaction, molecules sharing a large surface will be coupled strongly. The analytical form of the cutoff function is identical to the one for the atomic case (Eq. (3)). However, since a cartesian distance between molecules is ill defined, the argument in Eq. (3) is modified. The scaled distance rσc between the atoms is replaced by the normalized interaction between the molecules,
The eigenvalues of this final overlap matrix form now a fingerprint describing the environment of this molecule k with respect to the other molecules. To compare two structures, this procedure is done for all molecules contained in the corresponding unit cell. A configurational distance is calculated then as in Eq. (4) by using the Hungarian algorithm.41
IV. APPLICATION OF FINGERPRINT DISTANCES TO EXPERIMENTAL STRUCTURES
Structural data found in various material databases are frequently obtained from measurements at different temperatures which results in thermal expansion. Similarly, measurements at different pressures or low quality x-ray diffraction patterns can lead to slight cell distortions. Obviously, our fingerprint distances among such expanded or distorted but otherwise identical structures are different from zero. For these reasons, we have introduced a scheme where the six degrees of freedom associated to the cell are optimized while keeping the reduced atomic coordinates fixed such as to obtain the smallest possible distance to a reference configuration. The gradient of our fingerprint distance with respect to the lattice vectors can be calculated analytically using the Hellmann Feynman theorem. An application of the lattice optimization scheme was applied to a subset of ZrO2 structures taken from the Open Quantum Materials Database (OQMD),2,7 as will be discussed in further detail later in Sec. V.
V. NUMERICAL TESTS
Fig. 2 shows all possible pairwise configurational distances obtained with several fingerprints for various data sets. Different fingerprints are plotted along the x and y axis. Hence, for two different fingerprints which performed identically well all data points would lie on a diagonal line. Furthermore, an ideal fingerprint produces a clear gap in the fingerprint distances, separating structures that are either classified as identical or different. LFP stands for the uncontracted long fingerprint and in square parenthesis it is indicated whether only s or both s and p orbitals were used to set up the overlap matrix, SFP[s] stands for the short contracted fingerprint with s orbitals only where the properties used for the contraction are central atom and the element type of the neighboring atoms in the sphere. For materials that have only one type of element (Si in our case), the atomic fingerprint has only length two and the coincidence theorem is not satisfied. Even though there are hyperplanes in the configurational space where different configurations have identical fingerprints, it is very unlikely that different local minima lie on such hyperplanes and the fingerprint can therefore nevertheless well distinguish between identical and distinct structures. If both s and p orbitals are used (SFP[sp]), the atomic fingerprint has at least length 4 and no problem with the coincidence theorem arises. In addition, we also show the configurational distances arising from the Oganov and Valle and BCM13,33 fingerprints as well as from a fingerprint based on the amplitudes of symmetry functions.36 All our data sets contain both the global minimum (geometric ground state) as well as local minima (metastable) structures, obtained from minima hopping runs.9 Energies and forces were calculated with the Density Functional based Tight binding method (DFTB+)44 method for SiC and the molecular crystals, and the Lenosky tight-binding scheme was used for Si.45 For the CsPbI3 perovskite and the transparent conductive oxide Zn2SnO4, plane wave density functional theory (DFT) calculations were used as implemented in the Quantum Espresso code.46,47
The first test set consists of clathrate like structures of low density silicon allotropes.48 Low density silicon gives rise to a larger number of low energy crystalline structures than silicon at densities of diamond silicon and thus poses an ideal benchmark system. In the first line of the figure, we show the results of a relatively sloppy local geometry optimization, where the relaxation is stopped once the forces are smaller than 0.05 eV/Å. Gaps separating identical from distinct structures are hardly visible for all fingerprints. Once a very accurate geometry optimization with a force threshold of 5 meV/Å is performed, gaps become visible for all the fingerprints.
The second data set is silicon carbide, a material well known for its large number of polytypes. Our fingerprint gives rise to a small gap whereas the configurational distances based on all other fingerprints do not show any gap at all. The opening of a gap can again be observed once the geometry optimization is done with high accuracy. For this case, all fingerprints result in a gap, but like for all test sets, it is the least pronounced for the BCM fingerprint. Both the Oganov and BCM fingerprints are global ones and in the required averaging process over all the atoms more and more information is lost as the system gets larger. Therefore, it is not surprising that the gap again disappears even for the high quality geometry optimization once one goes to large cells.
The next two test sets consist of an oxide material and a perovskite with their characteristic building blocks of octahedra and tetrahedra which can be arranged in a very large number of different ways. All our fingerprints give rise to clear gaps separating identical from distinct structures. The Oganov fingerprint also gives rise to clear gaps whereas the BCM fingerprint only weakly indicates some gap. The Behler fingerprint gives a well pronounced gap for Zn2SnO4 but only a blurred gap for CsPbI3.
The last theoretical test system is a platinum surface. In this case, the energies were calculated with the Morse potential.49,50 The geometry optimization was done with high accuracy and therefore a big gap is visible in all cases.
Fig. 3 shows the correlation between the energy difference and the fingerprint distance for all the test cases of Fig. 2. Except for the very large 256 atoms system, there exists always a clear energy gap if the geometry optimization was done with high accuracy. Even though there is of course the possibility of nearly degenerate structures, this seems to happen rarely in practice and energy is thus a rather good and simple descriptor for small unit cells.
To test our molecular fingerprint (MFP), two test systems were employed, namely, crystalline formaldehyde and paracetamol. The formaldehyde system comprised 240 structures with 8 molecules per cell and the paracetamol system 300 structures with 4 molecules per cell. The two top panels of Fig. 4 show the molecular fingerprint distance versus the energy difference of different structures of paracetamol and formaldehyde, respectively. The two bottom panels show the correlation of the standard fingerprint against the molecular fingerprint for both systems. The existence of a gap in the pairwise distance distributions clearly indicates that identical and distinct structures can be identified by both fingerprints. However, the molecular fingerprint vector is considerably shorter because only six principal vectors were used (shown in Fig. 1). Since six is the number of degrees of freedom of a rigid rotator, it is expected that this fingerprint is long enough to satisfy the coincidence theorem.
COMPACK is a program developed by Chisholm et al.35 to compare molecular crystal structures. It is interfaced with the Cambridge Structural Database System through the CSD Python API.51 Both the formaldehyde and paracetamol test sets were used to compare the accuracy of the new and the COMPACK molecular fingerprint methods. Fig. 5 shows the COMPACK fingerprint distances plotted against our molecular fingerprint. In the panels (a) and (b), the default tolerance values of COMPACK was used, namely, distance tolerance = 20% and angle tolerance = 20° whereas for panels (c) and (d), we used larger values of 40% and 40°, respectively. In all 4 cases, there is a gap, showing that both methods are capable to distinguish between identical and different structures. Overall, the gaps for the distances computed by COMPACK are significantly smaller than with our new fingerprint. Especially for paracetamol with the default tolerances, the gap almost vanishes. Increasing the tolerances improves the gap (see panels (c) and (d)), and both fingerprints perform similarly well.
Next, we applied our fingerprint to ZrO2 structures contained in the OQMD. 115 different entries were available at this composition. The structures were either based on experimental data originating from the Inorganic Crystal Structure Database (ICSD) or on binary structural prototypes. When the OQMD was initially created, duplicate entries were identified with the structure comparison algorithm as implemented in the Materials Interface (MINT) software package52 which employs a 6-level test that includes cell reduction as well as an analysis of the lattice symmetry. Structures classified as identical to an existing entry in OQMD were mapped to that entry without performing a structural relaxation. Therefore, the structural data set contains both DFT optimized and experimental structures, resulting in noise on the atomic and cell coordinates arising from the numerical calculations as well as from the different experiments and thermal effects. In Fig. 6(a), we show the ordinary and the lattice vector optimized fingerprint distances for all 115 structures from the database. We can see that the optimization process can reduce the fingerprint distance down to about 10−7 for many structures. For some of them, the initial fingerprint distances were as large as 0.1. This allows to detect some identical structures whose initial large fingerprint distance was only due to thermal expansion. However, even with lattice vector optimization, it was not possible to decide for the whole data set in an inambiguous way which structures are identical and which were not. Therefore, local geometry optimizations were performed at the DFT level for all structures using the VASP code.53–55 A plane wave cutoff energy of 520 eV was used together with a dense k-point mesh. Both the atomic and cell variables were relaxed until the maximal force component was less than 2 meV/Å and the stress below 0.01 GPa. Panel (b) of Fig. 6 shows the DFT energy differences of the relaxed structures against the fingerprint distances, showing a clear gap that allows to distinguish between identical and different structures. Applying the lattice vector optimization scheme on these relaxed structures was not able to further lower the fingerprint distances of identical structures. The coloring in Fig. 6 indicates how the two structures belonging to a fingerprint distance were classified by MINT. Assuming that there are no different structures with degenerate DFT energies, one can conclude that MINT was not able to extract from the non-relaxed data set the information whether structures are identical or not and has erroneously assigned numerous identical structures as distinct and vice versa to a lesser extent.
Since both Oganov and BCM methods are global fingerprints that discard crucial information, they can fail to describe structural differences, a problem that becomes especially apparent when considering defect structures in complex materials. As an example, a 2 × 2 × 2 supercell was constructed of the cubic perovskite structure of LaAlO3.56 Half of the Al atoms on the B-sites were replaced by Mn. Then, single oxygen vacancies were introduced on symmetrically inequivalent X-sites. Obviously, the structural symmetry was reduced from the initial space group of LaAlO3 to the orthorhombic space group Amm2 of the supercell La(Al, Mn)O3, and the oxygen vacancies resulted in structures with Cm and Pm symmetry. Both MINT and our fingerprint confirm that the structures are clearly different, whereas the Oganov and BCM fingerprint erroneously classify both structures as identical.
VI. CONCLUSIONS
Atomic fingerprints that describe the scattering properties as obtained from an overlap matrix are well suited to characterize atomic environments. An ensemble of atomic fingerprints forms a global fingerprint that allows to identify crystalline structures and to define configurational distances satisfying the properties of a metric. The widely used Oganov and BCM fingerprints do not have these properties and do also in practice not allow a reliable way to distinguish identical from distinct structures. Symmetry function based fingerprints are of similar quality as our scattering fingerprints. However, they are much more costly to calculate. Both fingerprints have a cubic scaling with respect to the number of atoms within the cutoff range, but our prefactor of the matrix diagonalization is much smaller then the prefactor for the 3-body terms required for the calculation of the symmetry functions. In contrast to “true”–“false” schemes such as employed in MINT which rely on a threshold and affirm that two structures are either identical or distinct, our fingerprint gives a distance between configurations. The appearance of a gap in the distance distribution indicates that a reliable assignment of identical and distinct structures can be performed. In addition, strong reductions in the fingerprint distances upon lattice vector optimization can detect and eliminate thermal noise on the data set, rendering our fingerprint ideal to scan for duplicates in large structural databases. Furthermore, the fingerprint distance can be used as a simple measure of the similarity between two structures with a higher likelihood that they are identical if the fingerprint distance is small. In this context, machine learning methods (such as Bayesian techniques,57 support vector machines,58 and neural network59–62) could be trained with attributes derived from our fingerprint to better quantify crystal structure similarities. The new fingerprint can also accurately explore local environments to create atomic and structural attributes for machine learning techniques trained to predict material properties. Our scheme can easily be extended to molecular crystals by introducing quantities that are analogous to molecular orbitals. Even for large molecular crystals, the molecular fingerprint is short and still very accurate. They could thus be used in the context of crystal structure prediction, where the potential energy surface is explored and a large number of structures need to be compared to eliminate duplicates. In summary, we have demonstrated that our novel approach allows to characterize crystalline structures by rather short fingerprint vectors and to decide more reliably whether structures are identical or not than previously proposed methods.
Acknowledgments
We thank Vinay Hegde and Antoine Emery for valuable expert discussions. This work was done within the NCCR MARVEL project. M.A. gratefully acknowledges support from the Novartis Universität Basel Excellence Scholarship for Life Sciences and the Swiss National Science Foundation. C.W. acknowledges financial support from the U.S. Department of Energy under Grant No. DE-FG02-07ER46433. Computer resources were provided at the CSCS under project No. s499 and the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.