High-dimensional representations of the elements have become common within the field of materials informatics to build useful, structure-agnostic models for the chemistry of materials. However, the characteristics of elements change when they adopt a given oxidation state, with distinct structural preferences and physical properties. We explore several methods for developing embedding vectors of elements decorated with oxidation states. Graphs generated from 110 160 crystals are used to train representations of 84 elements that form 336 species. Clustering these learned representations of ionic species in low-dimensional space reproduces expected chemical heuristics, particularly the separation of cations from anions. We show that these representations have enhanced expressive power for property prediction tasks involving inorganic compounds. We expect that ionic representations, necessary for the description of mixed valence and complex magnetic systems, will support more powerful machine learning models for materials.
I. INTRODUCTION
Materials informatics has served as a powerful field of study for the discovery and optimization of functional materials with engineered properties. For machine learning applications, the representations of materials is an important subfield, where state-of-the-art performances have been achieved through graph representations that incorporate information about both composition and structure.1–9 This success has been made evident through the material science specific benchmarking suite Matbench, where graph neural networks (GNNs) dominate for tasks that incorporate structure.10
In the large data regime, neural network-based models have been able to learn representations of the elements. In addition to these learned descriptors of the elements, hand-crafted representations of elements and compositions continue to play a role in materials informatics for property prediction.11–18 An often overlooked aspect when dealing with traditional element representations is the role of ions. Ions, and the knowledge of oxidation states, play a significant role in both the structural and electronic properties of materials such as electrical conductivity,19,20 chemical bonding, and magnetism21 as a result of their electronic configuration differing from the parent atom.22,23 For example, Fe0 is the building block of ferromagnetic iron metal, Fe3+ is found in the antiferromagnetic insulator Fe2O3 (hematite), while a mixture of Fe2+ and Fe3+ is found in the ferrimagnetic crystals of Fe3O4 (magnetite).
For composition-only property prediction (also known as structure-agnostic learning), compositions can be represented as composition-based feature vectors (CBFVs), which are derived from element embeddings. Typically, the element embeddings are combined through pooling operations (commonly descriptive statistics) to make a CBFV, which can be used as an input for machine learning. This enables material informatics practitioners to make machine-learning property predictions in the absence of explicit structural information.
The choice of the underlying element embedding used to make the CBFV does impact the performance of the property prediction task of interest. Depending on the size of the training data, CBFVs, which lack domain knowledge (such as random representations or one-hot encoded atomic representations), can perform comparatively to CBFVs built from element properties.24
A common aspect of structure-agnostic approaches is that they treat element information for compositions. This raises the question of whether there is any utility for having representations or even incorporating knowledge of the oxidation states into structure-agnostic learning. When screening inorganic compositions, oxidation states are typically considered a form of chemical heuristics to achieve charge-balancing for enumerative algorithms,25,26 to suggest why one compound may form rather than another27 or as a measure to check the validity of compositions suggested by generative models.28,29 Often, these oxidation states do not see much further use beyond these applications in screening studies despite their central role in the explanation of properties in inorganic chemistry.
In this study, we develop high-dimensional representations of ions and assess their utility for materials informatics tasks. The SkipAtom30 approach for developing distributed representations for the chemical elements is adapted to develop distributed representations of ionic species. We call our adapted formalism SkipSpecies. The SkipSpecies representation is then benchmarked in a structure-agnostic setting on two regression tasks (formation energy and bandgap) and two classification tasks (metallic and magnetic classification) with differing dimension sizes and different pooling operations. We find that ionic representations can perform better than standard element representations on tasks that are linked to the electronic structure of a material.
II. METHODS
A. Dataset construction
The materials data are taken from the Materials Project database (version: 2022.10.28).31 The Materials Project API implemented in pymatgen (version: 2023.7.14)33 is used to query the oxidation states route of the database to obtain 154 718 non-deprecated structures.
We query for the following properties: material_id, structure, formula_pretty, possible_species, and method. From this query, we filter out materials that cannot be assigned oxidation states by removing entries where the method field is None. The method field in this dataset has three possible values: “bond valence analysis,” none, and “oxidation state guess.” “bond valence analysis” refers to materials where the oxidation states can be assigned using the BVAnalyzer class in pymatgen33 using the bond valence algorithm, which uses element-based parameters34 based on the ICSD.35 “Oxidation State Guess” refers to materials where the oxidation states are assigned using the oxi_state_guesses method implemented in the Composition class of the pymatgen.core.composition submodule. This filtering returns 116 363 structures with assigned oxidation states. A further filter is applied to remove structures with non-integer oxidation states, resulting in a final dataset of 110 160 oxidation-state decorated structures. Of these remaining structures, the majority (103 687/110 160) where assigned oxidation states by bond valence analysis, with the remaining structures being assigned oxidation states through the oxidation state guess method. The distribution of the oxidation states of the elements is shown in Fig. 1.
To construct the property datasets, the Material Project IDs (material_id) are used to query the materials’ summary route for the bandgap, formation energy per atom, and the metallic and magnetic classification.
B. SkipSpecies training
Following the approach in the original paper, which introduced SkipAtom,30 a Voronoi decomposition approach36 was used to convert the dataset of 110 160 oxidation state-decorated structures into graphs and then from these graphs, another dataset of the co-occurring species pairs is developed. Species that are connected in the graph representation of the structure can be considered to co-occur and make up the training pairs for learning distributed representations.
The SkipSpecies approach for learning representations of ionic species is that a “fake” learning task is used to predict the species that co-occur with a target species in a given structure. This task is referred to as “fake” since the aim is not to build a classifier for predicting what species will co-occur with a target species but rather to use the learned matrix of parameters as an embedding table for the species within the dataset of materials.
The general model architecture for training SkipSpecies models is adapted from SkipAtom30 and consists of an input layer of 336 neurons, a hidden embedding layer with d neurons, and an output layer with 336 neurons with softmax activation. The main modification compared to SkipAtom is that the input and output layers contain 336 neurons instead of 86 neurons to account for each unique species in the dataset. For the hidden dimension d, models with the following dimensions are chosen: 30, 86, 100, 200, 300, and 400. The input to this model is the one-hot representation of the target species, and the “fake” target variable is the one-hot representation of a species that co-occurs with the target. The loss function for this model is the cross-entropy loss between the one-hot representation of the target species and the probabilities produced by the softmax activation on the output layer of the model. The distributed representations are obtained from the embedding layer.
1. Induction
C. Creating composition-based feature vectors
D. SkipSpecies evaluation
To evaluate the performance of the distributed representations of SkipAtom and SkipSpecies, the ElemNet38 architecture (as chosen in the original SkipAtom paper30) with the CBFVs derived from the learned representations as inputs is applied to two classification and two regression tasks: metallic and magnetic classification, and formation energy per atom and bandgap, respectively. This work focuses on observing the difference in performance of the distributed representations for elements vs atoms as SkipAtom has been shown to give a superior performance to one-hot, random, and Mat2Vec39 element representations.30 As this dataset contained polymorphs of the same composition, it was further filtered by only keeping the composition of a polymorph with the lowest energy above the convex hull. This reduced the dataset to 71 470 materials. The property datasets are shown in Fig. 2.
The ElemNet model is implemented in TensorFlow40,41 and is a 17-layer feed-forward neural network that consists of 41 024 neuron layers, 3 × 512 neuron layers, 3 × 256 neuron layers, 3 × 128 neuron layers, 2 × 64 neuron layers, and 1 × 32 neuron layer. All the layers use ReLU activation. For the classification tasks, the output layer is a single neuron layer with sigmoid activation, and the loss function is the binary cross-entropy. The regression task uses an output layer with linear activation and the loss function is the mean absolute error (MAE). The models were trained using the following hyperparameters: a maximum number of epochs of 100, a learning rate of 10−4, a batch size of 32, and an L2 lambda of 10−5. Twice-repeated fivefold cross-validation is performed to evaluate the performance of the compound representations on the property prediction tasks. The reported metric on each task is the average MAE and the average AUC (area under the receiver-operating characteristic curve) for the regression and classification tasks, respectively. The error in the metrics is the standard deviation across the twice-repeated fivefold cross-validation.
III. RESULTS AND DISCUSSION
A. SkipSpecies training dataset
Figure 3 shows the distribution of unique components (elements/species) per structure in the dataset that has been curated for training the SkipSpecies vectors, as described in Sec. II B. In Figs. 3(a) and 3(b), the one-component materials come from polymorphs of H2 and N2, which from the oxidation states route of the Materials Project31 are automatically assigned oxidation states of 0+. Figure 3(a) shows that the mode number of unique elements in the dataset is three, reflecting that ternary materials are the most frequent in this dataset although this is very closely followed by quaternary materials. This is reversed in Fig. 3(b) when we consider the unique species within the structures, where the most frequent number becomes four. This distribution shift arises from the presence of many mixed-valence compounds within this dataset. Further evidence of this is shown in that the maximum number of components when we consider species becomes ten vs the case when we only consider the elements and would thus have at most nine component compounds.
Figure 3(c) further shows the breakdown between the number of components that each structure has when we consider either unique species or elements. With the exception of unary and nonary materials, mixed valency is present in all the materials. What is further illustrated is that the materials in this dataset can have one or more elements where mixed valency is observed. An example of this is the elemental quaternary material Li10GeP2S12, a known lithium superionic conductor,42 where depending on the crystal structure, it can be an ionic quinary material (mp-696 128) where S exhibits mixed valency as S− and S2− or an ionic senary material (mp-696 138) where P and S both exhibit mixed valency as and P4+ and P5+, and S− and S2−, respectively.
B. Learned representations
Various techniques exist to reduce high-dimensional data to make visualization easier for us to understand. In Fig. 4, we have chosen the uniform manifold approximation and projection (UMAP)43 and the t-stochastic neighbors embedding (t-SNE).44 Visualizing the embeddings can provide a qualitative understanding of the quality of the learned representations. It is apparent that we can recover expected chemical trends but also find patterns that we would not intuitively expect. For example, in Fig. 4(a), we find that the halides with the exception of the chloride ion appear close to each other within this space, with Br− and I− close to each other than with F−. To further emphasize this, a red ellipse has been drawn around these halide ions, which further highlights how the chloride ion is missing in this cluster in the UMAP reduction of the SkipSpecies. All the halides cluster together when t-SNE is used for the dimension reduction, as shown in Fig. 4(b) (the iodide and bromide anions overlap each other in the reduced space). The anions can generally be separated from the cations in the reduced space, although in the t-SNE figure, they form a cluster with a few outliers, including Te2−, C2−, and C4−. For the UMAP figure, we can still observe a cluster although it is more spread out compared to the t-SNE space.
C. Property prediction evaluation
1. Pooling effect
In Fig. 5, the error metrics of the property prediction tasks using the induced SkipSpecies representation are shown over a range of dimensions of the representation, with different curves representing the different choices of pooling to make the CBFV. Independent of the task, the dimension, and the representation, it is evident that creating a CBFV using max-pooling leads to worse performance for the property prediction tasks. This result likely occurs as not all the information that is present from the constituent species vectors is used in the max pooling operation, whereas sum and mean pooling do use information from all the constituent species vectors. It is possible that the max-pooling operation can, in some cases, neglect the element or species that is most important within a particular composition for the prediction of particular properties. To further expand on this rationale, the max-pooling approach is very sensitive to outliers, as species vectors that may have anomalously high components can dominate the components of the CBFV. Mean-pooling would be less sensitive to outliers due to taking the average over each component, making it more robust to outliers. The resulting max-pooled CBFV as such may not fully describe a composition.
The division by the total number of atoms to create the mean-pooled vector is a constant and hence the sum-pooled and mean-pooled CBFV have a linear relationship to each other.
2. Dimension effect
For creating SkipSpecies representations of the chemical species, there is a choice of dimensions for the resulting distributed representations. The effect of the choice in the dimensionality is shown in Fig. 5. It can be observed that, generally, as the number of dimensions increases, the performance of the models also increases. This trend is more dramatic for the max-pooled CBFVs. For both the sum-pooled and mean-pooled CBFVs, there are marginal increases in performance beyond 200 dimensions.
Within the NLP field for training word embeddings, an arbitrary dimension is often chosen. For the word embeddings, setting a higher number of dimensions usually results in higher-quality embeddings up until a saturation point.45 We have observed this from our property prediction tasks. The performance tends to improve with dimension size because a higher number of dimensions can capture more complex relationships between the co-occurring pairs. However, the effectiveness of increasing dimensions is ultimately constrained by the available data size, which limits the ability to learn meaningful patterns.
3. Representation effect
To better visualize the effect of the choice of representation on the performance of the four property prediction tasks, heatmaps of the sum-pooled representations have been shown in Fig. 6. For the bandgap prediction task, the SkipSpecies representations perform better than the induced SkipAtom representation across all dimensions. While the choice to apply induction to the SkipSpecies representation does appear to offer a slight improvement to the MAE, the error in these values makes it hard to discern if the application of induction is significant in upgrading the performance of the SkipSpecies representation. For the formation energy per atom task, the induced SkipAtom representation outperforms both SkipSpecies representations across all the dimensions. For each of the representations, there is a marginal improvement in the MAE beyond 200 dimensions, if any. The induced SkipSpecies representation performs the best on both the metallic and magnetic classification tasks, with the SkipAtom representation performing the worst.
To further highlight the effect of the representation on the model performance, we have shown a plot of how the validation metric (MAE for the regression tasks and AUC for the classification tasks) changes during training in Fig. 7. Except for the formation energy task shown in Fig. 7(d), the element representation SkipAtom performs the worst compared to the ionic SkipSpecies representations. For the other three tasks, the SkipSpecies representation achieves better results from the start of the training process, and this is maintained throughout the 100 epochs.
For the bandgap task, an ionic representation may offer better performance than a representation based on the neutral atom due to the knowledge of the oxidation states. This can be rationalized by considering that the oxidation state of an ion allows the model to distinguish between the properties of different materials containing the same element. The loss or gain of electrons affects both the effective radius of a species, impacting its local structural environment and electronic configuration, both of which can alter the bandgap. As the embeddings are learned such that species which occur within similar environments should be similar, ionic representations may provide the flexibility to describe different types of compounds containing the same element. For example, TiO2 containing Ti4+ has a wide bandgap above 3 eV, while Ti2O3 containing Ti3+ has a small bandgap closer to 0 eV. This flexibility may not be captured by atom-only representations.
The metallic classification task can be rationalized by an important caveat that this task is based on Materials Project31 data. This classification is based on the bandgap calculated with semi-local density functional theory, so it is possible that some of the compounds in this dataset that are labeled metals could, in fact, be semiconductors due to a bandgap underestimation at this level of theory.46,47 In addition, it is likely that many known metallic compounds are likely to have been excluded from the construction of the dataset. This occurred due to the need to assign oxidation states to materials, which typically leads to the exclusion of many intermetallic compounds.
The magnetic classification task is less common for property prediction tasks, as noted by its absence from MatBench.10 The boost in performance from considering ionic representations possibly arises from how magnetism arises from the electronic structure of a material through the spins of unpaired electrons centered on atomic sites. The oxidation states of the ions implicitly encode whether particular ions may possess paired or unpaired electrons depending on the crystal environments and species that they co-occurred within during the original training of the distributed representations, which could explain the difference in performance between the SkipSpecies and SkipAtom representations.
It is important to note that the performances shown will be influenced both by the quality of the representation and the chosen model architecture. Factors that can affect the quality of the representations include how often a particular species is represented in the dataset. In this work, we applied induction37 to the SkipSpecies representation to compensate for the under-represented species. The induction appears to offer a small boost in performance based on the error metrics.
IV. CONCLUSIONS
We have explored the development of ionic species representations for crystals from chemical data using machine learning. This work builds upon SkipAtom30 and some simple ion featurizers in Matminer17 based on oxidation states and electronegativity.
The SkipSpecies ionic representations can be used to develop property prediction models with lower errors than comparable atomic representations for predicting properties such as the bandgap or classifying compositions as metals and non-metals, and magnetic or non-magnetic, suggesting that there may be some utility for ionic representations for predicting the properties of compositions. One caveat is that ionic representations are more restrictive compared to element representations, as the oxidation states in the composition have to be known to use them for property prediction in a structure agnostic setting. In addition, the quality of the trained SkipSpecies is dependent on the correct assignment of charges used to build the dataset of pairs. While tools such as pymatgen33 and BERTOS48 can be used to assign oxidation states to compositions, this does introduce an additional step into a workflow to predict properties and may fail to be assigned physical charges for certain compounds.
These ionic representations may find use for property prediction in approaches that create or generate compositions alongside knowledge of the oxidation states of the constituent elements, rather than having to decorate an already existing set of compositions where this information is not already known. One example where these ionic representations could be used is within SMACT-based workflows, as the chemical filters used to generate compositional spaces also return both the constituent elements and the oxidation states of the compositions. These representations can be used for both property prediction on these spaces, as well as providing an alternative means to visualize the compositional space as opposed to using elemental representations to visualize this space, since compositions of the same formula but with elements in different oxidation states would be different points in the ionic space instead of the same point in the elemental space.
Distributed species representations may have applications for crystal structure assignment by analogy through ionic substitutions, as pairwise similarity values can be derived from the vectors using distance or similarity measures. Alternatively, similarity measures can be applied to compositional feature vectors derived from these representations to suggest what known materials are similar to hypothetical compositions as part of synthesizability models.49 Finally, we note that such representations are not limited to compositional (structure free) models, but could be used, for example, to initialize node vectors on graph-based models of materials structure and properties.
ACKNOWLEDGMENTS
We thank Luis M. Antunes and Ricardo Grau-Crespo for the insightful discussions. A.O. acknowledges EPSRC for a Ph.D. studentship (Grant No. EP/T51780X/1). We acknowledge the UK Materials and Molecular Modeling Hub for computational resources, which is partially funded by EPSRC (Grant Nos. EP/P020194/1 and EP/T022213/1). We also acknowledge the Imperial College Research Computing Service for its computational resources.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Anthony Onwuli: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Writing – original draft (equal). Keith T. Butler: Conceptualization (equal); Project administration (equal); Supervision (equal); Writing – review & editing (equal). Aron Walsh: Conceptualization (equal); Project administration (equal); Supervision (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are openly available as an archive at https://doi.org/10.5281/zenodo.12733915 and are maintained at https://github.com/WMD-group/skipspecies. The ionic representation schemes have also been including in the ElementEmbeddings package available from https://github.com/WMD-group/ElementEmbeddings.