One of the most appealing aspects of machine learning for material design is its high throughput exploration of chemical spaces, but to reach the ceiling of machine learning-aided exploration, more than current model architectures and processing algorithms are required. New architectures such as graph neural networks have seen significant research investments recently. For heterogeneous catalysis, defining substrate intramolecular bonds and adsorbate/substrate intermolecular bonds is a time-consuming and challenging process. Before applying a model, dataset pre-processing, node/bond descriptor design, and specific model constraints have to be considered. In this work, a framework designed to solve these issues is presented in the form of an automatic graph representation algorithm (AGRA) tool to extract the local chemical environment of metallic surface adsorption sites. This tool is able to gather multiple adsorption geometry datasets composed of different systems and combine them into a single model. To show AGRA’s excellent transferability and reduced computational cost compared to other graph representation methods, it was applied to five different catalytic reaction datasets and benchmarked against the Open Catalyst Projects graph representation method. The two oxygen reduction reaction (ORR) datasets with O/OH adsorbates obtained 0.053 eV root-mean-square deviation (RMSD) when combined together, whereas the three carbon dioxide reduction reaction datasets with CHO/CO/COOH obtained an average performance of 0.088 eV RMSD. To further display the algorithm’s versatility and extrapolation ability, a model was trained on a subset combination of all five datasets with an RMSD of 0.105 eV. This universal model was then used to predict a wide range of adsorption energies and an entirely new ORR catalyst system, which was then verified through density functional theory calculations.

Catalyzed energy reactions have emerged as a promising long term solution to a sustainable closed energy loop; however, current catalysts do not meet the requirements for them to become a reality.1 Current catalysts are not selective enough, and the few that show sufficient selectivity lack activity.1,2 This is the case for almost every electrocatalytic material group except for high entropy alloys (HEAs), which have been shown to outperform current benchmark catalyst systems by up to 100% selectivity and 200% overpotential cost.3,4 Due to the expansive configurational space of HEAs, their potential is incredible but largely untapped.1 This material space is so large that even with the most efficient high throughput experimental methods, it would take years to find optimal geometries for a catalytic application. With the help of artificial intelligence (AI) and computational modeling, however, these configurational spaces can be thoroughly explored at a fraction of the time, monetary cost, and human effort.1,5 For example, Batchelor et al. explored binary catalyst configurations through a combination of Density Functional Theory (DFT) and machine learning (ML) for the oxygen evolution reaction (OER) and, through optimization techniques, were able to propose the ideal catalytic composition of IrPt alloys.6 Ma et al. took this a step further and utilized a similar ML approach to not only optimize a catalytic surface but also design it in such a way that a specific reaction pathway in the carbon dioxide reduction reaction (CO2RR) was prioritized in order to produce long carbon chain products as a source of hydrocarbon fuel.7 

Although Batchelor and Ma have already shown the merits of AI as a tool in catalyst design, there is still room for improvement. Current regression models’ mean absolute error (MAE) accuracy levels when predicting trajectories, adsorption energies, and other electronic properties lie in the range of 0.05–0.5 eV MAE,6–10 and recent advancements in model complexity and spatial representation descriptors have been proven to be capable of pushing these accuracy levels even further.10,11 One particularly notable advancement was the transition from neural networks (NN) to graph neural networks (GNN), which can develop an understanding of a local chemical environment’s configurational composition, not just chemical composition.12 Additionally, traditional ML frameworks, along with tree models and Gaussian processes, are not robust, require exact formatting of input vectors, and cannot be easily cross-trained on multiple adsorbates. The flexible input of graph neural networks can help bridge this gap between various incompatible models while simultaneously improving model performance. Back et al. proved this by applying a GNN to predict binding energies on CO and H on diverse surfaces and reaching MAEs of 0.15 eV.13 Their graph representation of the crystals included information on the local environments of each atom computed using a statistical analysis of Voronoi polyhedra around each site, successfully encapsulating this information in a simple fashion using only the solid angle. Although this approach added flexibility to the input structure, there are still clear limitations to the method when multidentate adsorbates are brought into question.

Additionally, before applying a model to each of these studies, dataset pre-processing, descriptor design, and specific model constraints had to be considered. These processes demand large time commitments, can be prone to human error, and are difficult to incorporate into future works. Although scientific insight can be gained from each individual work, concatenating their work to generate even more comprehensive models would be ideal, and there is currently no clear path toward achieving this. GNNs contribute flexibility and spatial descriptors, but the combination of datasets, incorporation of unique descriptor representations such as solid angles, and efficient comparison of model frameworks have yet to be addressed.

In this work, another evolution in machine learning (ML) aided material design frameworks is proposed to solve major inefficiencies with current methods. We designed a unified graph representation with improved model performance while simultaneously reducing node counts and computational costs, which can be trained on numerous adsorbates with different coordination numbers. The framework is highly accurate and transferable and offers previously unavailable processing capabilities to combine multiple works together for boosted extrapolation ranges beyond single adsorbate and singular material system prediction. Using this tool, the data processing and model application steps of computational ML catalyst studies are vastly accelerated with new surface analysis functionality. This method also lays the foundations to predict catalytic properties on the world’s largest materials informatics databases, such as the Materials Project and OpenCatalyst, at a fraction of the computational cost and programming effort.

Our graph representation of a material system’s local chemical environment surrounding an adsorption site was built using a method inspired by Deshpande et al.14 The Python package Atomic Simulation Environment (ASE)15 was used to analyze a material system surface, and NetworkX16 was used to embed nodes and construct a graph representation of a given geometry file. A visual walkthrough of the algorithm can be seen in Fig. S1. The algorithm’s initial input is a geometry file of the adsorbate/catalyst system. Utilizing user input specifying the desired adsorbate to analyze, the specified molecule is identified, and its indices are extracted from the input structures.

For each atom, the nearest neighbor atoms are defined using an atomic radius-based neighbor list generated with the ase.neighborlist module. This module uses a radial cutoff for every atom based on metallic radii. A radius multiplier of 1.1 was also applied to all cutoffs for coarse grained adjustment and full encapsulation of awkwardly located adsorbate sites. Periodic boundary conditions are taken into account by unfolding bonds along the edge of the cell in repeats of the cell, and two atoms of the adsorbate are considered connected if the interatomic distances are lower than 1.8 Å. An atom of the adsorbate is considered connected to the catalyst (substrate) if the interatomic distance is lower than 2.3 Å. To be considered connected to a central atom, neighboring atoms must share a Voronoi face and have an interatomic distance lower than the sum of the Cordero covalent bond lengths. Once the nearest neighbors are extracted automatically, proximity based edge connections, depending on if the node is an adsorbate or substrate atom, are performed. For two adsorbate atoms, the cutoff is 2.3 Å. For substrate atoms, the cutoff is 2.8 Å. After identifying the catalyst atoms connected to the adsorbate, the algorithm selects their neighbors and removes the redundant atoms to generate a new structure of the local chemical environment surrounding the adsorbate. Redundant atoms are classified as duplicate atoms generated from periodic boundary conditions. As seen in Figs. 1(a) and 1(b), four primary types of adsorption sites can be identified. If one atom is considered connected to the adsorbate, the “on-top” geometry will be extracted. If two atoms are connected, the bridge geometry will be extracted, and if three atoms are connected, either hollow-fcc or hollow-hpc structures will be returned depending on the subsurface configuration of the adsorption site. Finally, the local chemical environment graph is generated using the extracted geometry, where the nodes represent the atoms and the edges represent the relationship between neighboring atoms. This entire process only requires the user to input the adsorbate species of interest. Compared to other graph generation methods, neighbor radius cutoffs and edge cutoffs are fully automated. At each node, a feature vector is embedded following the procedure described by Xie and Grossman.12 For the edges, the feature vector is constructed using the average Pauling electronegativity between the atoms each node represents. Subsequently, the bond length represented on a Gaussian basis is appended. All Gaussian basis parameters were taken from the CGCNN model and verified for consistent edge generation across the tested databases.12 Through the use of basic JSON files, the node descriptors and edge attributes may be easily changed to test new configurations and spatial descriptors. This JSON interpretable atom embedding allows for the easy exploration and concatenation of new and previously discovered descriptors. The graph generation methodology is also applicable to a wide range of lattice orientations that extend beyond the fcc/hcp crystal structures highlighted in Sec. III B. This allows researchers to keep models basic for the study of parametric sensitivity toward descriptors or complex to capture complex intermingled phenomena influencing a catalytic site on a wide range of catalytic applications such as co-adsorption and asymmetric substrates. By default, a 16- or 92-descriptor set is applied to each model depending on the desired type of model.12,17

FIG. 1.

(a) Four AGRA recognized primary adsorption sites on an fcc (111) alloy surface. The colors red, green, blue, and light-blue correspond to on-top, bridge, hollow-fcc, and hollow-hpc sites. (b) Generated local chemical environments extracted from AGRA surface analysis functionality. Periodic boundary conditions applied to each adsorption site in section (a), corresponding to the top sites (red), bridge sites (green), hollow-fcc (blue), and hollow-hpc (light blue). (c) Visualization of CO2RR and ORR HEA structures after adsorbate local chemical environment extraction is performed. (d) Conversion of the OH top-site local chemical environment geometry from (c) into a GNN interpretable graph with atom/edge embedded descriptors.

FIG. 1.

(a) Four AGRA recognized primary adsorption sites on an fcc (111) alloy surface. The colors red, green, blue, and light-blue correspond to on-top, bridge, hollow-fcc, and hollow-hpc sites. (b) Generated local chemical environments extracted from AGRA surface analysis functionality. Periodic boundary conditions applied to each adsorption site in section (a), corresponding to the top sites (red), bridge sites (green), hollow-fcc (blue), and hollow-hpc (light blue). (c) Visualization of CO2RR and ORR HEA structures after adsorbate local chemical environment extraction is performed. (d) Conversion of the OH top-site local chemical environment geometry from (c) into a GNN interpretable graph with atom/edge embedded descriptors.

Close modal

To further illustrate the surface analysis feature of AGRA, a visualization of each unique site geometry extracted from two publicly available databases is shown in Fig. 1(c). The GNN interpretable graph representation of an extracted site is also illustrated in Fig. 1(d). This geometry analysis feature is consistent because node count is not dependent on the initial size of the slab but only on the adsorption site type.

When compared to other graph representation methods such as the Open Catalyst Projects (OCP) atoms2graph function, AGRA’s benefits are highlighted.18 The OCP generates its graph nodes by extracting a user specified amount of nearest neighbors (200 by default) without consideration for crystal structure orientation.18 Additionally, the node count of the OCP’s graphing method can vary greatly if multiple databases with different simulation sizes are used. AGRA’s framework extracts local chemical environments dependent on the surface crystal orientation around adsorbates to provide an additional layer of spatial description to the graph with greater node count consistency when considering multiple material systems. AGRA also further separates substrate atoms into binding site atoms and nearest neighbor atoms, as opposed to OCP, which separates substrate atoms into fixed (core) atoms and free (surface) atoms. Based on the databases discussed in Sec. IV, the AGRA graph representation yields superior results to the OCP graph representation when combining multiple adsorbate datasets, which suggests the node generation method and local chemical environment extraction improve model transferability and versatility. The key differences are highlighted in Table I. For the technical limitations of AGRA regarding variable surface coverage and other situations, see Sec. A in the supplementary material.

TABLE I.

Summarized differences between AGRA and OCP graph generation.

ComponentAGRAOCP
Node generation Nodes based on adsorption geometry and local chemical environment Extracts N nearest neighbors from the input structure 
Edge generation Edge connections based on substrate adsorbate proximities Fully connected nodes for all atoms within specified distance of each other 
ComponentAGRAOCP
Node generation Nodes based on adsorption geometry and local chemical environment Extracts N nearest neighbors from the input structure 
Edge generation Edge connections based on substrate adsorbate proximities Fully connected nodes for all atoms within specified distance of each other 

Three graph neural networks were built on top of our representation to show the flexibility of our representation and its ability to add to existing frameworks. The first GNN will be referred to as “NNConv” and consists of three major components: convolutional layers, recurrent layers, and pooling layers. These components are implemented in PyTorch Geometric.19 The convolutional layers are continuous kernel-based convolutional operators.20 To handle graphs of varying sizes and connectivity, a dynamic edge-conditioned filter21 was applied. The recurrent layer is a Gated Recurrent Unit (GRU) implemented in PyTorch. The pooling layer is the global pooling operator based on iterative content-based attention.22 First, the nodes and edges are embedded, and then the convolutional and recurrent layers iteratively update these features. After N iterations, the pooling layer is then used to produce the overall feature vector. Finally, a fully connected layer and output layer are used to predict the target property. The second GNN used is a crystal graph convolution neural network and will be referred to as “CGCNN.”12 This model consists of two major components: convolutional layers and pooling layers. The convolutional layers iteratively update the atom feature vector with surrounding atoms and bonds with a non-linear graph convolution function. After N convolutions, the network automatically learns the feature vector for each atom. The pooling layer is then used to produce an overall feature vector. The third GNN used was an improvement on the second model published by Choudhary and DeCost23 and will be referred to as “ALIGNN.” This model utilizes the CGCNN framework but incorporates bond angle representation as well. The model updates nodes and edges via edge-gated graph convolution and sigmoid linear unit activations. After average pooling, the model runs through a final regression/classification layer to obtain a single value prediction. For the exact hyperparameters of each model used in training, see the supplementary material, Sec. C.

1. ORR dataset

The first dataset we used to evaluate our graph representation was reported by Batchelor et al.6 This dataset deals with the oxygen reduction reaction (ORR) on RuIrRhPdPt HEA. The calculations were performed using DFT through GPAW, an implementation of the projector-augmented wave method in ASE. The wave functions were expanded to plane-waves with an energy cutoff of 400 eV and RPBE exchange-correlation functionals. All slab calculations were performed with a minimum accuracy of 3 × 3 × 1 k-point Monkhorst–Pack sampling. HEA stability considerations used an 8 × 8 × 4 Monkhorst–Pack sampling. The training set corresponds to the adsorption energies of *OH and *O at 871 and 998 different 2 × 2 unit cells, whereas the test set was modeled on 3 × 4 unit cells. This is a particular case where our approach could be applied because the size of the graph feeding the model does not depend on the size of the surface used for the simulation. Figures 1(c) and 1(d) illustrate an example of extracted atoms for the *O adsorbed at the hollow sites and the *OH adsorbed at the top and bridge sites, as well as their corresponding graph representations. Predicted adsorption energies plotted against DFT-calculated energies are also displayed in Fig. 2. For each dataset, it was more efficient and accurate than that reported by Batchelor et al.6 They obtained test set root-mean-square deviation (RMSD) values of 0.063 and 0.076 eV for *OH and *O, respectively. Here, we obtained RMSDs of 0.093 (*OH) and 0.172 (*O) eV for the NNConv model, 0.094 (*OH) and 0.149 (*O) eV for the CGCNN model, and 0.048 (*OH) and 0.059 (*O) for the ALIGNN model, respectively. Each reported RMSE is the average of five train/val/test shuffles with 10/10/80 splits. Using the AGRA pipeline, each GNN was tested with a fractional amount of effort and time. When combining both datasets into a single model, an average RMSE of 0.053 eV was obtained, showing the transferability and flexibility of the model.

FIG. 2.

Predicted adsorption energies plotted against DFT-calculated energies for the ORR test dataset using the top performing AGRA/ALIGNN model from five re-trains. The MAE values highlighted were obtained from the most accurate saved model. The O model was trained on 998 datapoints, the OH model on 871, and the O/OH model on 1600 datapoints.

FIG. 2.

Predicted adsorption energies plotted against DFT-calculated energies for the ORR test dataset using the top performing AGRA/ALIGNN model from five re-trains. The MAE values highlighted were obtained from the most accurate saved model. The O model was trained on 998 datapoints, the OH model on 871, and the O/OH model on 1600 datapoints.

Close modal

For the combined dataset, 800 datapoints were randomly selected from the O and OH datasets to create a single 1600 point combined ORR database. AGRA’s ability to generate highly accurate models capable of predicting multiple reactions at high speeds with minimal processing has been effectively displayed with this dataset, but its limitations have not been tested with this dataset. This experiment only highlights the pipeline’s ability to predict multiple adsorbates without sacrificing performance or speed. This is because the limited atom count of each datapoint’s graph converted local chemical environment dramatically reduces the computational cost of each GNN training epoch when compared to training with a radius cutoff based graph representation such as OCPs.

2. CO2RR dataset

The CO2RR dataset taken from Chen et al. has 691 total datapoints spanning across CO, CHO, and COOH adsorption energies on equiatomic CoCuFeMoNi high entropy alloys designed with a neural generator that maximizes system entropy.24 The same graph representation and nodal descriptors were used as in the ORR dataset. The HEA systems were (111) orientation, 64 atom slabs with adsorbates placed on every unique top site (16 top sites per slab). All DFT calculations for this database were conducted with the Vienna ab initio simulation package (VASP). Core electrons were described by the projector-augmented wave pseudopotential, and the generalized gradient approximation (GGA) with the Perdew–Burke–Ernzerhof functional was used. The wave function kinetic energy cutoff was 550 eV and 4 × 4 × 1 k-point with a 15 Å vacuum gap. For symmetric adsorption molecules such as CO, which are equally influenced by each of the six nearest surface neighbors and three subsurface neighbors, each top site was considered a single datapoint. For asymmetric adsorbates such as CHO and COOH, each top site yielded six unique datapoints. For atoms that are not directly bonded to the monodentate adsorption site, the van der Waals and electron cloud interactions of their nearest neighbors were shown to influence these atoms. Because of this, each top site was calculated with six different orientations where the auxiliary atom was above the six nearest surface neighbors.

The authors reported MAE (eV) scores of 0.095, 0.095, and 0.068 eV for the CO, CHO, and COOH datasets. Using AGRA in combination with the ALIGNN and CGCNN models, we were able to achieve equivalent accuracy for the CO dataset and superior accuracy for the CHO and COOH datasets. The ALIGNN model provided the lowest MAE scores for all three datasets. The CO dataset had an average MAE of 0.095 eV, the CHO MAE of 0.095 eV, and the COOH MAE of 0.065 eV (Fig. 3). The strong accuracy of AGRA with minimal dataset preparation can be attributed to the script’s ability to accurately recognize bridge, hollow, and top sites [Fig. 1(c)] and automatically translate the local chemical environment into a descriptively readable graph for the GNN given the proper embedding descriptors for nodes.

FIG. 3.

Predicted adsorption energies plotted against DFT-calculated energies for the CO2RR test dataset using the top performing AGRA/ALIGNN model from five re-trains. The MAE values highlighted were obtained from the most accurate saved model. The CO, CHO, and COOH models were trained on 170, 204, and 267 datapoints. The combined CO2RR dataset was trained on 618 datapoints.

FIG. 3.

Predicted adsorption energies plotted against DFT-calculated energies for the CO2RR test dataset using the top performing AGRA/ALIGNN model from five re-trains. The MAE values highlighted were obtained from the most accurate saved model. The CO, CHO, and COOH models were trained on 170, 204, and 267 datapoints. The combined CO2RR dataset was trained on 618 datapoints.

Close modal

Combining the three adsorbate datasets into one model resulted in a highly accurate model with an average MAE of 0.067 eV. The application of AGRA to this work highlights AGRA’s versatility when presented with multiple limited size datasets. Through a combination of adsorbate bond angle consideration and binding site edge representation, a singular model was capable of capturing not only the complex catalytic surface of HEAs, which have been proven to break the linear scaling relation but also the complexities of various adsorbate configurations. Chen’s work utilized a multi-perceptron neural network to analyze the influence of the surrounding environment on each adsorbate dataset to ultimately show how HEAs break the linear scaling law to create superior catalysts. With AGRA, these same complexities were captured more accurately while simultaneously combining all three adsorbates into a singular database for even greater dynamic comprehension.

3. Combined CO2RR and ORR dataset

As discussed earlier, one of the benefits of this framework is the ability to combine multiple datasets in order to extrapolate performance metrics on systems not explicitly studied through DFT or experiment. As an example of this application, five DFT adsorption energy calculations were performed through VASP to study the CO2RR HEA system’s performance for the ORR. The adsorption energies of O and OH on these HEAs were calculated using the same VASP parameters as the CO2RR dataset, and the energy of O and OH in vacuum was obtained as the difference between water molecules’ energy in vacuum vs OH and H2, similar to the methodology used for the ORR dataset calculations.6 The material systems were composed of the CO2RR datasets HEA system and the ORR datasets adsorbates (Fig. 4). Each HEA system studied was generated using the same neural generator method that Chen utilized.24 The three GNNs were trained on a curated dataset composed of both ORR and CO2RR datapoints. For each adsorbate dataset (CO, CHO, COOH, O, OH), ∼200 randomly selected datapoints were combined to generate a 1000 datapoint dataset. The models were then tasked with predicting the five never-before-seen datapoint adsorption energies for comparison against DFT calculations. The ALIGNN model possessed the best average MAE and RMSE scores of 0.068 and 0.104 eV [Fig. 4(a)]. Although the model was trained on a limited amount of each adsorbate dataset spanning two HEA systems, the average RMSE score when tasked with predicting the datapoints excluded from the 1000 point dataset was within 0.02 eV MAE of AGRA models trained on each individual dataset [Fig. 4(b)]. This highlights AGRA’s potential for dataset concatenation to generate accurate models spanning multiple material systems and chemical reactions. Furthermore, when tasked with predicting the five DFT calculated datapoints composed of a new ORR system, AGRA was able to predict the material systems performance trends despite having no prior datapoints on the HEA systems ORR performance. Although AGRA consistently predicted the adsorption energy to be more negative than the DFT calculations by 0.6 eV, the conclusion that the newly studied ORR catalyst would not be ideal due to strong adsorption of O and OH would still be reached. The five DFT calculated datapoints can be found in Fig. 4(b) as the brown points labeled “New System.” Three OH adsorbate and three O adsorbate datapoints were initially calculated for a total of six DFT calculations; however, one of the O adsorbate DFT calculations saw considerable surface migration and was removed as a result. The final five datapoints primarily stabilized on hollow sites, with only one OH adsorbate stabilizing on a bridge site. Notably, the largest prediction errors came from the O adsorbate datapoints. The drop in accuracy can be attributed to the large degree of extrapolation required to perform this type of novel prediction. Aside from AGRA, the furthest extent of extrapolation that ML aided catalyst design went to was a compositional exploration of a material system and adsorbate combination included in the training data. AGRA takes this one step further by extrapolating to material systems and adsorbate combinations never before seen. The benefits of studying pre-established HEA systems for different reactions were proven by Chen with the CO2RR dataset. His work took an already existing catalyst for ammonia decomposition and applied it to the CO2RR to obtain exceptional overpotentials.25 Since this study can be replicated on any combination of published databases, AGRA has the potential to unify the many separate catalyst databases available to the public to discover unknown applications of pre-established catalysts.

FIG. 4.

(a) Combined dataset parity plot with 950 datapoints split between train/validation/testing. (b) Prediction of excluded datapoints not included in the model training/val/test phase to confirm AGRA is capable of accurate prediction across multiple material systems and adsorbates. A visualization of the newly designed ORR HEA system is provided and highlighted in the parity plot.

FIG. 4.

(a) Combined dataset parity plot with 950 datapoints split between train/validation/testing. (b) Prediction of excluded datapoints not included in the model training/val/test phase to confirm AGRA is capable of accurate prediction across multiple material systems and adsorbates. A visualization of the newly designed ORR HEA system is provided and highlighted in the parity plot.

Close modal

4. Pipeline performance

As seen in Table II, the ALIGNN model performed the best likely due to its consideration of bond angles with a crystal system as well as the adsorption angle of asymmetric adsorbates such as COOH and OH. Although all three models were very close to DFT level accuracy (0.1 eV),26 the ALIGNN model was clearly the strongest performing GNN with AGRA.

TABLE II.

Model performance summary—Average mean absolute error of five model re-trains (in eV).

Graphing typeCGCNNNNConvALIGNN
CO AGRA 0.140 0.084 0.095 
OCP 0.151 0.166 0.107 
CHO AGRA 0.099 0.102 0.095 
OCP 0.200 0.159 0.096 
COOH AGRA 0.065 0.100 0.065 
OCP 0.081 0.185 0.061 
CO2RR AGRA 0.081 0.114 0.067 
OCP 0.177 0.165 0.091 
AGRA 0.123 0.136 0.047 
OCP 0.124 0.086 0.058 
OH AGRA 0.074 0.072 0.034 
OCP 0.062 0.052 0.038 
ORR AGRA 0.086 0.097 0.042 
OCP 0.083 0.094 0.048 
All AGRA 0.300 0.260 0.068 
OCP 0.329 0.270 0.104 
Graphing typeCGCNNNNConvALIGNN
CO AGRA 0.140 0.084 0.095 
OCP 0.151 0.166 0.107 
CHO AGRA 0.099 0.102 0.095 
OCP 0.200 0.159 0.096 
COOH AGRA 0.065 0.100 0.065 
OCP 0.081 0.185 0.061 
CO2RR AGRA 0.081 0.114 0.067 
OCP 0.177 0.165 0.091 
AGRA 0.123 0.136 0.047 
OCP 0.124 0.086 0.058 
OH AGRA 0.074 0.072 0.034 
OCP 0.062 0.052 0.038 
ORR AGRA 0.086 0.097 0.042 
OCP 0.083 0.094 0.048 
All AGRA 0.300 0.260 0.068 
OCP 0.329 0.270 0.104 

Most notably, the accuracy loss related to combining multiple adsorbate datasets was the least pronounced with the ALIGNN model as well. This shows great potential for extrapolative uses in predicting the performance of material systems for energy reactions without explicitly performing experimental or computational studies. Although composition exploration studies have been performed before, their range of extrapolations has not been extended as far as the study in Sec. III, which investigated a new adsorbate on a known HEA.27 

When comparing AGRA to OCP graph representation approaches, it is evident that AGRA has similar or superior model performance at a reduced computational cost (Table II). For the individual adsorbate datasets, AGRA performed similarly to or superior to the OCP representation, but for the datasets that possessed multiple adsorbates, AGRA’s representation performed considerably better. This is likely due to the more consistent number of nodes and the non-fully connected edge representation AGRA provides.

In summary, this work has presented an algorithm to analyze and extract the chemical environment of an adsorption site on different metallic substrates. This automated graph representation adds an additional layer of descriptiveness to neural networks, which will ultimately allow them to develop greater understandings of the underlying physics of catalysis. This closed system improves model flexibility when combining databases at a reduced computational cost and can accelerate the optimization and discovery process of new catalysts for crucial energy reactions. By removing almost all of the manual curation associated with developing DFT powered datasets for materials design, this package increases the robustness, transferability, and explorative capabilities of researchers, allowing them to focus on theoretical mechanisms instead of software/technical hurdles to curating datasets. To prove this claim, AGRA was applied to two different catalytic reactions to show the exceptional performance of the GNNs on metallic substrates. The ORR dataset obtained 0.048 (*OH) and 0.059 (*O) RMSD on 871 and 998 datapoint datasets with minimal preparation. To show the versatile learning capabilities of the AGRA, a CO2RR dataset was tested with 691 datapoints split between three adsorbates (CHO, CO, and COOH) and obtained RMSD values of 0.123, 0.125, and 0.093 eV. In each dataset where the OCP representation was benchmarked against AGRA, identical or similar accuracy was obtained for singular adsorbate datasets. For multiple adsorbate datasets, however, AGRA performed better while also reducing the computational cost associated with training the GNNs. The dataset combination functionality of AGRA was further highlighted by combining an ORR and CO2RR database to ultimately evaluate a never-before-studied ORR system to show AGRA’s potential to harness the largest publicly available material informatics databases for material design exploration.

The supplementary material accompanying the work includes the technical limitations of the framework with strategies on how to fix them, visualizations of AGRA’s workflow, and hyperparameters of the models discussed in the work.

The authors acknowledge financial support from collaborative R&D programs and initiatives at the National Research Council through the Artificial Intelligence for Design (AI4D) Challenge program and the University of Waterloo. They also acknowledge the Nature Science and Engineer Research Council of Canada (NSERC), as well as the Digital Research Alliance of Canada, for providing computing resources at the SciNet, Calcul Quebec, and WestGrid consortia.

The authors have no conflicts to disclose.

Conceptualization, Conrard Tetsassi; Implementation and testing, Conrard Tetsassi and Zachary Gariepy; Manuscript writing, Zachary Gariepy and Conrard Tetsassi; CO2RR dataset generation, Zhiwen Chen and Zachary Gariepy. All authors discussed the results and contributed to the final manuscript.

Zachary Gariepy: Conceptualization (equal); Data curation (equal); Methodology (equal); Writing – original draft (equal); Writing – review & editing (equal). ZhiWen Chen: Data curation (equal); Writing – review & editing (equal). Isaac Tamblyn: Writing – review & editing (equal). Chandra Veer Singh: Writing – review & editing (equal). Conrard Giresse Tetsassi Feugmo: Conceptualization (equal); Methodology (equal); Writing – original draft (equal); Writing – review & editing (equal).

The main datasets that support the findings of this study are openly available in Chen and Batchelor’s works at https://doi.org/10.1021/acscatal.2c0367524 and https://doi.org/10.1016/j.joule.2018.12.015.6 The AGRA code can be found here: https://github.com/Feugmo-Group/AGRA,28 as well as the code used for the ML training and figure plotting.

1.
Y.
Xin
,
S.
Li
,
Y.
Qian
,
W.
Zhu
,
H.
Yuan
,
P.
Jiang
,
R.
Guo
, and
L.
Wang
,
ACS Catal.
10
,
11280
(
2020
).
2.
J. K.
Pedersen
,
T. A. A.
Batchelor
,
A.
Bagger
, and
J.
Rossmeisl
,
ACS Catal.
10
,
2169
(
2020
).
3.
T.
Löffler
,
A.
Ludwig
,
J.
Rossmeisl
, and
W.
Schuhmann
,
Angew. Chem., Int. Ed.
60
,
26894
(
2021
).
4.
D.
Wu
,
K.
Kusada
,
T.
Yamamoto
,
T.
Toriyama
,
S.
Matsumura
,
I.
Gueye
,
O.
Seo
,
J.
Kim
,
S.
Hiroi
,
O.
Sakata
,
S.
Kawaguchi
,
Y.
Kubota
, and
H.
Kitagawa
,
Chem. Sci.
11
,
12731
(
2020
).
5.
Y.
Yan
,
D.
Lu
, and
K.
Wang
,
Comput. Mater. Sci.
199
,
110723
(
2021
).
6.
T. A. A.
Batchelor
,
J. K.
Pedersen
,
S. H.
Winther
,
I. E.
Castelli
,
K. W.
Jacobsen
, and
J.
Rossmeisl
,
Joule
3
,
834
(
2019
).
7.
X.
Ma
,
Z.
Li
,
L. E. K.
Achenie
, and
H.
Xin
,
J. Phys. Chem. Lett.
6
,
3528
(
2015
).
8.
K.
Li
and
W.
Chen
,
Mater. Today Energy
20
,
100638
(
2021
).
9.
S.
Nayak
,
S.
Bhattacharjee
,
J.-H.
Choi
, and
S. C.
Lee
,
J. Phys. Chem. A
124
,
247
(
2020
).
10.
Z.
Lu
,
Z. W.
Chen
, and
C. V.
Singh
,
Matter
3
,
1318
(
2020
).
11.
Z. W.
Chen
,
Z.
Lu
,
L. X.
Chen
,
M.
Jiang
,
D.
Chen
, and
C. V.
Singh
,
Chem. Catal.
1
,
183
(
2021
).
12.
T.
Xie
and
J. C.
Grossman
,
Phys. Rev. Lett.
120
,
145301
(
2018
); arXiv:1710.10324.
13.
S.
Back
,
J.
Yoon
,
N.
Tian
,
W.
Zhong
,
K.
Tran
, and
Z. W.
Ulissi
,
J. Phys. Chem. Lett.
10
,
4401
(
2019
).
14.
S.
Deshpande
,
T.
Maxson
, and
J.
Greeley
,
npj Comput. Mater.
6
,
79
(
2020
).
15.
H.
Larsen
,
J. Condens. Matter Phys.
29
,
273002
(
2017
).
16.
A.
Hagberg
,
P.
Swart
, and
D. S.
Chult
, "
Exploring network structure, dynamics, and function using Networkx
,"
Los Alamos National Laboratory Report No. LA-UR-08-5495
(
2008
).
17.
C.
Chen
,
W.
Ye
,
Y.
Zuo
,
C.
Zheng
, and
S. P.
Ong
,
Chem. Mater.
31
(
9
),
3564
3572
(
2019
).
18.
L.
Chanussot
,
A.
Da.
,
S.
Goyal
,
T.
Lavril
,
M.
Shuaibi
,
M.
Riviere
,
K.
Tran
,
J.
Heras-Domingo
,
C.
Ho
,
W.
Hu
,
A.
Palizhati
,
A.
Sriram
,
B.
Wood
,
J.
Yoon
,
D.
Parikh
,
C. L.
Zitnick
, and
Z.
Ulissi
,
ACS Catal.
11
(
10
),
6059
6072
(
2021
).
19.
M.
Fey
and
J. E.
Lenssen
, in
ICLR Workshop on Representation Learning on Graphs and Manifolds
,
2019
.
20.
J.
Gilmer
,
S. S.
Schoenholz
,
P. F.
Riley
,
O.
Vinyals
, and
G. E.
Dahl
, in
Proceedings of the 34th International Conference on Machine Learning-ICML’17
(
JMLR.org
,
2017
), Vol.
70
, pp.
1263
1272
.
21.
M.
Simonovsky
and
N.
Komodakis
, in
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(
IEEE
,
2017
); arXiv.1704.02901 (
2017
).
22.
O.
Vinyals
,
S.
Bengio
, and
M.
Kudlur
, “
Order matters: Sequence to sequence for sets
,” in
Proceedings of the International Conference on Learning Representations
2016
(ICLR)
; arXiv.1511.06391 [stat.ML] (
2016
).
23.
K.
Choudhary
and
B.
DeCost
,
Npj Comput. Mater.
7
(
1
),
185
(
2021
).
24.
Z. W.
Chen
,
Z.
Gariepy
,
L.
Chen
,
X.
Yao
,
A.
Anand
,
S.-J.
Liu
,
C. G.
Tetsassi Feugmo
,
I.
Tamblyn
, and
C. V.
Singh
,
ACS Catal.
12
,
14864
(
2022
).
25.
P.
Xie
,
Y.
Yao
,
Z.
Huang
,
Z.
Liu
,
J.
Zhang
,
T.
Li
,
G.
Wang
,
R.
Shahbazian-Yassar
,
L.
Hu
, and
C.
Wang
,
Nat. Commun.
10
(
1
),
4011
(
2019
).
26.
T.
Zhong
,
Nature
581
(
7807
),
178
183
(
2020
).
27.
N. K.
Katiyar
,
G.
Goel
, and
S.
Goel
,
Emergent Mater.
4
,
1635
(
2021
).
28.
Feugmo Research Group
, “
AGRA: Automatic graph representation algorithm for heterogeneous catalysis
,” https://github.com/Feugmo-Group/AGRA,
2023
.

Supplementary Material