Metal–organic frameworks (MOFs), as novel porous crystalline materials with high porosity and a large specific surface area, have been increasingly utilized for CO2 adsorption. Machine learning (ML) combined with molecular simulations is used to identify MOFs with high CO2 adsorption capacity from millions of MOF structures. In this study, 23 structural and molecular features and 765 calculated features were proposed for the ML model and trained on a hypothetical MOF dataset for CO2 adsorption at different pressures. The calculated features improved the prediction accuracy of the ML model by 15%–20% and revealed its interpretability, consistent with the analysis of the interaction potential. Subsequently, the importance of the relevant features was ranked at different pressures. Regardless of the pressure, the molecular structure and pore size were the most critical factors. van der Waals force-related descriptors gained more competitive advantages at low pressures, whereas electrical-field-related descriptors gradually dominated at high pressures. Overall, this study provides a novel perspective to guide the initial high-throughput screening of MOFs as high-performance CO2 adsorption materials.
I. INTRODUCTION
Approximately 36 × 109 tons of carbon dioxide (CO2) are produced globally each year,1 making it the largest contributor to the global greenhouse effect.2 The largest source of CO2 emissions is the combustion of fossil fuels used for power generation and energy supply, accounting for ∼90% of the emissions.1,3,4 With the continuously increasing demand for energy, a method that effectively absorbs CO2 from exhaust gases and emission processes should be investigated to minimize the impact of CO2 on the atmosphere.5 Therefore, the development of new materials with high capacity, selectivity, and tunability for active CO2 adsorption is crucial.
Metal–organic frameworks (MOFs) are novel porous crystalline materials composed of building blocks that contain metal nodes and organic linkers.6 Compared with other frameworks, MOF materials possess high porosity and a large specific surface area,7,8 thereby prompting research interest in their applicability for gas adsorption and storage.9 MOFs with large surface areas and pore volumes are valued for CO2 adsorption10–12 because they provide abundant adsorption sites and void space to accommodate CO2 molecules.13 MOFs with different physical and chemical properties have emerged owing to the diversity of metal nodes and organic linkers and their complex topology. Thus, MOF molecules that are biased toward CO2 adsorption could be identified.14,15
Clarifying the pore chemistry and molecular structure of MOFs in relation to CO2 adsorption at different pressures is a key step in screening MOF molecules. However, screening for CO2 adsorption is preferable for the complex development of MOF materials, which is a lengthy and expensive process. Grand canonical Monte Carlo (GCMC) simulations, which mainly use van der Waals and Coulomb forces to simulate interatomic potentials, have proven effective in estimating the gas adsorption capacity of MOFs.16,17 Several studies have employed GCMC to predict the adsorption of gases such as CO2,18 H2,19 CH4,20 and I2,21 demonstrating promising performance. However, gas adsorption for simulation requires manual intervention and a large amount of computational resources.
To improve the method, the machine learning (ML) algorithm has become one of the mainstream methods to predict material properties and has been applied in chemistry and materials-related fields.22–24 Meanwhile, ML is also applied in the prediction of gas adsorption properties of MOFs25–27 and is extensively used in other framework materials, such as covalent organic frameworks (COFs).28–30 The main process of the ML model takes a large number of material descriptors as inputs and simulation or experimental results as labels for training, which makes predictions and assessments of material properties. Compared with traditional simulation methods, ML reduces computational time and cost and rarely requires advanced domain knowledge. However, training a ML model commonly requires a large-scale and high-quality dataset. Fortunately, over the past decade, numerous high-quality, large-scale MOF datasets have emerged, such as the Computation-Ready Experimental (CoRE) 2019 MOF database,31 Quantum MOF (QMOF) dataset,32 and hypothetical metal–organic framework (hMOF) dataset.33 Most related studies considered the structural information of MOF molecules as descriptors, such as void fraction and lowest cavity diameter.34–37 Some studies have added molecular descriptors to reveal the chemical reactions of the adsorption process, such as atomic species and functional groups.38,39 However, the related work only utilizes the ML model to predict the CO2 adsorption of MOFs, which also needs to provide a theoretical basis for the adsorption process of MOFs through the working mechanism of the ML model.
According to the “AI for Science”40 concept, the ML model, as a “black-box” model,41 is required to have predictive accuracy and model interpretability.42 The interpretability of the ML model was revealed through feature importance, which comprehensively evaluates the role of each feature in the ML model, including cumulative information gain, permutation importance,43 and Shapley additive explanation (SHAP) importance.44 Evaluating the importance of features using these methods and providing interpretations based on existing domain knowledge can improve the credibility of ML models.
Each MOF molecule is composed of metal nodes, organic linkers, and topologies. Bucior et al. proposed a generalization of each MOF molecule with a character sequence called MOF-id to fully express the properties of MOF molecules.45 In the MOF-id sequence, the Simplified Molecular Input Line Entry System (SMILES)46 was utilized for the building blocks that contain the chemical information of the atomic and interatomic connections. The structural information is presented in the MOF-id in terms of topologies and catenation, which are defined from the Reticular Chemistry Structure Resource database.47 Consequently, MOF-id retained most of the chemical and structural information for the prediction using the ML model while improving the processing speed of the RDkit.
In this work, we propose a machine learning model that builds on existing work by using additional calculated descriptors as feature inputs. In order to reveal the CO2 adsorption mechanism of MOFs, we predict the CO2 adsorption capacity of MOFs at five different pressures. We conduct comparative experiments by sequentially adding structural features, molecular features, and calculated features to the feature lists. The comparison revealed the effect of individual features on the CO2 adsorption of the MOFs and demonstrated the effectively improved prediction accuracy of the ML model with the addition of calculated descriptors. Finally, we combine feature importance with domain knowledge such as intermolecular force, secondary bond, and electric potential to reveal the reasons for the top-ranked features, reflecting the interpretability of the ML model.
II. METHODS
A. Data collection
In this work, we take the hMOF database as the dataset, including MOF molecules generated by Materials Studio. The hMOF database is created through a computational measure using a given chemical library of 102 building blocks, which calculates the pore-size distribution and surface area. The dataset, a crucial component of machine learning, undergoes data cleaning and preprocessing before being input to the ML model to ensure training and generalization capabilities. We primarily use the hMOF dataset as the original data, comprising 137 652 MOF structure files along with their CO2 adsorption capacities at different pressures. Building upon the hMOF dataset, we further collect the MOF-id for each MOF molecule. Among the 137 652 MOF molecules, 102 859 possess explicit MOF-id sequences. Then, MOFs containing carbon atoms with more than four bonds, oxygen atoms with more than two bonds, nitrogen atoms with more than three bonds, and halogen atoms with more than one bond are discarded by RDkit. After excluding anomalous molecules, we ultimately obtained 100 240 MOF molecules along with their MOF-id, illustrated in Fig. 1.
B. Feature expand for MOFs
In the research on materials science, the selection of relevant descriptors as feature inputs for the ML model can be affected by the diversity of materials as well as their large size, especially MOFs. To maximize the use of data from the hMOF dataset, we collect the MOF-id for each MOF molecule in the hMOF dataset, which contains the MOF molecule’s SMILES, topologies, structure, and catenation.
Simultaneously, the hMOF dataset provides structural information for each MOF molecule, including void fraction, gravimetric surface area, volumetric surface area, lowest cavity diameter (LCD), and pore limiting diameter (PLD). We directly use these five pieces of information as structural features for ML model input. By using a Python program, we systematically parse and extract information from each MOF-id, including the types and quantities of atoms, the types and quantities of common functional groups, topological structure, and catenation. For metal atoms in MOFs, we conduct additional processing, selecting atomic number, atomic weight, atomic radius, polarizability, electron affinity, and Mulliken electronegativity as features for each metal atom. These features are utilized as molecular features for ML model input.
To further characterize the properties of each MOF, we need to utilize additional features to express the physical and chemical properties of MOF molecules, which rely on organic linkers calculated by Mordred.50 After obtaining ∼1600 features calculated by Mordred, we perform data cleaning and sparse value removal. Finally, we obtain 765 features as calculated features for ML model input after the preprocessing.
C. ML models
Considering the ML model’s prediction accuracy, explainability, and complexity, the Xgboost51 (eXtreme gradient boosting) based on decision tree models is confirmed to predict the CO2 adsorption properties of MOFs in a supervised manner with various features. The models are trained on hMOF with the labels of CO2 adsorption in mol/kg at 0.01, 0.05, 0.1, 0.5, and 2.5 bars of pressure. These pressures can cover the range for CO2 adsorption using pressure-swing adsorption (PSA) and vacuum-swing adsorption (VSA) in applications52 such as natural gas purification, landfill gas separation, and flue gas separation.
To prevent the ML model from overfitting, the dataset is split into training and test sets with a regular ratio of 0.8/0.2. After the training, the model with the best test performance would be recorded, and the model parameters would be used to calculate SHAP importance. According to the splitting rule, the ML model had 86 122 hMOF data for training sets and 16 737 hMOF data for test sets.
In order to explain the operational mechanism of machine learning models and to expose the role played by individual features in the prediction process, we use the SHAP importance to calculate each feature’s importance, which calculates the marginal contribution of the features to the model output.
III. RESULTS AND DISCUSSION
A. Predictions of ML model for CO2 adsorption
Then, the XGBoost model is trained on the hMOF dataset with structure features, molecular features, and calculated features. To optimize and validate the models robustly, we tune the hyperparameters of the ML model by using cross-validation. Different hyperparameters are selected for each of the models using different feature lists, which contain max depth, learning rate, and number of trees.
While the simplest feature list is established by using only structure features, molecular features and calculated features are added to build more comprehensive feature lists such as “structure features and molecular features” and “structure features, molecular features, and calculated features.” The ML models are trained by using each of these three feature lists at five different pressures for comparison.
As shown in Table I, R2, MAE, and MSE of the test sets in various pressures, which contain three groups with different feature lists, a clear trend is demonstrated. For example, when structure features are used to predict the CO2 adsorption of MOFs, the R2 of the test set in 2.5 bars is computed as 0.737. When molecular features are added to the feature list, the R2 of the test set in 2.5 bars increases to 0.926. After adding the calculated features to the ML model, the R2 of the test set in 2.5 bars ends up at 0.948. Similarly, the addition of features can also be effective in increasing the R2 at lower pressures. At 0.01 bar, R2 of the test set can be improved from 0.429 to 0.721 by adding molecular features to the structural features. When the calculated features are further added to the ML model, the R2 metric is further improved to 0.765.
Structure . | 0.01 bar . | 0.05 bar . | 0.1 bar . | 0.5 bar . | 2.5 bars . |
---|---|---|---|---|---|
MAE | 0.066 736 901 | 0.168 290 27 | 0.245 032 675 | 0.530 866 008 | 1.001 906 015 |
MSE | 0.028 387 632 | 0.093 966 002 | 0.160 506 406 | 0.556 188 405 | 1.926 150 66 |
R2 | 0.429 084 809 | 0.618 243 386 | 0.668 305 589 | 0.722 719 795 | 0.737 439 525 |
Structure . | 0.01 bar . | 0.05 bar . | 0.1 bar . | 0.5 bar . | 2.5 bars . |
---|---|---|---|---|---|
MAE | 0.066 736 901 | 0.168 290 27 | 0.245 032 675 | 0.530 866 008 | 1.001 906 015 |
MSE | 0.028 387 632 | 0.093 966 002 | 0.160 506 406 | 0.556 188 405 | 1.926 150 66 |
R2 | 0.429 084 809 | 0.618 243 386 | 0.668 305 589 | 0.722 719 795 | 0.737 439 525 |
Molecule . | 0.01 bar . | 0.05 bar . | 0.1 bar . | 0.5 bar . | 2.5 bars . |
---|---|---|---|---|---|
MAE | 0.045 397 203 | 0.108 535 914 | 0.153 880 269 | 0.310 369 325 | 0.527 457 311 |
MSE | 0.013 893 511 | 0.042 723 885 | 0.069 751 42 | 0.206 038 189 | 0.541 793 487 |
R2 | 0.720 581 965 | 0.826 425 245 | 0.855 855 248 | 0.897 282 448 | 0.926 146 195 |
Molecule . | 0.01 bar . | 0.05 bar . | 0.1 bar . | 0.5 bar . | 2.5 bars . |
---|---|---|---|---|---|
MAE | 0.045 397 203 | 0.108 535 914 | 0.153 880 269 | 0.310 369 325 | 0.527 457 311 |
MSE | 0.013 893 511 | 0.042 723 885 | 0.069 751 42 | 0.206 038 189 | 0.541 793 487 |
R2 | 0.720 581 965 | 0.826 425 245 | 0.855 855 248 | 0.897 282 448 | 0.926 146 195 |
Calculate | 0.01 bar . | 0.05 bar . | 0.1 bar . | 0.5 bar . | 2.5 bars . |
---|---|---|---|---|---|
MAE | 0.043 003 041 | 0.099 597 64 | 0.139 757 256 | 0.266 536 593 | 0.433 903 647 |
MSE | 0.011 697 272 | 0.034 681 133 | 0.057 027 508 | 0.152 927 626 | 0.374 854 333 |
R2 | 0.764 751 412 | 0.859 100 615 | 0.882 149 84 | 0.923 760 001 | 0.948 902 267 |
Calculate | 0.01 bar . | 0.05 bar . | 0.1 bar . | 0.5 bar . | 2.5 bars . |
---|---|---|---|---|---|
MAE | 0.043 003 041 | 0.099 597 64 | 0.139 757 256 | 0.266 536 593 | 0.433 903 647 |
MSE | 0.011 697 272 | 0.034 681 133 | 0.057 027 508 | 0.152 927 626 | 0.374 854 333 |
R2 | 0.764 751 412 | 0.859 100 615 | 0.882 149 84 | 0.923 760 001 | 0.948 902 267 |
In terms of the mean error, the MAE and MSE of the ML model with only structural features were 1.001 and 1.926 at 2.50 bars. The addition of molecular features to the ML model effectively reduced the MAE and MSE to 0.527 and 0.542, respectively. Finally, adding the calculated features to the ML model decreased the MAE and MSE to 0.434 and 0.375. For low pressure, the changes in MAE and MSE were similar to those in R2. At 0.01 bar, the addition of the molecular features to the structural features reduced the MAE and MSE from 0.067 and 0.028 to 0.045 and 0.014, respectively. Similarly, the final addition of computational features to the feature list reduced the MAE and MSE to 0.043 and 0.011.
Figure 2 displays the changes in MAE, MSE, and R2 of the ML model after adding different lists of features, demonstrating a steady optimization trend. Overall, adding molecular features to structural features can effectively increase the predictive accuracy of the model by 50%, and further modification with calculated features can increase the accuracy by another 20%.
At different pressures, there is a significant variance in the distribution of CO2 adsorption data. Therefore, in addition to the analysis using the above metrics, this work further generates plots describing the distribution of model predictions vs actual values. As shown in Fig. 3, it is clear that the incorporation of structural, molecular, and calculated features into the ML model improves the prediction accuracy independent of the applied pressure.
B. Feature importance analysis
In the preceding chapter, our findings indicate that augmenting the ML process by incorporating molecular features and calculated features into structural features can effectively enhance predictive accuracy and generalization capability. This chapter ranks the feature importance of different feature lists at each pressure, using the SHAP importance method to better express the role of each feature in the prediction of the ML models. After the descriptors are determined, the relationship between these descriptors and the simulated CO2 adsorption data of MOFs is examined. The correlation coefficients are calculated between structural features and molecular features, which creates a heatmap (Fig. 4) to visually depict the obvious correlations among various features, providing a basis for ML model analysis. For instance, within the structural features, various features exhibit high positive correlations among themselves. However, catenation shows a negative correlation among these features.
As shown in Fig. 5, we rank and quantify the importance of the structural features at different pressures. As the pressure increased, the importance of the void fraction (VF) gradually increased from second to first place. Moreover, samples with high VF values exhibit a higher positive correlation with the output of the model. With an increase in the pressure and amount of CO2 adsorbed, the accommodation of CO2 becomes progressively more important. The VF features indicated that more CO2 molecules could be accommodated during the adsorption process. Meanwhile, the features representing the surface area ranked highly because the surface area affects the number of MOF adsorption sites in the CO2 adsorption process.
After incorporating the molecular features, Fig. 6 shows the top 20 most important structural and molecular features. Structural features occupy the top positions, indicating their predominant role during the CO2 adsorption of MOFs, with the elements, molecular structures, and functional groups playing crucial supporting roles. From the feature importance plots under various pressures, halogen elements (such as Br, F, and I) consistently maintain high importance, ranking just below the structural features. Halogens are highly electronegative elements, and their introduction may increase the overall electronegativity of MOFs, which would result in stronger intermolecular interactions, enhancing the affinity for CO2 and increasing the CO2 adsorption capacity.
The carbon atom occupies a dominant position in the list of molecular features. The number of carbon atoms indirectly indicates the size of the organic linkers, further influencing the structural features of MOFs. This is evident in the feature correlation analysis plot (Fig. 4), where the number of carbon atoms with VF and LCD shows correlation coefficients of 0.31 and 0.37, respectively, indicating a positive correlation between the number of carbon atoms and structural features, thereby affecting the CO2 adsorption in MOFs. However, the distribution of carbon atoms at specific sites, such as aromatics, can impact CO2 adsorption. Similarly, the importance of the oxygen atoms is mainly reflected in their high correlation with the number of carbon atoms, with a correlation coefficient of 0.59. In addition, oxygen atoms are the major components of functional groups, such as hydroxyl groups, which influence CO2 adsorption. Nitrogen atoms are crucial functional groups that constitute amines, amino acids, pyridines, and other alkaline functional groups, demonstrating their degree of importance. Among other molecular features, the amount of catenation exhibits a clear negative correlation with structural features, with correlation coefficients ranging from −0.34 to −0.51, thereby affecting the predictions.
With the addition of calculated features, the ML model achieves optimal performance on the test set. Figure 7 displays the changes in feature importance after adding calculated features at different pressures. Following the introduction of the new feature importance rankings, we summarize the frequency of occurrence and trend of importance changes with increasing pressure for calculated features, as shown in Fig. 8. To clearly illustrate the relationships between features, we also create a heatmap (Fig. 9) for the correlation analysis of the 24 significant calculated features and structural features.
Among the 24 calculated features, the importance of some features generally does not change with increasing pressure. For instance, the importance of the AATSC0p feature, which is calculated by employing the autocorrelation coefficients and atomic polarizability as weights, maintained a relatively stable ranking regardless of the pressure. In the GCMC simulation process, the van der Waals forces are a major component that significantly influences the overall adsorption performance when the strength of the van der Waals bonds increases with the polarizability of the participating atoms.
It is noteworthy that features appearing only once are mainly distributed at pressures of 0.01 and 2.50 bars. ATSC6m, ATSC6Z, and AATSC0m, which are calculated by employing autocorrelation coefficients using atomic mass and atomic number as weights to express the distribution of atomic masses in the MOFs, only appeared at 0.01 bar. Meanwhile, ATSC0i, GATS1dv, and AATS2pe, which represent the charge distribution in the MOFs using the atomic ionization potential, valence electrons, and Pauling electronegativity as weights, appeared only at a pressure of 2.5 bars. With increasing pressure, the quantity of adsorbed CO2 in the MOFs continues to increase. The increasing number of CO2 molecules in the pores leads to a greater average distance between CO2 molecules and adsorption sites. During CO2 adsorption, the interaction energies consisted of Coulomb and van der Waals forces. With increasing pressure, the proportion of Coulomb forces in the overall potential field gradually increases, whereas the proportion of van der Waals forces gradually decreases. Both the atomic mass and number, which are highly correlated features, significantly affected the van der Waals forces, making them particularly important at lower pressures. Conversely, features related to the electric field potential exhibit high importance at 2.5 bars, indicating the significance of Coulomb forces at higher pressures.
In addition, the importance of some features increases with the rise in pressure. The SsBr and SsF features represent the sums of Br and F atoms in MOF molecules, respectively. Because halogen elements generally have high electronegativity, SsBr and SsF features maintained high correlations of 0.54 and 0.41 with AATS2pe (the electronegativity feature) and exhibited a correlated trend with the increasing importance of Coulomb forces in the feature correlation analysis plot (Fig. 9). In particular, SsBr shows a high correlation of 0.84 with AATSC0p (polarizability feature), further emphasizing its importance. Moreover, the importance of the GATS1dv feature, which is a self-correlation feature calculated based on valence electrons, gradually increased, reflecting the distribution of molecular surface charges and its significant impact on intermolecular Coulomb forces.
Conversely, the importance of some features decreases with the increase in pressure. The importance of the metal atomic number as a crucial component relative to molecular mass decreases with the influence of van der Waals forces. Among other features, the nG12FaRing, NaaaC. and nFaRing features are highly correlated (0.64–0.77), all reflecting properties related to aromatic rings in the molecule. NaaaC represents the number of carbon atoms connecting multiple aromatic rings, providing further insights into the quantity of polycyclic aromatic hydrocarbons in the MOF molecule. We primarily focus on the role of aromatic rings in π–π stacking. Since the structure of polycyclic aromatic hydrocarbons can enhance the π–π stacking effect, the appearance of NaaaC in the importance ranking confirms this observation. Simultaneously, as the secondary bond, the importance of π–π stacking gradually decreases with the increase in molecular distance.
IV. CONCLUSION
To conclude, we propose a ML model for predicting the properties of CO2 adsorption in MOFs without calculation and simulation. We introduce a large number of calculated features and jointly train them with structural and molecular features on the hMOF dataset to enhance the accuracy of CO2 adsorption predictions at various pressures. Compared to using only structural and molecular descriptors as the feature lists, the addition of calculated features improves the accuracy of the ML model by 15%–20% among different pressures. Furthermore, the addition of calculated features provides a more explicit interpretation of the operational mechanism of the ML model. Visualizing the results of feature importance indicates that in the process of CO2 adsorption, the structure and pore size of MOF molecules play a predominant role, while calculated features such as molecular polarizability, electronegativity, and atomic weight also hold major importance. The ranking of calculated features with changing pressures aligns with our discussion on the variation between van der Waals forces and Coulomb forces during CO2 adsorption, demonstrating that this ML model can be interpreted through pore chemistry, molecular dynamics, and chemical reactions. Utilizing automatically extracted descriptors as input, the ML model can rapidly and accurately predict the CO2 adsorption of MOF materials on a large scale. Simultaneously, the interpretability of the ML model in the adsorption process elucidates the relative importance of factors such as molecular structure, atomic weight, and polarizability in the adsorption performance of materials with different pressures, laying the foundation for future MOF material searches.
SUPPLEMENTARY MATERIAL
The supplementary material contains additional information, including 15 plots of the distribution of results predicted by different feature sets at air pressures ranging from 0.01 to 2.5 bars. The names and specific descriptions of the features were included in the calculated feature set.
ACKNOWLEDGMENTS
This study was funded by the National Key R&D Program of China (Grant No. 2022YFB4703403) and also by the Academic Excellence Foundation of BUAA for Ph.D. students. The funder played no role in the study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
G.S. analyzed and interpreted the data and revised the manuscript. Y.T. performed the study, analyzed and interpreted the data and was a major contributor in writing the manuscript. All authors read and approved the final manuscript.
Yukun Teng: Data curation (equal); Software (equal); Writing – original draft (equal). Guangcun Shan: Funding acquisition (equal); Project administration (lead); Resources (equal); Software (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.