Metal–organic frameworks (MOFs), as novel porous crystalline materials with high porosity and a large specific surface area, have been increasingly utilized for CO2 adsorption. Machine learning (ML) combined with molecular simulations is used to identify MOFs with high CO2 adsorption capacity from millions of MOF structures. In this study, 23 structural and molecular features and 765 calculated features were proposed for the ML model and trained on a hypothetical MOF dataset for CO2 adsorption at different pressures. The calculated features improved the prediction accuracy of the ML model by 15%–20% and revealed its interpretability, consistent with the analysis of the interaction potential. Subsequently, the importance of the relevant features was ranked at different pressures. Regardless of the pressure, the molecular structure and pore size were the most critical factors. van der Waals force-related descriptors gained more competitive advantages at low pressures, whereas electrical-field-related descriptors gradually dominated at high pressures. Overall, this study provides a novel perspective to guide the initial high-throughput screening of MOFs as high-performance CO2 adsorption materials.

Approximately 36 × 109 tons of carbon dioxide (CO2) are produced globally each year,1 making it the largest contributor to the global greenhouse effect.2 The largest source of CO2 emissions is the combustion of fossil fuels used for power generation and energy supply, accounting for ∼90% of the emissions.1,3,4 With the continuously increasing demand for energy, a method that effectively absorbs CO2 from exhaust gases and emission processes should be investigated to minimize the impact of CO2 on the atmosphere.5 Therefore, the development of new materials with high capacity, selectivity, and tunability for active CO2 adsorption is crucial.

Metal–organic frameworks (MOFs) are novel porous crystalline materials composed of building blocks that contain metal nodes and organic linkers.6 Compared with other frameworks, MOF materials possess high porosity and a large specific surface area,7,8 thereby prompting research interest in their applicability for gas adsorption and storage.9 MOFs with large surface areas and pore volumes are valued for CO2 adsorption10–12 because they provide abundant adsorption sites and void space to accommodate CO2 molecules.13 MOFs with different physical and chemical properties have emerged owing to the diversity of metal nodes and organic linkers and their complex topology. Thus, MOF molecules that are biased toward CO2 adsorption could be identified.14,15

Clarifying the pore chemistry and molecular structure of MOFs in relation to CO2 adsorption at different pressures is a key step in screening MOF molecules. However, screening for CO2 adsorption is preferable for the complex development of MOF materials, which is a lengthy and expensive process. Grand canonical Monte Carlo (GCMC) simulations, which mainly use van der Waals and Coulomb forces to simulate interatomic potentials, have proven effective in estimating the gas adsorption capacity of MOFs.16,17 Several studies have employed GCMC to predict the adsorption of gases such as CO2,18 H2,19 CH4,20 and I2,21 demonstrating promising performance. However, gas adsorption for simulation requires manual intervention and a large amount of computational resources.

To improve the method, the machine learning (ML) algorithm has become one of the mainstream methods to predict material properties and has been applied in chemistry and materials-related fields.22–24 Meanwhile, ML is also applied in the prediction of gas adsorption properties of MOFs25–27 and is extensively used in other framework materials, such as covalent organic frameworks (COFs).28–30 The main process of the ML model takes a large number of material descriptors as inputs and simulation or experimental results as labels for training, which makes predictions and assessments of material properties. Compared with traditional simulation methods, ML reduces computational time and cost and rarely requires advanced domain knowledge. However, training a ML model commonly requires a large-scale and high-quality dataset. Fortunately, over the past decade, numerous high-quality, large-scale MOF datasets have emerged, such as the Computation-Ready Experimental (CoRE) 2019 MOF database,31 Quantum MOF (QMOF) dataset,32 and hypothetical metal–organic framework (hMOF) dataset.33 Most related studies considered the structural information of MOF molecules as descriptors, such as void fraction and lowest cavity diameter.34–37 Some studies have added molecular descriptors to reveal the chemical reactions of the adsorption process, such as atomic species and functional groups.38,39 However, the related work only utilizes the ML model to predict the CO2 adsorption of MOFs, which also needs to provide a theoretical basis for the adsorption process of MOFs through the working mechanism of the ML model.

According to the “AI for Science”40 concept, the ML model, as a “black-box” model,41 is required to have predictive accuracy and model interpretability.42 The interpretability of the ML model was revealed through feature importance, which comprehensively evaluates the role of each feature in the ML model, including cumulative information gain, permutation importance,43 and Shapley additive explanation (SHAP) importance.44 Evaluating the importance of features using these methods and providing interpretations based on existing domain knowledge can improve the credibility of ML models.

Each MOF molecule is composed of metal nodes, organic linkers, and topologies. Bucior et al. proposed a generalization of each MOF molecule with a character sequence called MOF-id to fully express the properties of MOF molecules.45 In the MOF-id sequence, the Simplified Molecular Input Line Entry System (SMILES)46 was utilized for the building blocks that contain the chemical information of the atomic and interatomic connections. The structural information is presented in the MOF-id in terms of topologies and catenation, which are defined from the Reticular Chemistry Structure Resource database.47 Consequently, MOF-id retained most of the chemical and structural information for the prediction using the ML model while improving the processing speed of the RDkit.

In this work, we propose a machine learning model that builds on existing work by using additional calculated descriptors as feature inputs. In order to reveal the CO2 adsorption mechanism of MOFs, we predict the CO2 adsorption capacity of MOFs at five different pressures. We conduct comparative experiments by sequentially adding structural features, molecular features, and calculated features to the feature lists. The comparison revealed the effect of individual features on the CO2 adsorption of the MOFs and demonstrated the effectively improved prediction accuracy of the ML model with the addition of calculated descriptors. Finally, we combine feature importance with domain knowledge such as intermolecular force, secondary bond, and electric potential to reveal the reasons for the top-ranked features, reflecting the interpretability of the ML model.

In this work, we take the hMOF database as the dataset, including MOF molecules generated by Materials Studio. The hMOF database is created through a computational measure using a given chemical library of 102 building blocks, which calculates the pore-size distribution and surface area. The dataset, a crucial component of machine learning, undergoes data cleaning and preprocessing before being input to the ML model to ensure training and generalization capabilities. We primarily use the hMOF dataset as the original data, comprising 137 652 MOF structure files along with their CO2 adsorption capacities at different pressures. Building upon the hMOF dataset, we further collect the MOF-id for each MOF molecule. Among the 137 652 MOF molecules, 102 859 possess explicit MOF-id sequences. Then, MOFs containing carbon atoms with more than four bonds, oxygen atoms with more than two bonds, nitrogen atoms with more than three bonds, and halogen atoms with more than one bond are discarded by RDkit. After excluding anomalous molecules, we ultimately obtained 100 240 MOF molecules along with their MOF-id, illustrated in Fig. 1.

FIG. 1.

Workflow of creating the feature lists in this study and using them to train and evaluate the ML model. First, the hMOF dataset is subjected to data processing and filtering, after which the 3D structure, MOF-id, and organic linker of each MOF are extracted and saved. Subsequently, we analyze the computed structural descriptors, molecular descriptors, and calculated descriptors for each MOF and use them as features for the input of the ML model. Simultaneously, the GCMC simulation results of CO2 are employed as labels for training the ML model. Finally, model evaluation is conducted based on the prediction results, and the interpretability of the model is revealed by considering feature correlations. Utilizing the ML model to assist in the screening of MOFs.

FIG. 1.

Workflow of creating the feature lists in this study and using them to train and evaluate the ML model. First, the hMOF dataset is subjected to data processing and filtering, after which the 3D structure, MOF-id, and organic linker of each MOF are extracted and saved. Subsequently, we analyze the computed structural descriptors, molecular descriptors, and calculated descriptors for each MOF and use them as features for the input of the ML model. Simultaneously, the GCMC simulation results of CO2 are employed as labels for training the ML model. Finally, model evaluation is conducted based on the prediction results, and the interpretability of the model is revealed by considering feature correlations. Utilizing the ML model to assist in the screening of MOFs.

Close modal
To obtain the label of the ML model, which represents the CO2 adsorption properties, GCMC simulations are used to calculate the adsorption of pure CO2 at 298 K at different pressures48 (0.01, 0.05, 0.1, 0.5, and 2.5 bars). The scale of the dataset is sufficient for training accurate ML models to predict CO2 adsorption properties. In GCMC simulations, interaction energies between non-bonded atoms are computed through the Lennard-Jones49 (LJ) and Coulomb potentials,
(1)
where i and j are the atoms’ number with interacting, rij is the distance between the atoms, òij is the well depth of Lennard-Jones potential, σij is the diameter of Lennard-Jones potential, qi is the partial charge of the atom i, and ò0 is the dielectric constant in Coulomb potentials.

In the research on materials science, the selection of relevant descriptors as feature inputs for the ML model can be affected by the diversity of materials as well as their large size, especially MOFs. To maximize the use of data from the hMOF dataset, we collect the MOF-id for each MOF molecule in the hMOF dataset, which contains the MOF molecule’s SMILES, topologies, structure, and catenation.

Simultaneously, the hMOF dataset provides structural information for each MOF molecule, including void fraction, gravimetric surface area, volumetric surface area, lowest cavity diameter (LCD), and pore limiting diameter (PLD). We directly use these five pieces of information as structural features for ML model input. By using a Python program, we systematically parse and extract information from each MOF-id, including the types and quantities of atoms, the types and quantities of common functional groups, topological structure, and catenation. For metal atoms in MOFs, we conduct additional processing, selecting atomic number, atomic weight, atomic radius, polarizability, electron affinity, and Mulliken electronegativity as features for each metal atom. These features are utilized as molecular features for ML model input.

To further characterize the properties of each MOF, we need to utilize additional features to express the physical and chemical properties of MOF molecules, which rely on organic linkers calculated by Mordred.50 After obtaining ∼1600 features calculated by Mordred, we perform data cleaning and sparse value removal. Finally, we obtain 765 features as calculated features for ML model input after the preprocessing.

Considering the ML model’s prediction accuracy, explainability, and complexity, the Xgboost51 (eXtreme gradient boosting) based on decision tree models is confirmed to predict the CO2 adsorption properties of MOFs in a supervised manner with various features. The models are trained on hMOF with the labels of CO2 adsorption in mol/kg at 0.01, 0.05, 0.1, 0.5, and 2.5 bars of pressure. These pressures can cover the range for CO2 adsorption using pressure-swing adsorption (PSA) and vacuum-swing adsorption (VSA) in applications52 such as natural gas purification, landfill gas separation, and flue gas separation.

To prevent the ML model from overfitting, the dataset is split into training and test sets with a regular ratio of 0.8/0.2. After the training, the model with the best test performance would be recorded, and the model parameters would be used to calculate SHAP importance. According to the splitting rule, the ML model had 86 122 hMOF data for training sets and 16 737 hMOF data for test sets.

In general, there are differences in the adsorption capacity of MOF for CO2 at different pressures and the adsorption mechanisms, so predictions using different feature lists are necessary to verify the validity of this work. To evaluate the ML model, we use the coefficient of determination (R2), mean absolute error (MAE), and mean square error (MSE), which could be calculated by the equations,
(2)
(3)
(4)
where n is the amount of training or test set, Y is the value of prediction from the ML model, Ŷ is the label of each data, and Ȳ is the average of predicted values.

In order to explain the operational mechanism of machine learning models and to expose the role played by individual features in the prediction process, we use the SHAP importance to calculate each feature’s importance, which calculates the marginal contribution of the features to the model output.

Then, the XGBoost model is trained on the hMOF dataset with structure features, molecular features, and calculated features. To optimize and validate the models robustly, we tune the hyperparameters of the ML model by using cross-validation. Different hyperparameters are selected for each of the models using different feature lists, which contain max depth, learning rate, and number of trees.

While the simplest feature list is established by using only structure features, molecular features and calculated features are added to build more comprehensive feature lists such as “structure features and molecular features” and “structure features, molecular features, and calculated features.” The ML models are trained by using each of these three feature lists at five different pressures for comparison.

As shown in Table I, R2, MAE, and MSE of the test sets in various pressures, which contain three groups with different feature lists, a clear trend is demonstrated. For example, when structure features are used to predict the CO2 adsorption of MOFs, the R2 of the test set in 2.5 bars is computed as 0.737. When molecular features are added to the feature list, the R2 of the test set in 2.5 bars increases to 0.926. After adding the calculated features to the ML model, the R2 of the test set in 2.5 bars ends up at 0.948. Similarly, the addition of features can also be effective in increasing the R2 at lower pressures. At 0.01 bar, R2 of the test set can be improved from 0.429 to 0.721 by adding molecular features to the structural features. When the calculated features are further added to the ML model, the R2 metric is further improved to 0.765.

TABLE I.

Performance of the ML model trained with different feature lists at different pressures.

Structure0.01 bar0.05 bar0.1 bar0.5 bar2.5 bars
MAE 0.066 736 901 0.168 290 27 0.245 032 675 0.530 866 008 1.001 906 015 
MSE 0.028 387 632 0.093 966 002 0.160 506 406 0.556 188 405 1.926 150 66 
R2 0.429 084 809 0.618 243 386 0.668 305 589 0.722 719 795 0.737 439 525 
Structure0.01 bar0.05 bar0.1 bar0.5 bar2.5 bars
MAE 0.066 736 901 0.168 290 27 0.245 032 675 0.530 866 008 1.001 906 015 
MSE 0.028 387 632 0.093 966 002 0.160 506 406 0.556 188 405 1.926 150 66 
R2 0.429 084 809 0.618 243 386 0.668 305 589 0.722 719 795 0.737 439 525 
Molecule0.01 bar0.05 bar0.1 bar0.5 bar2.5 bars
MAE 0.045 397 203 0.108 535 914 0.153 880 269 0.310 369 325 0.527 457 311 
MSE 0.013 893 511 0.042 723 885 0.069 751 42 0.206 038 189 0.541 793 487 
R2 0.720 581 965 0.826 425 245 0.855 855 248 0.897 282 448 0.926 146 195 
Molecule0.01 bar0.05 bar0.1 bar0.5 bar2.5 bars
MAE 0.045 397 203 0.108 535 914 0.153 880 269 0.310 369 325 0.527 457 311 
MSE 0.013 893 511 0.042 723 885 0.069 751 42 0.206 038 189 0.541 793 487 
R2 0.720 581 965 0.826 425 245 0.855 855 248 0.897 282 448 0.926 146 195 
Calculate 0.01 bar0.05 bar0.1 bar0.5 bar2.5 bars
MAE 0.043 003 041 0.099 597 64 0.139 757 256 0.266 536 593 0.433 903 647 
MSE 0.011 697 272 0.034 681 133 0.057 027 508 0.152 927 626 0.374 854 333 
R2 0.764 751 412 0.859 100 615 0.882 149 84 0.923 760 001 0.948 902 267 
Calculate 0.01 bar0.05 bar0.1 bar0.5 bar2.5 bars
MAE 0.043 003 041 0.099 597 64 0.139 757 256 0.266 536 593 0.433 903 647 
MSE 0.011 697 272 0.034 681 133 0.057 027 508 0.152 927 626 0.374 854 333 
R2 0.764 751 412 0.859 100 615 0.882 149 84 0.923 760 001 0.948 902 267 

In terms of the mean error, the MAE and MSE of the ML model with only structural features were 1.001 and 1.926 at 2.50 bars. The addition of molecular features to the ML model effectively reduced the MAE and MSE to 0.527 and 0.542, respectively. Finally, adding the calculated features to the ML model decreased the MAE and MSE to 0.434 and 0.375. For low pressure, the changes in MAE and MSE were similar to those in R2. At 0.01 bar, the addition of the molecular features to the structural features reduced the MAE and MSE from 0.067 and 0.028 to 0.045 and 0.014, respectively. Similarly, the final addition of computational features to the feature list reduced the MAE and MSE to 0.043 and 0.011.

Figure 2 displays the changes in MAE, MSE, and R2 of the ML model after adding different lists of features, demonstrating a steady optimization trend. Overall, adding molecular features to structural features can effectively increase the predictive accuracy of the model by 50%, and further modification with calculated features can increase the accuracy by another 20%.

FIG. 2.

The histogram represents the prediction accuracy of different feature sets at various pressures, from left to right, including MAE, MSE, and R2.

FIG. 2.

The histogram represents the prediction accuracy of different feature sets at various pressures, from left to right, including MAE, MSE, and R2.

Close modal

At different pressures, there is a significant variance in the distribution of CO2 adsorption data. Therefore, in addition to the analysis using the above metrics, this work further generates plots describing the distribution of model predictions vs actual values. As shown in Fig. 3, it is clear that the incorporation of structural, molecular, and calculated features into the ML model improves the prediction accuracy independent of the applied pressure.

FIG. 3.

The binned scatterplot shows the comparison between the three feature lists used by ML models at a pressure of 0.01 and 2.5 bars (a)–(f). The darker color of each hexagonal box means more data points in the box.

FIG. 3.

The binned scatterplot shows the comparison between the three feature lists used by ML models at a pressure of 0.01 and 2.5 bars (a)–(f). The darker color of each hexagonal box means more data points in the box.

Close modal

In the preceding chapter, our findings indicate that augmenting the ML process by incorporating molecular features and calculated features into structural features can effectively enhance predictive accuracy and generalization capability. This chapter ranks the feature importance of different feature lists at each pressure, using the SHAP importance method to better express the role of each feature in the prediction of the ML models. After the descriptors are determined, the relationship between these descriptors and the simulated CO2 adsorption data of MOFs is examined. The correlation coefficients are calculated between structural features and molecular features, which creates a heatmap (Fig. 4) to visually depict the obvious correlations among various features, providing a basis for ML model analysis. For instance, within the structural features, various features exhibit high positive correlations among themselves. However, catenation shows a negative correlation among these features.

FIG. 4.

Heatmap of correlation for molecular and structural feature sets. The color transitions from light to dark represent the shift from positive to negative correlations.

FIG. 4.

Heatmap of correlation for molecular and structural feature sets. The color transitions from light to dark represent the shift from positive to negative correlations.

Close modal

As shown in Fig. 5, we rank and quantify the importance of the structural features at different pressures. As the pressure increased, the importance of the void fraction (VF) gradually increased from second to first place. Moreover, samples with high VF values exhibit a higher positive correlation with the output of the model. With an increase in the pressure and amount of CO2 adsorbed, the accommodation of CO2 becomes progressively more important. The VF features indicated that more CO2 molecules could be accommodated during the adsorption process. Meanwhile, the features representing the surface area ranked highly because the surface area affects the number of MOF adsorption sites in the CO2 adsorption process.

FIG. 5.

Feature importance of structural features at different pressures (a)–(e). The feature’s importance increases gradually from bottom to top.

FIG. 5.

Feature importance of structural features at different pressures (a)–(e). The feature’s importance increases gradually from bottom to top.

Close modal

After incorporating the molecular features, Fig. 6 shows the top 20 most important structural and molecular features. Structural features occupy the top positions, indicating their predominant role during the CO2 adsorption of MOFs, with the elements, molecular structures, and functional groups playing crucial supporting roles. From the feature importance plots under various pressures, halogen elements (such as Br, F, and I) consistently maintain high importance, ranking just below the structural features. Halogens are highly electronegative elements, and their introduction may increase the overall electronegativity of MOFs, which would result in stronger intermolecular interactions, enhancing the affinity for CO2 and increasing the CO2 adsorption capacity.

FIG. 6.

Feature importance of the top 20 molecular and structural features at different pressures (a)–(e).

FIG. 6.

Feature importance of the top 20 molecular and structural features at different pressures (a)–(e).

Close modal

The carbon atom occupies a dominant position in the list of molecular features. The number of carbon atoms indirectly indicates the size of the organic linkers, further influencing the structural features of MOFs. This is evident in the feature correlation analysis plot (Fig. 4), where the number of carbon atoms with VF and LCD shows correlation coefficients of 0.31 and 0.37, respectively, indicating a positive correlation between the number of carbon atoms and structural features, thereby affecting the CO2 adsorption in MOFs. However, the distribution of carbon atoms at specific sites, such as aromatics, can impact CO2 adsorption. Similarly, the importance of the oxygen atoms is mainly reflected in their high correlation with the number of carbon atoms, with a correlation coefficient of 0.59. In addition, oxygen atoms are the major components of functional groups, such as hydroxyl groups, which influence CO2 adsorption. Nitrogen atoms are crucial functional groups that constitute amines, amino acids, pyridines, and other alkaline functional groups, demonstrating their degree of importance. Among other molecular features, the amount of catenation exhibits a clear negative correlation with structural features, with correlation coefficients ranging from −0.34 to −0.51, thereby affecting the predictions.

With the addition of calculated features, the ML model achieves optimal performance on the test set. Figure 7 displays the changes in feature importance after adding calculated features at different pressures. Following the introduction of the new feature importance rankings, we summarize the frequency of occurrence and trend of importance changes with increasing pressure for calculated features, as shown in Fig. 8. To clearly illustrate the relationships between features, we also create a heatmap (Fig. 9) for the correlation analysis of the 24 significant calculated features and structural features.

FIG. 7.

Importance of the top 20 calculated molecular and structural features at different pressures (a)–(e).

FIG. 7.

Importance of the top 20 calculated molecular and structural features at different pressures (a)–(e).

Close modal
FIG. 8.

Representation of histograms depicting changes in the importance of selected features. From light to dark, the pressures are 0.01, 0.05, 0.1, 0.5, and 2.5 bars.

FIG. 8.

Representation of histograms depicting changes in the importance of selected features. From light to dark, the pressures are 0.01, 0.05, 0.1, 0.5, and 2.5 bars.

Close modal
FIG. 9.

Heatmap of autocorrelation for calculated and structural feature sets. The color transitions from light to dark represent the shift from positive to negative correlations.

FIG. 9.

Heatmap of autocorrelation for calculated and structural feature sets. The color transitions from light to dark represent the shift from positive to negative correlations.

Close modal

Among the 24 calculated features, the importance of some features generally does not change with increasing pressure. For instance, the importance of the AATSC0p feature, which is calculated by employing the autocorrelation coefficients and atomic polarizability as weights, maintained a relatively stable ranking regardless of the pressure. In the GCMC simulation process, the van der Waals forces are a major component that significantly influences the overall adsorption performance when the strength of the van der Waals bonds increases with the polarizability of the participating atoms.

It is noteworthy that features appearing only once are mainly distributed at pressures of 0.01 and 2.50 bars. ATSC6m, ATSC6Z, and AATSC0m, which are calculated by employing autocorrelation coefficients using atomic mass and atomic number as weights to express the distribution of atomic masses in the MOFs, only appeared at 0.01 bar. Meanwhile, ATSC0i, GATS1dv, and AATS2pe, which represent the charge distribution in the MOFs using the atomic ionization potential, valence electrons, and Pauling electronegativity as weights, appeared only at a pressure of 2.5 bars. With increasing pressure, the quantity of adsorbed CO2 in the MOFs continues to increase. The increasing number of CO2 molecules in the pores leads to a greater average distance between CO2 molecules and adsorption sites. During CO2 adsorption, the interaction energies consisted of Coulomb and van der Waals forces. With increasing pressure, the proportion of Coulomb forces in the overall potential field gradually increases, whereas the proportion of van der Waals forces gradually decreases. Both the atomic mass and number, which are highly correlated features, significantly affected the van der Waals forces, making them particularly important at lower pressures. Conversely, features related to the electric field potential exhibit high importance at 2.5 bars, indicating the significance of Coulomb forces at higher pressures.

In addition, the importance of some features increases with the rise in pressure. The SsBr and SsF features represent the sums of Br and F atoms in MOF molecules, respectively. Because halogen elements generally have high electronegativity, SsBr and SsF features maintained high correlations of 0.54 and 0.41 with AATS2pe (the electronegativity feature) and exhibited a correlated trend with the increasing importance of Coulomb forces in the feature correlation analysis plot (Fig. 9). In particular, SsBr shows a high correlation of 0.84 with AATSC0p (polarizability feature), further emphasizing its importance. Moreover, the importance of the GATS1dv feature, which is a self-correlation feature calculated based on valence electrons, gradually increased, reflecting the distribution of molecular surface charges and its significant impact on intermolecular Coulomb forces.

Conversely, the importance of some features decreases with the increase in pressure. The importance of the metal atomic number as a crucial component relative to molecular mass decreases with the influence of van der Waals forces. Among other features, the nG12FaRing, NaaaC. and nFaRing features are highly correlated (0.64–0.77), all reflecting properties related to aromatic rings in the molecule. NaaaC represents the number of carbon atoms connecting multiple aromatic rings, providing further insights into the quantity of polycyclic aromatic hydrocarbons in the MOF molecule. We primarily focus on the role of aromatic rings in π–π stacking. Since the structure of polycyclic aromatic hydrocarbons can enhance the π–π stacking effect, the appearance of NaaaC in the importance ranking confirms this observation. Simultaneously, as the secondary bond, the importance of π–π stacking gradually decreases with the increase in molecular distance.

To conclude, we propose a ML model for predicting the properties of CO2 adsorption in MOFs without calculation and simulation. We introduce a large number of calculated features and jointly train them with structural and molecular features on the hMOF dataset to enhance the accuracy of CO2 adsorption predictions at various pressures. Compared to using only structural and molecular descriptors as the feature lists, the addition of calculated features improves the accuracy of the ML model by 15%–20% among different pressures. Furthermore, the addition of calculated features provides a more explicit interpretation of the operational mechanism of the ML model. Visualizing the results of feature importance indicates that in the process of CO2 adsorption, the structure and pore size of MOF molecules play a predominant role, while calculated features such as molecular polarizability, electronegativity, and atomic weight also hold major importance. The ranking of calculated features with changing pressures aligns with our discussion on the variation between van der Waals forces and Coulomb forces during CO2 adsorption, demonstrating that this ML model can be interpreted through pore chemistry, molecular dynamics, and chemical reactions. Utilizing automatically extracted descriptors as input, the ML model can rapidly and accurately predict the CO2 adsorption of MOF materials on a large scale. Simultaneously, the interpretability of the ML model in the adsorption process elucidates the relative importance of factors such as molecular structure, atomic weight, and polarizability in the adsorption performance of materials with different pressures, laying the foundation for future MOF material searches.

The supplementary material contains additional information, including 15 plots of the distribution of results predicted by different feature sets at air pressures ranging from 0.01 to 2.5 bars. The names and specific descriptions of the features were included in the calculated feature set.

This study was funded by the National Key R&D Program of China (Grant No. 2022YFB4703403) and also by the Academic Excellence Foundation of BUAA for Ph.D. students. The funder played no role in the study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

The authors have no conflicts to disclose.

G.S. analyzed and interpreted the data and revised the manuscript. Y.T. performed the study, analyzed and interpreted the data and was a major contributor in writing the manuscript. All authors read and approved the final manuscript.

Yukun Teng: Data curation (equal); Software (equal); Writing – original draft (equal). Guangcun Shan: Funding acquisition (equal); Project administration (lead); Resources (equal); Software (equal); Writing – review & editing (equal).

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
P.
Friedlingstein
et al, “
Global carbon budget 2023
,”
Earth Syst. Sci. Data
15
,
5301
(
2023
).
2.
P. M.
Cox
,
R. A.
Betts
,
C. D.
Jones
,
S. A.
Spall
, and
I. J.
Totterdell
, “
Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate model
,”
Nature
408
,
184
(
2000
).
3.
P.
Friedlingstein
et al, “
Global carbon budget 2022
,”
Earth Syst. Sci. Data
14
,
4811
(
2022
).
4.
P.
Friedlingstein
et al, “
Global Carbon Budget 2021
,”
Earth Syst. Sci. Data
14
,
1917
(
2022
).
5.
J.
Fargione
,
J.
Hill
,
D.
Tilman
,
S.
Polasky
, and
P.
Hawthorne
, “
Land clearing and the biofuel carbon debt
,”
Science
319
,
1235
(
2008
).
6.
S. L.
James
, “
Metal-organic frameworks
,”
Chem. Soc. Rev.
32
,
276
(
2003
).
7.
H.
Deng
et al, “
Large-pore apertures in a series of metal-organic frameworks
,”
Science
336
,
1018
(
2012
).
8.
O. K.
Farha
,
I.
Eryazici
,
N. C.
Jeong
,
B. G.
Hauser
,
C. E.
Wilmer
,
A. A.
Sarjeant
,
R. Q.
Snurr
,
S. T.
Nguyen
,
A. Ö.
Yazaydın
, and
J. T.
Hupp
, “
Metal–organic framework materials with ultrahigh surface areas: Is the sky the limit?
,”
J. Am. Chem. Soc.
134
,
15016
(
2012
).
9.
H.
Furukawa
,
K. E.
Cordova
,
M.
O’Keeffe
, and
O. M.
Yaghi
, “
The chemistry and applications of metal-organic frameworks
,”
Science
341
,
1230444
(
2013
).
10.
S.
Xiang
,
Y.
He
,
Z.
Zhang
,
H.
Wu
,
W.
Zhou
,
R.
Krishna
, and
B.
Chen
, “
Microporous metal-organic framework with potential for carbon dioxide capture at ambient conditions
,”
Nat. Commun.
3
,
954
(
2012
).
11.
S.
Keskin
,
T. M.
van Heest
, and
D. S.
Sholl
, “
Can metal–organic framework materials play a useful role in large-scale carbon dioxide separations?
,”
ChemSusChem
3
,
879
(
2010
).
12.
H.
Demir
,
G. O.
Aksu
,
H. C.
Gulbalkan
, and
S.
Keskin
, “
MOF Membranes for CO2 capture: Past, present and future
,”
Carbon Capture Sci. Technol.
2
,
100026
(
2022
).
13.
K.
Sumida
,
D. L.
Rogow
,
J. A.
Mason
,
T. M.
McDonald
,
E. D.
Bloch
,
Z. R.
Herm
,
T.-H.
Bae
, and
J. R.
Long
, “
Carbon dioxide capture in metal–organic frameworks
,”
Chem. Rev.
112
,
724
(
2012
).
14.
R.
Banerjee
,
H.
Furukawa
,
D.
Britt
,
C.
Knobler
,
M.
O’Keeffe
, and
O. M.
Yaghi
, “
Control of pore size and functionality in isoreticular zeolitic imidazolate frameworks and their carbon dioxide selective capture properties
,”
J. Am. Chem. Soc.
131
,
3875
(
2009
).
15.
X. S.
Wang
et al, “
Metal–organic frameworks based on double-bond-coupled di-isophthalate linkers with high hydrogen and methane uptakes
,”
Chem. Mater.
20
,
3145
(
2008
).
16.
L.
Chen
,
C. A.
Morrison
, and
T.
Düren
, “
Improving predictions of gas adsorption in metal–organic frameworks with coordinatively unsaturated metal sites: Model potentials, ab initio parameterization, and GCMC simulations
,”
J. Phys. Chem. C
116
,
18899
(
2012
).
17.
H.
Daglar
and
S.
Keskin
, “
Recent advances, opportunities, and challenges in high-throughput computational screening of MOFs for gas separations
,”
Coord. Chem. Rev.
422
,
213470
(
2020
).
18.
Y. R.
Tao
,
G. H.
Zhang
, and
H. J.
Xu
, “
Grand canonical Monte Carlo (GCMC) study on adsorption performance of metal organic frameworks (MOFs) for carbon capture
,”
Sustainable Mater. Technol.
32
,
e00383
(
2022
).
19.
D. A.
Gómez-Gualdrón
,
Y. J.
Colón
,
X.
Zhang
,
T. C.
Wang
,
Y.-S.
Chen
,
J. T.
Hupp
,
T.
Yildirim
,
O. K.
Farha
,
J.
Zhang
, and
R. Q.
Snurr
, “
Evaluating topologically diverse metal–organic frameworks for cryo-adsorbed hydrogen storage
,”
Energy Environ. Sci.
9
,
3279
(
2016
).
20.
B.
Sun
,
S.
Kayal
, and
A.
Chakraborty
, “
Study of HKUST (Copper benzene-1,3,5-tricarboxylate, Cu-btc MOF)-1 metal organic Frameworks for CH4 adsorption: An experimental Investigation with GCMC (grand canonical Monte-Carlo) simulation
,”
Energy
76
,
419
(
2014
).
21.
H.
Tan
and
G.
Shan
, “
Computational screening and functional tuning of chemically stable metal organic frameworks for I2/CH3I capture in humid environments
,”
iScience
27
,
109096
(
2024
).
22.
C. L.
Bird
and
J. G.
Frey
, “
Chemical information matters: An e-Research perspective on information and data sharing in the chemical sciences
,”
Chem. Soc. Rev.
42
,
6754
(
2013
).
23.
D. A. C.
Beck
,
J. M.
Carothers
,
V. R.
Subramanian
, and
J.
Pfaendtner
, “
Data science: Accelerating innovation and discovery in chemical engineering
,”
AIChE J.
62
,
1402
(
2016
).
24.
C.
Chen
,
Y.
Zuo
,
W.
Ye
,
X.
Li
,
Z.
Deng
, and
S. P.
Ong
, “
A critical review of machine learning of energy materials
,”
Adv. Energy Mater.
10
,
1903242
(
2020
).
25.
H.
Daglar
and
S.
Keskin
, “
Combining machine learning and molecular simulations to unlock gas separation potentials of MOF membranes and MOF/polymer MMMs
,”
ACS Appl. Mater. Interfaces
14
,
32134
(
2022
).
26.
C.
Altintas
,
O. F.
Altundal
,
S.
Keskin
, and
R.
Yildirim
, “
Machine learning meets with metal organic frameworks for gas storage and separation
,”
J. Chem. Inf. Model.
61
,
2131
(
2021
).
27.
N. S.
Bobbitt
and
R. Q.
Snurr
, “
Molecular modelling and machine learning for high-throughput screening of metal-organic frameworks for hydrogen storage
,”
Mol. Simul.
45
,
1069
(
2019
).
28.
G. O.
Aksu
and
S.
Keskin
, “
Advancing CH4/H2 separation with covalent organic frameworks by combining molecular simulations and machine learning
,”
J. Mater. Chem. A
11
,
14788
(
2023
).
29.
P.
Yang
,
H.
Zhang
,
X.
Lai
,
K.
Wang
,
Q.
Yang
, and
D.
Yu
, “
Accelerating the selection of covalent organic frameworks with automated machine learning
,”
ACS Omega
6
,
17149
(
2021
).
30.
J. S.
De Vos
,
S.
Ravichandran
,
S.
Borgmans
,
L.
Vanduyfhuys
,
P.
Van Der Voort
,
S. M. J.
Rogge
, and
V.
Van Speybroeck
, “
High-throughput screening of covalent organic frameworks for carbon capture using machine learning
,”
Chem. Mater.
36
,
4315
(
2024
).
31.
Y. G.
Chung
et al, “
Advances, updates, and analytics for the computation-ready, experimental metal–organic framework database: CoRE MOF 2019
,”
J. Chem. Eng. Data
64
,
5985
(
2019
).
32.
A. S.
Rosen
,
S. M.
Iyer
,
D.
Ray
,
Z.
Yao
,
A.
Aspuru-Guzik
,
L.
Gagliardi
,
J. M.
Notestein
, and
R. Q.
Snurr
, “
Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery
,”
Matter
4
,
1578
(
2021
).
33.
C. E.
Wilmer
,
M.
Leaf
,
C. Y.
Lee
,
O. K.
Farha
,
B. G.
Hauser
,
J. T.
Hupp
, and
R. Q.
Snurr
, “
Large-scale screening of hypothetical metal–organic frameworks
,”
Nat. Chem.
4
,
83
(
2012
).
34.
P. Z.
Moghadam
et al, “
Structure-mechanical stability relations of metal-organic frameworks via machine learning
,”
Matter
1
,
219
(
2019
).
35.
Y.
Yan
,
Z.
Shi
,
H.
Li
,
L.
Li
,
X.
Yang
,
S.
Li
,
H.
Liang
, and
Z.
Qiao
, “
Machine learning and in-silico screening of metal–organic frameworks for O2/N2 dynamic adsorption and separation
,”
Chem. Eng. J.
427
,
131604
(
2022
).
36.
Z.
Shi
,
X.
Yuan
,
Y.
Yan
,
Y.
Tang
,
J.
Li
,
H.
Liang
,
L.
Tong
, and
Z.
Qiao
, “
Techno-economic analysis of metal–organic frameworks for adsorption heat pumps/chillers: From directional computational screening, machine learning to experiment
,”
J. Mater. Chem. A
9
,
7656
(
2021
).
37.
Z.
Qiao
,
Y.
Yan
,
Y.
Tang
,
H.
Liang
, and
J.
Jiang
, “
Metal–organic frameworks for xylene separation: From computational screening to machine learning
,”
J. Phys. Chem. C
125
,
7839
(
2021
).
38.
J.
Burner
,
L.
Schwiedrzik
,
M.
Krykunov
,
J.
Luo
,
P. G.
Boyd
, and
T. K.
Woo
, “
High-performing deep learning regression models for predicting low-pressure CO2 adsorption properties of metal–organic frameworks
,”
J. Phys. Chem. C
124
,
27996
(
2020
).
39.
T. D.
Burns
,
K. N.
Pai
,
S. G.
Subraveti
,
S. P.
Collins
,
M.
Krykunov
,
A.
Rajendran
, and
T. K.
Woo
, “
Prediction of MOF performance in vacuum swing adsorption systems for postcombustion CO2 capture based on integrated molecular simulations, process optimizations, and machine learning models
,”
Environ. Sci. Technol.
54
,
4536
(
2020
).
40.
H.
Wang
et al, “
Scientific discovery in the age of artificial intelligence
,”
Nature
620
,
47
(
2023
).
41.
D.
Castelvecchi
, “
Can we open the black box of AI?
,”
Nature
538
,
20
(
2016
).
42.
X.
Zhang
et al, “
Artificial intelligence for science in quantum, atomistic, and continuum systems
,” arXiv:2307.08423 (
2023
).
43.
A.
Altmann
,
L.
Toloşi
,
O.
Sander
, and
T.
Lengauer
, “
Permutation importance: A corrected feature importance measure
,”
Bioinformatics
26
,
1340
(
2010
).
44.
S. M.
Lundberg
and
S.-I.
Lee
, “
A unified approach to interpreting model predictions
,” in
Advances in Neural Information Processing Systems
(
Curran Associates, Inc.
,
2017
), Vol.
30
.
45.
B. J.
Bucior
,
A. S.
Rosen
,
M.
Haranczyk
,
Z.
Yao
,
M. E.
Ziebel
,
O. K.
Farha
,
J. T.
Hupp
,
J. I.
Siepmann
,
A.
Aspuru-Guzik
, and
R. Q.
Snurr
, “
Identification schemes for metal–organic frameworks to enable rapid search and cheminformatics analysis
,”
Cryst. Growth Des.
19
,
6682
(
2019
).
46.
D.
Weininger
, “
SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules
,”
J. Chem. Inf. Comput. Sci.
28
,
31
(
1988
).
47.
M.
O’Keeffe
,
M. A.
Peskov
,
S. J.
Ramsden
, and
O. M.
Yaghi
, “
The reticular chemistry structure resource (RCSR) database of, and symbols for, crystal nets
,”
Acc. Chem. Res.
41
,
1782
(
2008
).
48.
C. E.
Wilmer
,
O. K.
Farha
,
Y.-S.
Bae
,
J. T.
Hupp
, and
R. Q.
Snurr
, “
Structure–property relationships of porous materials for carbon dioxide separation and capture
,”
Energy Environ. Sci.
5
,
9849
(
2012
).
49.
J. E.
Jones
and
S.
Chapman
, “
On the determination of molecular fields. —II. From the equation of state of a gas
,”
Proc. R. Soc. London, Ser. A
106
,
463
(
1997
).
50.
H.
Moriwaki
,
Y.-S.
Tian
,
N.
Kawashita
, and
T.
Takagi
, “
Mordred: A molecular descriptor calculator
,”
J. Cheminf.
10
,
4
(
2018
).
51.
T.
Chen
and
C.
Guestrin
, “
XGBoost: A scalable tree boosting system
,” in
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(
Association for Computing Machinery
,
New York
,
2016
), pp.
785
794
.
52.
Y.-S.
Bae
and
R. Q.
Snurr
, “
Development and evaluation of porous materials for carbon dioxide separation and capture
,”
Angew. Chem., Int. Ed.
50
,
11586
(
2011
).