A machine learning (ML) methodology that uses a histogram of interaction energies has been applied to predict gas adsorption in metal–organic frameworks (MOFs) using results from atomistic grand canonical Monte Carlo (GCMC) simulations as training and test data. In this work, the method is first extended to binary mixtures of spherical species, in particular, Xe and Kr. In addition, it is shown that single-component adsorption of ethane and propane can be predicted in good agreement with GCMC simulation using a histogram of the adsorption energies felt by a methyl probe in conjunction with the random forest ML method. The results for propane can be improved by including a small number of MOF textural properties as descriptors. We also discuss the most significant features, which provides physical insight into the most beneficial adsorption energy sites for a given application.
INTRODUCTION
Machine learning (ML), which can help identify trends, provide insights, and make predictions from large amounts of input data, has emerged as a powerful tool in many research disciplines. In computational chemistry and materials research, where fully bottom-up atomistic simulations can be computationally rather expensive and where the number of materials to be explored can be quite large, an increasing number of researchers are employing ML methods1–4 to accelerate the discovery process.5–12 To access sufficient data for training and validation, ML models often use results from molecular simulation or quantum chemical calculations. In gas adsorption, which is the subject of this work, the adsorption properties are commonly obtained from grand canonical Monte Carlo (GCMC) simulations, which have been shown in prior work to yield adsorption uptakes in good agreement with experiment for many systems. Metal–organic frameworks (MOFs), a class of nanoporous materials assembled from inorganic nodes and organic linkers, are especially well suited to exploration by ML due to the combinatorial explosion of possible MOF structures.13 For example, using a library of 15 inorganic nodes, 14 organic “nodes,” and 50 ditopic organic connectors, Colón and Gomez-Gualdron computationally generated 13 512 MOFs using their Topologically Based Crystal Constructor (ToBaCCo).14 The number of MOFs synthesized in laboratories around the world is somewhere between 14 00015 and 88 000,16 depending on the criteria used to define a MOF, and databases of the synthesized MOFs now exist. Several groups have explored these and other databases of the proposed MOFs,17 looking at properties such as gas adsorption18–22 or bandgaps.23–25
An important aspect in the design of ML methods is the choice of features. A set of properly chosen features can make it much easier to generate accurate and predictive models. In applying ML to adsorption in MOFs, researchers have developed and tested a variety of different features in various ML algorithms.7,19–21,26–34 For example, Simon et al.7 recognized the role of strong binding sites for predicting low pressure adsorption of Xe and suggested a descriptor based on the interaction energy between Xe and such binding sites, which could be easily identified using computational geometry approaches involving Voronoi tessellation. Fanourgakis et al.19 used atom type properties as features in a random forest (RF) algorithm to predict methane and CO2 uptakes in the hypothetical MOFs of Wilmer et al.17 Atom types are defined for each atom, depending on its element and connectivity. They discovered that using atom types as a descriptor can provide better accuracy than traditional features such as the identity of the building blocks. They also reported that the methodology is transferable to other nanoporous materials by studying covalent organic frameworks (COFs). Anderson et al.33 used textural properties such as void fraction (VF), largest cavity diameter (LCD), pore limiting diameter (PLD), and density to form MOF fingerprints to train on and then predict the adsorption uptake of simple alchemical species, which are composed of 1–2 pseudo-atoms. They generalized the properties of real adsorbate molecules into features such as effective Lennard-Jones (LJ) parameters and end-to-end distance. Their model showed robustness for predicting isotherms of small molecules such as argon, krypton, xenon, nitrogen, methane, and ethane. They also tested the performance and transferability of the method in MOFs with pores that are inaccessible due to high energy barriers. Their results indicated that the model trained on a dataset without pore blocking can still yield results that match well with those from GCMC simulations, in which the inaccessible pores are blocked. Finally, they pushed the limit of their model to larger molecules, including propane, n-butane, isobutane, and benzene, using the same deep learning model and discovered empirical correlations between outliers of predictions and their percentage of saturation. They concluded that the model predicts low-fugacity points better than high-fugacity ones.
Recently, Bucior et al.34 used a histogram of the host/guest energy as the features for machine learning with the linear regression method known as least absolute shrinkage and selection operator (LASSO).1 They used this method to predict the top candidates for hydrogen and methane storage in MOFs. Since both hydrogen and methane were modeled as spherical pseudo-atoms, the host/guest energy histograms were generated using the pseudo-atoms for H2 or CH4. Here, we test the limits of this methodology for more complex problems. First, we extend it to mixtures and test if it can predict the selectivity for xenon over krypton in Xe/Kr mixtures using just a single energy histogram (that of Kr in this case). Second, we extend the method to short linear alkanes. For n-alkanes, the host/guest energy histogram is generated using the energies felt by a methyl group as the probe, and we investigate whether this is enough information for the method to be trained to predict the adsorption of chain molecules up to propane. To go beyond simple linear regression, we also tested the performance of RF, which is a non-linear ML method, and compared the results against LASSO.
METHODS
Molecular simulations
GCMC35 simulations were performed for Xe, Kr, Xe/Kr mixtures, ethane, and propane in randomly selected MOFs from the 13 512 structures in the ToBaCCo 1.0 database.14 For each set of simulations, we randomly picked 2000 MOFs. The GCMC simulations were performed using the RASPA-2.0 code.36 The pressures fed to RASPA36 were converted to fugacities using the Peng–Robinson equation of state.37 Non-bonded interactions were modeled using a Lennard-Jones (LJ) potential between every pair of non-bonded atoms or pseudo-atoms,
Here, ϵ and σ are the LJ well depth and size, and r is the distance between the LJ sites. The Lorentz–Berthelot mixing rules were used to obtain the Lennard-Jones parameters between unlike sites. The LJ parameters for the framework atoms were taken from the Universal Force Field (UFF).38 MOF atoms were held fixed during the simulations and the generation of the host/guest energy histograms. The LJ parameters for Xe and Kr were obtained from the work of Hirschfelder et al.39 and Talu and Myers,40 respectively. For the alkanes, we used the TraPPE united atom force field,41 but instead of using the TraPPE convention of rigid bond lengths, we used a harmonic bond potential,
where K/kB is the spring constant for the bond, b is the actual length, and b0 is the reference length. We took the force constant K/kB value of 96 500 provided by van Westen, Vlugt, and Gross.42 The reference bond length was 1.54 Å, which is the same as the original TraPPE parameter. We applied a 12.8 Å cutoff for the interaction between LJ particles for alkane simulations and a 12.0 Å cutoff for Xe/Kr simulations and truncated the interaction beyond the cutoff.
Simulations for binary mixtures of Xe and Kr were performed at a temperature of 273 K and pressures of 1 and 10 bars with 40 000 initialization cycles and 40 000 production cycles to allow for equilibration and gather enough statistics. A cycle consists of NA Monte Carlo steps, where NA is the number of adsorbate molecules in the system at the beginning of the cycle (or 20 if there are fewer than 20 molecules present). The bulk-phase composition was set to 20:80 for Xe:Kr.43 Single-component simulations for both xenon and krypton were also performed for the same temperature and pressures. For the alkane simulations, we equilibrated the system with 30 000 initialization cycles and calculated the properties of interest using 30 000 production cycles. For each step, the algorithm randomly tries a Monte Carlo move to perturb the system. All simulations used insertion, deletion, translation, rotation (if appropriate), identity change (for Xe and Kr mixture simulations), and reinsertion moves to sample the configuration space. Configurational bias Monte Carlo (CBMC)44 was used for insertion, deletion, and reinsertion moves of ethane and propane. Compared to traditional translation moves, which are limited to small local displacements, a reinsertion move attempts to displace the molecule to a random position in the simulation box. For the alkanes, reinsertion-in-place moves, which attempt to regrow the molecule from the starting bead, were also used to accelerate the equilibration. All move types were attempted with equal probability (insertion/deletion in RASPA is combined into one move called “swap,” with equal probability of either insertion or deletion).
To pick the most interesting pressures for testing our ML model for ethane and propane, we first simulated isotherms for 25 random MOFs from the ToBaCCo 1.0 database at 298 K.14 The results are shown in Fig. 1. The pressure ranges for the isotherms were chosen based on the experimental vapor pressures (P0) of the molecules at 298 K (Table S1). Based on these isotherms, we selected pressures at ∼10%, 50%, and 100% of P0 for our ML studies, namely, 4, 20, and 40 bars for ethane and 1, 5, and 10 bars for propane. Assessing the isotherms in Fig. 1, we noted significant variations in absolute loading at 20 bars for ethane and at 5 bars for propane. For these pressures, some of the structures have already reached their saturation loadings, while others have not. Thus, these are likely to be the most challenging conditions for the machine learning model to fit.
Machine learning
Following the work of Bucior et al.,34 we first computed the energies that a probe species feels on a three-dimensional (3D) grid of equally spaced points placed throughout the unit cell of a given structure. Next, we created a 1D histogram from the 3D energy grid data for each material. We started with a fine histogram with a small bin size of 0.1 kJ/mol. Then, instead of continuing with equally sized bins as done by Bucior et al.,34 the energy histograms were automatically aggregated without pre-determined bin sizes such that if a range of energy values has fewer counts, it is represented with fewer bins in the histogram. The histogram is normalized so that the sum of values in all bins for each MOF equals 1. A detailed description is provided in the supplementary material. For alkanes, the 3D grids were generated using a methyl probe with TraPPE Lennard-Jones parameters using a 1 spacing. For Xe/Kr systems, grids were generated using a Kr probe again with a 1 spacing. If the unit cell size was not divisible by the spacing, for example, using 1 Å spacing for a 37.6 Å unit cell, then the number of grid points was rounded up by 1.
Data analyses were performed with the R language45 in Rstudio integrated development environment (IDE).46 We first performed multiple linear regression (MLR),
where is the prediction, β is the coefficient vector trained from ML, and X is the feature space, assuming we have p features and N data points. In our case, yi is the amount adsorbed at a particular set of conditions and Xi is the energy histogram for material i. When performing linear regression, the sum of squared errors is minimized. We performed the MLR using LASSO, which adds a penalty or regularization term λΣj|βj| to the linear regression, so the cost function to be minimized is1
The penalty term λΣj|βj| helps reduce overfitting in the training. In R, LASSO is performed using the glmnet package.46 The hyper-parameter λ is determined with the help of the glmnet package by using ten-fold cross-validation and selecting the value of λ that minimizes the mean-squared error. A more detailed discussion on LASSO can be found in Bucior et al.34 We also performed non-linear regression using RF3 with 500 trees. Default settings were used in the random forest package3 in R.
When performing machine learning, the dataset was divided into training and test sets. If the dataset had more than 2000 adsorption data points, 1000 of them were randomly selected to be the training set and another 1000 for testing. The code is available on Github (https://github.com/snurr-group/energygrid). To summarize the results, we tabulated various metrics, including the coefficient of determination (R2), mean absolute error (MAE), and root-mean-square error (RMSE), for every dataset in the supplementary material (Excel file under the “Figure–Summary” datasheet).
RESULTS AND DISCUSSION
Adsorption of Xe/Kr mixtures
Bucior et al.34 developed their method for small, essentially spherical molecules and applied it to single-component adsorption. Here, we tested the method for mixtures of spherical species, in particular, Xe and Kr. Final purification of Xe/Kr mixtures from air separation can be accomplished via cryogenic distillation,43 but this process is energy intensive. Removal of 85Kr from spent nuclear fuel47 is another application where better Xe/Kr separation methods are needed, and separation via adsorption would be an attractive alternative.7,48 First, we performed pure-component GCMC simulations of xenon and krypton adsorption for 2000 randomly chosen ToBaCCo14 MOFs at 1 and 10 bars, both at 273 K. Before turning to mixtures, we tested whether we could train (separate) ML models for Xe and Kr but using only the histograms of the Kr energies as features. Note that we use the Kr histogram rather than the Xe histogram because there are locations in the MOFs where Kr can fit but Xe cannot due to its larger size. Figure 2 shows the parity plots for Kr (top row) and Xe (bottom row) utilizing LASSO and random forest models. Qualitatively, Fig. 2 shows promising fitting for both xenon and krypton using histograms of the Kr energies. It also shows that the non-linear random forest method is no better than the simpler, linear LASSO method, validating the choice of Bucior et al.34 to use LASSO for systems with small, spherical molecules.
A question rises as to why the fitting in Fig. 2 is better for Xe than for Kr (based on the R2 values) when using the Kr energy histograms. Figure S3 shows the distributions of Xe and Kr loadings at 273 K at 1 and 10 bars. It can be seen that the GCMC loadings at 10 bars in the 2000 MOFs generally have a flatter distribution than those at 1 bar, and distributions for Xe are flatter than those for Kr. This may explain why Xe loadings are easier to fit than Kr loadings: when the distribution of the dataset is uneven, there exist parts that are under-represented in the training data. The distribution for Kr at 1 bar is quite narrow, with most of the loadings below 10 cm3/cm3. Thus, at 1 bar for Kr, the high-loading points were under-represented in the training set, and Fig. 2 shows a large amount of scatter for Kr at 1 bar. We then conducted similar analysis for the heats of adsorption generated from GCMC simulations. The results are shown in Fig. S4, which shows that Xe generally has a higher heat of adsorption, Qst, than Kr, meaning that the structures prefer Xe over Kr from an enthalpic point of view. With more structures having higher loadings, the high-loading data points in the Xe dataset are more represented than those in the Kr dataset.
We analyzed Xe and Kr adsorption in some structures in more detail. An example for MOF No. 10 from the ToBaCCo 1.0 database14 is provided in Fig. S5 and Table S3. At 1 bar [Figs. S5(a) and S5(c)], most Xe and Kr atoms adsorb near the nodes. At 10 bars [Figs. S5(b) and S5(d)], Kr starts to adsorb in the void space away from the nodes, while Xe has reached its saturation loading and the pores are completely filled. Table S3 indicates that Xe has a higher Qst than Kr. We also plotted the GCMC loadings of Xe and Kr vs the largest cavity diameter of the structures (Fig. S6). Similar to previous results from Sikora et al.,48 we found that the loadings are highest for MOFs with a LCD slightly larger than 2σ, where σ is the LJ diameter of the guest molecule (note that, on average, the LJ diameters for the host atoms are smaller and the 2σ value should be sufficient for unobstructed adsorption of two guest molecules at opposing walls).
We then studied if energy histograms can be used as features to learn the selectivities for Xe/Kr mixtures. The selectivity was calculated based on the standard definition of ratios between compositions in the bulk phase and adsorbed phase,
where xi and yi are the mole fractions of species i in the adsorbed and bulk phases, respectively. We used the Kr-based energy histograms and applied them directly to learn the selectivities from GCMC simulations of Xe/Kr mixtures with yXe fixed at 0.2 and yKr fixed at 0.8. Here, we added the Spearman coefficient (ρ), which measures how similarly two lists of values are ranked (+1.0 means they are ranked identically, 0.0 means one list is randomly ordered with respect to the other, and −1.0 means one list is ordered in the reverse order of the other). As illustrated in Fig. S7, the majority of the 2000 structures yield selectivities less than 20 at both 1 and 10 bars, but there are some structures that exhibit very high selectivities—as high as 346—and a few others show inverse selectivity. Parity plots of the selectivities at 1 and 10 bars and 273 K are shown in Fig. S8 for both training and test data. Despite the high errors in terms of R2 and RMSE, the Spearman coefficients are all above 0.85, with only the predictions at 10 bars using LASSO below 0.9. This indicates that the ML predictions mimic the rankings of the GCMC selectivities very well. However, there are some systematic errors for the parity plots at 10 bars, where the results from machine learning significantly over-predict the selectivity in both training and testing plots.
Another strategy, instead of regressing the selectivities directly, is to first fit the loadings of the individual components from the mixture simulations and then use the predicted (mixture) loadings to calculate the selectivities. Compared to the pure-component fittings in Fig. 2, in the mixture case, the individual loadings are influenced by the bulk composition, and the individual loadings are correlated with each other due to competitive adsorption in the pores. We implemented this route, and the results of the fitting of the individual Xe and Kr loadings using RF are shown in Fig. 3. We observe only a slight decrease in the quality of the fits as measured by R2 in Fig. 3 compared to the corresponding single-component systems in Fig. 2, and the MAE and RMSE values are roughly similar, suggesting the potential of this route to fitting the selectivities.
From the predictions in Fig. 3, we can calculate the selectivities based on Eq. (5). The results are shown in Fig. 4. To directly compare the results in Figs. 4(a) and 4(b) with RF results in Figs. S8(f) and S8(h), we used the same training and test datasets. This guarantees the same data splitting. Figures 4(a) and 4(b) still contain points that are not predicted well by the ML approach. However, at 10 bars, the agreement between the ML selectivities and the results from GCMC is significantly improved in Fig. 4(b) compared to Fig. S8(h). In addition, compared to Fig. S8, the predictions in Fig. 4 eliminated the points that are significantly over-predicted at 10 bars. For completeness, the training data for the selectivities predicted by individual loadings for Xe and Kr are reported in Fig. S9.
The new strategy performs well for materials with selectivities below 10, as evidenced by visual inspection of the parity plots in Fig. 4, the high Spearman correlation coefficients, and the low MAE values. However, the approach of fitting the individual loadings first and then calculating the selectivity drastically under-predicts the selectivity for the materials with very high selectivity (points on the right boundary of plots in Fig. 4). These under-predictions, in turn, have a significant effect on the R2 and RMSE values. This is because R2 and RMSE are squared values, and the effects of outlying points are magnified. If we only consider MOFs with a selectivity of less than 20 in either the GCMC simulations or the ML predictions, then the R2 values increase to 0.88 and 0.83 at 1 and 10 bars, respectively, and similarly, the RMSEs are reduced to 0.82 and 0.61, respectively. It should also be kept in mind that in Eq. (5), the krypton loading is in the denominator. Thus, the selectivity is very sensitive to small changes in the Kr loading if the Kr loading approaches zero. Nevertheless, the high Spearman correlation coefficients indicate that this method can accurately rank the structures according to their selectivities, and ranking materials is a key requirement in materials screening. Once materials are ranked, those with high selectivity from ML can be further studied with GCMC simulations to get a more accurate estimate of the selectivity and working capacity. In this way, ML still saves considerable computing time, by identifying the top-performing materials, even if the selectivity from ML has a large uncertainty.
Similar to the analysis performed by Sikora et al.,48 we plotted the selectivities vs xenon capacity from GCMC simulations in Fig. S10 to see if there is a trade-off between selectivity and capacity. Interestingly, the selectivity generally increases with increasing Xe loading at a given pressure. However, as pressure increases from 1 to 10 bars, the capacity increases, but the selectivity decreases.
We also tried training a RF model to predict the selectivities at 273 K and 1 bar (directly) using the energy histograms plus four simple textural properties as descriptors: the volumetric surface area (VSA), gravimetric surface area (GSA), PLD, and LCD. The results are shown in Fig. S11. Comparing these results with those in Figs. S8(b) and S8(f), we see that adding these textural descriptors does not significantly change the quality of the predictions. We also used this model to test if our method is able to identify top-performing candidates from the literature by including three top-performing MOFs from the CoRE MOF database, identified by Simon et al.,7 into the testing data [Fig. S11(b)]. Table S5 shows that the three MOFs from Simon et al.7 are in the top five for our testing set, as predicted by the ML model, in agreement with GCMC simulation.
Simon et al.7 reported that the Voronoi energy, used as a descriptor in their ML model, is more effective for predicting the selectivity of Xe/Kr adsorption than textural properties. To test if a similar conclusion is valid for our energy histogram descriptor, we analyzed the feature importance for the model used to generate Fig. S11. In Fig. S12, we report the percentage increase in mean-squared error (%IncMSE)49 for the top ten features in the RF prediction. The %IncMSE value for a feature reflects the percent increase in MSE of the predictions when values of this feature are randomly shuffled. For example, a feature “F” has a 30 %IncMSE if the MSE increases by 30% when we shuffle the values in “F.” Our results show that the energy bins have a higher importance in the prediction of selectivities than the textural properties. As shown in Fig. S12, at 1 bar, the lowest energy bin, which ranges from −57 to −10.4 kJ/mol, has the highest %IncMSE among all features. It is followed by a weakly attractive energy bin that ranges from −0.1 to 0 kJ/mol. The only textural feature that is in the top ten is the gravimetric surface area. [Note that one of the energy features is closely related to the void fraction of the material, namely, the energy bin for regions where the Kr probe experiences positive energies (zero to infinity), which can be understood as one minus the void fraction.] Our examination of the feature importance agrees with the conclusion from Simon et al.7 that energetic descriptors are more important to the prediction of Xe/Kr selectivity in porous materials than simple textural descriptors.
Finally, we examined the effect of different bulk Xe/Kr compositions on the selectivity. In some nuclear waste applications,47 the bulk composition of Xe/Kr is ∼90/10. Thus, we simulated 100 structures at 1 and 10 bars at 273 K at this composition to compare with the results for a bulk composition of 20/80. Figure S13 compares the selectivities at these two compositions at 1 and 10 bars. Despite the large difference in bulk compositions, the calculated selectivities are remarkably similar for almost all of the selected structures. At each pressure, there was one MOF that showed a large difference in the selectivity upon changing the bulk composition, and the loadings and selectivities for these two MOFs are given in Table S6. It can be seen that as the composition changes from 20:80 to 90:10, the Xe loadings barely change in these Xe-selective MOFs, but the Kr loadings change dramatically. These drastic changes are magnified in the selectivity through Eq. (5), since the Kr loadings are in the denominator. Overall, from this analysis, we can infer that the majority of materials will have similar selectivities at different bulk compositions.
Prediction of single-component alkane adsorption
Next, we turned to predicting the uptake of non-spherical molecules, in particular, short alkanes. As the feature set, we used the histogram of energies felt by a methyl group. Given that molecules such as ethane and propane contain multiple methyl (or methylene) sites and are not spherical, it is much more challenging to predict the uptake of these molecules than the uptake of small spherical species using our descriptors. Figure 5 summarizes the training (first row) and test (second row) results for ethane using LASSO. The fits are promising, with R2 values above 0.9 and MAE and RMSE values ranging from 7 to 12 cm3/cm3 for loadings that range up to almost 300 cm3/cm3. Turning to a larger and non-linear molecule, Fig. 6 shows the results for propane. For propane, LASSO regression using the histogram of energies felt by a methyl probe still yields promising results at low to medium pressure, but the regression is not good at high pressure, with an R2 value of only 0.79 and some systematic deviations. In addition, for propane, the ML method predicts negative loadings for some cases, which are unphysical. One could force the negative predictions to go to zero, but the negative predictions suggest that the linear LASSO regression using only the energies felt by a single methyl group cannot accurately capture the adsorption behavior of this non-spherical molecule.
To go beyond a linear model, we tried random forest. Training and test results are shown in Fig. S15 for ethane and Fig. 7 and Fig. S16 for propane. For ethane, the R2 values are similar and the MAE and RMSE values are similar or slightly lower using RF (Fig. S15) than using LASSO (Fig. 5). However, for propane, the R2 values are notably higher and the MAE and RMSE values are uniformly lower using RF (Fig. 7) than using LASSO (Fig. 6). Thus, for propane, while LASSO yields fits with systematic deviations in the parity plots, RF yields better results, with good quality fits and no systematic deviations. While LASSO enforces monotonicity between specific energy bins and the loading due to linearity, RF allows for complicated relationships and has a higher predictive power. Nonetheless, in Fig. 7, one can observe that as the pressure increases, the fitting quality decreases. From the isotherms in Fig. 1, we suggested above that medium and higher pressures would be the most challenging to fit, since in this pressure region, different MOFs have drastically different loadings, with some MOFs reaching saturation while other MOFs have not. In this pressure region, different materials are in different regimes in their adsorption isotherms, making it more difficult to predict their adsorption uptake.
To obtain some physical insight into our model, we examined the importance of the features in the RF models for ethane and propane at different pressures using the %IncMSE. The results are displayed in Fig. 8. We highlighted the top ten energies for each pressure for ethane and propane with open circles. In addition, to show the top ten energy bins and try to group them into “energy ranges,” we included these data in Table S7 for easier display.
Let us start our analysis by the comparison between different pressures. For ethane and propane at 10% P0, which corresponds to the blue lines in Fig. 8, we see that the most important bins are the repulsive bin (above 0 kJ/mol) and the four most attractive bins. The repulsive bin provides important information on the volume that is not available for adsorption due to the presence of framework atoms. Based on Fig. 8, the strongest predictors for adsorption at low pressure are the most attractive bins. At 50% P0 (green lines), the most attractive bins are less important in the regression. This is because the most attractive regions in the structures are already filled up at this pressure. We also observe the rise in importance of the moderately attractive bins, mainly from −6 to −4 kJ/mol, which correspond to the next locations that become filled by adsorbate molecules. A similar trend is observed for the regression at 100% P0. For the repulsive bins, both ethane and propane display a trend of increasing importance as pressure increases. This agrees with our intuition that under saturated conditions, the loading or capacity of a structure is dictated by its total void volume. Some additional analysis and discussion are provided in the supplementary material.
Finally, we compared the use of histograms against other feature sets. We started with simple textural properties, in particular, the VSA, GSA, PLD, LCD, and VF. These properties were obtained from the literature for the MOFs in our dataset.9 The results using RF with only these five textural properties as features are shown in Figs. 9(a)–9(c). The results indicate that these five textural properties are enough for predicting propane adsorption at 100% P0 [Fig. 9(c)] with good accuracy, and the accuracy is improved compared to that from using the energy histograms [Fig. 7(c)]. This is largely because the saturation loading correlates well with some of the textural properties.50 However, at lower pressures [Figs. 9(a) and 9(b)], we see that the performance using the textural properties is worse than using energy histograms [Figs. 7(a) and 7(b)]. This is because adsorption at lower loadings is controlled more by the energetics of adsorption than by factors such as the surface area or void fraction,50 and the energy histograms include the host–guest interaction energies, which are the main component of the adsorption energy, especially at low loading.
Observing the merits and drawbacks of energy histogram and textural properties, we tried to combine them and form a minimal feature set that uses the smallest number of features for accurate predictions of propane adsorption at different pressures. Instead of the energy histograms, we tried representing each structure by the same five textural properties as above plus three characteristic adsorption energy values (energy statistics, ES), namely, the mean, the median, and the first quartile (Q1) energies from the energy histograms. The results are summarized in Figs. 9(d)–9(f). Comparing Figs. 9(d)–9(f) against Figs. 9(a)–9(c), we see that with the three additional features from the energy statistics, the points at 10% and 50% P0 are much better predicted [note the “thinner” distribution of points and the larger R2 values compared to Figs. 9(a) and 9(b)]. Despite the improvement, the accuracy for 10% P0 using textural properties and characteristic energy values is not quite as good as that using the energy histograms [Figs. 7(a) and 7(b)]. Finally, by comparing Figs. 9(c) and 9(f), we see that the additional energy statistics features further decrease the errors of the predictions at 100% P0.
CONCLUSIONS
We extended the application of host/guest energy histograms as features for machine learning to the adsorption of mixtures of simple spherical molecules and adsorption of short alkanes. For predicting the selectivity for xenon/krypton mixture adsorption in metal–organic frameworks, we found that predicting the individual loadings from ML first and then calculating the selectivities from those loadings is more accurate than predicting the selectivities directly from ML. For materials with high Xe selectivity, the value of the selectivity is very sensitive to small uncertainties in the Kr loading, and ML was not highly accurate in predicting the exact selectivity in such cases. However, the method was very good at ranking materials, suggesting that our ML method can be utilized to find promising candidates with high Xe/Kr selectivity, and then, more accurate estimates of the selectivities can be obtained from GCMC simulations on the top candidates. We also showed that adsorption of ethane and propane can be predicted using the energy histograms from a methyl probe. Linear regression worked well for ethane, whereas for propane, we needed to turn to the non-linear RF method. We also compared energy histograms against textural properties as features. At low pressures, energy histograms have a clear advantage. Finally, we formed a minimal feature set composed of five textural properties and three characteristic values from the energy histogram and demonstrated its accuracy at low, medium, and high pressures for propane.
SUPPLEMENTARY MATERIAL
See the supplementary material for additional details about the energy-histogram methodology, force field parameters, additional results (pdf), and GCMC loadings for xenon/krypton, ethane, and propane (Excel spreadsheet).
ACKNOWLEDGMENTS
This work was supported by the U.S. Department of Energy, Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences and Biosciences through the Nanoporous Materials Genome Center under Award No. DE-FG02-17ER16362. Z.L. acknowledges support from a Data Science Fellowship via the Northwestern Institute on Complex Systems (NICO). R.Q.S. has a financial interest in the startup company NuMat Technologies, which is seeking to commercialize metal–organic frameworks. This research was partly supported through the computational resources provided for the Quest high-performance computing facility at Northwestern University, which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology. We also gratefully acknowledge the resources provided by the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
DATA AVAILABILITY
The data that support the findings of this study are openly available in Github at https://github.com/snurr-group/energygrid.51