Feature selection (FS) methods often are used to develop data-driven descriptors (i.e., features) for rapidly predicting the functional properties of a physical or chemical system based on its composition and structure. FS algorithms identify descriptors from a candidate pool (i.e., feature space) built by feature engineering (FE) steps that construct complex features from the system’s fundamental physical properties. Recursive FE, which involves repeated FE operations on the feature space, is necessary to build features with sufficient complexity to capture the physical behavior of a system. However, this approach creates a highly correlated feature space that contains millions or billions of candidate features. Such feature spaces are computationally demanding to process using traditional FS approaches that often struggle with strong collinearity. Herein, we address this shortcoming by developing a new method that interleaves the FE and FS steps to progressively build and select powerful descriptors with reduced computational demand. We call this method iterative Bayesian additive regression trees (iBART), as it iterates between FE with unary/binary operators and FS with Bayesian additive regression trees (BART). The capabilities of iBART are illustrated by extracting descriptors for predicting metal–support interactions in catalysis, which we compare to those predicted in our previous work using other state-of-the-art FS methods (i.e., least absolute shrinkage and selection operator + l0, sure independence screening and sparsifying operator, and Bayesian FS). iBART matches the performance of these methods yet uses a fraction of the computational resources because it generates a maximum feature space of size O(102), as opposed to O(106) generated by one-shot FE/FS methods.

1.
K. T.
Butler
,
D. W.
Davies
,
H.
Cartwright
,
O.
Isayev
, and
A.
Walsh
,
Nature
559
,
547
(
2018
).
2.
B. R.
Goldsmith
,
J.
Esterhuizen
,
J. X.
Liu
,
C. J.
Bartel
, and
C.
Sutton
,
AIChE J.
64
,
2311
(
2018
).
3.
K. M.
Jablonka
,
D.
Ongari
,
S. M.
Moosavi
, and
B.
Smit
,
Chem. Rev.
120
(
16
),
8066
8129
(
2020
).
4.
C.
Chen
,
Y.
Zuo
,
W.
Ye
,
X.
Li
,
Z.
Deng
, and
S. P.
Ong
,
Adv. Energy Mater.
10
,
1903242
(
2020
).
5.
T.
Toyao
,
Z.
Maeno
,
S.
Takakusagi
,
T.
Kamachi
,
I.
Takigawa
, and
K.-i.
Shimizu
,
ACS Catal.
10
,
2260
(
2020
).
6.
Z. W.
Ulissi
,
A. J.
Medford
,
T.
Bligaard
, and
J. K.
Nørskov
,
Nat. Commun.
8
,
14621
(
2017
).
7.
C. W.
Coley
,
W. H.
Green
, and
K. F.
Jensen
,
Acc. Chem. Res.
51
,
1281
(
2018
).
8.
B.
Sanchez-Lengeling
and
A.
Aspuru-Guzik
,
Science
361
,
360
(
2018
).
9.
J.
Noh
,
G. H.
Gu
,
S.
Kim
, and
Y.
Jung
,
Chem. Sci.
11
,
4871
(
2020
).
10.
K.
Tran
and
Z. W.
Ulissi
,
Nat. Catal.
1
,
696
(
2018
).
11.
B.
Burger
,
P. M.
Maffettone
,
V. V.
Gusev
,
C. M.
Aitchison
,
Y.
Bai
,
X.
Wang
,
X.
Li
,
B. M.
Alston
,
B.
Li
,
R.
Clowes
,
N.
Rankin
,
B.
Harris
,
R. S.
Sprick
, and
A. I.
Cooper
,
Nature
583
,
237
(
2020
).
12.
N. J.
O’Connor
,
A. S. M.
Jonayat
,
M. J.
Janik
, and
T. P.
Senftle
,
Nat. Catal.
1
,
531
(
2018
).
13.
C.-Y.
Liu
,
S.
Zhang
,
D.
Martinez
,
M.
Li
, and
T. P.
Senftle
,
npj Comput. Mater.
6
,
102
(
2020
).
14.
B.
Medasani
,
A.
Gamst
,
H.
Ding
,
W.
Chen
,
K. A.
Persson
,
M.
Asta
,
A.
Canning
, and
M.
Haranczyk
,
npj Comput. Mater.
2
,
1
(
2016
).
15.
J.
Schmidt
,
J.
Shi
,
P.
Borlido
,
L.
Chen
,
S.
Botti
, and
M. A. L.
Marques
,
Chem. Mater.
29
,
5090
(
2017
).
16.
J. A.
Esterhuizen
,
B. R.
Goldsmith
, and
S.
Linic
,
Chem
6
,
3100
(
2020
).
17.
A.
Seko
,
A.
Togo
,
H.
Hayashi
,
K.
Tsuda
,
L.
Chaput
, and
I.
Tanaka
,
Phys. Rev. Lett.
115
,
205901
(
2015
).
18.
G.
Pilania
,
J. E.
Gubernatis
, and
T.
Lookman
,
Comput. Mater. Sci.
129
,
156
(
2017
).
19.
Z.
Li
,
L. E. K.
Achenie
, and
H.
Xin
,
ACS Catal.
10
,
4377
(
2020
).
20.
J. N.
Wei
,
D.
Duvenaud
, and
A.
Aspuru-Guzik
,
ACS Cent. Sci.
2
,
725
(
2016
).
21.
A.
Nandy
,
J.
Zhu
,
J. P.
Janet
,
C.
Duan
,
R. B.
Getman
, and
H. J.
Kulik
,
ACS Catal.
9
,
8243
(
2019
).
22.
K. T.
Schütt
,
M.
Gastegger
,
A.
Tkatchenko
,
K.-R.
Müller
, and
R. J.
Maurer
,
Nat. Commun.
10
,
5024
(
2019
).
23.
L. M.
Ghiringhelli
,
J.
Vybiral
,
S. V.
Levchenko
,
C.
Draxl
, and
M.
Scheffler
,
Phys. Rev. Lett.
114
,
105503
(
2015
).
24.
P.
Pankajakshan
,
S.
Sanyal
,
O. E.
de Noord
,
I.
Bhattacharya
,
A.
Bhattacharyya
, and
U.
Waghmare
,
Chem. Mater.
29
,
4190
(
2017
).
25.
R.
Ouyang
,
S.
Curtarolo
,
E.
Ahmetcik
,
M.
Scheffler
, and
L. M.
Ghiringhelli
,
Phys. Rev. Mater.
2
,
083802
(
2018
).
26.
C. J.
Bartel
,
C.
Sutton
,
B. R.
Goldsmith
,
R.
Ouyang
,
C. B.
Musgrave
,
L. M.
Ghiringhelli
, and
M.
Scheffler
,
Sci. Adv.
5
,
eaav0693
(
2019
).
27.
M.
Andersen
,
S. V.
Levchenko
,
M.
Scheffler
, and
K.
Reuter
,
ACS Catal.
9
,
2752
(
2019
).
28.
Y.-Q.
Su
,
L.
Zhang
,
Y.
Wang
,
J.-X.
Liu
,
V.
Muravev
,
K.
Alexopoulos
,
I. A. W.
Filot
,
D. G.
Vlachos
, and
E. J. M.
Hensen
,
npj Comput. Mater.
6
,
144
(
2020
).
29.
M. E.
Strayer
,
T. P.
Senftle
,
J. P.
Winterstein
,
N. M.
Vargas-Barbosa
,
R.
Sharma
,
R. M.
Rioux
,
M. J.
Janik
, and
T. E.
Mallouk
,
J. Am. Chem. Soc.
137
,
16216
(
2015
).
30.
S.
Curtarolo
,
D.
Morgan
,
K.
Persson
,
J.
Rodgers
, and
G.
Ceder
,
Phys. Rev. Lett.
91
,
135503
(
2003
).
31.
J. P.
Janet
and
H. J.
Kulik
,
J. Phys. Chem. A
121
,
8939
(
2017
).
32.
W.
Xu
,
M.
Andersen
, and
K.
Reuter
,
ACS Catal.
11
,
734
(
2021
).
33.
S.
Ye
,
T. P.
Senftle
, and
M.
Li
, arXiv:2110.10195 [Stat] (
2021
).
34.
H. A.
Chipman
,
E. I.
George
, and
R. E.
McCulloch
,
Ann. Appl. Stat.
4
,
266
(
2010
).
35.
J.
Bleich
,
A.
Kapelner
,
E. I.
George
, and
S. T.
Jensen
,
Ann. Appl. Stat.
8
,
1750
(
2014
).
36.
R.
Tibshirani
,
J. R. Stat. Soc. Ser. B
58
,
267
(
1996
).
37.
S. L.
Hemmingson
and
C. T.
Campbell
,
ACS Nano
11
,
1196
(
2017
).
38.
C. T.
Campbell
and
Z.
Mao
,
ACS Catal.
7
,
8460
(
2017
).
39.
J.
Fan
and
J.
Lv
,
J. R. Stat. Soc. Ser. B
70
,
849
(
2008
).
40.
A.
Bhattacharya
,
D.
Pati
,
N. S.
Pillai
, and
D. B.
Dunson
,
J. Am. Stat. Assoc.
110
,
1479
(
2015
).
41.
H. D.
Bondell
and
B. J.
Reich
,
J. Am. Stat. Assoc.
107
,
1610
(
2012
).
42.
H. A.
Chipman
,
E. I.
George
, and
R. E.
McCulloch
,
J. Am. Stat. Assoc.
93
,
935
(
1998
).
43.
A. R.
Linero
,
J. Am. Stat. Assoc.
113
,
626
(
2018
).
44.
A. R.
Linero
and
Y.
Yang
,
J. R. Stat. Soc. Ser. B
80
,
1087
(
2018
).
45.
J. H.
Friedman
,
T.
Hastie
, and
R.
Tibshirani
,
J. Stat. Software
33
,
1
(
2010
).
46.
Y.
Cui
,
C.
Stiehler
,
N.
Nilius
, and
H.-J.
Freund
,
Phys. Rev. B
92
,
075444
(
2015
).
47.
G.
Pacchioni
and
H.-J.
Freund
,
Chem. Soc. Rev.
47
,
8474
(
2018
).
48.
C. T.
Campbell
and
J. R. V.
Sellers
,
Faraday Discuss.
162
,
9
(
2013
).
49.
L. M.
Ghiringhelli
et al.
The ability to reveal such relationships among selected descriptors is a strength of all symbolic-regression methods
,”
New J. Phys.
19
,
023017
(
2017
).
50.
A. S. M.
Jonayat
, “
Distributed as part of the publication: Interaction trends between single metal atoms and oxide supports identified with density functional theory and statistical learning
.” (
2018
).
Github.
https://github.com/tsenftle/Metal-Oxide-LASSO-lo
51.
C.-Y.
Liu
and
S.
Zhang
, “
Distributed as part of the publication: Using Statistical Learning to Predict Interactions Between Single Metal Atoms and Modified MgO(100) Supports
,” (
2020
).
Github.
https://github.com/tsenftle/MgO_SL
52.
S.
Ye
and
M.
Li
, “
An R package for iterative BART for Variable and Operator Selection with Operator Induced Structure (OIS)
,” (
2022
).
Github.
https://github.com/mattsheng/iBART

Supplementary Material

You do not currently have access to this content.