Machine learning is quickly becoming an important tool in modern materials design. Where many of its successes are rooted in huge datasets, the most common applications in academic and industrial materials design deal with datasets of at best a few tens of data points. Harnessing the power of machine learning in this context is, therefore, of considerable importance. In this work, we investigate the intricacies introduced by these small datasets. We show that individual data points introduce a significant chance factor in both model training and quality measurement. This chance factor can be mitigated by the introduction of an ensemble-averaged model. This model presents the highest accuracy, while at the same time, it is robust with regard to changing the dataset size. Furthermore, as only a single model instance needs to be stored and evaluated, it provides a highly efficient model for prediction purposes, ideally suited for the practical materials scientist.

1.
O. K.
Farha
,
A. O.
Yazaydin
,
I.
Eryazici
,
C. D.
Malliakas
,
B. G.
Hauser
,
M. G.
Kanatzidis
,
S. T.
Nguyen
,
R. Q.
Snurr
, and
J. T.
Hupp
,
Nat. Chem.
2
,
944
(
2010
).
2.
S.
Curtarolo
,
G. L. W.
Hart
,
M. B.
Nardelli
,
N.
Mingo
,
S.
Sanvito
, and
O.
Levy
,
Nat. Mater.
12
,
191
(
2013
).
4.
G. R.
Schleder
,
A. C. M.
Padilha
,
C. M.
Acosta
,
M.
Costa
, and
A.
Fazzio
,
J. Phys. Mater.
2
,
032001
(
2019
).
5.
D. E. P.
Vanpoucke
,
J. Phys. Condens. Matter
26
,
133001
(
2014
).
6.
D. E. P.
Vanpoucke
,
S. S.
Nicley
,
J.
Raymakers
,
W.
Maes
, and
K.
Haenen
,
Diam. Relat. Mater.
94
,
233
(
2019
).
7.
R. G.
Parr
and
W.
Yang
, Density-Functional Theory of Atoms and Molecules, International Series of Monographs on Chemistry Vol. 16 (Oxford Science Publications, Oxford, 1989).
8.
K.
Lejaeghere
,
G.
Bihlmayer
,
T.
Bjoerkman
,
P.
Blaha
,
S.
Bluegel
,
V.
Blum
,
D.
Caliste
,
I. E.
Castelli
,
S. J.
Clark
,
A.
Dal Corso
et al.,
Science
351
,
aad3000
(
2016
).
9.
E.
Ghafari
,
M.
Bandarabadi
,
H.
Costa
, and
E.
Júlio
,
J. Mater. Civ. Eng.
27
,
04015017
(
2015
).
10.
Y.
Liu
,
T.
Zhao
,
W.
Ju
, and
S.
Shi
,
J. Mater.
3
,
159
(
2017
).
11.
K. T.
Butler
,
D. W.
Davies
,
H.
Cartwright
,
O.
Isayev
, and
A.
Walsh
,
Nature
559
,
547
(
2018
).
12.
M.
Rupp
,
O. A.
von Lilienfeld
, and
K.
Burke
,
J. Chem. Phys.
148
,
241401
(
2018
).
13.
M.
Haghighatlari
and
J.
Hachmann
,
Curr. Opin. Chem. Eng.
23
,
51
(
2019
).
14.
Y.
Zhang
,
X.
He
,
Z.
Chen
,
Q.
Bai
,
A. M.
Nolan
,
C. A.
Roberts
,
D.
Banerjee
,
T.
Matsunaga
,
Y.
Mo
, and
C.
Ling
,
Nat. Commun.
10
,
5260
(
2019
).
15.
G. R.
Schleder
,
A. C. M.
Padilha
,
A.
Reily Rocha
,
G. M.
Dalpian
, and
A.
Fazzio
,
J. Chem. Inf. Model.
60
,
452
(
2020
).
16.
H.
Willems
,
S.
De Cesco
, and
F.
Svensson
,
J. Med. Chem.
(published online
2020
).
17.
Y.
Goldberg
,
J. Artif. Intell. Res.
57
,
345
(
2016
).
18.
J. N.
Kutz
,
J. Fluid. Mech.
814
,
1
(
2017
).
19.
S.
Mehrkanoon
,
Y. A.
Shardt
,
J. A.
Suykens
, and
S. X.
Ding
,
Eng. Appl. Artif. Intell.
55
,
219
(
2016
).
20.
S.
Chmiela
,
H. E.
Sauceda
,
K.-R.
Müller
, and
A.
Tkatchenko
,
Nat. Commun.
9
,
3887
(
2018
).
21.
A.
Kamath
,
R. A.
Vargas-Hernández
,
R. V.
Krems
,
T.
Carrington
, and
S.
Manzhos
,
J. Chem. Phys.
148
,
241702
(
2018
).
22.
Y. A. W.
Shardt
,
S.
Mehrkanoon
,
K.
Zhang
,
X.
Yang
,
J.
Suykens
,
S. X.
Ding
, and
K.
Peng
,
Can. J. Chem. Eng.
96
,
171
(
2018
).
23.
K. T.
Schütt
,
M.
Gastegger
,
A.
Tkatchenko
,
K.-R.
Müller
, and
R. J.
Maurer
,
Nat. Comm.
10
,
5024
(
2019
).
24.
P. Z.
Moghadam
,
S. M. J.
Rogge
,
A.
Li
,
C.-M.
Chow
,
J.
Wieme
,
N.
Moharrami
,
M.
Aragones-Anglada
,
G.
Conduit
,
D. A.
Gomez-Gualdron
,
V.
Van Speybroeck
et al.,
Matter
1
,
219
(
2019
).
25.
W.
Yang
,
T. T.
Fidelis
, and
W.-H.
Sun
,
ACS Omega
5
,
83
(
2020
).
26.
K.
Gubaev
,
E. V.
Podryabinkin
, and
A. V.
Shapeev
,
J. Chem. Phys.
148
,
241727
(
2018
).
27.
A. Y.-T.
Wang
,
R. J.
Murdock
,
S. K.
Kauwe
,
A. O.
Oliynyk
,
A.
Gurlo
,
J.
Brgoch
,
K. A.
Persson
, and
T. D.
Sparks
,
Chem. Mater.
32
,
4954
4965
(
2020
).
28.
J. R.
Cendagorta
,
J.
Tolpin
,
E.
Schneider
,
R. Q.
Topper
, and
M. E.
Tuckerman
,
J. Phys. Chem. B
124
,
3647
(
2020
).
29.
S.
Curtarolo
,
W.
Setyawan
,
G. L.
Hart
,
M.
Jahnatek
,
R. V.
Chepulskii
,
R. H.
Taylor
,
S.
Wang
,
J.
Xue
,
K.
Yang
,
O.
Levy
et al.,
Comput. Mater. Sci.
58
,
218
(
2012
).
30.
A.
Jain
,
S. P.
Ong
,
G.
Hautier
,
W.
Chen
,
W. D.
Richards
,
S.
Dacek
,
S.
Cholia
,
D.
Gunter
,
D.
Skinner
,
G.
Ceder
et al.,
APL Mater.
1
,
011002
(
2013
).
31.
S.
Kirklin
,
J. E.
Saal
,
B.
Meredig
,
A.
Thompson
,
J. W.
Doak
,
M.
Aykol
,
S.
Rühl
, and
C.
Wolverton
,
npj Comput. Mater.
1
,
15010
(
2015
).
32.
N.
Carson
,
Chem. Eur. J.
26
,
3194
(
2020
).
33.
J. P.
Perdew
and
K.
Schmidt
,
AIP Conf. Proc.
577
,
1
(
2001
).
34.
D. E. P.
Vanpoucke
and
K.
Haenen
,
Diam. Relat. Mater.
79
,
60
(
2017
).
35.
C.
Houben
,
N.
Peremezhney
,
A.
Zubov
,
J.
Kosek
, and
A. A.
Lapkin
,
Org. Process Res. Dev.
19
,
1049
(
2015
).
36.
M.
Rubens
,
J.
Van Herck
, and
T.
Junkers
,
ACS Macro Lett.
8
,
1437
(
2019
).
37.
A. D.
Clayton
,
A. M.
Schweidtmann
,
G.
Clemens
,
J. A.
Manson
,
C. J.
Taylor
,
C. G.
Niño
,
T. W.
Chamberlain
,
N.
Kapur
,
A. J.
Blacker
,
A. A.
Lapkin
et al.,
Chem. Eng.
384
,
123340
(
2020
).
38.
Y.
Zhang
and
C.
Ling
,
npj Comput. Mater.
4
,
25
(
2018
).
39.
K.
De Grave
,
J.
Ramon
, and
L.
De Raedt
, in 11th International Conference on Discovery Science, Budapest, Hungary, October 13–16th, 2008, Lecture Notes in Computer Science Vol. 5255, edited by J. F. Boulicaut, M. R. Berthold, and T. Horvath (Springer, 2008), pp. 185–196, ISBN isbn978-3-540-88410-1.
40.
D. A.
Cohn
,
Z.
Ghahramani
, and
M. I.
Jordan
,
J. Artif. Intell. Res.
4
,
129
(
1996
).
41.
N.
Peremezhney
,
E.
Hines
,
A.
Lapkin
, and
C.
Connaughton
,
Eng. Optim.
46
,
1593
(
2014
).
42.
A. M.
Schweidtmann
,
A. D.
Clayton
,
N.
Holmes
,
E.
Bradford
,
R. A.
Bourne
, and
A. A.
Lapkin
,
Chem. Eng.
352
,
277
(
2018
).
43.
M.
Rubens
,
J. H.
Vrijsen
,
J.
Laun
, and
T.
Junkers
,
Angew. Chem. Int. Ed.
58
,
3183
(
2019
).
44.
C. W.
Coley
,
D. A.
Thomas
,
J. A. M.
Lummiss
,
J. N.
Jaworski
,
C. P.
Breen
,
V.
Schultz
,
T.
Hart
,
J. S.
Fishman
,
L.
Rogers
,
H.
Gao
et al.,
Science
365
,
6453
(
2019
).
45.
Z.
Wang
,
Y.
Su
,
W.
Shen
,
S.
Jin
,
J. H.
Clark
,
J.
Ren
, and
X.
Zhang
,
Green Chem.
21
,
4555
(
2019
).
46.
A.
Menon
,
C.
Gupta
,
K. M.
Perkins
,
B. L.
DeCost
,
N.
Budwal
,
R. T.
Rios
,
K.
Zhang
,
B.
Póczos
, and
N. R.
Washburn
,
Mol. Syst. Des. Eng.
2
,
263
(
2017
).
47.
A.
Menon
,
J. A.
Thompson-Colón
, and
N. R.
Washburn
,
Frontiers Mater.
6
,
87
(
2019
).
48.
A.
Géron
, Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (O’Reilly Media, Sebastopol, CA, 2017), ISBN isbn978-1491962299.
49.
We are assuming non-pathological datasets and data splittings. Furthermore, if the data requires it, we also assume stratification of the data is considered during the splitting of the dataset.
50.
Within the context of this work, we assume that a (possible) test set is split of initially. This test set can be used to compare the quality of different models. The remaining dataset is used to create an ensemble of train-validation splits. For simplicities sake, we refer to this combination of the validation and training set as the full set.
51.
K.
Hermans
,
B.
Kranz
, and
O.
van Knippenberg
, “Pressure sensitive adhesive tape,” patent WO2018/002055A1 (1 April 2018).
52.
For practical purposes, a polynomial regression is equivalent to a multiple linear regression where each of the linear descriptors is one of the polynomial terms.
53.
For n=10, only 45 distinct 80/20 splits are possible, while for n=20, this number has already grown to 4845. In this work, we have used 1000 (random) splits regradless of the dataset size. Since the splits are drawn randomly from the collection of all possible splits, this means that for a dataset of size n=10, the ensemble of 1000 splits contains multiple copies of the same split realization. Each split appears on average 22 times. For such cases, a computationally more efficient approach is to limit the ensemble of splits to the exhaustive sampling of all splits (which is to be implemented in our framework in the future). However, no significant differences are expected between this and the current implementation.
54.
B.
Efron
and
R. J.
Tibshirani
, An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability Vol. 57 (Chapman & Hall/CRC, Boca Raton, FL, 1993).
55.
B.
Efron
and
T.
Hastie
,
Computer Age Statistical Inference: Algorithms, Evidence, and Data Science
, 1st ed (
Cambridge University Press
,
2016
).
56.
J.
Friedman
,
T.
Hastie
, and
R.
Tibshirani
,
J. Stat. Softw.
33
,
1
(
2010
).
57.
S.
Kim
,
K.
Koh
,
M.
Lustig
,
S.
Boyd
, and
D.
Gorinevsky
,
IEEE J. Sel. Top. Signal. Process.
1
,
606
(
2007
).
58.
F.
Santosa
and
W. W.
Symes
,
SIAM J. Sci. Comput.
7
,
1307
(
1986
).
59.
R.
Tibshirani
,
J. Royal Stat. Soc. B
58
,
267
(
1996
).
60.
A. E.
Hoerl
and
R. W.
Kennard
,
Technometrics
12
,
55
(
1970
).
61.
As we are using the scikit learn framework, the hyperparameter tuning is performed using the ElasticNetCV model using at least 100α’s.
62.
63.
The better quality observed for the polynomial ENR model in case of the larger PSS datasets may be due to an overfitting by the ENR model in those instances.
64.
S.
Wenmackers
and
D. E. P.
Vanpoucke
,
Stat. Neerl.
66
,
339
(
2012
).
65.
In this context, the Best and Worst model instances in the figures should not be considered as absolute, but rather as representative.
66.
S.
Kullback
and
R. A.
Leibler
,
Ann. Math. Statist.
22
,
79
(
1951
).
67.
H.
Akaike
,
IEEE Trans. Automat. Contr.
19
,
716
(
1974
).
68.
69.
E.
Wit
,
E. v. d.
Heuvel
, and
J.-W.
Romeijn
,
Stat. Neer.
66
,
217
(
2012
).
70.
J. E.
Cavanaugh
,
Stat. Probab. Lett.
33
,
201
(
1997
).
71.
S.
Konishi
and
G.
Kitagawa
,
Information Criteria and Statistical Modeling
, 1st ed. (
Springer Publishing Company
,
Incorporated
,
2007
), ISBN isbn0387718869.
72.
L.
Breiman
,
Mach. Learn.
24
,
123
(
1996
).
73.
T. K.
Ho
,
IEEE Trans. Pattern Anal.
20
,
832
(
1998
).
74.
C. E.
Rasmussen
and
C. K. I.
Williams
,
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
(
The MIT Press
,
2005
), ISBN isbn026218253X.
75.
D. K.
Duvenaud
, Ph.D. thesis, University of Cambridge, 2014.
76.
L. S.
Shapley
, Notes on the n-Person Game—II: The Value of an n-Person Game (RAND Corporation, Sante Monica, CA, 1951).
77.
S. M.
Lundberg
and
S.-I.
Lee
, in Advances in Neural Information Processing Systems 30 (NIPS 2017), Advances in Neural Information Processing Systems Vol. 30, edited by Guyon, I and Luxburg, UV and Bengio, S and Wallach, H and Fergus, R and Vishwanathan, S and Garnett, R (2017), , ISSN issn1049-5258.

Supplementary Material

You do not currently have access to this content.