Three active learning schemes are used to generate training data for Gaussian process interpolation of intermolecular potential energy surfaces. These schemes aim to achieve the lowest predictive error using the fewest points and therefore act as an alternative to the status quo methods involving grid-based sampling or space-filling designs like Latin hypercubes (LHC). Results are presented for three molecular systems: CO2–Ne, CO2–H2, and Ar3. For each system, two of the active learning schemes proposed notably outperform LHC designs of comparable size, and in two of the systems, produce an error value an order of magnitude lower than the one produced by the LHC method. The procedures can be used to select a subset of points from a large pre-existing data set, to select points to generate data de novo, or to supplement an existing data set to improve accuracy.

1.
C. M.
Handley
and
P. L. A.
Popelier
,
J. Phys. Chem. A
114
,
3371
3383
(
2010
).
2.
M.
Karthikeyan
and
R.
Vyas
,
Machine Learning Methods in Chemoinformatics for Drug Discovery
(
Springer India
,
New Delhi
,
2014
), pp.
133
194
.
3.
X.
Deng
,
V. R.
Joseph
,
A.
Sudjianto
, and
C. F. J.
Wu
,
J. Am. Stat. Assoc.
104
,
969
981
(
2009
).
4.
G.
Riccardi
and
D.
Hakkani-Tur
,
IEEE Trans. Speech Audio Process.
13
,
504
511
(
2005
).
5.
K.
Toyoura
,
D.
Hirano
,
A.
Seko
,
M.
Shiga
,
A.
Kuwabara
,
M.
Karasuyama
,
K.
Shitara
, and
I.
Takeuchi
,
Phys. Rev. B
93
,
054112
(
2016
).
6.
J.
Cui
and
R. V.
Krems
,
J. Phys. B: At., Mol. Opt. Phys.
49
,
224001
(
2016
).
7.
B.
Kolb
,
P.
Marshall
,
B.
Zhao
,
B.
Jiang
, and
H.
Guo
,
J. Phys. Chem. A
121
,
2552
2557
(
2017
).
8.
E.
Uteva
,
R. S.
Graham
,
R. D.
Wilkinson
, and
R. J.
Wheatley
,
J. Chem. Phys.
147
,
161706
(
2017
).
9.
A. J.
Cresswell
,
R. J.
Wheatley
,
R. D.
Wilkinson
, and
R. S.
Graham
,
Faraday Discuss.
192
,
415
436
(
2016
).
10.
S. D.
Whitehead
and
D. H.
Ballard
,
Mach. Learn.
7
,
45
83
(
1991
).
11.
S.
Thrun
,
Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches
(
Van Nostrand Reinhold
,
Florence, Kentucky
,
1992
).
12.
L.
Atlas
,
D.
Cohn
,
R.
Ladner
,
M. A.
El-Sharkawi
, and
R. J.
Marks
 II
, “
Training connectionist networks with queries and selective sampling
,” in
Advances in Neural Information Processing Systems 2
, edited by
D. S.
Touretzky
(
Morgan Kaufmann Publishers, Inc.
,
San Francisco, CA, USA
,
1990
), pp.
566
573
.
13.
J.
Schmidhuber
,
J.
Storck
, and
S.
Hochreiter
, in
Proceedings of the ICANN’95 (Paris)
(
EC2 & Cie
,
1995
), Vol. 2, pp.
159
164
.
14.
D. J. C.
MacKay
,
Neural Comput.
4
,
590
604
(
1992
).
15.
M.
Rupp
,
M. R.
Bauer
,
R.
Wilcken
,
A.
Lange
,
M.
Reutlinger
,
F. M.
Boeckler
, and
G.
Schneider
,
PLoS Comput. Biol.
10
,
e1003400
(
2014
).
16.
E. V.
Podryabinkin
and
A. V.
Shapeev
,
Comput. Mater. Sci.
140
,
171
180
(
2017
).
17.
Z.
Li
,
J. R.
Kermode
, and
A.
De Vita
,
Phys. Rev. Lett.
114
,
096405
(
2015
).
18.
Y.
Guan
,
S.
Yang
, and
D. H.
Zhang
,
Mol. Phys.
116
,
823
834
(
2017
).
19.
C. E.
Rasmussen
and
C. K. I.
Williams
,
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
(
The MIT Press
,
2005
).
20.
H.-J.
Werner
,
P. J.
Knowles
,
G.
Knizia
,
F. R.
Manby
,
M.
Schütz
 et al, molpro, version 2015.1, a package of ab initio programs, 2015, see http://www.molpro.net.
21.
GPy, GPy: A Gaussian process framework in python, http://github.com/SheffieldML/GPy, since 2012.
22.
D.
Den Hertog
,
J. P.
Kleijnen
, and
A.
Siem
,
J. Oper. Res. Soc.
57
,
400
409
(
2006
).
23.
B.
Efron
,
Breakthroughs in Statistics
(
Springer
,
1992
), pp.
569
593
.
24.
J. P.
Kleijnen
and
W. C.
Van Beers
,
J. Oper. Res. Soc.
55
,
876
883
(
2004
).
You do not currently have access to this content.