The training set of atomic configurations is key to the performance of any Machine Learning Force Field (MLFF) and, as such, the training set selection determines the applicability of the MLFF model for predictive molecular simulations. However, most atomistic reference datasets are inhomogeneously distributed across configurational space (CS), and thus, choosing the training set randomly or according to the probability distribution of the data leads to models whose accuracy is mainly defined by the most common close-to-equilibrium configurations in the reference data. In this work, we combine unsupervised and supervised ML methods to bypass the inherent bias of the data for common configurations, effectively widening the applicability range of the MLFF to the fullest capabilities of the dataset. To achieve this goal, we first cluster the CS into subregions similar in terms of geometry and energetics. We iteratively test a given MLFF performance on each subregion and fill the training set of the model with the representatives of the most inaccurate parts of the CS. The proposed approach has been applied to a set of small organic molecules and alanine tetrapeptide, demonstrating an up to twofold decrease in the root mean squared errors for force predictions on non-equilibrium geometries of these molecules. Furthermore, our ML models demonstrate superior stability over the default training approaches, allowing reliable study of processes involving highly out-of-equilibrium molecular configurations. These results hold for both kernel-based methods (sGDML and GAP/SOAP models) and deep neural networks (SchNet model).

1.
S. A.
Hollingsworth
and
R. O.
Dror
, “
Molecular dynamics simulation for all
,”
Neuron
99
,
1129
1143
(
2018
).
2.
J. C.
Phillips
,
R.
Braun
,
W.
Wang
,
J.
Gumbart
,
E.
Tajkhorshid
,
E.
Villa
,
C.
Chipot
,
R. D.
Skeel
,
L.
Kalé
, and
K.
Schulten
, “
Scalable molecular dynamics with NAMD
,”
J. Comput. Chem.
26
,
1781
1802
(
2005
).
3.
T.
Hansson
,
C.
Oostenbrink
, and
W.
van Gunsteren
, “
Molecular dynamics simulations
,”
Curr. Opin. Struct. Biol.
12
,
190
196
(
2002
).
4.
M. J.
Abraham
,
T.
Murtola
,
R.
Schulz
,
S.
Páll
,
J. C.
Smith
,
B.
Hess
, and
E.
Lindahl
, “
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers
,”
SoftwareX
1-2
,
19
25
(
2015
).
5.
D. C.
Rapaport
,
The Art of Molecular Dynamics Simulation
(
Cambridge University Press
,
2004
).
6.
N.
Plattner
and
F.
Noé
, “
Protein conformational plasticity and complex ligand-binding kinetics explored by atomistic simulations and Markov models
,”
Nat. Commun.
6
,
7653
(
2015
).
7.
K. A.
Krylova
,
J. A.
Baimova
,
I. P.
Lobzenko
, and
A. I.
Rudskoy
, “
Crumpled graphene as a hydrogen storage media: Atomistic simulation
,”
Physica B
583
,
412020
(
2020
).
8.
Y.
Wu
,
H.
Sun
,
L.
Wu
, and
J. D.
Deetz
, “
Extracting the mechanisms and kinetic models of complex reactions from atomistic simulation data
,”
J. Comput. Chem.
40
,
1586
1592
(
2019
).
9.
S.
Wolf
,
M.
Amaral
,
M.
Lowinski
,
F.
Vallée
,
D.
Musil
,
J.
Güldenhaupt
,
M. K.
Dreyer
,
J.
Bomke
,
M.
Frech
,
J.
Schlitter
, and
K.
Gerwert
, “
Estimation of protein-ligand unbinding kinetics using non-equilibrium targeted molecular dynamics simulations
,”
J. Chem. Inf. Model.
59
,
5135
5147
(
2019
).
10.
D. T.
Kallikragas
and
I. M.
Svishchev
, “
Atomistic simulations of corrosion related species in nano-cracks
,”
Corros. Sci.
135
,
255
262
(
2018
).
11.
H.
DorMohammadi
,
Q.
Pang
,
L.
Árnadóttir
, and
O.
Burkan Isgor
, “
Atomistic simulation of initial stages of iron corrosion in pure water using reactive molecular dynamics
,”
Comput. Mater. Sci.
145
,
126
133
(
2018
).
12.
I. B.
Obot
,
K.
Haruna
, and
T. A.
Saleh
, “
Atomistic simulation: A unique and powerful computational tool for corrosion inhibition research
,”
Arabian J. Sci. Eng.
44
,
1
32
(
2019
).
13.
R. B.
Best
, “
Atomistic molecular simulations of protein folding
,”
Curr. Opin. Struct. Biol.
22
,
52
61
(
2012
).
14.
H.
Xiao
,
B.
Huang
,
G.
Yao
,
W.
Kang
,
S.
Gong
,
H.
Pan
,
Y.
Cao
,
J.
Wang
,
J.
Zhang
, and
W.
Wang
, “
Atomistic simulation of the coupled adsorption and unfolding of protein GB1 on the polystyrenes nanoparticle surface
,”
Sci. China: Phys., Mech. Astron.
61
,
038711
(
2018
).
15.
D.
Meneksedag-Erol
and
S.
Rauscher
, “
Atomistic simulation tools to study protein self-aggregation
,” in
Protein Self-Assembly: Methods and Protocols
, Methods in Molecular Biology, Vol. 2039, (
Springer
,
2019
), pp.
243
262
.
16.
F.
Noé
,
A.
Tkatchenko
,
K.-R.
Müller
, and
C.
Clementi
, “
Machine learning for molecular simulation
,”
Annu. Rev. Phys. Chem.
71
,
361
390
(
2020
).
17.
S.
Chmiela
,
H. E.
Sauceda
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Towards exact molecular dynamics simulations with machine-learned force fields
,”
Nat. Commun.
9
,
3887
(
2018
).
18.
K. T.
Schütt
,
P.-J.
Kindermans
,
H. E.
Sauceda
,
S.
Chmiela
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
SchNet: A continuous-filter convolutional neural network for modeling quantum interactions
,” in
Proceedings of the 31st International Conference on Neural Information Processing Systems
(
2017
), Vol. 30, pp.
992
1002
.
19.
G.
Montavon
,
M.
Rupp
,
V.
Gobre
,
A.
Vazquez-Mayagoitia
,
K.
Hansen
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O.
Anatole von Lilienfeld
, “
Machine learning of molecular electronic properties in chemical compound space
,”
New J. Phys.
15
,
095003
(
2013
).
20.
J.
Behler
, “
Perspective: Machine learning potentials for atomistic simulations
,”
J. Chem. Phys.
145
,
170901
(
2016
).
21.
K. T.
Schütt
,
F.
Arbabzadah
,
S.
Chmiela
,
K. R.
Müller
, and
A.
Tkatchenko
, “
Quantum-chemical insights from deep tensor neural networks
,”
Nat. Commun.
8
,
13890
(
2017
).
22.
K.
Hansen
,
F.
Biegler
,
R.
Ramakrishnan
,
W.
Pronobis
,
O.
Anatole von Lilienfeld
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space
,”
J. Phys. Chem. Lett.
6
,
2326
2331
(
2015
).
23.
A.
Mardt
,
L.
Pasquali
,
H.
Wu
, and
F.
Noé
, “
VAMPnets for deep learning of molecular kinetics
,”
Nat. Commun.
9
,
5
(
2018
).
24.
A. P.
Bartók
,
S.
De
,
C.
Poelking
,
N.
Bernstein
,
J. R.
Kermode
,
G.
Csányi
, and
M.
Ceriotti
, “
Machine learning unifies the modeling of materials and molecules
,”
Sci. Adv.
3
,
e1701816
(
2017
).
25.
F. A.
Faber
,
A.
Lindmaa
,
O. A.
von Lilienfeld
, and
R.
Armiento
, “
Machine learning energies of 2 million elpasolite (ABC2D6) crystals
,”
Phys. Rev. Lett.
117
,
135502
(
2016
).
26.
N.
Artrith
,
A.
Urban
, and
G.
Ceder
, “
Constructing first-principles phase diagrams of amorphous LixSi using machine-learning-assisted sampling with an evolutionary algorithm
,”
J. Chem. Phys.
148
,
241711
(
2018
).
27.
F.
Gregory
, MLFF, https://github.com/fonsecag/MLFF (
2020
).
28.
A. P.
Bartók
and
G.
Csányi
, “
Gaussian approximation potentials: A brief tutorial introduction
,”
Int. J. Quantum Chem.
115
,
1051
1057
(
2015
).
29.
A. P.
Bartók
,
R.
Kondor
, and
G.
Csányi
, “
On representing chemical environments
,”
Phys. Rev. B
87
,
184115
(
2013
).
30.
Q.
Lin
,
Y.
Zhang
,
B.
Zhao
, and
B.
Jiang
, “
Automatically growing global reactive neural network potential energy surfaces: A trajectory-free active learning strategy
,”
J. Chem. Phys.
152
,
154104
(
2020
).
31.
J. S.
Smith
,
B.
Nebgen
,
N.
Lubbers
,
O.
Isayev
, and
A. E.
Roitberg
, “
Less is more: Sampling chemical space with active learning
,”
J. Chem. Phys.
148
,
241733
(
2018
).
32.
V.
Botu
and
R.
Ramprasad
, “
Adaptive machine learning framework to accelerate ab initio molecular dynamics
,”
Int. J. Quantum Chem.
115
,
1074
1083
(
2015
).
33.
M.
Ceriotti
,
G. A.
Tribello
, and
M.
Parrinello
, “
Demonstrating the transferability and the descriptive power of sketch-map
,”
J. Chem. Theory Comput.
9
,
1521
1532
(
2013
).
34.
P. O.
Dral
,
A.
Owens
,
S. N.
Yurchenko
, and
W.
Thiel
, “
Structure-based sampling and self-correcting machine learning for accurate calculations of potential energy surfaces and vibrational levels
,”
J. Chem. Phys.
146
,
244108
(
2017
).
35.
J. H.
Ward
, Jr.
, “
Hierarchical grouping to optimize an objective function
,”
J. Am. Stat. Assoc.
58
,
236
244
(
1963
).
36.
D.
Sculley
, “
Web-scale k-means clustering
,” in
Proceedings of the 19th International Conference on World Wide Web, WWW ’10
(
Association for Computing Machinery
,
New York, NY
,
2010
), pp.
1177
1178
.
37.
D.
Arthur
and
S.
Vassilvitskii
, “
k-means++: The advantages of careful seeding
,” in
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms
(
SODA
,
2007
), Vol. 8, pp.
1027
1035
.
38.
F.
Pedregosa
,
G.
Varoquaux
,
A.
Gramfort
,
V.
Michel
,
B.
Thirion
,
O.
Grisel
,
M.
Blondel
,
P.
Prettenhofer
,
R.
Weiss
,
V.
Dubourg
,
J.
Vanderplas
,
A.
Passos
,
D.
Cournapeau
,
M.
Brucher
,
M.
Perrot
, and
E.
Duchesnay
, “
Scikit-learn: Machine learning in Python
,”
J. Mach. Learn. Res.
12
,
2825
2830
(
2011
).
39.
S.
Chmiela
,
A.
Tkatchenko
,
H. E.
Sauceda
,
I.
Poltavsky
,
K. T.
Schütt
, and
K.-R.
Müller
, “
Machine learning of accurate energy-conserving molecular force fields
,”
Sci. Adv.
3
,
e1603015
(
2017
).
40.
V.
Blum
,
R.
Gehrke
,
F.
Hanke
,
P.
Havu
,
V.
Havu
,
X.
Ren
,
K.
Reuter
, and
M.
Scheffler
, “
Ab initio molecular simulations with numeric atom-centered orbitals
,”
Comput. Phys. Commun.
180
,
2175
2196
(
2009
).
41.
V.
Kapil
,
M.
Rossi
,
O.
Marsalek
,
R.
Petraglia
,
Y.
Litman
,
T.
Spura
,
B.
Cheng
,
A.
Cuzzocrea
,
R. H.
Meißner
,
D. M.
Wilkins
,
B. A.
Helfrecht
,
P.
Juda
,
S. P.
Bienvenue
,
W.
Fang
,
J.
Kessler
,
I.
Poltavsky
,
S.
Vandenbrande
,
J.
Wieme
,
C.
Corminboeuf
,
T. D.
Kühne
,
D. E.
Manolopoulos
,
T. E.
Markland
,
J. O.
Richardson
,
A.
Tkatchenko
,
G. A.
Tribello
,
V.
Van Speybroeck
, and
M.
Ceriotti
, “
i-PI 2.0: A universal force engine for advanced molecular simulations
,”
Comput. Phys. Commun.
236
,
214
223
(
2019
).
42.
J. P.
Perdew
,
K.
Burke
, and
M.
Ernzerhof
, “
Generalized gradient approximation made simple
,”
Phys. Rev. Lett.
77
,
3865
3868
(
1996
).
43.
A.
Ambrosetti
,
A. M.
Reilly
,
R. A.
DiStasio
, and
A.
Tkatchenko
, “
Long-range correlation energy calculated from coupled atomic response functions
,”
J. Chem. Phys.
140
,
18A508
(
2014
).
44.
A.
Tkatchenko
,
R. A.
DiStasio
,
R.
Car
, and
M.
Scheffler
, “
Accurate and efficient method for many-body van der Waals interactions
,”
Phys. Rev. Lett.
108
,
236402
(
2012
).
45.
J.
Westermayr
and
P.
Marquetand
, “
Deep learning for UV absorption spectra with SchNarc: First steps toward transferability in chemical compound space
,”
J. Chem. Phys.
153
,
154112
(
2020
).
46.
J.
Westermayr
,
F. A.
Faber
,
A. S.
Christensen
,
O. A.
von Lilienfeld
, and
P.
Marquetand
, “
Neural networks and kernel ridge regression for excited states dynamics of CH2NH+2: From single-state to multi-state representations and multi-property machine learning models
,”
Mach. Learn.: Sci. Technol.
1
,
025009
(
2020
).
47.
F. A.
Faber
,
A. S.
Christensen
, and
O. A.
von Lilienfeld
, “
Quantum machine learning with response operators in chemical compound space
,” in
Machine Learning Meets Quantum Physics
, edited by
K. T.
Schütt
,
S.
Chmiela
,
O. A.
von Lilienfeld
,
A.
Tkatchenko
,
K.
Tsuda
, and
K.-R.
Müller
(
Springer International Publishing
,
Cham
,
2020
), pp.
155
169
.
48.
A. S.
Christensen
and
O. A.
von Lilienfeld
, “
On the role of gradients for machine learning of molecular energies and forces
,”
Mach. Learn.: Sci. Technol.
1
,
045018
(
2020
).
49.
O. T.
Unke
,
S.
Chmiela
,
H. E.
Sauceda
,
M.
Gastegger
,
I.
Poltavsky
,
K. T.
Schütt
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
Machine learning force fields
,”
Chem. Rev.
(published online,
2021
).

Supplementary Material

You do not currently have access to this content.