Machine learning potentials (MLPs) have attracted significant attention in computational chemistry and materials science due to their high accuracy and computational efficiency. The proper selection of atomic structures is crucial for developing reliable MLPs. Insufficient or redundant atomic structures can impede the training process and potentially result in a poor quality MLP. Here, we propose a local-environment-guided screening algorithm for efficient dataset selection in MLP development. The algorithm utilizes a local environment bank to store unique local environments of atoms. The dissimilarity between a particular local environment and those stored in the bank is evaluated using the Euclidean distance. A new structure is selected only if its local environment is significantly different from those already present in the bank. Consequently, the bank is then updated with all the new local environments found in the selected structure. To demonstrate the effectiveness of our algorithm, we applied it to select structures for a Ge system and a Pd13H2 particle system. The algorithm reduced the training data size by around 80% for both without compromising the performance of the MLP models. We verified that the results were independent of the selection and ordering of the initial structures. We also compared the performance of our method with the farthest point sampling algorithm, and the results show that our algorithm is superior in both robustness and computational efficiency. Furthermore, the generated local environment bank can be continuously updated and can potentially serve as a growing database of feature local environments, aiding in efficient dataset maintenance for constructing accurate MLPs.

1.
J.
Behler
, “
Four generations of high-dimensional neural network potentials
,”
Chem. Rev.
121
,
10037
(
2021
).
2.
R. C.
Bernardi
,
M. C.
Melo
, and
K.
Schulten
, “
Enhanced sampling techniques in molecular dynamics simulations of biological systems
,”
Biochim. Biophys. Acta, Gen. Subj.
1850
,
872
(
2015
).
3.
F.
Musil
et al, “
Physics-inspired structural representations for molecules and materials
,”
Chem. Rev.
121
,
9759
(
2021
).
4.
O. T.
Unke
et al, “
Machine learning force fields
,”
Chem. Rev.
121
,
10142
(
2021
).
5.
C. M.
Handley
and
P. L.
Popelier
, “
Potential energy surfaces fitted by artificial neural networks
,”
J. Phys. Chem. A
114
,
3371
(
2010
).
6.
S.
Manzhos
and
T.
Carrington
, Jr.
, “
Neural network potential energy surfaces for small molecules and reactions
,”
Chem. Rev.
121
,
10187
(
2020
).
7.
J.
Behler
, “
Constructing high‐dimensional neural network potentials: A tutorial review
,”
Int. J. Quantum Chem.
115
,
1032
(
2015
).
8.
T. W.
Ko
and
S. P.
Ong
, “
Recent advances and outstanding challenges for machine learning interatomic potentials
,”
Nat Comput Sci
3
,
998
(
2023
).
9.
N. L.
Allinger
,
Y. H.
Yuh
, and
J. H.
Lii
, “
Molecular mechanics. The MM3 force field for hydrocarbons. 1
,”
J. Am. Chem. Soc.
111
,
8551
(
1989
).
10.
W. D.
Cornell
et al, “
A second generation force field for the simulation of proteins, nucleic acids, and organic molecules
,”
J. Am. Chem. Soc.
117
,
5179
(
1995
).
11.
T. J.
Lenosky
et al, “
Highly optimized empirical potential model of silicon
,”
Modell. Simul. Mater. Sci. Eng.
8
,
825
(
2000
).
12.
A. C.
Van Duin
et al, “
ReaxFF: A reactive force field for hydrocarbons
,”
J. Phys. Chem. A
105
,
9396
(
2001
).
13.
W.
Han
, “
Molecular modeling by machin learning
,”
Math. Numer. Sin.
43
,
261
(
2021
).
14.
J.
Behler
and
M.
Parrinello
, “
Generalized neural-network representation of high-dimensional potential-energy surfaces
,”
Phys. Rev. Lett.
98
,
146401
(
2007
).
15.
K. T.
Schütt
et al, “
Quantum-chemical insights from deep tensor neural networks
,”
Nat. Commun.
8
,
13890
(
2017
).
16.
A. P.
Bartók
et al, “
Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons
,”
Phys. Rev. Lett.
104
,
136403
(
2010
).
17.
O. T.
Unke
and
M.
Meuwly
, “
PhysNet: A neural network for predicting energies, forces, dipole moments, and partial charges
,”
J. Chem. Theory Comput.
15
,
3678
(
2019
).
18.
L.
Zhang
et al, “
End-to-end symmetry preserving inter-atomic potential energy model for finite and extended systems
,”
Adv. Neural Inf. Process. Syst.
31
,
4436
(
2018
).
19.
A. V.
Shapeev
, “
Moment tensor potentials: A class of systematically improvable interatomic potentials
,”
Multiscale Model Simul.
14
,
1153
(
2016
).
20.
J. S.
Smith
,
O.
Isayev
, and
A. E.
Roitberg
, “
ANI-1: An extensible neural network potential with DFT accuracy at force field computational cost
,”
Chem. Sci.
8
,
3192
(
2017
).
21.
K.
Yao
et al, “
The TensorMol-0.1 model chemistry: A neural network augmented with long-range physics
,”
Chem. Sci.
9
,
2261
(
2018
).
22.
S.
Chmiela
et al, “
Machine learning of accurate energy-conserving molecular force fields
,”
Sci. Adv.
3
,
e1603015
(
2017
).
23.
S.
Chmiela
et al, “
Towards exact molecular dynamics simulations with machine-learned force fields
,”
Nat. Commun.
9
,
3887
(
2018
).
24.
L.
Zhang
et al, “
Deep potential molecular dynamics: A scalable model with the accuracy of quantum mechanics
,”
Phys. Rev. Lett.
120
,
143001
(
2018
).
25.
M.
Rupp
et al, “
Fast and accurate modeling of molecular atomization energies with machine learning
,”
Phys. Rev. Lett.
108
,
058301
(
2012
).
26.
J.
Behler
, “
Atom-centered symmetry functions for constructing high-dimensional neural network potentials
,”
J. Chem. Phys.
134
,
074106
(
2011
).
27.
V.
Botu
et al, “
Machine learning force fields: Construction, validation, and outlook
,”
J. Phys. Chem. C
121
,
511
(
2017
).
28.
G. P.
Pun
et al, “
Physically informed artificial neural networks for atomistic modeling of materials
,”
Nat. Commun.
10
,
2339
(
2019
).
29.
W.
Jia
et al, in
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
(
IEEE
,
2020
), p.
1
.
30.
L.
Li
et al, “
Pair-distribution-function guided optimization of fingerprints for atom-centered neural network potentials
,”
J. Chem. Phys.
152
,
224102
(
2020
).
31.
A.
Khorshidi
and
A. A.
Peterson
, “
Amp: A modular approach to machine learning in atomistic simulations
,”
Comput. Phys. Commun.
207
,
310
(
2016
).
32.
N.
Artrith
and
A.
Urban
, “
An implementation of artificial neural-network potentials for atomistic materials simulations: Performance for TiO2
,”
Comput. Mater. Sci.
114
,
135
(
2016
).
33.
N.
Artrith
and
J.
Behler
, “
High-dimensional neural network potentials for metal surfaces: A prototype study for copper
,”
Phys. Rev. B
85
,
045439
(
2012
).
34.
I. A.
Basheer
and
M.
Hajmeer
, “
Artificial neural networks: Fundamentals, computing, design, and application
,”
J. Microbiol. Methods
43
,
3
(
2000
).
35.
A. P.
Bartók
and
G.
Csányi
, “
Gaussian approximation potentials: A brief tutorial introduction
,”
Int. J. Quantum Chem.
115
,
1051
(
2015
).
36.
V. L.
Deringer
et al, “
Gaussian process regression for materials and molecules
,”
Chem. Rev.
121
,
10073
(
2021
).
37.
C.
Chen
et al, “
Graph networks as a universal machine learning framework for molecules and crystals
,”
Chem. Mater.
31
,
3564
(
2019
).
38.
T.
Xie
and
J. C.
Grossman
, “
Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties
,”
Phys. Rev. Lett.
120
,
145301
(
2018
).
39.
A.
Hajibabaei
,
C. W.
Myung
, and
K. S.
Kim
, “
Sparse Gaussian process potentials: Application to lithium diffusivity in superionic conducting solid electrolytes
,”
Phys. Rev. B
103
,
214102
(
2021
).
40.
A.
Hajibabaei
and
K. S.
Kim
, “
Universal machine learning interatomic potentials: Surveying solid electrolytes
,”
J. Phys. Chem. Lett.
12
,
8115
(
2021
).
41.
V. L.
Deringer
et al, “
Origins of structural and electronic transitions in disordered silicon
,”
Nature
589
,
59
(
2021
).
42.
V.
Kapil
et al, “
The first-principles phase diagram of monolayer nanoconfined water
,”
Nature
609
,
512
(
2022
).
43.
L.
Zhang
et al, “
Phase diagram of a deep potential water model
,”
Phys. Rev. Lett.
126
,
236001
(
2021
).
44.
M. F.
Calegari Andrade
et al, “
Free energy of proton transfer at the water–TiO2 interface from ab initio deep potential molecular dynamics
,”
Chem. Sci.
11
,
2335
(
2020
).
45.
J.
Byggmästar
,
K.
Nordlund
, and
F.
Djurabekova
, “
Gaussian approximation potentials for body-centered-cubic transition metals
,”
Phys. Rev. Mater.
4
,
093802
(
2020
).
46.
A.
Hajibabaei
et al, “
Machine learning of first-principles force-fields for alkane and polyene hydrocarbons
,”
J. Phys. Chem. A
125
,
9414
(
2021
).
47.
M.
Ha
et al, “
Al‐doping driven suppression of capacity and voltage fadings in 4d‐element containing Li‐ion‐battery cathode materials: Machine learning and density functional theory
,”
Adv. Energy Mater.
12
,
2201497
(
2022
).
48.
A. P.
Bartók
et al, “
Machine learning a general-purpose interatomic potential for silicon
,”
Phys. Rev. X
8
,
041048
(
2018
).
49.
E. V.
Podryabinkin
and
A. V.
Shapeev
, “
Active learning of linearly parametrized interatomic potentials
,”
Comput. Mater. Sci.
140
,
171
(
2017
).
50.
R.
Ramakrishnan
et al, “
Quantum chemistry structures and properties of 134 kilo molecules
,”
Sci. Data
1
,
140022
(
2014
).
51.
L.
Ruddigkeit
et al, “
Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17
,”
J. Chem. Inf. Model.
52
,
2864
(
2012
).
52.
G.
Mills
and
H.
Jónsson
, “
Quantum and thermal effects in H2 dissociative adsorption: Evaluation of free energy barriers in multidimensional quantum systems
,”
Phys. Rev. Lett.
72
,
1124
(
1994
).
53.
G.
Mills
,
H.
Jónsson
, and
G. K.
Schenter
, “
Reversible work transition state theory: Application to dissociative adsorption of hydrogen
,”
Surf. Sci.
324
,
305
(
1995
).
54.
M.
Born
and
W.
Heisenberg
, “
Zur quantentheorie der molekeln
,”
Ann. Phys.
379
,
1
(
1924
).
55.
P.
Hohenberg
and
W.
Kohn
, “
Inhomogeneous electron gas
,”
Phys. Rev.
136
,
B864
(
1964
).
56.
W.
Kohn
and
L. J.
Sham
, “
Self-consistent equations including exchange and correlation effects
,”
Phys. Rev.
140
,
A1133
(
1965
).
57.
C.
Møller
and
M. S.
Plesset
, “
Note on an approximation treatment for many-electron systems
,”
Phys. Rev.
46
,
618
(
1934
).
58.
R.
Car
and
M.
Parrinello
, “
Unified approach for molecular dynamics and density-functional theory
,”
Phys. Rev. Lett.
55
,
2471
(
1985
).
59.
G.
Kresse
and
J.
Hafner
, “
Ab initio molecular dynamics for liquid metals
,”
Phys. Rev. B
47
,
558
(
1993
).
60.
Y. I.
Yang
et al, “
Efficient sampling over rough energy landscapes with high barriers: A combination of metadynamics with integrated tempering sampling
,”
J. Chem. Phys.
144
,
094105
(
2016
).
61.
C.
Abrams
and
G.
Bussi
, “
Enhanced sampling in molecular dynamics using metadynamics, replica-exchange, and temperature-acceleration
,”
Entropy
16
,
163
(
2013
).
62.
Y.
Zhang
and
G. A.
Voth
, “
Combined metadynamics and umbrella sampling method for the calculation of ion permeation free energy profiles
,”
J. Chem. Theory Comput.
7
,
2277
(
2011
).
63.
A.
Barducci
,
M.
Bonomi
, and
M.
Parrinello
, “
Metadynamics
,”
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
1
,
826
(
2011
).
64.
W.
Jeong
et al, “
Toward reliable and transferable machine learning potentials: Uniform training by overcoming sampling bias
,”
J. Phys. Chem. C
122
,
22790
(
2018
).
65.
Y.
Zuo
et al, “
Performance and cost assessment of machine learning interatomic potentials
,”
J. Phys. Chem. A
124
,
731
(
2020
).
66.
W.
Kohn
,
A. D.
Becke
, and
R. G.
Parr
, “
Density functional theory of electronic structure
,”
J. Phys. Chem.
100
,
12974
(
1996
).
67.
E. J.
Baerends
and
O. V.
Gritsenko
, “
A quantum chemical view of density functional theory
,”
J. Phys. Chem. A
101
,
5383
(
1997
).
68.
S.
De
et al, “
Comparing molecules and solids across structural and alchemical space
,”
Phys. Chem. Chem. Phys.
18
,
13754
(
2016
).
69.
G.
Imbalzano
et al, “
Automatic selection of atomic fingerprints and reference configurations for machine-learning potentials
,”
J. Chem. Phys.
148
,
241730
(
2018
).
70.
T. D.
Huan
et al, “
A universal strategy for the creation of machine learning-based atomistic force fields
,”
npj Comput. Mater.
3
,
37
(
2017
).
71.
M. O.
Jäger
et al, “
Machine learning hydrogen adsorption on nanoclusters through structural descriptors
,”
npj Comput. Mater.
4
,
37
(
2018
).
72.
V.
Botu
and
R.
Ramprasad
, “
Learning scheme to predict atomic forces and accelerate materials simulations
,”
Phys. Rev. B
92
,
094306
(
2015
).
73.
V.
Botu
and
R.
Ramprasad
, “
Adaptive machine learning framework to accelerate ab initio molecular dynamics
,”
Int. J. Quantum Chem.
115
,
1074
(
2015
).
74.
V.
Botu
,
J.
Chapman
, and
R.
Ramprasad
, “
A study of adatom ripening on an Al (1 1 1) surface with machine learning force fields
,”
Comput. Mater. Sci.
129
,
332
(
2017
).
75.
T.
Suzuki
,
R.
Tamura
, and
T.
Miyazaki
, “
Machine learning for atomic forces in a crystalline solid: Transferability to various temperatures
,”
Int. J. Quantum Chem.
117
,
33
(
2017
).
76.
G.
Pilania
et al, “
Accelerating materials property predictions using machine learning
,”
Sci. Rep.
3
,
2810
(
2013
).
77.
T. D.
Huan
,
A.
Mannodi-Kanakkithodi
, and
R.
Ramprasad
, “
Accelerated materials property predictions and design using motif-based fingerprints
,”
Phys. Rev. B
92
,
014106
(
2015
).
78.
A.
Mannodi-Kanakkithodi
et al, “
Machine learning strategy for accelerated design of polymer dielectrics
,”
Sci. Rep.
6
,
20952
(
2016
).
79.
A. P.
Bartók
et al, “
Machine learning unifies the modeling of materials and molecules
,”
Sci. Adv.
3
,
e1701816
(
2017
).
80.
P.
Friederich
et al, “
Machine-learned potentials for next-generation matter simulations
,”
Nat. Mater.
20
,
750
(
2021
).
81.
B.
Lakshminarayanan
,
A.
Pritzel
, and
C.
Blundell
, “
Simple and scalable predictive uncertainty estimation using deep ensembles
,”
Adv. Neural Inf. Process. Syst.
30
,
6402
(
2017
).
82.
Y.
Yang
,
O. A.
Jiménez-Negrón
, and
J. R.
Kitchin
, “
Machine-learning accelerated geometry optimization in molecular simulation
,”
J. Chem. Phys.
154
,
234704
(
2021
).
83.
Q.
Lin
et al, “
Searching configurations in uncertainty space: Active learning of high-dimensional neural network reactive potentials
,”
J. Chem. Theory Comput.
17
,
2691
(
2021
).
84.
N.
Xu
et al, “
Training data set refinement for the machine learning potential of Li-Si alloys via structural similarity analysis
,” arXiv:2103.04347 (
2021
).
85.
T.
Hofmann
,
B.
Schölkopf
, and
A. J.
Smola
, “
Kernel methods in machine learning
,’’
Ann. Statist.
36
,
1171
(
2008
).
86.
L.
Li
et al, “
‘Atom-centered machine-learning force field package
,”
Comput. Phys. Commun.
292
,
108883
(
2023
).
87.
L.
Himanen
et al, “
DScribe: Library of descriptors for machine learning in materials science
,”
Comput. Phys. Commun.
247
,
106949
(
2020
).
88.
F.
Pedregosa
et al, “
Scikit-learn: Machine learning in Python
,”
J. Mach. Learn. Res.
12
,
2825
(
2011
).

Supplementary Material

You do not currently have access to this content.