Instant machine learning predictions of molecular properties are desirable for materials design, but the predictive power of the methodology is mainly tested on well-known benchmark datasets. Here, we investigate the performance of machine learning with kernel ridge regression (KRR) for the prediction of molecular orbital energies on three large datasets: the standard QM9 small organic molecules set, amino acid and dipeptide conformers, and organic crystal-forming molecules extracted from the Cambridge Structural Database. We focus on the prediction of highest occupied molecular orbital (HOMO) energies, computed at the density-functional level of theory. Two different representations that encode the molecular structure are compared: the Coulomb matrix (CM) and the many-body tensor representation (MBTR). We find that KRR performance depends significantly on the chemistry of the underlying dataset and that the MBTR is superior to the CM, predicting HOMO energies with a mean absolute error as low as 0.09 eV. To demonstrate the power of our machine learning method, we apply our model to structures of 10k previously unseen molecules. We gain instant energy predictions that allow us to identify interesting molecules for future applications.

1.
M.
Rupp
,
O. A.
von Lilienfeld
, and
K.
Burke
, “
Guest editorial: Special topic on data-enabled theoretical chemistry
,”
J. Chem. Phys.
148
,
241401
(
2018
).
2.
T.
Müller
,
A. G.
Kusne
, and
R.
Ramprasad
, “
Machine learning in materials science
,” in
Reviews in Computational Chemistry
(
John Wiley & Sons, Ltd.
,
2016
), Chap. 4, pp.
186
273
.
3.
A.
Zunger
, “
Inverse design in search of materials with target functionalities
,”
Nat. Rev. Chem.
2
,
0121 EP
(
2018
), perspective.
4.
J.
Ma
,
R. P.
Sheridan
,
A.
Liaw
,
G. E.
Dahl
, and
V.
Svetnik
, “
Deep neural nets as a method for quantitative structure—Activity relationships
,”
J. Chem. Inf. Model.
55
,
263
274
(
2015
).
5.
A. D.
Sendek
,
E. D.
Cubuk
,
E. R.
Antoniuk
,
G.
Cheon
,
Y.
Cui
, and
E. J.
Reed
, “
Machine learning-assisted discovery of solid Li-Ion conducting materials
,”
Chem. Mater.
31
,
342
352
(
2019
).
6.
M. A.
Shandiz
and
R.
Gauvin
, “
Application of machine learning methods for the prediction of crystal system of cathode materials in lithium-ion batteries
,”
Comput. Mater. Sci.
117
,
270
278
(
2016
).
7.
R.
Gómez-Bombarelli
,
J.
Aguilera-Iparraguirre
,
T. D.
Hirzel
,
D.
Duvenaud
,
D.
Maclaurin
,
M. A.
Blood-Forsythe
,
H. S.
Chae
,
M.
Einzinger
,
D.-G.
Ha
,
T. C.-C.
Wu
,
G.
Markopoulos
,
S.
Jeon
,
H.
Kang
,
H.
Miyazaki
,
M.
Numata
,
S.
Kim
,
W.
Huang
,
S. I.
Hong
,
M. A.
Baldo
,
R. P.
Adams
, and
A.
Aspuru-Guzik
, “
Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach
,”
Nat. Mater.
15
(
10
),
1120
1127
(
2016
).
8.
B. R.
Goldsmith
,
J.
Esterhuizen
,
J.-X.
Liu
,
C. J.
Bartel
, and
C.
Sutton
, “
Machine learning for heterogeneous catalyst design and discovery
,”
AIChE J.
64
,
2311
2323
(
2018
).
9.
B.
Meyer
,
B.
Sawatlon
,
S.
Heinen
,
O. A.
von Lilienfeld
, and
C.
Corminboeuf
, “
Machine learning meets volcano plots: Computational discovery of cross-coupling catalysts
,”
Chem. Sci.
9
,
7069
7077
(
2018
).
10.
K.
Hansen
,
G.
Montavon
,
F.
Biegler
,
S.
Fazli
,
M.
Rupp
,
M.
Scheffler
,
O. A.
von Lilienfeld
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
Assessment and validation of machine learning methods for predicting molecular atomization energies
,”
J. Chem. Theory Comput.
9
,
3404
3419
(
2013
).
11.
M.
Rupp
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O. A.
von Lilienfeld
, “
Fast and accurate modeling of molecular atomization energies with machine learning
,”
Phys. Rev. Lett.
108
,
058301
(
2012
).
12.
B.
Huang
and
O. A.
von Lilienfeld
, “
Communication: Understanding molecular representations in machine learning: The role of uniqueness and target similarity
,”
J. Chem. Phys.
145
,
161102
(
2016
).
13.
F. A.
Faber
,
L.
Hutchison
,
B.
Huang
,
J.
Gilmer
,
S. S.
Schoenholz
,
G. E.
Dahl
,
O.
Vinyals
,
S.
Kearnes
,
P. F.
Riley
, and
O. A.
von Lilienfeld
, “
Prediction errors of molecular machine learning models lower than hybrid DFT error
,”
J. Chem. Theory Comput.
13
,
5255
5264
(
2017
).
14.
F. A.
Faber
,
A. S.
Christensen
,
B.
Huang
, and
O. A.
von Lilienfeld
, “
Alchemical and structural distribution based representation for universal quantum machine learning
,”
J. Chem. Phys.
148
,
241717
(
2018
).
15.
A. P.
Bartók
,
S.
De
,
C.
Poelking
,
N.
Bernstein
,
J. R.
Kermode
,
G.
Csányi
, and
M.
Ceriotti
, “
Machine learning unifies the modeling of materials and molecules
,”
Sci. Adv.
3
,
e1701816
(
2017
).
16.
C. R.
Collins
,
G. J.
Gordon
,
O. A.
von Lilienfeld
, and
D. J.
Yaron
, “
Constant size descriptors for accurate machine learning models of molecular properties
,”
J. Chem. Phys.
148
,
241718
(
2018
).
17.
S.
De
,
A. P.
Bartók
,
G.
Csányi
, and
M.
Ceriotti
, “
Comparing molecules and solids across structural and alchemical space
,”
Phys. Chem. Chem. Phys.
18
,
13754
13769
(
2016
).
18.
R.
Ramakrishnan
,
P. O.
Dral
,
M.
Rupp
, and
O. A.
von Lilienfeld
, “
Big data meets quantum chemistry approximations: The δ-machine learning approach
,”
J. Chem. Theory Comput.
11
,
2087
2096
(
2015
).
19.
G.
Montavon
,
M.
Rupp
,
V.
Gobre
,
A.
Vazquez-Mayagoitia
,
K.
Hansen
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O. A.
von Lilienfeld
, “
Machine learning of molecular electronic properties in chemical compound space
,”
New J. Phys.
15
,
095003
(
2013
).
20.
K. T.
Schütt
,
H. E.
Sauceda
,
P.-J.
Kindermans
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
Schnet—A deep learning architecture for molecules and materials
,”
J. Chem. Phys.
148
,
241722
(
2018
).
21.
F.
Pereira
and
J. A.
de Sousa
, “
Machine learning for the prediction of molecular dipole moments obtained by density functional theory
,”
J. Cheminf.
10
,
43
(
2018
).
22.
T.
Bereau
,
D.
Andrienko
, and
O. A.
von Lilienfeld
, “
Transferable atomic multipole machine learning models for small organic molecules
,”
J. Chem. Theory Comput.
11
,
3225
3233
(
2015
).
23.
T.
Bereau
,
R. A.
DiStasio
, Jr.
,
A.
Tkatchenko
, and
O. A.
von Lilienfeld
, “
Non-covalent interactions across organic and biological subsets of chemical space: Physics-based potentials parametrized from machine learning
,”
J. Chem. Phys.
148
,
241706
(
2018
).
24.
W.
Pronobis
,
K. T.
Schütt
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
Capturing intensive and extensive DFT/TDDFT molecular properties with machine learning
,”
Eur. Phys. J. B
91
,
178
(
2018
).
25.
R.
Ramakrishnan
,
M.
Hartmann
,
E.
Tapavicza
, and
O. A.
von Lilienfeld
, “
Electronic spectra from TDDFT and machine learning in chemical space
,”
J. Chem. Phys.
143
,
084111
(
2015
).
26.
M.
Rupp
,
R.
Ramakrishnan
, and
O. A.
von Lilienfeld
, “
Machine learning for quantum mechanical properties of atoms in molecules
,”
J. Phys. Chem. Lett.
6
,
3309
3313
(
2015
).
27.
E. O.
Pyzer-Knapp
,
K.
Li
, and
A.
Aspuru-Guzik
, “
Learning from the Harvard clean energy project: The use of neural networks to accelerate materials discovery
,”
Adv. Funct. Mater.
25
,
6495
6502
(
2015
).
28.
F.
Pereira
,
K.
Xiao
,
D. A. R. S.
Latino
,
C.
Wu
,
Q.
Zhang
, and
J.
Aires-de Sousa
, “
Machine learning methods to predict density functional theory B3LYP energies of homo and lumo orbitals
,”
J. Chem. Inf. Model.
57
,
11
21
(
2017
).
29.
R.
Ramakrishnan
,
P. O.
Dral
,
M.
Rupp
, and
O. A.
von Lilienfeld
, “
Quantum chemistry structures and properties of 134 kilo molecules
,”
Sci. Data
1
,
140022
(
2014
).
30.
M.
Ropo
,
M.
Schneider
,
C.
Baldauf
, and
V.
Blum
, “
First-principles data set of 45,892 isolated and cation-coordinated conformers of 20 proteinogenic amino acids
,”
Sci. Data
3
,
160009
(
2016
).
31.
C.
Schober
,
K.
Reuter
, and
H.
Oberhofer
, “
Virtual screening for high carrier mobility in organic semiconductors
,”
J. Phys. Chem. Lett.
7
,
3973
3977
(
2016
).
32.
L.
Ruddigkeit
,
R.
van Deursen
,
L. C.
Blum
, and
J.-L.
Reymond
, “
Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17
,”
J. Chem. Inf. Model.
52
,
2864
2875
(
2012
).
33.
H.
Huo
and
M.
Rupp
, “
Unified representation for machine learning of molecules and crystals
,” e-print arXiv:1704.06439 [cond-mat, physics:physics] (
2017
).
34.
V.
Blum
,
R.
Gehrke
,
F.
Hanke
,
P.
Havu
,
V.
Havu
,
X.
Ren
,
K.
Reuter
, and
M.
Scheffler
, “
Ab initio molecular simulations with numeric atom-centered orbitals
,”
Comput. Phys. Commun.
180
,
2175
2196
(
2009
).
35.
V.
Havu
,
V.
Blum
,
P.
Havu
, and
M.
Scheffler
, “
Efficient o(n) integration for all-electron electronic structure calculation using numeric basis functions
,”
J. Comput. Phys.
228
,
8367
(
2009
).
36.
S. V.
Levchenko
,
X.
Ren
,
J.
Wieferink
,
R.
Johanni
,
P.
Rinke
,
V.
Blum
, and
M.
Scheffler
, “
Hybrid functionals for large periodic systems in an all-electron, numeric atom-centered basis framework
,”
Comput. Phys. Commun.
192
,
60
69
(
2015
).
37.
X.
Ren
,
P.
Rinke
,
V.
Blum
,
J.
Wieferink
,
A.
Tkatchenko
,
S.
Andrea
,
K.
Reuter
,
V.
Blum
, and
M.
Scheffler
, “
Resolution-of-identity approach to Hartree-Fock, hybrid density functionals, RPA, MP2, and GW with numeric atom-centered orbital basis functions
,”
New J. Phys.
14
,
053020
(
2012
).
38.
J. P.
Perdew
,
K.
Burke
, and
M.
Ernzerhof
, “
Generalized gradient approximation made simple
,”
Phys. Rev. Lett.
77
,
3865
3868
(
1996
).
39.
A.
Tkatchenko
and
M.
Scheffler
, “
Accurate molecular van der Waals interactions from ground-state electron density and free-atom reference data
,”
Phys. Rev. Lett.
102
,
073005
(
2009
).
40.
L.
Hedin
,
Phys. Rev.
139
,
A796
(
1965
).
41.
P.
Rinke
,
A.
Qteish
,
J.
Neugebauer
,
C.
Freysoldt
, and
M.
Scheffler
, “
Combining GW calculations with exact-exchange density-functional theory: An analysis of valence-band photoemission for compound semiconductors
,”
New J. Phys.
7
,
126
(
2005
).
42.
R.
Ramakrishnan
and
O. A.
Lilienfeld
, “
Machine learning, quantum chemistry, and chemical space
,” in
Reviews in Computational Chemistry
(
John Wiley & Sons, Ltd.
,
2017
), Chap. 5, pp.
225
256
.
43.
K. T.
Schütt
,
F.
Arbabzadah
,
S.
Chmiela
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Quantum-chemical insights from deep tensor neural networks
,”
Nat. Commun.
8
,
13890
(
2017
); e-print arXiv: 1609.08259.
44.
N.
Lubbers
,
J. S.
Smith
, and
K.
Barros
, “
Hierarchical modeling of molecular energies using a deep neural network
,”
J. Chem. Phys.
148
,
241715
(
2018
).
45.
O. T.
Unke
and
M.
Meuwly
, “
A reactive, scalable, and transferable model for molecular energies from a neural network approach based on local information
,”
J. Chem. Phys.
148
,
241708
(
2018
).
46.
N.
Artrith
,
A.
Urban
, and
G.
Ceder
, “
Efficient and accurate machine-learning interpolation of atomic energies in compositions with many species
,”
Phys. Rev. B
96
,
014112
(
2017
).
47.
S.
De
,
F.
Musil
,
T.
Ingram
,
C.
Baldauf
, and
M.
Ceriotti
, “
Mapping and classifying molecules from a high-throughput structural database
,”
J. Cheminf.
9
,
6
(
2017
).
48.
C. R.
Groom
,
I. J.
Bruno
,
M. P.
Lightfoot
, and
S. C.
Ward
, “
The Cambridge structural database
,”
Acta Crystallogr., Sect. B: Struct. Sci., Cryst. Eng. Mater.
72
,
171
179
(
2016
).
49.
L.
van der Maaten
and
G.
Hinton
, “
Visualizing data using t-SNE
,”
J. Mach. Learn. Res.
9
,
2579
2605
(
2008
).
50.
M.
Rupp
, “
Machine learning for quantum mechanics in a nutshell
,”
Int. J. Quantum Chem.
115
,
1058
1073
(
2015
).
51.
A. P.
Bartók
,
R.
Kondor
, and
G.
Csányi
, “
On representing chemical environments
,”
Phys. Rev. B
87
,
184115
(
2013
).
52.
O. A.
von Lilienfeld
,
R.
Ramakrishnan
,
M.
Rupp
, and
A.
Knoll
, “
Fourier series of atomic radial distribution functions: A molecular fingerprint for machine learning models of quantum chemical properties
,”
Int. J. Quantum Chem.
115
,
1084
1093
(
2015
).
53.
K.
Hansen
,
F.
Biegler
,
R.
Ramakrishnan
,
W.
Pronobis
,
O. A.
von Lilienfeld
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space
,”
J. Phys. Chem. Lett.
6
,
2326
2331
(
2015
).
54.
L.
Himanen
,
M. O. J.
Jäger
,
E. V.
Morooka
,
F. F.
Canova
,
Y. S.
Ranawat
,
D. Z.
Gao
,
P.
Rinke
, and
A. S.
Foster
, “
DScribe: Library of descriptors for machine learning in materials science
,” e-print arXiv:1904.08875 [cond-mat.mtrl-sci] (2019).
55.
T.
Hastie
,
R.
Tibshirani
, and
J.
Friedman
,
The Elements of Statistical Learning: Data Mining, Inference and Prediction
, 2nd ed. (
Springer
,
2009
).
56.
G.
Montavon
,
K.
Hansen
,
S.
Fazli
,
M.
Rupp
,
F.
Biegler
,
A.
Ziehe
,
A.
Tkatchenko
,
O. A.
von Lilienfeld
, and
K.-R.
Müller
, “
Learning invariant representations of molecules for atomization energy prediction
,”
Adv. Neural Inf. Process. Syst.
25
,
440
448
(
2012
).
57.
C.
Kunkel
,
C.
Schober
,
J. T.
Margraf
,
K.
Reuter
, and
H.
Oberhofer
, “
Finding the right bricks for molecular legos: A data mining approach to organic semiconductor design
,”
Chem. Mater.
31
,
969
978
(
2019
).
58.
C.
Kunkel
,
C.
Schober
,
H.
Oberhofer
, and
K.
Reuter
, “
Knowledge discovery through chemical space networks: The case of organic electronics
,”
J. Mol. Model.
25
,
87
(
2019
).
59.
M.
Todorović
,
M. U.
Gutmann
,
J.
Corander
, and
P.
Rinke
, “
Bayesian inference of atomistic structure in functional materials
,”
npj Comput. Mater.
5
,
35
(
2019
).
60.
K.
Ghosh
,
A.
Stuke
,
M.
Todorović
,
P. B.
Jørgensen
,
M. N.
Schmidt
,
A.
Vehtari
, and
P.
Rinke
, “
Deep learning spectroscopy: Neural networks for molecular excitation spectra
,”
Adv. Sci.
6
,
1801367
(
2019
).
61.
Z.
Wu
,
B.
Ramsundar
,
E. N.
Feinberg
,
J.
Gomes
,
C.
Geniesse
,
A. S.
Pappu
,
K.
Leswing
, and
V.
Pande
, “
MoleculeNet: a benchmark for molecular machine learning
,”
Chem. Sci.
9
,
513
530
(
2018
).
62.
M.
Rupp
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O. A.
von Lilienfeld
, “
Rupp et al. reply
,”
Phys. Rev. Lett.
109
,
059802
(
2012
).
63.
G.
Pilania
,
J. E.
Gubernatis
, and
T.
Lookman
, “
Multi-fidelity machine learning models for accurate bandgap predictions of solids
,”
Comput. Mater. Sci.
129
,
156
163
(
2017
).
64.
R.
Gómez-Bombarelli
,
J. N.
Wei
,
D.
Duvenaud
,
J. M.
Hernández-Lobato
,
B.
Sánchez-Lengeling
,
D.
Sheberla
,
J.
Aguilera-Iparraguirre
,
T. D.
Hirzel
,
R. P.
Adams
, and
A.
Aspuru-Guzik
, “
Automatic chemical design using a data-driven continuous representation of molecules
,”
ACS Cent. Sci.
4
,
268
276
(
2018
).
65.
L.
Li
,
J. C.
Snyder
,
I. M.
Pelaschier
,
J.
Huang
,
U.-N.
Niranjan
,
P.
Duncan
,
M.
Rupp
,
K.-R.
Müller
, and
K.
Burke
, “
Understanding machine-learned density functionals
,”
Int. J. Quantum Chem.
116
,
819
833
(
2016
).
66.
M. I.
Jordan
and
T. M.
Mitchell
, “
Machine learning: Trends, perspectives, and prospects
,”
Science
349
,
255
260
(
2015
).
67.
Atomistic Computer Simulations: A Practical Guide
, edited by
V.
Brazdova
and
D. R.
Bowler
(
Wiley
,
2013
).
68.
See https://www.nrel.gov/pv/ for NREL, National Center for Photovoltaics, Research Cell Record Efficiency Chart; accessed
4 August 2017
.
69.
S.
Chu
and
A.
Majumdar
, “
Opportunities and challenges for a sustainable energy future
,”
Nature
488
,
294
303
(
2012
).
70.
W.
Shockley
and
H. J.
Queisser
, “
Detailed balance limit of efficiency of p-n junction solar cells
,”
J. Appl. Phys.
32
,
510
519
(
1961
).
71.
B.
Huskinson
,
M. P.
Marshak
,
C.
Suh
,
S.
Er
,
M. R.
Gerhardt
,
C. J.
Galvin
,
X.
Chen
,
A.
Aspuru-Guzik
,
R. G.
Gordon
, and
M. J.
Aziz
, “
A metal-free organic–inorganic aqueous flow battery
,”
Nature
505
,
195
198
(
2014
).
72.
M.
Liu
,
Y.
Pang
,
B.
Zhang
,
P. D.
Luna
,
O.
Voznyy
,
J.
Xu
,
X.
Zheng
,
C. T.
Dinh
,
F.
Fan
,
C.
Cao
,
F. P. G.
de Arquer
,
T. S.
Safaei
,
A. H.
Mepham
,
A.
Klinkova
,
E.
Kumacheva
,
T.
Filleter
,
D.
Sinton
,
S. O.
Kelley
, and
E. H.
Sargent
, “
Enhanced electrocatalytic CO2 reduction via field-induced reagent concentration
,”
Nature
537
,
382
386
(
2016
).
73.
L. C.
Blum
and
J.-L.
Reymond
, “
970 million druglike small molecules for virtual screening in the chemical universe database GDB-13
,”
J. Am. Chem. Soc.
131
,
8732
(
2009
).
74.
O. A.
von Lilienfeld
, “
Quantum machine learning in chemical compound space
,”
Angew. Chem., Int. Ed.
57
,
4164
4169
(
2018
).
You do not currently have access to this content.