Chemical (molecular, quantum) machine learning relies on representing molecules in unique and informative ways. Here, we present the matrix of orthogonalized atomic orbital coefficients (MAOC) as a quantum-inspired molecular and atomic representation containing both structural (composition and geometry) and electronic (charge and spin multiplicity) information. MAOC is based on a cost-effective localization scheme that represents localized orbitals via a predefined set of atomic orbitals. The latter can be constructed from such small atom-centered basis sets as pcseg-0 and STO-3G in conjunction with guess (non-optimized) electronic configuration of the molecule. Importantly, MAOC is suitable for representing monatomic, molecular, and periodic systems and can distinguish compounds with identical compositions and geometries but distinct charges and spin multiplicities. Using principal component analysis, we constructed a more compact but equally powerful version of MAOC—PCX-MAOC. To test the performance of full and reduced MAOC and several other representations (CM, SOAP, SLATM, and SPAHM), we used a kernel ridge regression machine learning model to predict frontier molecular orbital energy levels and ground state single-point energies for chemically diverse neutral and charged, closed- and open-shell molecules from an extended QM7b dataset, as well as two new datasets, N-HPC-1 (N-heteropolycycles) and REDOX (nitroxyl and phenoxyl radicals, carbonyl, and cyano compounds). MAOC affords accuracy that is either similar or superior to other representations for a range of chemical properties and systems.

Molecular (also often called chemical or quantum) machine learning (ML) has rapidly entered the domain of chemical science, facilitating both the development of simulation techniques and the design of improved reagents, catalysts, functional materials, etc.1–8 Molecular representations are the prerequisite to machine-learn chemical properties, as they uniquely encode information about molecular composition and structure into a numerical format.9–11 These representations can be based on simple molecular properties (such as molecular weight and van der Waals volume,12 number of heteroatoms,13 partial charges,14 lipophilicity,15 etc.; also known as fingerprints or descriptors), molecular coordinates, graphs,16 topologies,17 and images18 and can take a form of a string (e.g., SMILES19 and SELFIES20), a vector, or a matrix. Among them, the physics-inspired representations9 arguably aim to encode molecular geometries and compositions most comprehensively. Some of these representations are built from atom-centered continuous basis functions, such as the smooth overlap of atomic positions (SOAP)21 and the atomic cluster expansion (ACE).22 Others use potentials, such as the Coulomb Matrix (CM),9 the spectrum of London and Axilrod-Teller-Muto (SLATM) potential,23 and the Faber-Christensen-Huang-von Lilienfeld representations (FCHL18,19),24,25 or transform input structural data into an internal coordinate system, for example, in the many-body tensor representation (MBTR).26 However, these representations only consider the nuclear charge and position of the nuclei in space, assuming charge neutrality. As a result, they violate the injectivity requirement of machine learning representations, i.e., are unable to distinguish between compounds with distinct electronic configurations but identical atomic compositions and geometries (Fig. 1). To satisfy the injectivity condition, several so-called quantum-inspired representations were developed by including additional electronic structure information. Molecular orbital basis machine learning (MOB-ML)27 and the F (Fock), J (Coulomb), and K (exchange) matrices (FJK) representation28 use post-Hartree-Fock molecular orbital properties, i.e., they require an a priori ab initio computation and thus operate in a ∆-ML fashion. In contrast, the spectrum of approximated Hamiltonian matrices (SPAHM)29 from Corminboeuf and colleagues employs a guess (i.e., pre-Hartree-Fock) electronic Hamiltonian, making it a simpler and quicker way to encode not only the composition and geometry but also the charge, spin multiplicity, and electronic state of a molecule.

FIG. 1.

Various representations for neutral and dicationic benzene in the same geometry.

FIG. 1.

Various representations for neutral and dicationic benzene in the same geometry.

Close modal

In this work, we present matrix of orthogonalized atomic orbital coefficients (MAOC), a quantum-inspired representation that uses meta-Löwdin orthogonalized atomic orbitals to generate localized molecular orbital coefficients. MAOC is an atomic and molecular representation that can describe charged and open-shell compounds and distinguish molecules with identical structures but distinct electronic states. We tested the performance of MAOC in comparison with other coordinate- and Hamiltonian-based representations using kernel ridge regression machine learning model for predicting orbital and single-point energies for a broad range of conventional and redox-active species from known and newly constructed datasets.

In the Hartree–Fock theory, molecular orbitals are obtained by diagonalizing the Fock matrix to satisfy Brillouin’s theorem:
(1)
where ΨHF is a self-consistent optimized Hartree–Fock wavefunction of the ground state, Ĥ is the Hamiltonian operator, and Ψiα is a single excited determinant. The occupied-occupied and virtual-virtual blocks of the Fock matrix are diagonalized to obtain canonical molecular orbitals, which are delocalized over the molecule. Occupied molecular orbitals can be localized by optimizing the cost function that measures their locality, as is done in the Pipek-Mezey (maximizing population charges on atoms),30,31 Boys and Foster (maximizing the sum of squares between orbital centroids),32 and Edmiston-Ruedenberg (maximizing the Coulomb self-repulsion)33 localization schemes. Alternatively, both the occupied and virtual orbitals can be localized by projecting them onto a set of predefined (atomic) orbitals,34 and a localization scheme of this type35 is adopted in this work. The matrix of orthogonalized molecular orbitals is based on the coefficients of non-optimized (guess) localized molecular orbitals (L-MOs), generated as a linear combination of predefined orthogonalized atomic orbitals (o-AOs). Each block of atomic orbitals (core, valence, and virtual) is independently orthogonalized in MAOC using the meta-Löwdin scheme, which has an added benefit of being applicable to periodic systems (see Fig. S2 in the supplementary material).36,37 In the case of core electrons and lone pairs, localized orbitals are atom-centered, whereas, in the case of bonds and π-conjugated systems, localized orbitals are dispersed throughout sets of molecular fragments or over an entire molecule. Further details on the assignment of charge and spin and the sorting of the MAOC squared matrix are given in Sec. I in the supplementary material.

Because the size of MAOC is related to the number of orbitals used to represent an atom, representation size grows quickly with compound size. To address this issue, a dimensionality reduction technique—principal component analysis (PCA)—can be used to reduce the size of MAOC, making it more compact, as well as easier to generate and use in machine learning tasks. This version of MAOC is denoted as PCX-MAOC, where X is the number of principal components.

The code for generating MAOC and PCX-MAOC is freely available from https://github.com/hits-ccc/MAOC/tree/main/Codes/MAOC_mol_rep.

Three sets of non-overlapping chemical data were used to test MAOC in supervised machine learning. The QM7b dataset38,39 was selected since it is among the most widely used datasets for machine learning in chemistry. It consists of 7211 relatively small neutral molecules with no conjugated bonds within cyclic moieties. To evaluate the performance of MAOC for radicals and ions, geometries of the anionic, cationic, and dicationic forms of the compounds in the dataset were optimized and their various vertical and adiabatic properties were computed in this work; strongly spin-contaminated species were eliminated from the dataset (see Computational procedures below). Therefore, in total 7197 geometry-optimized anion radicals, 6999 geometry-optimized cation radicals, 7198 geometry-optimized dications, as well as 7208 anion radicals and 7208 cation radicals in the geometry of the parent neutral molecule were added to the original QM7b dataset of neutral molecules. We denote this extended dataset as QM7bX.

The N-HPC-1 dataset, constructed in this work,40 consists of 3735 nitrogen-doped polycyclic compounds that were generated by introducing up to 11 nitrogen atoms into prototypical polyaromatic skeletons (Fig. 2). For each generated compound, canonical SMILES were produced, and any duplicates were removed from the dataset. To generate respective open-shell compounds, one, two, or three electrons were removed, or one electron was added to the neutral molecules to produce eight groups within the dataset: neutral singlets, neutral triplets, anionic doublets, cationic doublets, dicationic singlets, dicationic triplets, tricationic doublets, and tricationic quartets. Stable open-shell N-heteropolycycles display a range of practically useful electronic,41 magnetic,42 and optical43 properties, whose facile prediction with a tailor-made representation could provide valuable molecular design guidelines.

FIG. 2.

Datasets constructed and used in this work. N-HPC-1: composition of the dataset according to the polycyclic skeleton, doped with nitrogen atoms. REDOX: composition of the dataset according to the type of redox-active molecules, as well as schemes of the corresponding redox reactions.

FIG. 2.

Datasets constructed and used in this work. N-HPC-1: composition of the dataset according to the polycyclic skeleton, doped with nitrogen atoms. REDOX: composition of the dataset according to the type of redox-active molecules, as well as schemes of the corresponding redox reactions.

Close modal

The REDOX dataset, also constructed for this study,44 contains 4146 neutral redox-active molecules45 with 1–4 unpaired electrons (Fig. 2). The dataset consists of several classes of compounds carrying diverse functional groups: organic radicals (nitroxyl, phenoxyl, and galvinoxyl), carbonyl compounds (quinones, carboxylates, and phenazine-derived radicals), and cyanides. These systems are widely utilized across chemical sciences in synthesis, characterization, radical trapping, organic batteries, etc. Nitroxyls undergo one electron oxidation to form oxoammonium cations, while all other species in the dataset instead favor one electron reduction toward the corresponding anions and anion radicals. With the inclusion of these adducts, the entire REDOX dataset thus comprises 4146 neutral radicals, as well as 4018 anionic and 687 cationic open- and closed-shell molecules.

The QM7bX, N-HPC-1, and REDOX datasets can be freely obtained from GitHub (https://github.com/hits-ccc/MAOC/tree/main/Datasets).

Geometries of all systems in the N-HPC-1 and REDOX datasets and of all charged systems in the QM7bX dataset were optimized at the PBE0-D3/def2-TZVP level of theory using the ORCA 5.0 package.46 Geometries of neutral compounds in the QM7b dataset were used to compute vertical properties at the same level of theory. In all computations, wavefunction stability checks were performed. For all open-shell compounds, the expectation value of the spin-squared operator was assessed, and species with Ŝ2 above 10% of the expectation value were excluded (see Sec. II in the supplementary material). Single-point energies (SPEs), the energies of the highest, lowest, and singly occupied molecular orbitals (HOMOs, LUMOs, and SOMOs, respectively), as well as HOMO-LUMO gaps for the closed-shell and SOMO-LUMO gaps for the open-shell species, were also computed at the PBE0-D3/def2-TZVP level of theory.

Global SLATM representations23 were generated using QML code47 with the variables set to sigmas = [0.05, 0.05], dgrids = [0.03, 0.03], rcut = 4.8, and rpower = 6. Coulomb Matrix representations9 were generated using the DScribe package48 and sorted using the L2 norm to ensure permutation invariance. The SOAP representations21 were generated using the DScribe48 with the cutoff for the local regions (rcut) set to 6 Å, the number of radial basis functions set to 8, the maximum degree of spherical harmonics set to 100, and the spherical Gaussian-type orbitals chosen as the type of radial basis functions. The SPAHM representations were generated using the code provided in Ref. 29with the LB guess and the MINAO basis (see also Sec. VII in the supplementary material). The MAOC representations were generated using the localized orbital (lo) package of the Python-based Simulations of Chemistry Framework (PySCF) package.49 Unless specified otherwise, MAOC representations were generated using the pcseg-0 basis set and ANO50 as the reference basis. In this work, a sorted flattened version of MAOC is used. The PCX-MAOC representation is created by reducing the size of an unflattened MAOC array from M × M to M × 1, M × 2, …, M × X (where X < M) using the PCA dimensionality reduction technique. In this work, the number of principal components, X, is set to 3 (see Sec. III in the supplementary material), and the representation is denoted as PC3-MAOC.

A DELL XPS 15 9510 laptop with a processor 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30 GHz, 64-bits operating system (x64-based processor), and 32.0 GB of installed RAM was used to generate all representations. The CPU timings were measured using the datetime built-in Python library in a Jupyter notebook. The storage space requirements were evaluated as the amount of space needed to store the NumPy ndarrays.

A kernel ridge regression (KRR) model was used to test MAOC and other representations in predicting the properties of compounds in the QM7bX, N-HPC-1, and REDOX datasets. All learning curves were generated using KRR with a Laplacian kernel, and the sigma and lambda hyperparameters were optimized for each representation and training set size using a two-fold cross-validation splitting. The sigma values examined ranged from 1 to 8000. The grid step size was set to be between 100 and 500, depending on the size of the dataset under consideration. The lambda values were chosen to be between 1 and 10−10 (i.e., 1, 10−3, 10−5, 10−6, 10−7, 10−8, and 10−10). The Laplacian kernel was chosen after evaluating its performance vs the Gaussian kernel (see Sec. IV in the supplementary material). Unless specified otherwise, the train-to-test ratio for all properties under investigation is 80/20 and all the results in this study are the mean of five measurements using random shuffling of the data. The Cholesky decomposition (in QML code47) is used to solve the equation a=K+λI1y of the KRR model. The step size (training set size) for the learning curves is chosen based on the dataset, ranging from 400 to 1500. All machine learning operations were performed on a MacBook Pro (2019) with a 1.5 GHz Quad-Core Intel Core i5 processor and 8 GB of 2133 MHz LPDDR3 memory.

Principal component analysis (PCA) was conducted utilizing the sklearn Python library and an “auto” SVD solver. In this work, the number of components was set to either two (when mapping the chemical space) or three (when constructing PC3-MAOC), and the explained variance ratio was recorded close to the axis of the PCA plots. t-distributed Stochastic Neighbor Embedding (t-SNE) was performed using the sklearn Python library. The number of components was set to two, and all the other values were left as defaults.

Since MAOC uses orthogonalized atomic orbitals to construct the guess localized molecular orbitals, the basis set used to generate these atomic orbitals affects the matrix generation time, the training time for machine learning models, the amount of space needed to store the representation, and the quality of the property predictions. To identify the optimal choice, we tested several basis sets of varying sizes and flexibilities [Fig. 3(a)]. Generating MAOC for the N-HPC-1 dataset with the Karlsruhe (def2-SVP) or Dunning (cc-pVDZ) basis sets requires significantly more time and storage space than with the Pople, Jensen (pc-0 and pcseg-0), and LANL2DZ basis sets. STO-3G produces the smallest representations file in a relatively short time. In comparison to other representations (Table I), generation of MAOC requires the longest time; however, even for the largest studied dataset, QM7bX, this is only ∼two CPU-hours. Moreover, both MAOC and SPAHM are well parallelized, i.e., MAOC for QM7bX requires just over 15 min on an eight-CPU laptop. Executing principal component analysis on the generated MAOC is virtually instant. In terms of storage space, PC3-MAOC and SPAHM are by far the most compact representations, while full MAOC is roughly in the same order of magnitude as SLATM and SOAP.

FIG. 3.

MAOC representations for 4000 compounds from the N-HPC-1 dataset build with various basis sets: (a) CPU time and storage space requirements, (b) HOMO-LUMO gap learning curves (logarithmic scale) generated using a kernel ridge regression model with a Laplacian kernel and a train-to-test ratio of 70/30.

FIG. 3.

MAOC representations for 4000 compounds from the N-HPC-1 dataset build with various basis sets: (a) CPU time and storage space requirements, (b) HOMO-LUMO gap learning curves (logarithmic scale) generated using a kernel ridge regression model with a Laplacian kernel and a train-to-test ratio of 70/30.

Close modal
TABLE I.

CPU and wall timings (in seconds) and storage space (in MB) associated with the tested representations for all studied datasets.

CMSLATMSOAPSPAHMMAOCPC3-MAOC
QM7bX 
CPU time 9.4 19.6 22.8 2442.5 7036.0 a 
Wall time 18.5 21.8 25.4 333.6 961.0 a 
Storage 31.6 523.0 582.0 3.7 496.0 5.2 
N-HPC-1 
CPU time 5.5 5.7 9.7 876.3 2529.0 a 
Wall time 7.3 8.0 13.7 114.2 329.0 a 
Storage 17.8 92.5 76.9 2.7 607.0 4.2 
REDOX 
CPU time 8.4 26.0 30.7 1589.2 4580.0 a 
Wall time 14.5 27.7 32.7 206.9 596.0 a 
Storage 444.0 779.4 1024.0 20.7 5632.0 40.5 
CMSLATMSOAPSPAHMMAOCPC3-MAOC
QM7bX 
CPU time 9.4 19.6 22.8 2442.5 7036.0 a 
Wall time 18.5 21.8 25.4 333.6 961.0 a 
Storage 31.6 523.0 582.0 3.7 496.0 5.2 
N-HPC-1 
CPU time 5.5 5.7 9.7 876.3 2529.0 a 
Wall time 7.3 8.0 13.7 114.2 329.0 a 
Storage 17.8 92.5 76.9 2.7 607.0 4.2 
REDOX 
CPU time 8.4 26.0 30.7 1589.2 4580.0 a 
Wall time 14.5 27.7 32.7 206.9 596.0 a 
Storage 444.0 779.4 1024.0 20.7 5632.0 40.5 
a

Time required to generate PC3-MAOC from full MAOC is negligible compared with the time required to initially generate the full MAOC.

The learning curves in Fig. 3(b) illustrate the accuracy of the KRR machine learning model for the prediction of HOMO- and SOMO-LUMO gaps for a random sample of 4000 compounds from the N-HPC-1 dataset using MAOC built from various basis sets in terms of mean absolute error (MAE). The train-to-test ratio is 70/30, and the sigma values for each set of training data are optimized. Surprisingly, the highest accuracy is achieved with relatively small basis sets (3-21G, 6-31G, and pcseg-0). However, results with all tested basis sets ultimately fall within a narrow range of 0.31–0.35 eV in MAE. We note that adding diffuse and polarization functions to the basis sets increases the computational costs but does not affect (polarization functions) or even reduces (diffuse functions) the accuracy of ML predictions and, therefore, is not advised [see Fig. 3(a) and Secs. V A and V B in the supplementary material]. The reason behind the counterintuitively superior performance of smaller basis sets is that MAOC constructed from larger basis sets includes many more virtual orbitals, which, as shown in Section E, do not carry significant weight in machine learning of the studied properties. While this is the case for organic molecules in this work, it might not always be: polarization functions are likely important (and must thus be included when constructing MAOC) for molecules containing heavy metals, etc. Finally, using principal component analysis, we evaluated how well the “atomic” space of chemical elements across the Periodic Table is mapped with MAOC depending on the basis set (see Sec. V C in the supplementary material). Ultimately, pcseg-0 was chosen as an optimal basis set for constructing MAOC representations in this work, but a similarly good ratio of computational cost to accuracy can be achieved with 3-21G and 6-31G.

By construction from orthogonalized atomic orbitals, MAOC can serve not only as a molecular but also an atomic representation, differentiating single-atom systems and enabling prediction of their properties. This contrasts with many conventional molecular representations, which either rely on many-atom environments or, for a single atom, produce a representation unsuitably small for machine learning. Instead, prediction of monatomic (monoatomic) properties employs such traditional descriptors as electronegativity and atomic radii,51 which, however, are not uniquely defined.52,53 Instead, MAOC offers a rigorously defined method to encode single atoms across the entire Periodic Table for machine learning tasks. With MAOC, the “atomic” space is well resolved and the proximity between two atoms reflects the similarity in their electronic (orbital) configurations (Fig. 4).

FIG. 4.

Atomic space of the Periodic Table of elements generated using MAOC with the pcseg-0 (first–sixth period) and AHGBS-5 (seventh period, lanthanides, and actinides) basis sets.

FIG. 4.

Atomic space of the Periodic Table of elements generated using MAOC with the pcseg-0 (first–sixth period) and AHGBS-5 (seventh period, lanthanides, and actinides) basis sets.

Close modal

Coordinate-based representations, for example, Coulomb Matrix, SOAP, and SLATM, distinguish compounds based on their compositions and geometries, while the quantum-inspired representations SPAHM and MAOC also account for electronic structure features. As a result, only the latter can resolve the chemical space of molecules with similar geometries but nonidentical charges and spins [Fig. 5(a)]. This is further illustrated in the t-SNE plot of the chemical space of anionic and cationic compounds in the geometries of the parent neutral species from the QM7b dataset [Fig. 5(b)].

FIG. 5.

(a) Space of electronic configurations of cyclopenta[de]cinnoline, generated with MAOC using principal component analysis. (b) t-SNE plot of the chemical space of the anionic and cationic compounds in the geometries of the parent neutral species from QM7b dataset, generated with MAOC and colored according to molecular charge.

FIG. 5.

(a) Space of electronic configurations of cyclopenta[de]cinnoline, generated with MAOC using principal component analysis. (b) t-SNE plot of the chemical space of the anionic and cationic compounds in the geometries of the parent neutral species from QM7b dataset, generated with MAOC and colored according to molecular charge.

Close modal

In this work, we focus on the fundamental properties of charged (and) open-shell compounds: energies of the frontier molecular orbitals, energy gaps between SOMO or HOMO and LUMO, and ground state single-point electronic energies. We first discuss the results for the commonly available QM7b dataset,38,39 which consists of small neutral organic compounds composed of elements from the first to third rows of the Periodic Table and mostly lacking conjugated π-bonds in cyclic moieties. To create an extended QM7bX dataset, anion radicals, cation radicals, and dications were considered for each parent molecule in QM7b; for the open-shell compounds, both vertical and adiabatic properties were evaluated. Overall, the best predictive performance of kernel ridge regression in terms of mean absolute errors (MAEs) and root-mean-square errors (RMSEs) for the orbital energies is achieved with SLATM, SOAP, and SPAHM representations, followed by MAOC, PC3-MAOC, and Coulomb Matrix [see Table II and Figs. 6(a) and 6(c)]. Compared with geometry-based representations, their quantum-inspired counterparts perform worse in predicting the single-point energies [Fig. 6(b)] but are generally better in predicting the orbital energies of the charged open-shell compounds. For example, SPAHM and PC3-MAOC are the best representations for machine-learning the SOMO-LUMO gap of anion radicals [Fig. 6(d)] and cation radicals [Fig. 6(e)], respectively. We have also combined the data for non-optimized (vertical) anion and cation radicals into a single subset to test the representations on systems with identical geometries but distinct electronic states. In this case, only the quantum-inspired representations, SPAHM and MAOC, are able to learn from the data, with SPAHM delivering the lowest MAEs [Fig. 6(f)].

TABLE II.

Mean absolute errors (MAEs) and root-mean-square errors (RMSEs) for frontier molecular orbital energies (in eV), their differences (gaps, in eV), and single-point energies (SPEs, in a.u.) of compounds in the QM7bX dataset predicted with kernel ridge regression, trained on 5284 molecules, in conjunction with various molecular representations.a 

 
 
FIG. 6.

Selected learning curves (logarithmic mean absolute errors vs logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and single-point energies for compounds in the QM7bX dataset with various molecular representations. Superscript vert denotes ion radicals in the geometries of the parent neutral compounds. The full set of computed learning curves is provided in Sec. IX A in the supplementary material.

FIG. 6.

Selected learning curves (logarithmic mean absolute errors vs logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and single-point energies for compounds in the QM7bX dataset with various molecular representations. Superscript vert denotes ion radicals in the geometries of the parent neutral compounds. The full set of computed learning curves is provided in Sec. IX A in the supplementary material.

Close modal

For the N-HPC-1 dataset40 (Fig. 2) comprising neutral N-heteropolycycles, their open-shell (doublet) anions and cations, closed- and open-shell (triplet) dications, and open-shell (doublet and quartet) trications, CM generally performs worse than other representations (Table III and Fig. 7). Performance of the latter varies: quantum-inspired representations offer the best accuracy in predicting the single-point energies, while SOAP and SLATM generally produce better orbital energies. However, in several cases, such as singlet dications and doublet trications, PC3-MAOC outperforms other representations, including the full MAOC (see Sec. III E). Finally, the performance of MAOC for the N-HPC-1 dataset containing π-conjugated compounds is improved (i.e., is associated with lower MAEs) relative to QM7bX, which is comprised mostly of saturated systems. To explain this, we turn to the way MAOC is constructed. The localized orbitals of conjugated π-bonds are no longer atom-centered, but rather are distributed over multiple atoms and account for the long-range interactions between them. This captures the global character of MAOC.

TABLE III.

Mean absolute errors (MAEs) and root-mean-square errors (RMSEs) for frontier molecular orbital energies (in eV), their differences (gaps, in eV), and single-point energies (SPEs, in a.u.) of compounds in the N-HPC-1 dataset predicted with kernel ridge regression, trained on 2900 (1700 in the case of trication doublets) molecules, in conjunction with various molecular representations.a 

 
 
FIG. 7.

Selected learning curves (logarithmic mean absolute errors vs logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and single-point energies for compounds in the N-HPC-1 dataset with various molecular representations. The full set of computed learning curves is provided in Sec. IX B in the supplementary material.

FIG. 7.

Selected learning curves (logarithmic mean absolute errors vs logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and single-point energies for compounds in the N-HPC-1 dataset with various molecular representations. The full set of computed learning curves is provided in Sec. IX B in the supplementary material.

Close modal

Next, we assess the performance of the kernel ridge regression with various representations for the REDOX dataset44 (Table IV and Fig. 8), which includes redox-active molecules from diverse chemical classes and with a range of functional groups (Fig. 2). The errors in property predictions for this dataset are markedly higher than for QM7bX and N-HPC-1 due to the greater chemical diversity within and non-combinatorial generation of the REDOX dataset. Each class of redox-active species is represented by ∼1000 compounds formed from a variety of structural motifs and functional group combinations, resulting in a much broader range of values for a given property compared with the QM7bX and N-HPC-1 datasets (see Sec. VIII in the supplementary material) and limiting the model’s ability to “learn”. Nonetheless, the quantum-inspired representations—SPAHM, MAOC, and PC3-MAOC—as well as the geometry-based SLATM, outperform CM and SOAP in all cases. SPAHM predicts significantly better single-point energies, while MAOC and PC3-MAOC afford the best accuracy for frontier orbital energies. Functional groups, present in REDOX but absent in the N-HPC-1 dataset, enhance the global character of MAOC through resonance, which results in molecular orbitals that are centered not on a single bond but distributed over at least three atoms.

TABLE IV.

Mean absolute errors (MAEs) and root-mean-square errors (RMSEs) for frontier molecular orbital energies (in eV), their differences (gaps, in eV), and single-point energies (SPEs, in a.u.) of compounds in the REDOX dataset predicted with kernel ridge regression, trained on 7000 molecules, in conjunction with various molecular representations.a 

 
 
FIG. 8.

Learning curves (logarithmic mean absolute errors vs logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and single-point energies for compounds in the REDOX dataset with various molecular representations.

FIG. 8.

Learning curves (logarithmic mean absolute errors vs logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and single-point energies for compounds in the REDOX dataset with various molecular representations.

Close modal

Finally, to further test the generalizability of MAOC, we applied it to predict the HOMO-LUMO gaps in a dataset containing in total 5000 compounds, randomly selected from the QM7b, N-HPC-1, and REDOX datasets. We find that the performance of MAOC and PC3-MAOC for this “mixed” dataset average between the corresponding errors for the individual datasets (see Sec. VI in the supplementary material). This further demonstrates that MAOC is a suitable representation for compositionally, structurally, and electronically diverse sets of chemical compounds.

For the N-HPC-1 dataset of N-heteropolycycles, the PCA-reduced version of MAOC, PC3-MAOC, outperforms the full representation in predicting all considered properties. In search for a reason behind this puzzling behavior, we have analyzed the orbital contributions to the three principal components of PC3-MAOC. Specifically, for all species in a given subset, localized molecular orbitals contributing the most to each principal component (above 0.2 variance) are listed. We then count how many times an orbital of interest, e.g., HOMO-2, HOMO-1, etc., occurs in this list. Finally, this occurrence count is normalized to the total number of L-MOs in the list. For example, six frontier and near-frontier localized MOs—from HOMO-2 to LUMO+2—constitute ∼11% of all orbitals with variance above 0.2 in PC1-3 of MAOC for the parent neutral compounds from the QM7b dataset (Fig. 9). Among the three studied datasets, N-HPC-1 stands out in that its MAOC principal component lists feature HOMO and LUMO much less often than the respective higher- and lower-lying orbitals (HOMO-2, HOMO-1, LUMO+1, and LUMO+2). In contrast, all six frontier and near-frontier MOs occur with similar frequency in the >0.2 variance lists of the QM7bX and REDOX datasets. An in-depth quantitative analysis of these contributions to the molecular representations and their effects on the quality of machine learning predictions is planned for a future study.

FIG. 9.

Normalized counts of frontier and near-frontier localized molecular orbitals in the lists of orbitals with variance above 0.2 in the three principal components of PC3-MAOC for each compound type in the three investigated datasets.

FIG. 9.

Normalized counts of frontier and near-frontier localized molecular orbitals in the lists of orbitals with variance above 0.2 in the three principal components of PC3-MAOC for each compound type in the three investigated datasets.

Close modal

Principal component analysis of the MAOC representation allows for an elaboration of how the electronic structure governs the chemical properties. For example, in line with the chemical intuition, predicting orbital energies of the anionic compounds requires information on the occupied molecular orbitals, while properties of cations are instead dominated by the unoccupied MOs (Fig. 9). While these relationships are straightforward when predicting the energy levels of molecular orbitals themselves, as is done in this study, less obvious effects can potentially be uncovered through such an analysis of more complex properties, such as reaction energies, EPR spectra characteristics, and excited state features.

In this work, we present a quantum-inspired molecular representation for chemical machine learning tasks—the matrix of orthogonalized atomic orbital coefficients (MAOC). It utilizes not only the structures and compositions of the systems it represents but also their electronic information, i.e., charge and spin multiplicity. The matrix of orthogonalized molecular orbitals is constructed from the coefficients of the non-optimized (guess) localized molecular orbitals, which in turn are linear combinations of predefined orthogonalized atomic orbitals. In generating the latter, even very small atom-centered basis sets, such as pcseg-0 and STO-3G, afford the accuracy of larger basis sets at a fraction of the computational cost. MAOC is a representation with dual character: local when the core and lone pair orbitals of atoms are constructed, and global when the orthogonalized atomic orbitals are combined to form molecular orbitals, which can either be localized or distributed over a part or even an entire molecule. MAOC is uniquely suited to represent not only molecules but also monatomic (by construction) and periodic (using meta-Löwdin orthogonalization scheme) systems. Furthermore, MAOC allows learning and predicting the properties of systems with identical compositions and geometries but distinct electronic configurations, such as vertical properties or redox properties of rigid compounds. Finally, we also present a reduced version of MAOC based on the principal component analysis dimensionality reduction technique, PCX-MAOC, which possesses all of the above-mentioned features of full MAOC while being significantly more compact, enabling further analysis of the connection between electronic structure and chemical properties, and, in certain cases, leading to higher prediction accuracy.

The performance of the kernel ridge regression ML model, in conjunction with MAOC, PC3-MAOC, and several other coordinate- and Hamiltonian-based representations, was tested in predicting the frontier orbital energy levels and ground state single-point energies for an extended QM7b dataset and two new datasets, N-HPC-1 and REDOX. We observed that MAOC and PC3-MAOC outperform geometry-based representations for sets of compounds with identical geometries but distinct electronic configurations and for a chemically diverse set of redox-active compounds, while affording accuracy similar to that of SOAP, SLATM, and SPAHM for a set of N-heteropolycycles. Further improvements in MAOC’s performance are likely achievable with larger datasets (in cases where property values are spread over a broad range) and with other machine learning architectures, such as neural networks.

The supplementary material contains additional computational and data analysis details, additional data, and a full set of learning curves.

The authors gratefully acknowledge support from the Klaus Tschira Foundation, funding from the German Research Foundation (DFG) through the collaborative research center No. SFB1249 (Project No. 281029004-SFB 1249, Project C10), and funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 101042290 PATTERNCHEM), as well as access to computational resources from the state of Baden-Württemberg through bwHPC (bwForCluster JUSTUS2). The authors also thank Ksenia Briling and Professor Clemence Corminboeuf for their help with the SPAHM representation, Professor Olexandr Isayev and Professor Michele Ceriotti for fruitful discussions, and Dr. John Lindner for proofreading the manuscript.

The authors declare no conflicts to disclose.

Stiv Llenga: Conceptualization (equal); Formal analysis (lead); Investigation (lead); Methodology (lead); Visualization (equal); Writing – original draft (lead). Ganna Gryn’ova: Conceptualization (equal); Formal analysis (supporting); Funding acquisition (lead); Supervision (lead); Visualization (equal); Writing – review & editing (lead).

The data that support the findings of this study are available within the article and its supplementary material, as well as from GitHub at https://github.com/hits-ccc/MAOC/tree/main/Datasets.

1.
M.
Meuwly
, “
Machine learning for chemical reactions
,”
Chem. Rev.
121
,
10218
(
2021
).
2.
R.
Pollice
,
G.
dos Passos Gomes
,
M.
Aldeghi
,
R. J.
Hickman
,
M.
Krenn
,
C.
Lavigne
,
M.
Lindner-D’Addario
,
A.
Nigam
,
C. T.
Ser
,
Z.
Yao
, and
A.
Aspuru-Guzik
, “
Data-driven strategies for accelerated materials design
,”
Acc. Chem. Res.
54
,
849
(
2021
).
3.
J. A.
Keith
,
V.
Vassilev-Galindo
,
B.
Cheng
,
S.
Chmiela
,
M.
Gastegger
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Combining machine learning and computational Chemistry for predictive insights into chemical systems
,”
Chem. Rev.
121
,
9816
(
2021
).
4.
A. C.
Mater
and
M. L.
Coote
, “
Deep learning in Chemistry
,”
J. Chem. Inf. Model.
59
,
2545
(
2019
).
5.
M.
Rupp
, “
Machine learning for quantum mechanics in a nutshell
,”
Int. J. Quantum Chem.
115
,
1058
(
2015
).
6.
F.
Noé
,
A.
Tkatchenko
,
K.-R.
Müller
, and
C.
Clementi
, “
Machine learning for molecular simulation
,”
Annu. Rev. Phys. Chem.
71
,
361
(
2020
).
7.
K. T.
Butler
,
D. W.
Davies
,
H.
Cartwright
,
O.
Isayev
, and
A.
Walsh
, “
Machine learning for molecular and materials science
,”
Nature
559
,
547
(
2018
).
8.
A.
Glielmo
,
B. E.
Husic
,
A.
Rodriguez
,
C.
Clementi
,
F.
Noé
, and
A.
Laio
, “
Unsupervised learning methods for molecular simulation data
,”
Chem. Rev.
121
,
9722
(
2021
).
9.
F.
Musil
,
A.
Grisafi
,
A. P.
Bartók
,
C.
Ortner
,
G.
Csányi
, and
M.
Ceriotti
, “
Physics-inspired structural representations for molecules and materials
,”
Chem. Rev.
121
,
9759
(
2021
).
10.
D. S.
Wigh
,
J. M.
Goodman
, and
A. A.
Lapkin
, “
A review of molecular representation in the age of machine learning
,”
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
12
,
e1603
(
2022
).
11.
K.
Atz
,
F.
Grisoni
, and
G.
Schneider
, “
Geometric deep learning on molecular representations
,”
Nat. Mach. Intell.
3
,
1023
(
2021
).
12.
H.
Moriwaki
,
Y.-S.
Tian
,
N.
Kawashita
, and
T.
Takagi
, “
Mordred: A molecular descriptor calculator
,”
J. Cheminform.
10
,
4
(
2018
).
13.
M.
Floris
,
A.
Manganaro
,
O.
Nicolotti
,
R.
Medda
,
G. F.
Mangiatordi
, and
E.
Benfenati
, “
A generalizable definition of chemical similarity for read-across
,”
J. Cheminform.
6
,
39
(
2014
).
14.
D. T.
Stanton
,
S.
Dimitrov
,
V.
Grancharov
, and
O. G.
Mekenyan
, “
Charged partial surface area (CPSA) descriptors QSAR applications
,”
SAR QSAR Environ. Res.
13
,
341
(
2002
).
15.
J.
Kujawski
,
H.
Popielarska
,
A.
Myka
,
B.
Drabińska
, and
M. K.
Bernard
, “
The log P parameter as a molecular descriptor in the computer-aided drug design-an overview
,”
Comput. Methods Sci. Tech.
18
,
81
(
2012
).
16.
L.
David
,
A.
Thakkar
,
R.
Mercado
, and
O.
Engkvist
, “
Molecular representations in AI-driven drug discovery: A review and practical guide
,”
J. Cheminform.
12
,
56
(
2020
).
17.
J.
Townsend
,
C. P.
Micucci
,
J. H.
Hymel
,
V.
Maroulas
, and
K. D.
Vogiatzis
, “
Representation of molecular structures with persistent homology for machine learning applications in Chemistry
,”
Nat. Commun.
11
,
3230
(
2020
).
18.
M. R.
Wilkinson
,
U.
Martinez-Hernandez
,
C. C.
Wilson
, and
B.
Castro-Dominguez
, “
Images of chemical structures as molecular representations for deep learning
,”
J. Mater. Res.
37
,
2293
(
2022
).
19.
D.
Weininger
, “
SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules
,”
J. Chem. Inform. Comput. Sci.
28
,
31
(
1988
).
20.
M.
Krenn
,
Q.
Ai
,
S.
Barthel
,
N.
Carson
,
A.
Frei
,
N. C.
Frey
,
P.
Friederich
,
T.
Gaudin
,
A. A.
Gayle
,
K. M.
Jablonka
,
R. F.
Lameiro
,
D.
Lemm
,
A.
Lo
,
S. M.
Moosavi
,
J. M.
Nápoles-Duarte
,
A.
Nigam
,
R.
Pollice
,
K.
Rajan
,
U.
Schatzschneider
,
P.
Schwaller
,
M.
Skreta
,
B.
Smit
,
F.
Strieth-Kalthoff
,
C.
Sun
,
G.
Tom
,
G.
Falk von Rudorff
,
A.
Wang
,
A. D.
White
,
A.
Young
,
R.
Yu
, and
A.
Aspuru-Guzik
, “
SELFIES and the future of molecular string representations
,”
Patterns
3
,
100588
(
2022
)..
21.
A. P.
Bartók
,
R.
Kondor
, and
G.
Csányi
, “
On representing chemical environments
,”
Phys. Rev. B
87
,
184115
(
2013
).
22.
R.
Drautz
, “
Atomic cluster expansion for accurate and transferable interatomic potentials
,”
Phys. Rev. B
99
,
014104
(
2019
).
23.
B.
Huang
and
O. A.
von Lilienfeld
, “
Quantum machine learning using atom-in-molecule-based fragments selected on the fly
,”
Nat. Chem.
12
,
945
(
2020
).
24.
F. A.
Faber
,
A. S.
Christensen
, and
O. A. v.
Lilienfeld
, “
Quantum machine learning with response operators in chemical compound space
,” in
Machine Learning Meets Quantum Physics
, Lecture Notes in Physics, Vol. 968, edited by
K.
Schütt
,
S.
Chmiela
,
O. A. v.
Lilienfeld
,
A.
Tkatchenko
,
K.
Tsuda
, and
K.-R.
Müller
(
Springer
,
Cham
,
2020
).
25.
A. S.
Christensen
,
L. A.
Bratholm
,
F. A.
Faber
, and
O. A.
von Lilienfeld
, “
FCHL revisited: Faster and more accurate quantum machine learning
.”
J. Chem. Phys.
152
,
044107
(
2020
).
26.
H.
Huo
and
M.
Rupp
, “
Unified representation of molecules and crystals for machine learning
,”
Mach. Learn.: Sci. Technol.
3
,
045017
(
2022
).
27.
M.
Welborn
,
L.
Cheng
, and
T. F.
Miller
III
, “
Transferability in machine learning for electronic structure via the molecular orbital basis
,”
J. Chem. Theory Comput.
14
,
4772
(
2018
).
28.
K.
Karandashev
and
O. A.
von Lilienfeld
, “
An orbital-based representation for accurate quantum machine learning
,”
J. Chem. Phys.
156
,
114101
(
2022
).
29.
A.
Fabrizio
,
K. R.
Briling
, and
C.
Corminboeuf
, “
SPAHM: The spectrum of approximated Hamiltonian matrices representations
,”
Digital Discovery
1
,
286
(
2022
).
30.
J.
Pipek
and
P. G.
Mezey
, “
A fast intrinsic localization procedure applicable for ab initio and semiempirical linear combination of atomic orbital wave functions
,”
J. Chem. Phys.
90
,
4916
(
1989
).
31.
S.
Lehtola
and
H.
Jónsson
, “
Pipek-Mezey orbital localization using various partial charge estimates
,”
J. Chem. Theory Comput.
10
,
642
(
2014
).
32.
J. M.
Foster
and
S. F.
Boys
, “
Canonical configurational interaction procedure
,”
Rev. Mod. Phys.
32
,
300
(
1960
).
33.
C.
Edmiston
and
K.
Ruedenberg
, “
Localized atomic and molecular orbitals
,”
Rev. Mod. Phys.
35
,
457
(
1963
).
34.
A.
Heßelmann
, “
Local molecular orbitals from a projection onto localized centers
,”
J. Chem. Theory Comput.
12
,
2720
(
2016
).
35.
D.
Maynau
,
S.
Evangelisti
,
N.
Guihéry
,
C. J.
Calzado
, and
J.-P.
Malrieu
, “
Direct generation of local orbitals for multireference treatment and subsequent uses for the calculation of the correlation energy
,”
J. Chem. Phys.
116
,
10060
(
2002
).
36.
P. O.
Löwdin
, “
On the non-orthogonality problem connected with the use of atomic wave functions in the theory of molecules and crystals
,”
J. Chem. Phys.
18
,
365
(
1950
).
37.
Q.
Sun
and
G. K.-L.
Chan
, “
Exact and optimal quantum mechanics/molecular mechanics boundaries
,”
J. Chem. Theory Comput.
10
,
3784
(
2014
).
38.
L. C.
Blum
and
J.-L.
Reymond
, “
970 million druglike small molecules for virtual screening in the chemical universe database GDB-13
,”
J. Am. Chem. Soc.
131
,
8732
(
2009
).
39.
G.
Montavon
,
M.
Rupp
,
V.
Gobre
,
A.
Vazquez-Mayagoitia
,
K.
Hansen
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O. A.
von Lilienfeld
, “
Machine learning of molecular electronic properties in chemical compound space
,”
New J. Phys.
15
,
095003
(
2013
).
40.
N-HPC-1 dataset is freely available from https://github.com/hits-ccc/MAOC/tree/main/Datasets/NHPC1. This resource contains the full set of structures and their computed properties, while only its portion is used and discussed in this work.
41.
C. L.
Donley
,
J.
Zaumseil
,
J. W.
Andreasen
,
M. M.
Nielsen
,
H.
Sirringhaus
,
R. H.
Friend
, and
J.-S.
Kim
, “
Effects of packing structure on the optoelectronic and charge transport properties in poly(9,9-di-N-octylfluorene-alt-benzothiadiazole)
,”
J. Am. Chem. Soc.
127
,
12890
(
2005
).
42.
G. D.
McManus
,
J. M.
Rawson
,
N.
Feeder
,
F.
Palacio
, and
P.
Oliete
, “
Structure and magnetic properties of a sulfur-nitrogen radical, methylbenzodithiazolyl
,”
J. Mater. Chem.
10
,
2001
(
2000
).
43.
Q.
Zhang
,
J.
Li
,
K.
Shizu
,
S.
Huang
,
S.
Hirata
,
H.
Miyazaki
, and
C.
Adachi
, “
Design of efficient thermally activated delayed fluorescence materials for pure blue organic light emitting diodes
,”
J. Am. Chem. Soc.
134
,
14706
(
2012
).
44.
REDOX dataset is freely available from https://github.com/hits-ccc/MAOC/tree/main/Datasets/REDOX.
45.
Y.
Lu
,
Q.
Zhang
,
L.
Li
,
Z.
Niu
, and
J.
Chen
, “
Design strategies toward enhancing the performance of organic electrode materials in metal-ion batteries
,”
Chem
4
,
2786
(
2018
).
46.
F.
Neese
,
F.
Wennmohs
,
U.
Becker
, and
C.
Riplinger
, “
The ORCA quantum chemistry program package
,”
J. Chem. Phys.
152
,
224108
(
2020
).
47.
A. S.
Christensen
,
F. A.
Faber
,
B.
Huang
,
L. A.
Bratholm
,
A.
Tkatchenko
,
K. R.
Muller
, and
O. A.
von Lilienfeld
, “
QML: A Python toolkit for quantum machine learning
,”
2017
, https://github.com/qmlcode/qml.
48.
L.
Himanen
,
M. O. J.
Jäger
,
E. v.
Morooka
,
F.
Federici Canova
,
Y. S.
Ranawat
,
D. Z.
Gao
,
P.
Rinke
, and
A. S.
Foster
, “
DScribe: Library of descriptors for machine learning in materials science
,”
Comput. Phys. Commun.
247
,
106949
(
2020
).
49.
Q.
Sun
,
T. C.
Berkelbach
,
N. S.
Blunt
,
G. H.
Booth
,
S.
Guo
,
Z.
Li
,
J.
Liu
,
J. D.
McClain
,
E. R.
Sayfutyarova
,
S.
Sharma
,
S.
Wouters
, and
G. K. L.
Chan
, “
PySCF: The python-based simulations of Chemistry Framework
,”
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
8
,
e1340
(
2018
).
50.
P. O.
Widmark
,
P. Å.
Malmqvist
, and
B. O.
Roos
, “
Density matrix averaged atomic natural orbital (ANO) basis sets for correlated molecular wave functions
,”
Theoret. Chim. Acta
77
,
291
(
1990
).
51.
Z.-H.
Liu
,
T.-T.
Shi
, and
Z.-X.
Chen
, “
Machine learning prediction of monatomic adsorption energies with non-first-principles calculated quantities
,”
Chem. Phys. Lett.
755
,
137772
(
2020
).
52.
G. D.
Sproul
, “
Evaluation of electronegativity scales
,”
ACS Omega
5
,
11585
(
2020
).
53.
M. V.
Putz
,
N.
Russo
, and
E.
Sicilia
, “
Atomic radii scale and related size properties from density functional electronegativity formulation
,”
J. Phys. Chem. A
107
,
5461
(
2003
).

Supplementary Material