Matrix of Orthogonalized Atomic Orbital Coefficients Representation for Radicals and Ions

Chemical (molecular, quantum) machine learning relies on representing molecules in unique and informative ways. Here, we present matrix of orthogonalized atomic orbital coefficients (MAOC) as a quantum-inspired molecular and atomic representation containing both structural (composition and geometry) and electronic (charge and spin multiplicity) information. MAOC is based on a cost-effective localization scheme that represents localized orbitals via a predefined set of atomic orbitals. The latter can be constructed from such small atom-centered basis sets as pcseg-0 and STO-3G in conjunction with guess (non-optimized) electronic configuration of the molecule. Importantly, MAOC is suitable for representing monatomic, molecular, and periodic systems, and can distinguish compounds with identical compositions and geometries but distinct charges and spin multiplicities. Using principal component analysis, we constructed a more compact but equally powerful version of MAOC – PCX-MAOC. To test the performance of full and reduced MAOC and several other representations (CM, SOAP, SLATM, and SPA H M), we used a kernel ridge regression machine learning model to predict frontier molecular orbital energy levels and ground state single-point energies for chemically diverse neutral and charged, closed-and open-shell molecules from an extended QM7b dataset, as well as two new datasets, N-HPC-1 (N-heteropolycycles) and REDOX

(nitroxyl and phenoxyl radicals, carbonyl and cyano compounds).MAOC affords accuracy that is either similar or superior to other representations for a range of chemical properties and systems.

I. INTRODUCTION
0][11] These representations can be based on simple molecular properties (such as molecular weight and van der Waals volume 12 , number of heteroatoms, 13 partial charges, 14 lipophilicity, 15 etc.; also known as fingerprints or descriptors), molecular coordinates, graphs, 16 topologies, 17 images, 18 and can take a form of a string (e.g., SMILES 19 and SELFIES 20 ), a vector, or a matrix.Among them, the physics-inspired representations 9 arguably aim to encode molecular geometries and compositions most comprehensively.Some of these representations are built from atom-centered continuous basis functions, such as the smooth overlap of atomic positions (SOAP) 21 and the atomic cluster expansion (ACE). 22Others use potentials, such as the Coulomb Matrix (CM), 9 the spectrum of London and Axilrod-Teller-Muto (SLATM) potential, 23 and the Faber-Christensen-Huang-von Lilienfeld representations (FCHL18,19), 24,25 or transform input structural data into an internal coordinate system, for example in the many-body tensor representation (MBTR). 26However, these representations only consider the nuclear charge and position of the nuclei in space, assuming charge neutrality.As a result, they violate the injectivity requirement of machine learning representations, i.e., are unable to distinguish between compounds with distinct electronic configurations but identical atomic compositions and geometries (Figure 1).To satisfy the injectivity condition, several so-called quantum-inspired representations were developed by including additional electronic structure information.Molecular orbital basis machine learning (MOB-ML) 27 and the F (Fock), J (Coulomb), and K (exchange) matrices (FJK) representation 28 use post-Hartree-Fock molecular orbital properties, i.e., they require an a priori ab initio computation and thus operate in a ∆-ML fashion.In contrast, the spectrum of approximated Hamiltonian matrices (SPA H M) 29 from Corminboeuf and colleagues employs a guess (i.e., pre-Hartree-Fock) electronic Hamiltonian, making it a simpler and quicker way to encode not only the composition and geometry, but also the charge, spin multiplicity, and electronic state of a molecule.In this work, we present matrix of orthogonalized atomic orbital coefficients (MAOC), a quantum-inspired representation which uses meta-Löwdin orthogonalized atomic orbitals to generate localized molecular orbital coefficients.MAOC is an atomic and molecular representation that can describe charged and open-shell compounds and distinguish molecules with identical structures but distinct electronic states.We tested the performance of MAOC in comparison with other coordinateand Hamiltonian-based representations using kernel ridge regression machine learning model for predicting orbital and single-point energies for a broad range of conventional and redox-active species from known and newly constructed datasets.

A. MAOC
In the Hartree-Fock theory, molecular orbitals are obtained by diagonalizing the Fock matrix to satisfy Brillouin's theorem: where Ψ  is a self-consistent optimized Hartree-Fock wavefunction of the ground state,  ̂ is the Hamiltonian operator, and Ψ   is a single excited determinant.The occupied-occupied and virtualvirtual blocks of the Fock matrix are diagonalized to obtain canonical molecular orbitals, which are delocalized over the molecule.Occupied molecular orbitals can be localized by optimizing the cost function that measures their locality, as is done in the Pipek-Mezey (maximizing population charges on atoms), 30,31 Boys and Foster (maximizing the sum of squares between orbital centroids), 32 and Edmiston-Ruedenberg (maximizing the Coulomb self-repulsion) 33 localization schemes.
Alternatively, both the occupied and virtual orbitals can be localized by projecting them onto a set of predefined (atomic) orbitals, 34 and a localization scheme of this type 35 is adopted in this work.The matrix of orthogonalized molecular orbitals is based on the coefficients of non-optimized (guess) localized molecular orbitals (L-MOs), generated as a linear combination of predefined orthogonalized atomic orbitals (o-AOs).Each block of atomic orbitals (core, valence, and virtual) is independently orthogonalized in MAOC using the meta-Löwdin scheme, which has an added benefit of being applicable to periodic systems (see Figure S2 in the Supplementary Information). 36,37In the case of core electrons and lone pairs, localized orbitals are atom-centered, whereas in the case of bonds and π-conjugated systems, localized orbitals are dispersed throughout sets of molecular fragments or over an entire molecule.Further details on the assignment of charge and spin and the sorting of the MAOC squared matrix are given in Section I in the Supplementary Material.
Because the size of MAOC is related to the number of orbitals used to represent an atom, representation size grows quickly with compound size.To address this issue, a dimensionality reduction techniqueprincipal component analysis (PCA)can be used to reduce the size of MAOC, making it more compact, as well as easier to generate and use in machine learning tasks.This version of MAOC is denoted as PCX-MAOC, where X is the number of principal components.

B. Datasets
Three sets of non-overlapping chemical data were used to test MAOC in supervised machine learning.The QM7b dataset 38,39 was selected since it is among the most widely used datasets for machine learning in chemistry.It consists of 7,211 relatively small neutral molecules with no conjugated bonds within cyclic moieties.To evaluate the performance of MAOC for radicals and ions, geometries of the anionic, cationic, and dicationic forms of the compounds in the dataset were optimized and their various vertical and adiabatic properties were computed in this work; strongly spin-contaminated species were eliminated from the dataset (see Computational procedures below).
Therefore, in total 7,197 geometry-optimized anion radicals, 6,999 geometry-optimized cation radicals, 7,198 geometry-optimized dications, as well as 7,208 anion radicals and 7,208 cation radicals in the geometry of the parent neutral molecule were added to the original QM7b dataset of neutral molecules.We denote this extended dataset as QM7b X .
The N-HPC-1 dataset, constructed in this work, 40 consists of 3,735 nitrogen-doped polycyclic compounds that were generated by introducing up to 11 nitrogen atoms into prototypical polyaromatic skeletons (Figure 2).For each generated compound, canonical SMILES were produced, and any duplicates removed from the dataset.To generate respective open-shell compounds, one, two, or three electrons were removed, or one electron was added to the neutral molecules to produce eight groups within the dataset: neutral singlets, neutral triplets, anionic doublets, cationic doublets, dicationic singlets, dicationic triplets, tricationic doublets, and tricationic quartets.Stable open-shell Nheteropolycycles display a range of practically useful electronic, 41 magnetic, 42 and optical 43 properties, whose facile prediction with a tailor-made representation could provide valuable molecular design guidelines.The REDOX dataset, also constructed for this study, 44  The QM7b X , N-HPC-1, and REDOX datasets can be freely obtained from GitHub (https://github.com/hits-ccc/MAOC/tree/main/Datasets).

C. Computational procedures
Geometries of all systems in the N-HPC-1 and REDOX datasets and of all charged systems in the QM7b X dataset were optimized at the PBE0-D3/def2-TZVP level of theory using the ORCA 5.0 package. 46Geometries of neutral compounds in the QM7b dataset were used to compute vertical properties at the same level of theory.In all computations, wavefunction stability checks were performed.For all open-shell compounds, the expectation value of the spin-squared operator was assessed and species with 〈 ̂2〉 above 10% of the expectation value were excluded (see Section II in the Supplementary Material).Single-point energies (SPEs), the energies of the highest, lowest, and singly occupied molecular orbitals (HOMOs, LUMOs, and SOMOs, respectively), as well as HOMO-LUMO gaps for the closed-shell and SOMO-LUMO gaps for the open-shell species were also computed at the PBE0-D3/def2-TZVP level of theory.

D. Molecular representations
Global SLATM representations 23 were generated using QML code 47 with the variables set to sigmas = [0.05,0.05], dgrids = [0.03,0.03], rcut = 4.8, and rpower = 6.Coulomb Matrix representations 9 were generated using the DScribe package 48 and sorted using the L2 norm to ensure permutation invariance.The SOAP representations 21 were generated using the DScribe 48 with the cutoff for the local regions (rcut) set to 6 Å, the number of radial basis functions set to 8, the maximum degree of spherical harmonics set to 100, and the spherical Gaussian-type orbitals chosen as the type of radial basis functions.The SPA H M representations were generated using the code provided in Ref.
29 with the LB guess and the MINAO basis (see also Section VII in the Supplementary Material).
The MAOC representations were generated using the localized orbital (lo) package of the Pythonbased Simulations of Chemistry Framework (PySCF) package. 49Unless specified otherwise, MAOC representations were generated using the pcseg-0 basis set and ANO 50 as the reference basis.In this work, a sorted flattened version of MAOC is used.The PCX-MAOC representation is created by reducing the size of an unflattened MAOC array from MM to M1, M2, … MX (where X < M) using the PCA dimensionality reduction technique.In this work, the number of principal components, X, is set to 3 (see Section III in the Supplementary Material), and the representation is denoted as PC3-MAOC.
A DELL XPS 15 9510 laptop with a processor 11 th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz, 64-bit operating system (x64-based processor), and 32.0 GB of installed RAM was used to generate all representations.The CPU timings were measured using the datetime built-in Python library in a Jupyter notebook.The storage space requirements were evaluated as the amount of space needed to store the NumPy ndarrays.

E. Machine learning and data analysis
A kernel ridge regression (KRR) model was used to test MAOC and other representations in predicting the properties for compounds in the QM7b X , N-HPC-1, and REDOX datasets.All learning curves were generated using KRR with a Laplacian kernel, and the sigma and lambda hyperparameters were optimized for each representation and training set size using a two-fold cross validation splitting.The sigma values examined ranged from 1 to 8,000.The grid step size was set to be between 100 and 500, depending on the size of the dataset under consideration.The lambda values were chosen to be between 1 and 10 -10 (i.e., 1, 10 -3 , 10 -5 , 10 -6 , 10 -7 , 10 -8 , and 10 -10 ).The Laplacian kernel was chosen after evaluating its performance vs. the Gaussian kernel (see Section IV in the Supplementary Material).Unless specified otherwise, the train-to-test ratio for all properties under investigation is 80/20 and all the results in this study are the mean of five measurements using random shuffling of the data.The Cholesky decomposition (in QML code 47 ) is used to solve the equation  = Principal component analysis (PCA) was conducted utilizing the sklearn Python library and an "auto" SVD solver.In this work, the number of components was set to either two (when mapping the chemical space) or three (when constructing PC3-MAOC), and the explained variance ratio was recorded close to the axis of the PCA plots.t-distributed Stochastic Neighbor Embedding (t-SNE) was performed using the sklearn Python library.The number of components was set to two and all the other values were left as defaults.

A. Basis set benchmarking
Since MAOC uses orthogonalized atomic orbitals to construct the guess localized molecular orbitals, the basis set used to generate these atomic orbitals affects the matrix generation time, the training time for machine learning models, the amount of space needed to store the representation, and the quality of the property predictions.To identify the optimal choice, we tested several basis sets of varying sizes and flexibilities (Figure 3a).Generating MAOC for the N-HPC-1 dataset with the Karlsruhe (def2-SVP) or Dunning (cc-pVDZ) basis sets requires significantly more time and storage space than with the Pople, Jensen (pc-0 and pcseg-0), and LANL2DZ basis sets.STO-3G produces the smallest representations file in a relatively short time.In comparison to other representations (Table 1), generation of MAOC requires the longest time; however, even for the largest studied dataset, QM7b X , this is only ca.The learning curves in Figure 3b illustrate the accuracy of the KRR machine learning model for the prediction of HOMO-and SOMO-LUMO gaps for a random sample of 4,000 compounds from the N-HPC-1 dataset using MAOC built from various basis sets in terms of mean absolute error (MAE).The train-to-test ratio is 70/30 and the sigma values for each set of training data are optimized.
Surprisingly, the highest accuracy is achieved with relatively small basis sets (3-21G, 6-31G, and pcseg-0).However, results with all tested basis sets ultimately fall within a narrow range of 0.31-0.35eV in MAE.We note that adding diffuse and polarization functions to the basis sets increases the computational costs but does not affect (polarization functions) or even reduces (diffuse functions) the accuracy of ML predictions, and therefore is not advised (see Figure 3a and Sections V.A and V.B in the Supplementary Material).The reason behind the counterintuitively superior performance of smaller basis sets is that MAOC constructed from larger basis sets includes many more virtual orbitals, which, as shown in Section E, do not carry significant weight in machine learning of the studied properties.While this is the case for organic molecules in this work, it might not always be, i.e., polarization functions are likely important (and must thus be included when constructing MAOC) for molecules containing heavy metals, etc.Finally, using principal component analysis we evaluated how well the "atomic" space of chemical elements across the periodic table is mapped with MAOC depending on the basis set (see Section V.C in the Supplementary Material).Ultimately, pcseg-0 was chosen as an optimal basis set for constructing MAOC representations in this work, but a similarly good ratio of computational cost to accuracy can be achieved with 3-21G and 6-31G.

B. MAOC as an atomic representation
By construction from orthogonalized atomic orbitals, MAOC can serve not only as a molecular, but also an atomic representation, differentiating single-atom systems and enabling prediction of their properties.This contrasts with many conventional molecular representations, which either rely on many-atom environments or, for a single atom, produce a representation unsuitably small for machine learning.Instead, prediction of monatomic (monoatomic) properties employs such traditional descriptors such as electronegativity and atomic radii, 51 which, however, are not uniquely defined. 52,53stead, MAOC offers a rigorously defined method to encode single atoms across the entire periodic table for machine learning tasks.With MAOC, the "atomic" space is well resolved and the proximity between two atoms reflects similarity in their electronic (orbital) configurations (Figure 4).Atomic space of the periodic table of elements generated using MAOC with the pcseg-0 (1 st to 6 th period) and AHGBS-5 (7 th period, lanthanides, and actinides) basis sets.

C. Mapping chemical space with MAOC
Coordinate-based representations, for example Coulomb Matrix, SOAP, and SLATM, distinguish compounds based on their compositions and geometries, while the quantum-inspired representations SPA H M and MAOC also account for electronic structure features.As a result, only the latter can resolve the chemical space of molecules with similar geometries but nonidentical charges and spins (Figure 5a).This is further illustrated in the t-SNE plot of the chemical space of anionic and cationic compounds in the geometries of the parent neutral species from the QM7b dataset (Figure 5b).

D. Supervised machine learning
In this work, we focus on the fundamental properties of charged (and) open-shell compounds: energies of the frontier molecular orbitals, energy gaps between SOMO or HOMO and LUMO, and ground state single-point electronic energies.We first discuss the results for the commonly available QM7b dataset, 38,39 which consists of small neutral organic compounds composed of elements from the 1 st -3 rd rows of the periodic table and mostly lacking conjugated π-bonds in cyclic moieties.To create an extended QM7b X dataset, anion radicals, cation radicals, and dications were considered for each parent molecule in QM7b; for the open-shell compounds, both vertical and adiabatic properties were evaluated.Overall, the best predictive performance of kernel ridge regression in terms of mean absolute errors (MAEs) and root-mean-square errors (RMSEs) for the orbital energies is achieved with SLATM, SOAP, and SPA H M representations, followed by MAOC, PC3-MAOC, and Coulomb Matrix (see Table 2 and Figure 6a,c).Compared to geometry-based representations, their quantuminspired counterparts perform worse in predicting the single-point energies (Figure 6b) but are generally better in predicting the orbital energies of the charged open-shell compounds.For example, SPA H M and PC3-MAOC are the best representations for machine-learning the SOMO-LUMO gap of anion radicals (Figure 6d) and cation radicals (Figure 6e), respectively.We have also combined the data for non-optimized (vertical) anion and cation radicals into a single subset to test the representations on systems with identical geometries but distinct electronic states.In this case, only the quantum-inspired representations, SPA H M and MAOC, are able to learn from the data, with SPA H M delivering the lowest MAEs (Figure 6f).Values are colored from green to orange according to the following ranges: <0.2<0.4<0.6<0.8< for orbital energies and gaps, <2<4<6<8< for single-point energies.Superscript vert denotes ion radicals in the geometries of the parent neutral compounds.3 and Figure 7).
Performance of the latter varies: quantum-inspired representations offer the best accuracy in predicting the single-point energies, while SOAP and SLATM generally produce better orbital energies.However, in several cases, such as singlet dications and doublet trications, PC3-MAOC outperforms other representations, including the full MAOC (see Section III.E).Finally, the performance of MAOC for the N-HPC-1 dataset containing π-conjugated compounds is improved (i.e., is associated with lower MAEs) relative to QM7b X , which is comprised mostly of saturated systems.To explain this, we turn to the way MAOC is constructed.The localized orbitals of conjugated π-bonds are no longer atom-centered, but rather are distributed over multiple atoms and account for the long-range interactions between them.This captures the global character of MAOC.Next, we assess the performance of the kernel ridge regression with various representations for the REDOX dataset 44 (Table 4 and Figure 8), which includes redox-active molecules from diverse chemical classes and with a range of functional groups (Figure 2).The errors in property predictions for this dataset are markedly higher than for QM7b X and N-HPC-1 due to the greater chemical diversity within and non-combinatorial generation of the REDOX dataset.Each class of redox-active species is represented by approximately 1,000 compounds formed from a variety of structural motifs and functional group combinations, resulting in a much broader range of values for a given property compared to the QM7b X and N-HPC-1 datasets (see Section VIII in the Supplementary Material) and limiting the model's ability to "learn".Nonetheless, the quantum-inspired representations -SPA H M, MAOC, and PC3-MAOCas well as the geometry-based SLATM, outperform CM and SOAP in all cases.SPA H M predicts significantly better single-point energies, while MAOC and PC3-MAOC afford the best accuracy for frontier orbital energies.Functional groups, present in REDOX but absent in the N-HPC-1 dataset, enhance the global character of MAOC through resonance, which results in molecular orbitals that are centered not on a single bond but distributed over at least three atoms.Finally, to further test the generalizability of MAOC, we applied it to predict the HOMO-LUMO gaps in a dataset containing in total 5,000 compounds, randomly selected from the QM7b, N-HPC-1, and REDOX datasets.We find that the performance of MAOC and PC3-MAOC for this "mixed" dataset average between the corresponding errors for the individual datasets (see Section VI in the Supplementary Information).This further demonstrates that MAOC is a suitable representation for compositionally, structurally, and electronically diverse sets of chemical compounds.

E. Principal components of MAOC
For the N-HPC-1 dataset of N-heteropolycycles, the PCA-reduced version of MAOC, PC3-MAOC, outperforms the full representation in predicting all considered properties.In search for a reason behind this puzzling behavior, we have analyzed the orbital contributions to the three principal components of PC3-MAOC.Specifically, for all species in a given subset, localized molecular orbitals contributing the most to each principal component (above 0.2 variance) are listed.We then count how many times an orbital of interest, e.g., HOMO-2, HOMO-1, etc., occurs in this list.Finally, this occurrence count is normalized to the total number of L-MOs in the list.For example, six frontier and near-frontier localized MOsfrom HOMO-2 to LUMO+2constitute ca.11% of all orbitals with variance above 0.2 in PC1-3 of MAOC for the parent neutral compounds from the QM7b dataset (Figure 9).Among the three studied datasets, N-HPC-1 stands out in that its MAOC principal component lists feature HOMO and LUMO much less often than the respective higher-and lowerlying orbitals (HOMO-2, HOMO-1, LUMO+1, and LUMO+2).In contrast, all six frontier and nearfrontier MOs occur with similar frequency in the >0.2 variance lists of the QM7b X and REDOX datasets.An in-depth quantitative analysis of these contributions to the molecular representations and their effects on the quality of machine learning predictions is planned for a future study.Principal component analysis of the MAOC representation allows elaborating how the electronic structure governs the chemical properties.For example, in line with the chemical intuition, predicting orbital energies of the anionic compounds requires information on the occupied molecular orbitals, while properties of cations are instead dominated by the unoccupied MOs (Figure 9).While these relationships are straightforward when predicting the energy levels of molecular orbitals themselves, as is done in this study, less obvious effects can potentially be uncovered through such an analysis of more complex properties, such as reaction energies, EPR spectra characteristics, and excited state features.

IV. CONCLUSIONS
In this work, we present a quantum-inspired molecular representation for chemical machine learning tasksthe matrix of orthogonalized atomic orbital coefficients (MAOC).It utilizes not only the structures and compositions of the systems it represents, but also their electronic information, i.e., charge and spin multiplicity.The matrix of orthogonalized molecular orbitals is constructed from the coefficients of the non-optimized (guess) localized molecular orbitals, which in turn are linear combinations of predefined orthogonalized atomic orbitals.In generating the latter, even very small atom-centered basis sets, such as pcseg-0 and STO-3G, afford the accuracy of larger basis sets at a fraction of the computational cost.MAOC is a representation with dual character: local when the core and lone pair orbitals of atoms are constructed, and global when the orthogonalized atomic orbitals are combined to form molecular orbitals, which can either be localized or distributed over a part or even an entire molecule.MAOC is uniquely suited to represent not only molecules but also monatomic (by construction) and periodic (using meta-Löwdin orthogonalization scheme) systems.
Furthermore, MAOC allows learning and predicting the properties of systems with identical compositions and geometries but distinct electronic configurations, such as vertical properties or redox properties of rigid compounds.Finally, we also present a reduced version of MAOC based on the principal component analysis dimensionality reduction technique, PCX-MAOC, which possesses all of the abovementioned features of full MAOC whilst being significantly more compact, enabling further analysis of the connection between electronic structure and chemical properties, and, in certain cases, leading to higher prediction accuracy.
The performance of the kernel ridge regression ML model in conjunction with MAOC, PC3-MAOC, and several other coordinate-and Hamiltonian-based representations was tested in predicting the frontier orbital energy levels and ground state single-point energies for an extended QM7b dataset and two new datasets, N-HPC-1 and REDOX.We observed that MAOC and PC3-MAOC outperform geometry-based representations for sets of compounds with identical geometries but distinct electronic configurations and for a chemically diverse set of redox-active compounds, whilst affording accuracy similar to that of SOAP, SLATM, and SPA H M for a set of N-heteropolycycles.
Further improvements in MAOC's performance are likely achievable with larger datasets (in cases where property values are spread over a broad range) and with other machine learning architectures, such as neural networks.

Figure 1 .
Figure 1.Various representations for neutral and dicationic benzene in the same geometry.

Figure 2 .
Figure2.Datasets constructed and used in this work.N-HPC-1: composition of the dataset according to the polycyclic skeleton, doped with nitrogen atoms.REDOX: composition of the dataset according to the type of redox-active molecules, as well as schemes of the corresponding redox reactions.

(
+ ) −1  of the KRR model.The step size (training set size) for the learning curves is chosen based on the dataset, ranging from 400 to 1,500.All machine learning operations were performed on a MacBook Pro (2019) with a 1.5 GHz Quad-Core Intel Core i5 processor and 8GB of 2133 MHz LPDDR3 memory.
2 CPU-hours.Moreover, both MAOC and SPA H M are well parallelized, i.e., MAOC for QM7b X requires just over 15 minutes on an 8-CPU laptop.Executing principal component analysis on the generated MAOC is virtually instant.In terms of storage space, PC3-MAOC and SPA H M are by far the most compact representations, while full MAOC is roughly on the same order of magnitude as SLATM and SOAP.

Figure 3 .
Figure 3. MAOC representations for 4,000 compounds from the N-HPC-1 dataset build with various basis sets: (a) CPU time and storage space requirements, (b) HOMO-LUMO gap learning curves (logarithmic scale) generated using a kernel ridge regression model with a Laplacian kernel and a train-to-test ratio of 70/30.

Figure 4 .
Figure 4. Atomic space of the periodic table of elements generated using MAOC with the pcseg-0 (1 st to 6 th period) and AHGBS-5 (7 th period, lanthanides, and actinides) basis sets.

Figure 5 .
Figure 5. (a) Space of electronic configurations of cyclopenta[de]cinnoline, generated with MAOC using principal component analysis.(b) t-SNE plot of the chemical space of the anionic and cationic compounds in the geometries of the parent neutral species from QM7b dataset, generated with MAOC and colored according to molecular charge.

Figure 6 .
Figure 6.Selected learning curves (logarithmic mean absolute errors vs. logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and single-point energies for compounds in the QM7b X dataset with various molecular representations.Superscript vert denotes ion radicals in the geometries of the parent neutral compounds.The full set of computed learning curves is provided in Section IX.A in the Supplementary Material.

Figure 7 .
Figure 7. Selected learning curves (logarithmic mean absolute errors vs. logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and single-point energies for neutral and tricationic quartet radical compounds in the N-HPC-1 dataset with various molecular representations.The full set of computed learning curves is provided in Section IX.B in the Supplementary Material.

Figure 8 .
Figure 8. Learning curves (logarithmic mean absolute errors vs. logarithmic training set size) for the kernel ridge regression prediction of frontier molecular orbital energies, their energy differences (gaps), and singlepoint energies for compounds in the REDOX dataset with various molecular representations.

Figure 9 .
Figure 9. Normalized counts of frontier and near-frontier localized molecular orbitals in the lists of orbitals with variance above 0.2 in the three principal components of PC3-MAOC for each compound type in the three investigated datasets.

Table 1 .
CPU Time required to generate PC3-MAOC from full MAOC is negligible compared to the time required to initially generate the full MAOC.
and wall timings (in seconds) and storage space (in MB) associated with the tested representations for all studied datasets.a

Table 2 .
Mean absolute errors (MAEs) and root-mean-square errors (RMSEs) for frontier molecular orbital energies (in eV), their differences (gaps, in eV), and single-point energies (SPEs, in a.u.) of compounds in the QM7b X dataset predicted with kernel ridge regression, trained on 5,284 molecules, in conjunction with various molecular representations.a

Table 3 .
Mean absolute errors (MAEs) and root-mean-square errors (RMSEs) for frontier molecular orbital energies (in eV), their differences (gaps, in eV), and single-point energies (SPEs, in a.u.) of compounds in the N-HPC-1 dataset predicted with kernel ridge regression, trained on 2,900 (1,700 in the case of trication doublets) molecules, in conjunction with various molecular representations.a Values are colored from green to orange according to the following ranges: <0.2<0.4<0.6<0.8< for orbital energies and gaps, <2<4<6<8< for single-point energies. a