The average energy curvature as a function of the particle number is a molecule-specific quantity, which measures the deviation of a given functional from the exact conditions of density functional theory. Related to the lack of derivative discontinuity in approximate exchange-correlation potentials, the information about the curvature has been successfully used to restore the physical meaning of Kohn–Sham orbital eigenvalues and to develop non-empirical tuning and correction schemes for density functional approximations. In this work, we propose the construction of a machine-learning framework targeting the average energy curvature between the neutral and the radical cation state of thousands of small organic molecules (QM7 database). The applicability of the model is demonstrated in the context of system-specific gamma-tuning of the LC-ωPBE functional and validated against the molecular first ionization potentials at equation-of-motion coupled-cluster references. In addition, we propose a local version of the non-linear regression model and demonstrate its transferability and predictive power by determining the optimal range-separation parameter for two large molecules relevant to the field of hole-transporting materials. Finally, we explore the underlying structure of the QM7 database with the t-SNE dimensionality-reduction algorithm and identify structural and compositional patterns that promote the deviation from the piecewise linearity condition.
I. INTRODUCTION
The extension of Hohenberg–Kohn density functional theory (HK-DFT)1 to non-integer particle numbers led to the determination of two fundamental properties of the exact density functional.2 The first is the piecewise linearity condition, which imposes that the total energy as a function of the (fractional) particle number [E(N)] must evolve as a series of straight-line segments.2–5 The second is the derivative discontinuity, which establishes that the exact exchange-correlation potential is characterized by sudden jumps while varying across integer particle numbers.2,6–10
Approximate density functionals do not fulfill these requirements. Instead, they are generally characterized by a convex E(N) curvature and by continuously derivable exchange-correlation potentials.3,11–16 As demonstrated by Kronik, Baer, and co-workers,4 these two quantities are related, and therefore, the knowledge of the first is sufficient to quantify the extent of the second. Using the same argument, the minimization of the energy curvature has the consequence of correcting the effects of the missing derivative discontinuity, restoring the compliance of approximate functionals to the exact conditions of DFT. Failure to comply with these requirements exacerbates the effects of the delocalization error,3,14,17–19 leads to an incorrect dissociation behavior of heterodimers,2,12,20 and causes the Kohn–Sham frontier orbital eigenvalues to deviate from the ionization potential and the electronic affinity, respectively.21–24
The existence of a relationship between the curvature and the derivative discontinuity is especially convenient, as the first can be readily evaluated for a given functional and chemical system according to the following expression:4
where represents the average curvature between two integer points with N and N − 1 electrons, while and are the eigenvalues of the frontier orbitals of the N and N − 1 particle states for a fixed molecular geometry. Equation (1) is exact and it is a direct consequence of Janak’s theorem.21,25
Straightforward accessibility to information about the energy curvature, as well as its relation to some fundamental pitfalls of approximate density functionals, has already proven to be fertile ground and is used in numerous applications. For instance, the minimization of serves as a formally motivated criterion for the compound-specific optimal tuning of range-separated hybrid density functionals.26,27 The accuracy of such functionals has been largely demonstrated in the computations of outer-valence spectra,28 optical rotations,29 and fundamental and optical gaps.30–32 The energy curvature has also been applied as a criterion to assess the extent of delocalization error in approximate functionals and to rationalize, on this basis, their relative accuracy.33–38 Within a different context, knowledge of the curvature played a central role in the validation of ensemble generalizations of standard density functionals,5,39–41 carefully designed to retrieve the correct piecewise-linearity behavior of E(N) and the derivative discontinuities in the exchange-correlation potential. Finally, information on the curvature is exploited to develop correction schemes for existing exchange-correlation density functionals.4,42–45
The relevance of the information encoded in the energy curvature, corroborated by the extent of possible applications, contrasts with the modest chemical complexity and relatively low number of molecules for which the curvature has been reported.4,35
Recently, machine-learning (ML) techniques have been redefining the scale and the complexity of achievable quantum chemical tasks.46–54 Supported by the construction of large molecular databases of quantum chemical benchmarks,46–48,55–58 supervised learning approaches promote the large-scale screening of any targeted molecular quantity ranging from simple ground-state properties59 to complex objects such as electron densities60–63 and the many-body-wavefunction.64 In addition, machine-learning techniques have been intensively used to promote the access to system-specific quantities such as atomic parameters for semi-empirical computations,65 atomic and molecular multipoles moments, polarizabilities, and overlap integrals in the context of intermolecular potentials.66,67 Tackling the scaling-up problem with artificial intelligence (AI) techniques is especially appealing. Specifically, AI results in a reduction in the computational cost of accessing molecular properties,59,68,69 facilitates the extrapolation of acquired information to larger and more complex chemical systems,70–72 and promotes the analysis and identification of non-trivial similarity patterns in otherwise unimaginably large amounts of data.73
Within this context, we here report the construction of a machine-learning model of the average energy curvature () of a set of 7165 organic molecules taken from the QM7 database.46,47 In this work, the focus is placed on the curvature between the neutral and the first radical cation state of each molecule, as its minimization leads to compliance with Koopmans’ theorem.26 The applicability of the regression framework is demonstrated by performing system-specific optimal tuning of the LC-ωPBE functional13,74,75 on the basis of the predicted curvatures. In addition, the transferability of the model is tested by predicting the optimal range–separation parameters and estimating the ionization potential of two larger molecules of practical use, relevant for the field of hole transport materials. Finally, we address the question whether specific chemical patterns are more prone to deviation from piecewise linearity using unsupervised dimensionality reduction algorithms to draw statistically robust relationships between the structure/composition of the molecules and their average energy curvature.
II. LEARNING CURVES
The training set for the non-linear regression of was selected by randomly choosing 6465 molecules out of the QM7 database, leaving the remaining 10% for out-of-sample (oos) predictions. The hyperparameters of the model have been tuned for each functional by 10-fold cross-validation on a randomly selected set containing 10% of the 6465 molecules. The parameters were optimized using a simplex algorithm and an array of 42 different initial conditions sampling a large scale of possible hyperparameter values. At each iteration of the simplex algorithm and for each initial condition, a new set for cross-validation containing 10% of the 6465 molecules was randomly selected. The performance of the model was then evaluated by training on five subsets of different sizes (100, 500, 1000, 2000, and 5000 molecules) while predicting on a test set of fixed size (645 molecules). The final learning curves (Fig. 1) are obtained by repeating this procedure 10 times, each time shuffling the training and the test set among the 6465 initial molecules. The regression model reported in the figure uses the spectrum of London and Axilrod–Teller–Muto (SLATM) molecular representation,76,77 as it was the best performing for the largest training set size (further details about the global and local framework used herein are given in the supplementary material).
As shown in Fig. 1, the difficulty of the learning exercise largely depends on the electronic structure level at which the energy curvature is computed. The learning of PBE078,79/def2-SVP and LC-ωPBE13,74,75/def2-SVP is the most straightforward, followed by PBE80,81/def2-SVP and finally Hartree–Fock (HF/def2-SVP). As already apparent in the upper panels of Fig. 1, this specific ordering is directly related to the amount of variation of the target quantity () within each functional and method. As shown in Fig. 2, the mean absolute error of the model trained on 5000 molecules correlates nearly perfectly with the standard deviation of for each level of theory.
Following Eq. (1), the energy curvature depends on the HOMO eigenvalue of the neutral molecule [N-HOMO] and on the LUMO eigenvalue of its radical cation [(N-1)-LUMO]. Therefore, the relative robustness of each functional in describing these two quantities could be invoked to rationalize the overall spread of its . However, as shown in the top panel of Fig. 3, the individual variations of the frontier orbital energies are not sufficient to explain the overall trend found for . All the functionals are characterized by standard deviations of similar orbital energies, whereas HF shows larger deviations. Importantly, the spread of the individual orbital eigenvalues within each method (Fig. 3 upper panel) is much larger than the standard deviation of their difference (, x-axis of Fig. 2), with the sole exception of HF for which the two are comparable. The narrower distribution of the curvature across the dataset effectively makes this quantity a simpler learning target when compared to previously published efforts to learn individual orbital energies.50,56,82,83
The ordering of Fig. 2 is retrieved only after combining the information about the variation of the orbital energies with the one about their correlation (Fig. 3, bottom panel). In particular, the frontier orbital eigenvalues correlate almost perfectly in LC-ωPBE and PBE0, while their correlation is lower in PBE and very poor within HF. Consequently, the difficulty of the learning exercise ultimately depends on the consistency of a method in describing the orbital energies both of the neutral and the radical cation state of a molecule. Both the spread in orbital eigenvalues and their covariance do not change while performing the computation with a larger basis set (see the supplementary material)
The poor covariance between the frontier orbital eigenvalues in Hartree–Fock is the consequence of the different way in which the occupied and the unoccupied manifolds are treated within the method. In particular, the orbital energies of the occupied manifold, hence the HOMO eigenvalue of the neutral molecule, are determined in Hartree–Fock by the effective potential of N − 1 particles, as the exchange cancels out the self-interaction contribution. This is not the case for the unoccupied manifold, where the effective potential originates from the totality of the particles. In contrast, the energies of both the occupied and the unoccupied orbitals in density functional theory are determined by a N − 1 particle effective potential, as the (approximate) exchange-correlation hole exclude a single electron from each and every orbital.
III. SYSTEM SPECIFIC -TUNING
The energy curvature predicted for each molecule by using the machine-learning model can be readily applied as a criterion for system-specific γ-tuning of range-separated hybrid density functionals. Usually, the tuning procedure consists in adjusting the range–separation parameter to satisfy as closely as possible Koopmans’ theorem for both the neutral and the anionic state of a targeted molecule.26 This method is by far the most commonly used and relies on the knowledge of the ionization potential either from a computational/experimental reference or from an on-the-fly estimation using ΔSCF procedures.26,27,84 As already demonstrated by Kronick, Baer, and co-workers,4 the minimization of in approximate functionals implies their compliance with Koopmans’ theorem. Therefore, the optimal range-separation parameter for a specific compound can be found by imposing the curvature to be identically zero.
Figure 4 schematically illustrates a modification of the regression framework as presented in Sec. II, which uses the curvature information to determine the optimal γ parameter for a given chemical system. In particular, the procedure consists of nine independent kernel ridge regression models, each targeting at different values of the range-separation parameter (learning curves in the supplementary material). In the last step, a cubic spline interpolation of the predicted curvatures leads to the optimal γ parameter (i.e., the γ value for which ) for a given molecule.
To avoid the introduction of unpredictable noise in the data, we considered here only those compounds for which all the computations converged. In consequence, the model was trained using the energy curvature of 5754 small organic molecules taken from the QM7 database and used to predict the system-specific optimal LC-ωPBE γ values for a test set of 640 molecules. Upon a single point computation using the tuned functional, the ionization potential of each molecule is evaluated as −εHOMO and compared to the corresponding value at IP-EOM-CCSD. For consistency, all computations are performed with the def2-SVP basis set. Figure 5 shows the accuracy of estimated IPs averaged over the test set for the standard LC-ωPBE and its γ-tuned variant. The error bars show the maximum and the minimum deviation from the IP-EOM-CCSD reference registered among the 640 molecules.
The tuning procedure on the basis of the predicted curvatures results in a fivefold decrease in the average ionization potential error compared to the standard functional. The robustness of the predictions is further demonstrated by the fact that the worst error made with the γ-tuned variant is only as high as the average error made with the standard LC-ωPBE.
Including several hundreds of different molecules, the test set represents a sufficiently large ensemble for a statistically relevant analysis of the optimal range-separation parameter in LC-ωPBE. By registering the frequency of appearance of the predicted γ values, it is shown that their distribution tends to a Gaussian function centered around 0.32 bohr−1 (Fig. 6). Out of the 640 molecules, only 12 are characterized by an optimal γ parameter close to the 0.4 bohr−1 of the standard functional. In all those cases where system-specific γ-tuning is not possible, for instance, in the computation of dimer binding energies,85 the distribution in Fig. 6 demonstrates that fixing the range-separation parameter of LC-ωPBE to 0.32 bohr−1 would reduce the curvature for the majority of molecules.
The discrepancy with the original parametrization of LC-ωPBE has to be interpreted as the results of a different optimization strategy. Here, the suggested 0.32 bohr−1 minimizes the energy curvature for the highest number of compounds in a comprehensive dataset of organic molecules. Following the works of Baer,4,26 fixing the γ parameter by minimization of the energy curvature is a formally motivated procedure, as it leads to compliance with Koopmans’ theorem and exact conditions of DFT. The original approach used for the parametrization of LC-ωPBE is more pragmatic and seeks to minimize the error of the functional against different energy-based benchmark databases.74 The formal issue associated with this second strategy is that the range-separation parameter inevitably compensates for unrelated deficiencies in the rest of the approximated exchange-correlation functional.
IV. EXTRAPOLATION
The machine-learning models presented in the previous paragraphs rely on a global molecular representation, i.e., each vector in the feature space characterizes one specific compound. As the energy curvature is a molecular property, this class of representations is highly suitable and easily applicable to the regression problem. On the other hand, a model based on a global representation is not transferable, i.e., it cannot be trained on smaller compounds and used to make predictions on larger molecules.86 This issue can be tackled using local representations, which encode the molecular information as a collection of atoms in their environments. By establishing similarity measures between local atomic environments, rather than between whole molecules, local representations lead to more transferable models, applicable to larger and more diverse molecules than those included in the training set (see Refs. 63, 87, and 88). The regression framework shown in Fig. 4 is general and can be readily extended to local, atom-centered molecular representations. More details about the modification of the learning framework to accommodate locality and transferability are given in the supplementary material.
Figure 7 shows the application of the local regression framework to predict the optimal γ values of two large molecules commonly used in hole-transporting materials:91,92 N,N′-Bis(3-methylphenyl)-N,N′-diphenylbenzidine (TPD) and 4,4′,4″-Tris[(3-methylphenyl)phenylamino]triphenylamine (m-MTDATA). The model was exclusively trained on the local environments of the small organic molecules of the QM7 database using the atomic spectrum of London and Axilrod–Teller–Muto (aSLATM)76,77 representation combined with an orthogonal matching pursuit93,94 algorithm for sparse regression (see the supplementary material).
The −εHOMO computed with standard LC-ωPBE is a rather poor approximation of the ionization potential of TPD and m-MTDATA with errors around 1 eV compared to the ab initio references (bt-PNO-IP-EOM-CCSD and ΔSCF at DLPNO-CCSD). Upon ML-based γ-tuning, the error with respect to the wavefunction based methods is reduced to 0.1–0.2 eV for both molecules. In addition, the optimal γs from machine learning and the corresponding IPs are in very close agreement with the optimal parameters obtained ab initio by minimizing the difference between −εHOMO and the neutral-cation ΔSCF energy (Δγ = 0.025 bohr−1 and 0.01 bohr−1). These results, obtained on compounds four times larger than the largest molecule in the training set, demonstrate the transferability of the local model and its applicability to targeted complex molecules. Interestingly, the optimal γ parameters for both TPD and m-MTDATA are much lower than any value obtained on the smaller molecules of the QM7 test set (Fig. 6). This behavior is consistent with the results of the existing literature95–98 and further supports the conclusion that γ can be interpreted as the inverse of an effective conjugation length dependent on the system size. Finally, the HOMO eigenvalue of PBE0 is the farthest from the ab initio reference, but the closest to the experimental values obtained by cyclic voltammetry in organic solution (TPD)89 or by ultraviolet photoemission spectroscopy in amorphous solid state (m-MTDATA).90 This result is not unexpected (see Refs. 99 and 100) and shows that the error made by the global hybrid mimics the effects of the condensed phase environment (e.g., solvent and crystal field).101,102
V. UNSUPERVISED LEARNING AND ANALYSIS OF THE QM7 DATASET
The large chemical diversity contained in the QM7 database promotes a thorough assessment of the relation between the energy curvature computed with a given functional and the system-specific structural and compositional patterns. However, drawing such a relationship for thousands of molecules inevitably leads to a high-dimensional problem, which is unsuitable for analysis and visualization. In this context, non-linear dimensionality reduction algorithms reveal the underlying structure of high-dimensional data by projecting complex vectors into lower dimensions.103–117 Figure 8 shows a two-dimensional representation of the chemical diversity of the database using t-distributed Stochastic Neighbor Embedding (t-SNE).118 This algorithm converts the similarity between molecules, which is defined herein as the Euclidean distance between their SLATM representation, to the probability of being each other’s neighbors. The embedding of high-dimensional data into lower dimensions is then performed by ensuring that the joint probability between molecules should not change upon projection. While the two axis (dimensions) obtained after a t-SNE transformation have no formal physical or chemical meaning, it is still possible to identify at least a qualitative correlation between chemical properties and the dimensions in Fig. 8 vide infra.
The application of t-SNE to QM7 reveals clusters of compounds with similar chemical patterns, mainly defined by the presence or the absence of heteroatoms and their connectivity. In particular, the vertical axis (Dim. 2) somehow correlates with the number of heteroatoms from zero (alkanes, bottom) to two or more non-carbon atoms (hydroxyamines and oxyamines, top). The horizontal axis follows instead a gradient of chemical composition going from the oxygen-based compounds (left) to nitrogen containing molecules (right), passing from mixed species. Each point is color-coded by its average energy curvature computed at PBE/def2-SVP to establish a global, qualitative connection between these macro-families of compounds and the degree of their deviation from piecewise linearity. The choice of PBE is motivated by the fact that the absence of Hartree–Fock exchange leads to a curvature that represents an upper limit for the other functionals.
Figure 8 highlights seven key families characterized by at least one region of high average energy curvature (in red). Out of those clusters, three contains only oxygen as heteroatom [alcohols (7), ethers (15), and acids/esters (8)], two includes sp-hybridized carbons [cyano groups (11) and alkynes (13)], one contains only nitrogen [amines (10)], and the last group includes the amides (9). In contrast, alkanes (with the exception of the smallest methane and ethane, see discussion below) (16), diamines separated by long carbon chains (14), all sulfur containing compounds (5), and amidines (6) are all characterized by lower curvatures. These trends suggest that the presence of increasingly electron-rich heteroatoms tends to increase the average energy curvature. In particular, the presence of oxygen atoms is especially sensitive as shown by the qualitative difference between amides and amidines. The low average energy curvature that characterizes all sulfur containing compounds suggests that the presence of heteroatoms beyond the second row of the periodic table does not have a critical impact on the deviation from piecewise linearity. In addition to heteroatoms, the hybridization of the carbon centers is also a relevant factor as illustrated by the contrast between alkane and alkyne groups. These conclusions are consistent with previous work on charge transfer complexes119,120 and delocalization error.37 In particular, the results presented here are comparable with the work of Kronik and Baer,4 who report the average energy curvature for a set of nine small molecules, whose order can be rationalized in terms of the presence of electron-rich heteroatoms, their hybridization, and the molecular size.
Although not explicitly evident in the mapping of Fig. 8, the molecular size is in fact crucial to determine the extent of the average energy curvature. To emphasize this point, we report in Fig. 9 the correlation of the curvature at PBE/def2-SVP and the size of the molecules, upon averaging over all the molecules with the same number of non-hydrogen atoms (Nheavy). Although the mean values for Nheavy = 1 and Nheavy = 2 are not statistically significant (i.e., these categories include only 1 and 3 molecules, respectively), the robust inverse size/curvature relationship justifies the high curvature of the smallest alkanes (cluster 16, Fig. 8).
The error bars in Fig. 9 show that within every Nheavy, there is a distribution of curvatures that reflect the chemical composition. The analysis of the subset with 3 non-hydrogen atoms is especially suitable as it contains sufficient compounds to reflect general trends but is simultaneously small enough to list all its molecules. The inset of Fig. 9 shows the energy curvature of all the compounds with Nheavy = 3 ordered from the highest to the lowest. This plot validates the conclusions drawn from the t-SNE map, as the curvature decreases with the electron-richness of the heteroatom (O > N > C). One exception due to the effects of hybridization is acetonitrile, which has a slightly higher but comparable curvature to methyl ether. Complementing the information of the t-SNE map, the inset of the figure reveals the high-energy curvature of 3-membered rings (oxirane, aziridine, and cyclopropane), which are generally considered to act as unsaturated systems.121 The previous arguments remain valid to explain the relative order of the electron rich 5-membered conjugated heterocycles [thiophene (6.3) < pyrrole (6.5) < furane (6.7)], whose curvatures are as expected higher than benzene (6.0).
VI. CONCLUSION
The average energy curvature with respect to the particle number is a crucial system-dependent property of density functionals, which quantify their deviation from the exact conditions of DFT and therefore affects their accuracy. Related to the lack of derivative discontinuity in the exchange-correlation potential and thus to the degree of severity of the delocalization error, the information about this quantity has been successfully used for optimal tuning of long-range corrected functionals and to correct Kohn–Sham orbital eigenvalues to match ionization potentials and electron affinities. In this work, we have proposed the construction of a machine-learning model of the average energy curvature and shown its applications for the system-specific tuning of the LC-ωPBE functional. In parallel, unsupervised learning techniques have been applied to obtain qualitative information about particular chemical patterns and molecular properties, which results in highly convex curvatures.
As the curvature is both a system-specific and a functional dependent quantity, we have first shown that the learning exercise is not equally difficult for any given functional, but it depends on its ability to describe on equal footing the neutral and the radical cation state of a molecule. This result implies that the possible spread of value for the average energy curvature is not equal for all methods. In particular, the largest standard deviation for the curvature is registered for Hartree–Fock due to the poor correlation between the neutral molecule HOMO eigenvalue and the LUMO of the radical cation.
Training several independent models to target the curvature at LC-ωPBE for different values of its range-separation parameter led to the construction of a second framework dedicated to the system-dependent optimal tuning of the functional. The use of the predicted γ parameters resulted in a fivefold increase in the accuracy when estimating the first ionization potential (IP) with −εHOMO with respect to the standard functional. The distribution of the predicted range-separation parameters on the QM7 database shows that the original 0.4 value of LC-ωPBE is far from optimal to minimize energy curvature. As a generalization of the framework, we use a local molecular representation for the training and demonstrate the transferability of the modified model by estimating the optimal γ-values and computing the ionization potentials of two larger molecules, relevant for the field of hole-transporting materials.
Finally, projecting the high dimensional SLATM representation of QM7 in two dimensions with a t-SNE algorithm revealed the underlying structure of the database. In particular, the mapping showed several distinct clusters enclosing molecules similar to each other in terms of their scaffold and the presence of heteroatoms. The curvature values across these clusters were found to assume the highest values for compounds with second row heteroatoms, most frequently oxygen, or for compounds with sp-hybridization. Additional analysis of the data supports the existence of an inverse correlation between the molecular size and the average energy curvature.
SUPPLEMENTARY MATERIAL
See the supplementary material for the learning curves of the energy curvature with different molecular representations, a comparison of the model performance between test and out-of-sample set, a more detailed description of the local model, the demonstration of the robustness of the results when changing the basis set, the learning curves for each range-separation parameter and for the numerical data associated to the results reported in the main text..
ACKNOWLEDGMENTS
The authors acknowledge Kun-Han Lin for helpful discussions and the National Centre of Competence in Research (NCCR) “Materials’ Revolution: Computational Design and Discovery of Novel Materials (MARVEL)” of the Swiss National Science Foundation (SNSF) and the EPFL for financial support.
DATA AVAILABILITY
The data and the model that support the findings of this study are openly available in Materials Cloud at https://doi.org/10.24435/materialscloud:2020.0031/v1.
APPENDIX: COMPUTATIONAL DETAILS
The molecular geometries for all species were taken as published in the QM7 database.46,47 The curvatures were computed according to Eq. (1) using the orbital eigenvalues of the neutral and the first radical cation state of each molecule. All the computations using PBE, PBE0, LC-ωPBE, and Hartree–Fock were performed in Gaussian16,122 in combination with the def2-SVP123 basis set. The first ionization potential energies at IP-EOM-CCSD124,125 and bt-PNO-IP-EOM-CCSD,126,127 as well as at ΔSCF (DLPNO-CCSD),128 were obtained with Orca 4.0129 using the def2-SVP basis set for consistency with the DFT values. The density fitting approximation was applied in the DLPNO-CCSD and bt-PNO-IP-EOM-CCSD computations. The machine-learning representations and similarity kernels were obtained using the Quantum Machine Learning toolkit QMLcode130 with the exception of SOAP131 (see the supplementary material), which was computed using DScribe 0.3.2.132 The mathematical form of the similarity kernels was chosen as the standard procedure according to the specific representation (see the supplementary material for more details). The two-dimensional map of the QM7 database was generated using the t-SNE118 algorithm as implemented in the scikit-learn package.133