Historical methods of functional development in density functional theory have often been guided by analytic conditions that constrain the exact functional one is trying to approximate. Recently, machine-learned functionals have been created by interpolating the results from a small number of exactly solved systems to unsolved systems that are similar in nature. For a simple one-dimensional system, using an exact condition, we find improvements in the learning curves of a machine learning approximation to the non-interacting kinetic energy functional. We also find that the significance of the improvement depends on the nature of the interpolation manifold of the machine-learned functional.

## I. INTRODUCTION

Density functional theory (DFT) is a popular approach to electronic structure calculations in both materials science and chemistry.^{1,2} By approximating various components of the energy of a quantum system with functionals of the density, DFT bypasses the need to solve the Schrödinger equation for the fully interacting problem. The utility of DFT calculations are, in turn, critically dependent upon the quality of the approximations to the energy of a system as a functional of the density. Almost all modern calculations employ the Kohn-Sham (KS) scheme,^{3} in which only the exchange-correlation (XC) energy needs to be approximated as a functional of the spin-densities.

Within the arena of density functional approximation development, there is tension between the desire to derive approximations from general principles of quantum mechanics (e.g., Perdew-Burke-Ernzerhof^{4}) and fitting to known molecular and materials data (e.g., Minnesota functionals^{5}). A key part of this debate is the role of exact conditions. Many general exact conditions are known,^{6} but which ones are relevant to functional performance remains an open question.

Recently, an entirely new type of approximate functional has been developed, using algorithms from the general area of machine learning.^{7,8} Machine-learned density functionals are empirical functionals that work by extrapolating the exact values of the functional on a handful of densities (training set) to predict the value of the functional on new densities (test set). We emphasize that this empiricism is very different from that previously used in DFT. In machine learning (ML), the functional is approximated in an intrinsically non local fashion, requiring many parameters.^{7} This yields an approximation of the functional that achieves chemical accuracy on the interpolation manifold. This approximation is even accurate for processes that standard semilocal functionals fail at, such as stretched bonds. Thus a ML non-interacting kinetic energy remains accurate as a bond breaks^{9} (unlike its semilocal counterparts), while an ML XC energy can include strong correlation, even in the thermodynamic limit.^{8} (Besides finding density functionals, machine learning is more commonly used in chemical and materials science to make accurate predictions of chemical and physical properties.^{10–19})

This kind of DFT development raises a natural question: are exact conditions, which are often built in to human-based functionals, useful in the design of machine-learning approximations? By manually imposing exact conditions on machine-learned density functionals, one introduces prior knowledge to the machine-learned functionals that may foreseeably enable easier training. Among the simplest, and most powerful, exact conditions in KS DFT are those produced by coordinate scaling. For instance, the exchange energy functional scales in a simple way when the density is squeezed or stretched,

where *γ* is a positive real number. All popular exchange functionals satisfy this condition which determines the local density approximation up to a constant.^{20}

So far, ML functionals have been designed without the use of such conditions, in order to test principles and limitations of ML construction of functionals in their most primitive form. We can now phrase our question very precisely. If we can *impose* a condition like Eq. (1) on an ML functional, i.e., construct the functional so that the condition is automatically satisfied, do we get a more accurate approximation for a given amount of data? We shall see that the answer is typically yes, but whether that gain in accuracy is significant depends on the nature of the interpolation manifold.

To put the work in context, we note that our title question is very general, and we are unaware of any previous work on it. In such cases, it is vitally important to find the simplest possible non-trivial case where it can be addressed. The simpler the example, the easier it is to perform calculations and separate out competing effects. It would be neither practical nor even useful to attempt to answer this question, e.g., the XC functional, because of its myriad applications and the many exact conditions that it is known to satisfy. But the non-interacting kinetic energy functional of KS DFT is extremely important in its own right and has been the subject of many previous studies in creating ML-based approximations because of its potential to greatly accelerate KS-DFT calculations.^{21} Finally, although the examples discussed here are one-dimensional, there is no reason to believe that three-dimensional calculations would yield qualitatively different conclusions. Even in this case, where fewer exact conditions are known, our choice of coordinate scaling is just one of many. Other examples of exact conditions on the kinetic energy are positive definiteness, size consistency, and reduction to the von Weisäcker form for one or two (unpolarized) electrons. However, given the power and generality of coordinate-scaling, it seems an entirely appropriate choice as a first pass on the title question. To put it in another way, if exact conditions from coordinate scaling did *not* improve the systems discussed here, there would be little reason to try using exact conditions in more complex cases.

This paper is organized as follows. Section II reviews the formalism of KS DFT and machine learning. In this section, we also discuss recent progress in applying machine learning to DFT and standard systems which have been used to test novel ideas in DFT. In Sec. III, we present the theory behind our approach and motivate it with an example. In Sec. IV, we explain in detail the methods used to generate reference densities needed to use our ML formalism. In Sec. V, we compare the performance of a traditional ML functional to one that satisfies an exact scaling condition by construction. In Sec. VI, we interpret our results and offer an explanation in terms of a simple model system. In Sec. VII, we explain how these results may improve the applicability of machine learned functionals and discuss possible extensions of this work.

## II. BACKGROUND

### A. DFT

The main implementation of all modern DFT calculations is the Kohn-Sham (KS) scheme.^{3} To solve an *N* electron system in an external potential $vr$ using the KS scheme, one considers a system of *N* non-interacting electrons. This system is described by a set of orbitals, {*ϕ*_{i}}, that may be attained by solving the Kohn-Sham equations,

where $vs$[*n*](**r**) is the Kohn-Sham potential. We have written this equation in atomic units where *ℏ* = *m*_{e} = *e* = 1 and will continue to use these units throughout the rest of the paper. The Kohn-Sham potential is chosen so that the density of this non-interacting system,

matches that of the fully interacting system. For simplicity, we have assumed the system is spin unpolarized. This allows one to write the energy of the fully interacting system as

where we have defined

and *E*_{XC}[*n*] is defined by Eq. (4). Many successful approximations to *E*_{XC}[*n*] have been constructed, enabling a self-consistent solution to Eq. (2) and the functional derivative of Eq. (4).^{4,22,23}

Orbital-free DFT is a methodology that predates KS (in fact, goes back to the Thomas-Fermi theory^{24,25} of 1927) that removes the need to solve for the eigenfunctions of the Kohn-Sham equation. To make KS DFT orbital free, one can construct an approximation to *T*_{s}[*n*], which removes the dependence of Eq. (4) on the Kohn-Sham orbitals. Orbital-free calculations hold the promise of by-passing the bottleneck of solving the KS equations. In the special case of one electron, or two spin-unpolarized electrons, the exact non-interacting kinetic energy functional is given by the von Weisäcker form,

Modern XC functional approximations mostly fall on Jacob’s ladder,^{26} where an increasing number of ingredients are aimed to produce higher accuracy. The most commonly used approximations begin with the local density approximation (LDA), which applies the exact XC functional for the uniform electron gas to all systems. LDA is a local approximation, meaning that it depends only on the density at each point. Higher up on Jacob’s ladder are generalized gradient approximations (GGA) and meta GGAs, which introduce a dependence on the gradient and Laplacian, respectively, of the density. Above these are hybrid functionals, which incorporate some amount of exact exchange energy.

### B. Machine-learned functionals

The machine learning approximations are of a very different kind.^{7,8,27,28} Machine-learned functionals do not possess a convenient closed form and typically are not intended to accurately describe general systems. Instead, they fit a function to a dataset of densities for which the exact value of the functional has already been determined through other means, known as the training set.

Kernel ridge regression (KRR)^{29} is a non-linear form of regression that uses the kernel trick. For this work, we choose a gaussian kernel function. To model a functional $TsML(\sigma ,\lambda )[n]$ using kernel ridge regression, we write

where *m* is the number of training data and {*n*_{i}} is the set of densities for which *T*_{s}[*n*] is known. The inner product must be a measure of the similarity between densities and is defined here by

The {*α*_{i}} are determined so that the loss function

is minimized. Here, ** α** is a vector of weights defined by

*α*_{i}=

*α*

_{i}and

**K**is the kernel matrix defined by

The minimization of Eq. (10) may be done analytically, yielding

where **I** is the *m* × *m* identity matrix and **T**_{s} is an *m* × 1 vector such that **T**_{si} = *T*_{s}[*n*_{i}].

In KRR, *λ* and *σ* are hyperparameters of the model. The optimal choice of hyperparameters prevents overfitting and helps the model generalize well on out-of-sample data. They are determined by cross-validation, a process where a part of the training set, known as the validation set, is hidden from the model during training. In this work, we perform leave one out cross-validation, meaning that for a training set of *N*_{T} densities, we train on a set of *N*_{T} − 1 densities, and the remaining density forms the validation set. We pick the pair of *λ* and *σ* that produce the lowest error on the validation set. Then, we rotate the data point being used in the validation set, eventually forming a set of *N*_{T} best values for *λ* and *σ*. We then pick the median value of each to parameterize our model and retrain the model using the full training set.

KRR functionals are entirely non-local. Thus, they can describe effects due to non-local behavior of the exact functional that is unable to be reproduced by semilocal functionals. For example, KRR approximations to *F*[*n*] are able to reproduce the dissociation curve of H_{2},^{8} whereas approximations to *F*[*n*] that use a semilocal *E*_{XC} approximation are unable to reproduce this result.

Most of the work, like that reported here, involves the non-interacting kinetic energy functional. The non-interacting kinetic energy functional, defined in Eq. (5), typically requires one to solve the Kohn-Sham equation before evaluation. However, nearly every KS DFT calculation performed solves the Kohn-Sham equations and thus can serve as data to train a machine-learned kinetic energy functional, which would enable later calculations to be orbital-free.

### C. 1D model systems

In this work, we will use two model 2-electron 1D systems to test the basic idea of using exact conditions to improve machine-learned density functionals.

The first may be denoted as the 1D Hooke’s atom. The regular (3D) Hooke’s atom is two electrons in a harmonic well but repelling each other with a Coulomb repulsion. This has analytic solutions for an infinite discrete set of force constants and has been a popular toy model for testing DFT ideas.^{30–32} The same behavior is also seen in 1D, but now the interaction is replaced by a delta repulsion (as the Coulomb interaction is too singular in 1D, and the delta-repulsion produces an electron-electron cusp that exactly mimics that of 3D). Thus the Hamiltonian is

This Hamiltonian is separable into a center of mass and relative coordinate. The first is just a harmonic oscillator, while the second can be solved numerically.^{33,34} As well as a model for electronic structure, these systems are also important in cold-atom physics.^{35}

More recently, an entirely new 1D mimic of electronic structure has been developed, with the principal aim being to test ideas of strong correlation. These systems consist of chains of 1D H atoms which can be accurately and efficiently solved via density-matrix renormalization group (DMRG) codes.^{36–38} They are also designed so that standard density functional approximations, such as LDA, perform quantitatively similar to those in 3D, yielding reasonably accurate equilibrium bond lengths and energies but also breaking spin symmetry at the Coulson-Fisher point as a bond is stretched. For this model, we use a simple exponential for both the attraction to the nuclei and the repulsion among electrons,

where *A* = 1.071 295 and *κ*^{−1} = 2.385 345. Just as in 3D reality, the external potential for two “protons” separated by a distance *R* is given by

and the interaction potential is given by

## III. THEORY

We attempt to improve machine-learning of the kinetic energy functional *T*_{S}[*n*] of non-interacting electrons, by enforcing the exact condition

where *n*_{γ}(*x*) is defined as in Eq. (1). The effect of scaling on a density is illustrated in Fig. 1.

Note that we can always choose the origin so that

and always apply our approximations only to densities shifted to satisfy this condition. In this way, they are automatically translationally invariant. One could further enforce the 1D equivalent of rotational invariance, namely, invariance under the transformation *x* → −*x* around this origin, but here we study only symmetric potentials.

To enforce the condition, define the square-width of a 1D density as

We next define a scaled density

We can write the kinetic energy functional as

Thus $T\u0303S$ is a functional that only applies to densities whose square-width is 1 and any fit to that functional will by construction provide a kinetic energy functional that has the correct scaling.

To illustrate this in a trivial case, consider a single-particle in a harmonic well of mass 1 and frequency *ω*,

whose exact kinetic energy is *ω*/4. Given a variety of densities and energies for different values of *ω*, one could imagine learning the kinetic energy using the techniques of Sec. II. But, for each *ω*, *l* = (2*ω*)^{−1/2}, and, independent of *ω*,

with kinetic energy 1/8. Insertion of these results into Eq. (21) yields the exact answer. Thus, any single data point would be sufficient to learn the entire curve.

## IV. CALCULATIONS

For the 1D Hooke’s atom, the Hamiltonian becomes separable under the coordinate transformation *X* = *x*_{1} + *x*_{2}, $u=x1\u2212x22$, which yields a system of equations

and Ψ(*u*, *X*) = *ψ*(*X*)*φ*(*u*), where we have set *g* = 1 for all calculations.^{33,34} The first equation is solved analytically and the second equation is solved numerically. We then transform Ψ back to the coordinates *x*_{1}, *x*_{2}. We find the density by evaluating

and calculating the non-interacting kinetic energy using Eq. (7). We repeat this process for 200 values of *ω* linearly spaced in the interval [0.1, 5] to form a set of reference data. Finally, for convenience, we scale all densities to have the same square-width,

so that all scaled densities satisfy *l*^{2}[*ñ*] = *l*^{2}[*n*_{ω=5}]. The exact constant chosen for *l*^{2}[*ñ*] only sets the scale of the problem. Here, this value is chosen so that the scaled and unscaled systems lie approximately on the same scale to make plotting convenient.

To create the ML functional, we first choose 50 densities from the reference data which have equally spaced *ω* that span the interval to form a test set. We do not include the endpoints *ω* = 0.1, *ω* = 5 in this set. From the remaining densities, we randomly choose 17 densities and take the first *N*_{T} of these to serve as our training set. We then use KRR to machine-learn the non-interacting kinetic energy. We then repeat this procedure on the scaled densities using the same *N*_{T} densities as before. We then pick another random set of 17 densities from the densities that have not been included in the test set and repeat the procedure.

For the 1D H_{2} case, we perform the same procedure, but now we alter the separation between the centers, *R*, instead of *ω*. The densities were found via DMRG as described above, and the non-interacting kinetic energy was found using Eq. (7). We chose 50 test densities between *R* = 3 and *R* = 8 but do not include the end-points. We train on the same numbers of training data as before. The scale factor is chosen so that all scaled densities satisfy *l*^{2}[*ñ*] = *l*^{2}[*n*_{R=8}].

For both scaled and unscaled functionals, we calculate mean absolute errors in the kinetic energies of the test set for *N*_{T} = 3, …, 17 (odd) to form a learning curve. We repeat this process 200 times, redrawing the 17 random densities to include in the training sets over each trial.

## V. RESULTS

In Fig. 2, we plot the median learning curves for the 1D Hooke’s atom with solid lines. Red indicates the standard ML approximation, while blue denotes the results with scaling included. Each of the many faint lines in the background indicate one realization of the randomly chosen training points. The exact condition has produced learning curves that are a spectacular order of magnitude better than their unscaled counterparts, for a given amount of training data.

On the other hand, Fig. 3 is far less encouraging. In this figure, we show the same result for the functionals trained on H_{2} densities. The scaled and unscaled functionals perform almost identically, with at most a modest gain for imposing the exact condition. The rest of this paper is devoted to understanding why performance is so much less in this case.

Figure 4 shows the densities of the 1D Hooke’s atom in our set, both before and after scaling. It is apparent that the densities become much more similar once they have been scaled. The appearance of the double peak is a strong correlation effect, in which the two electrons are trying to avoid one another. But the decay of the scaled density at large |*x*| is the same for all cases. To reinforce this, we plot *l* as a function of *ω* for each density and *T*_{S} (both with and without scaling) as a function of *ω*. Clearly the scaled functional varies less and is easier to learn.

To quantify this, we plot in Fig. 5 the value of *γ* for each *ω* versus $1/2\omega $. The curve is close to a constant (as it would be for the non-interacting case) and only varies strongly from that as *ω* becomes small and the system becomes more strongly correlated. Finally, in Fig. 6, we plot both the unscaled and scaled KS kinetic energies for the 1D Hooke’s atom. If the density were not changing shape, i.e., truly non-interacting, the unscaled energy would be linear and the scaled kinetic energy would be constant, as in our toy example.

Now let us examine the same sequence of plots for the 1D H_{2} molecule as a function of *R*. Figure 7 shows the densities (both unscaled and scaled) for the sequence of H_{2} densities as a function of *R*. Now the reverse appears true: The unscaled densities appear to change less (or no more than) than the scaled densities. As *R* increases, the scaled densities change considerably. Moreover, even when the densities are essentially indistinguishable from the sum of atomic densities, the scaled density continues to shrink in extent around the two centers as *R* increases, i.e., the scaled densities continue to change with *R*, even when *R* is very large.

Of course, appearances are not definitive proof, so we quantify this effect. Figure 8 shows *γ* for different values of *R*. Comparing Fig. 5, we see that *γ* appears to vary linearly with *R* as *R* increases, instead of approaching a constant. When we plot the unscaled and scaled kinetic energies in Fig. 9, we see that, in sharp contrast to Fig. 8, now the scaled kinetic energy varies more over the range of densities than the unscaled one. Thus, it is not surprising that scaling is of little or no help in this case.

## VI. DISCUSSION

We see that, for the contact interacting one-dimensional Hooke’s atom described above, it is advantageous to machine learn the scaled non-interacting energy and use exact conditions to relate this to the unscaled non-interacting kinetic energy, rather than machine learn the unscaled non-interacting kinetic energy directly. Furthermore, this result does not extend to the set of one-dimensional H_{2} densities, where both methods of machine learning the non-interacting kinetic energy have relatively similar performance.

To understand the system-dependence of the improved performance, we analyze the densities plotted in Figs. 4 and 7. We see that, for Hooke’s atom, the densities in the reference data set are approximately scaled versions of each other. Explicitly, as *ω* increases, the peaks in the density shift away from the origin and become broader. Scaling the densities reverses the effect of this trend, creating new densities that may be more meaningfully compared. By contrast, for H_{2}, the densities in the reference data set do not resemble scaled versions of each other. As *R* increases, the peaks in the density shift away from the origin, but they do not broaden. As such, scaling these densities will make the densities more similar by aligning the peaks of the densities but will artificially make them more dissimilar by broadening or narrowing the width of the peaks.

We can understand the poor performance of the exact condition for stretched H_{2} as follows. Consider a model of two disjoint atomic densities. We model an atomic density as a gaussian,

and consider densities of the form

where *R* is large enough so that overlap can be neglected. Solving for the square-width of this density yields

Then, for *R* ≫ 1, Eq. (7) yields *T*_{S} → 1/4, independent of *R*. So, in principle, one could machine learn *T*_{s}[*n*] for all densities of sufficiently large *R* with a single training example. On the other hand, the scaled kinetic energy does not simplify in this limit. For large *R*, $l[n]\u2248R/2$. Thus, $T\u0303s[\xf1]\u2248R2Ts[n]/2=R2/8$ (see the large *R* behavior in Figs. 8 and 9). This non-trivial dependence upon *R* makes it at least as difficult to learn the scaled kinetic energy as it is to learn the unscaled kinetic energy.

We believe that this insight allows us to infer how our results generalize to more realistic systems. If the set of reference densities is made more similar by scaling, then we suspect that machine learning $T\u0303s[\xf1]$ and using the exact scaling condition will be more data efficient, as was seen in the case of Hooke’s atom. On the other hand, if the set of reference densities is not made more similar by scaling, then we suspect that either method of machine learning *T*_{s}[*n*] and $T\u0303s[\xf1]$ will yield similar performance. It is worth noting that, even when the functionals produce similar errors, $T\u0303s[\xf1]$ may still be considered superior, as using this functional to calculate *T*_{s}[*n*] will satisfy the exact scaling condition, Eq. (17), by construction. This is in contrast to directly machine learning the functional *T*_{s}[*n*], which will only approximately satisfy the exact scaling condition when both the scaled and unscaled densities lie within the training manifold and will otherwise violate this condition.

In fact, we can see the stretched case as a conflict between *two* exact conditions. The first is coordinate scaling, while the second is size-consistency: Once two densities stop overlapping, any functional should be equal to or tend toward the sum of its values on the two separated densities. Our imposition of the first condition made it more difficult to learn the second.

## VII. CONCLUSIONS

In this paper, for two simple model systems (each interacting 2-electron systems in 1D), we have imposed an exact condition (quadratic scaling with coordinates) on the machine learned Kohn-Sham kinetic energy functional. We find that typically results are improved, but much more so in the case of a harmonic trap than in the case of a bond being stretched.

We can make intelligent guesses as to what is going on and how such findings might generalize in the real world. The Hooke’s atom example should behave very similar to the sequence of two-electron ions as a function of nuclear charge. Thus we expect that imposing of the exact condition would greatly improve accuracy. But one could reasonably ask, what use is an ML kinetic energy functional for just 2-electrons in such a limited situation?

A more interesting situation would be the sequence of neutral atoms, possibly also including anions and cations. Here the densities are confined, and they must (weakly) approach the Thomas-Fermi solution for large (non-relativistic) values of *Z*, the nuclear charge. We suspect that, once again, the energy would be much more accurately learned with the imposition of the exact condition.

But in the simple case of stretching a diatomic bond, the changes in the shape of density continue as long as the bond is stretched, and little is gained in accuracy on the interpolation manifold by imposing the condition. It suggest that, in some sense, scaling of any density on the manifold moves one in a direction away from the manifold, yielding no additional information on the manifold itself. This would therefore appear to be the most difficult case in which to improve learning and probably a different exact condition (such as separating to isolated atoms) would be more beneficial to the learning process. We also note that all semilocal functionals fail in this limit. Finally, we note that if we are interested in more than a simple stretch along a bond, such as vibrational and rotational motion in 3D, we can expect better performance enhancement from imposing the exact condition. This difficulty is likely related to the difficulties encountered in stretching 1D H chains in Ref. 8.

## ACKNOWLEDGMENTS

This material is based upon the work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1321846. K.B. and L.L. graciously acknowledge support from NSF Grant No. CHE-1240252. T.E.B. thanks the support of the postdoctoral fellowship at Institut quantique.