Construct exchange-correlation functional via machine learning 

Density functional theory has been widely used in quantum mechanical simulations, but the search for a universal exchange-correlation (XC) functional has been elusive. Over the last two decades, machine-learning techniques have been introduced to approximate the XC functional or potential


I. INTRODUCTION
In 1964, Hohenberg and Kohn proved the unique mapping between the ground state electron density and local potential, besides an overall constant. 1This insight led to the Kohn-Sham formulation of density functional theory and the notion of exchangecorrelation (XC) energy functional, introduced by Kohn and Sham in 1965. 2 The Kohn-Sham approach provides a way to transform the many-electron problem into an equivalent one-electron problem with an effective potential.The search for the universal XC functional has resulted in a variety of approximate XC functionals since then.However, the universal XC energy functional has remained elusive.Although a universal analytical form for the XC functional is believed to be impractical, the search for a universal XC functional remains an active area of research.For the state-of-the-art of DFT, we refer to the reader to Ref. 3.
Machine learning (ML) has been applied to construct the XC functional in DFT since 1996 when Handy et al. proposed a machine learning approach to map the local electron density to the local XC potential. 4In 2004, Zheng et al. independently used a neural network to construct an improved XC energy functional based on the functional form of B3LYP. 5With the success of deep learning in computer vision, 6 natural language processing, 7 and other fields, 8,9 there is growing interest in using deep learning algorithms, such as convolution neural networks (CNNs), 10 graph neural networks (GNNs), 11 and transformers, 12 to approximate the universal XC functional.
8][29][30][31][32][33][34][35][36][37][38][39] In addition to the XC functional or potential, other aspects of the DFT framework can also benefit from ML techniques.][42][43][44][45][46][47][48] Furthermore, ML has been extensively used to fit or construct potential energy surfaces, 49 where DFT is frequently used as a training target or benchmark for the ML algorithms.Besides the above-mentioned works on MLB XC functionals, researchers have employed other data-driven techniques rather than deep learning (such as genetic algorithms) to seek accurate forms of XC The Journal of Chemical Physics PERSPECTIVE pubs.aip.org/aip/jcpfunctionals. 50,51For example, in Ref. 50, the authors have proposed a Symbolic Functional Evolutionary search to construct accurate XC functionals in the symbolic form.
In this perspective, we focus on the construction of MLB XC functional or potential and review various methodologies, from the early approaches to the latest developments. 52It is important to note that our goal in this article is not to be exhaustive.Rather, we aim to explore specifically how machine learning techniques can be used to construct XC functionals or potentials and answer the question of whether the universal functional can be accurately obtained via deep learning.

II. EXCHANGE-CORRELATION FUNCTIONAL AND POTENTIAL
Our discussion begins with an introduction to the fundamental concepts of the DFT framework that will be referenced throughout the subsequent sections.The Hohenberg-Kohn theorem 1 forms the basis for predicting the quantum mechanical properties of a manyelectron system from its electron density, implying that the ground state energy is a unique functional of the electron density (denoted as ρ(r)).By introducing a non-interacting reference system, Kohn and Sham 2 expressed the ground state energy functional E[ρ(r)] as follows: and Here, Ts, Eext, E H , and Exc stand for the Single-Slater kinetic energy with a set of orbitals {ϕ i }, the external energy, the Hartree energy, and the exchange correlation energy, respectively.The terms T and Eee are the exact kinetic energy and the Coulomb energy for the many-electron interacting system, respectively.Minimizing the total energy constrained by normalized orbitals leads to the following Kohn-Sham (KS) equations: where vxc is called the exchange-correlation potential, which is the functional derivative of the exchange-correlation energy with respect to the electron density, vxc(r) = δExc[ρ(r)] δρ(r) .
The left-hand side of the Kohn-Sham Eq. (3) includes the electron density ρ (depending on the orbitals ϕ i 's), and thus, it is a nonlinear eigenvalue problem.To solve this problem, an initial density ρ 0 must be provided, and the solution must be updated until convergence is reached, a process known as self-consistent field (SCF) calculation. 53key starting point for using ML techniques within DFT is to parameterize the XC energy functional (or potential, or even the corresponding energy density) using various ML architectures, such as neural networks, 54 and train the model with carefully designed descriptors for input and training data.This is referred to as the ML-DFT method, and the ML architecture is termed the ML-DFT model in this perspective.Descriptors should be the functions or functionals of electron density.Below, we review the existing ML-DFT methodologies based on different types of descriptors used in modeling.

III. EARLY WORKS ON XC MODELS
Prior to the recent research to construct XC functional or potential via ML, two research groups employed neural networks to search for the XC functional and potential, and the two pioneering publications 4,5 were published in 1996 and 2004, respectively.The electron density was used for the descriptors, and the output was the XC potential or functional.

A. Neural network-based B3LYP functional
As one of the most popular hybrid functionals, the B3LYP functional 55 includes five pure functional terms: (i) the Slater exchange functional E Slater

X
[ρ]; 56 (ii) the Hartree-Fock exchange functional E HF X [ρ]; 57 (iii) the difference between the Becke88 exchange 58 and the Slater functionals, denoted as ΔE Becker . 59The B3LYP functional is tuned by three coefficients: a 0 , aX, and aC; it reads as follows: In hybrid functionals like B3LYP, the coefficients are typically determined by fitting to experimental data or accurate calculations, and once obtained, they are treated as constants.In B3LYP, the values are a 0 = 0.8, aX = 0.72, and aC = 0.81, based on fitting a set of atomization energies and ionization potential. 58See also Ref. 60  for calibration and selection of hybrid density functionals using Bayesian optimization techniques.
In 2004, Zheng et al. 5 proposed to project the exact XC functional onto the B3LYP functional and pointed out that a 0 , aX, and aC should, in theory, be system-dependent or functional of electron density.By making these coefficients as functionals of density, the exact XC functional can be expressed as and the resulting coefficients become clearly system-dependent, with different values for different density inputs.Thus, learning the density functional coefficients is essential for determining the exact density functional.In an effort to learn such XC functional, Zheng et al.
terms containing derivatives w.r.t.energy functionals When one assumes that those coefficients do not depend on ρ too much, that is, the potential can be (approximately) written as follows: With the above approximation in Formula ( 7), the machine-learned XC potential was trained and tested in SCF calculations for 116 small molecules, yielding improvements over the original B3LYP functional.Using the basis set of 6-311+G(3df,2p), the RMS errors in overall energies using the conventional B3LYP is 4.7 kcal mol −1 , while the NN-based functional gives 2.9 kcal mol −1 (see Table I below).However, the resulting MLB XC functional is not as accurate due to the above approximation that the functional derivatives of a 0 , aX, and aC are zero.RMS errors (all data are in the units of kcal mol

B. Neural network-based XC potential model
In 1996, Tozer et al. 4 proposed a neural network architecture that mapped local electron density to the corresponding local XC potential.The method is classified as local descriptor-based due to its single density input.It is expected to achieve improvements if the information from higher order derivatives of densities is employed.In Ref. 4, the input densities were calculated at the CCSD level (with Brueckner coupled cluster method 61 ) and the model consisted of one fully connected layer with eight hidden neurons; while the target XC potentials were computed by the Zhao-Morrison-Parr (ZMP) method. 62Trainings were performed on (ρ, vxc) pairs from either one molecule or multiple atoms/molecules.The ML-DFT model was used to perform KS-SCF calculations, resulting in significant improvements over LDA (see the column of CNN in Table II for the numerical performance of the method).These improvements can be enhanced further by including more information from the neighboring area of the local point, for instance, by adding first and To summarize, the approach developed in Ref. 5 pioneers the research direction on constructing the XC functional using machine learning while the approach developed in Ref. 4 is the first work targeting directly the XC potential.The method in Ref. 4 uses information from local electron densities as the descriptor and we term it as the local descriptor-based method, 63 and obviously, it is only an approximation.The numerical scheme constructed in Ref. 5 intends to use the entire electron density of a molecule as the descriptors, and we term it as the global descriptor-based method.The global descriptor-based method can be, in principle, exact.In Sec.IV, we review, first, the recent studies on global descriptor-based methods utilizing more advanced machine-learning architectures.

IV. RECENT WORKS ON GLOBAL DESCRIPTOR-BASED XC MODELS A. Deep neural networks for XC potential
The work by Nagai et al. 13 investigated the idea of incorporating a neural-network trained XC potential model in the KS-SCF calculation.Specifically, this approach makes use of a fixed grid with 100 consecutive and equally spaced points to feed the entire density as a vector to a fully connected neural network with two 300-neuron hidden layers, mapping the entire electron density to the target XC potential [see Fig. 1 (left) for the algorithmic procedure of the numerical scheme].Once the XC potential model is trained and established, one can solve the Kohn-Sham equation, and the initial XC potential is produced via the neural-network trained XC potential model with the initial electron density as the descriptor.The total energy can also be evaluated.
The proposed method was tested in a 1-D model system consisting of two interacting spinless fermions with various random Gaussian external potentials.The target potential was set to be the total Coulomb potential vH xc = vH + vxc with vH = −A exp(−x 2 /B 2 ) being the Hartree potential with two parameters A and B; the corresponding density was calculated using exact diagonalization.
In Fig. 1 (right), the two columns show (as color maps) the outof-training error in density and total energy derived from the KS scheme with the trained potentials.The horizontal and vertical axes represent the ranges of the parameters A and B, respectively.Overall, the trained neural network model demonstrated good generalizability in out-of-sample tests with unseen external potentials within the simple setup.

B. Projection-based XC potential and energy model
To simplify and standardize the density descriptor in realistic systems, a projection method may be chosen.In a recent Here, u (ℓ) and ϕ ℓ are the ℓth projection coefficient and ℓth basis function, respectively.The term v denotes the external nuclear potential, which was approximated by a sum of Gaussians as in Ref. 28.The projection coefficient vector u = (u (ℓ) ) L ℓ=1 is then mapped to the target energy through the kernel ridge regression (KRR) model. 64ee Fig. 2(a) for its algorithmic procedure.The target energy was selected to be either the DFT energy obtained using the PBE functional, the CCSD(T) energy, or the difference between the two, which captures the exchange-correlation contribution at varying levels of accuracy.
The basic idea behind a KRR model is that when presented with a new density, the algorithm performs an interpolation based on known target energies from densities in the training set.The output of the model is written as where k(⋅, ⋅) is a Gaussian kernel measuring the similarity between any two projected density descriptor vectors, with σ being a hyper-parameter determined by cross-validation. 65In Eq. ( 8), the model E ML stands for the fitted energy; u[v] denotes the projection coefficient vector for the external potential v; and vi is the ith external potential in the training set.Predictions for new densities are generated by a summation of parameters weighted by the  To broaden the scope of their approach, the authors also built a separate KRR model mapping external potential to density.By combining this with the model that maps density to energy, all the density functionals can be expressed as functionals of external potential.This effectively blurs the line between machine learning methods based on density functional theory and those that directly learn from molecular geometry.

C. Kohn-Sham regularizer
Previous efforts have been made to construct XC potential models for SCF calculations.However, in those efforts, the training procedure and the SCF calculations were independent of each other.In contrast, in the work by Li et al., 21 the ML model was programmed in a fully differentiable way with the aid of automatic differentiation, [66][67][68][69][70] allowing error to backpropagate through multiple iterations of the SCF calculation.In general, automatic differentiation allows efficiently computing the derivatives of any functions in the computer program, and this technique can be used, for instance, to minimize the Hartree-Fock energy (or any other objective functionals) to avoid eigenvalue calculation in an orbital-free setting. 68he scheme developed by Li et al. 21effectively included more information about the functional mapping from the density to the XC energy, and the scheme was named the Kohn-Sham regularizer   .The loss function includes both energy and density loss terms.All the terms E k 's are used to form the energy loss function while only the last term ρ K in the charge density sequence will be used to form the density loss function.In other words, the former loss had contributions from multiple iterations, with decay weights for earlier iterations, while the latter only contains the root mean squared error between the last iteration's output and the target.
The KSR model has shown the generalizability for a simple one-dimensional H 2 model system, with only two training examples needed to determine the whole dissociation curve reasonably well.However, as the work was developed for 1D model systems, it still falls under the category of proof-of-concept.Moreover, the energy loss term contains the contributions of all produced energies E k 's from the previous SCF iterations, and this training mechanism enforces that the output energy of the model should converge more or less exactly in the way the training labels did, which is generally not practical in the conventional SCF calculations.Extending it to realistic 3D systems will require extra effort due to the computational complexity.

V. MODEL TRANSFERABILITY AND HOLOGRAPHIC ELECTRON DENSITY THEOREM
Although the application of high precision quantum chemistry methods, such as CCSD 72 and quantum Monte Carlo, 73,74 facilitates the acquisition of large amounts of data on small molecules, obtaining such an accurate dataset for large molecules from ab initio methods is not practical.The lack of such data for larger molecules poses a key problem to the transferability of machine-learning-based XC functionals of complex molecules.Since most of the existing ML-DFT models are trained only with the datasets of small molecules, the model's transferability, from simple and small molecules to complicated and large ones, may pose a challenge in constructing a universal XC functional.To address this issue, the density descriptors must be carefully designed to ensure the transferability of the ML-DFT model from small molecules to large ones.
Riess and Münch 75 posited in 1981 that the electron density distribution of a molecular system is determined by an arbitrary finite volume of the ground state electron density, based on the hypothesis that electron density functions of atomic and molecular species are real analytic in real space excluding nuclei.The validity of this hypothesis, however, was not rigorously proven until Fournais et al. demonstrated the real analyticity of electron density of arbitrary atomic and molecular eigenstate of the Schrödinger Equation. 76,77Another proof of real analyticity of electron density has been given by Jecko. 78The ground state holographic electron The Journal of Chemical Physics PERSPECTIVE pubs.aip.org/aip/jcpdensity theorem (GS-HEDT) named by Mezey 79 is thought to be linked to the concept of quantum similarity measures in DFT. 80,81n the case of an atomic and molecular system, the external potential v(r) acting on each electron is real analytic (mathematically defined) except at the nuclei.The electron density is real analytic everywhere except for isolated points where the nuclei's point charges cause non-analyticities.Analytic functions, such as Gaussian orbitals and plane waves, are often used as basis sets for quantum mechanical calculations, resulting in real analytic electron densities.The values within a subregion is sufficient to determine values everywhere in the physical space, and this can be shown by the analytic continuation of real analytic functions, as demonstrated in Ref. 82.4][85][86] Moreover, the nearsightedness principle proposed by Kohn 87 (see also Ref. 88) suggests that local electronic properties, such as the electron density, depend mostly on the external potential in the nearby regions.This principle shares the same foundation with the GS-HEDT, which also highlights the local nature of ground state electrons.
Based on the GS-HEDT, the electron density within a finite volume is sufficient to determine the global density distribution of a real atomic or molecular system.While many modern density functional approximations utilize nonlocal information for improved accuracy, 89 it may be possible to achieve an accurate quasi-local KS mapping through the use of advanced machine learning techniques, To create an ML-DFT model for quasi-local electron density, a direct mapping of the electron density to the XC potential for use in the SCF calculations can be used as a starting point.We may write where M θ denotes the ML-DFT model with its optimized parameters denoted as θ, and B(r) denotes a neighborhood of r.The ML XC potential v ML-XC (r) is dependent on the electron density ρ at r and its neighborhood.After training, the resulting ML-DFT model for v ML-XC (r) can be used in SCF calculations.As dictated by the GS-HEDT, the neighborhood could be arbitrarily small in principle.However, in practice, the quasi-local region surrounding the spatial point should be of a certain finite size to ensure the numerically feasible KS mapping.In Ref. 14, a cube centered at each position with sampling points arranged along their spatial directions is a viable neighborhood choice.For instance, for a given window half-length h > 0, the sampling points range from the cube, where r = (rx, ry, rz) with a certain step length (the smaller the step, the more points are sampled given a fixed h).A convolution neural network (CNN) 90 architecture was employed with the input being a cube of sampled density, and the final output of the model is a scalar value of the XC potential at the respective quadrature point.The resultant ML XC potential is integrated and used for SCF calculations later on.
The ML-DFT model is a 3D CNN neural network, as depicted in Fig. 4. It was tested on H 2 and HeH + and trained on a dataset of 50 H 2 molecules and 50 HeH + ions (with bond lengths ranging from 0.504 to 0.896 Å).The ground state electron density is used as the input or descriptor and was calculated by employing CCSD(T).The target or output is the XC potential, which was calculated using Wu-Yang method 91,92 (see Appendix B for a brief introduction).
This ML-DFT model outperforms traditional DFT using B3LYP in terms of electron density accuracy by at least one order of magnitude, as demonstrated by benchmarking with the reference CCSD electron density.When integrated into the SCF procedure, the ML XC potential achieves impressive performance on the elec- In Fig. 5(a), HeH + electron density calculated with the ML-DFT method is compared with B3LYP, and the reference data are the CCSD(T) electron density.With the predicted electron density, an atomic force can be calculated using Hellman-Feynman theorem 93 and basis set correction. 94The accuracy is significantly better than that of B3LYP. Figure 5(b) shows that the same model was tested on HeH + ions with He-H distances up to values much larger than those in the training set.The model's out-of-sample performance, as measured by the density difference to CCSD, remained much smaller than that of B3LYP even at bond distances around 3 Å for HeH + .Furthermore, the density performance of the ML-DFT model outperformed that of B3LYP even in more complex systems (such as He-H-H-He 2+ ) with different numbers of electrons and nuclei than molecules in the training set, and Figs.5(c) and 5(d) show the comparison of two different structures, respectively.The use of quasilocal electron density as input has yielded exceptional transferability of the ML-DFT model.

B. Quasi-local XC energy density model
An alternative approach is to build an ML-DFT model that directly targets the XC energy density εxc defined as follows: Although targeting the energy densities shares similarities with the previous ML-DFT models for the XC potential, the model output is different, and careful consideration should be taken.Similar to the previous ML-DFT models, this model requires data in the form of XC potential or electron density at each grid point, and a sensible strategy is training targets for the entire grid.Unfortunately, unlike the XC potential, there is no procedure like the WY method 91,92 to produce the energy density.Furthermore, the calculation of parameters requires second order derivatives, which can be computationally intensive.Nevertheless, automatic differentiation techniques and packages are now available to handle such calculations.Implementing the model involves saving the first derivative graph and including other numerical burdens in the backpropagation process to calculate second order derivatives.The XC energy and potential can be obtained via numerical manipulation from the XC energy density.To generate the total XC energy, the XC energy density can be integrated weighted by electron density.
Below, we present the ML-DFT model for the XC energy density developed by Nagai et al., 24 which employs a fully-connected neural network model trained with different electron density descriptors as inputs and the XC energy density as output.However, the XC energy density is not directly used for training loss function (only losses in total energy and electron density are employed).The electron density descriptors used in the model include various combinations of the following density-related quantities: Denote g the overall input vector concatenating all necessary input descriptors.Then, the XC energy density is parameterized as Here, G NN xc is the model's output, which is constructed using a four-layer fully-connected neural network.The descriptors defined in Eq. ( 12) include a set of five quantities, which form a near region approximation (NRA) in DFT.Depending on which terms are included, the formulation unifies various levels of detail about the local or quasi-local electron density.If all five descriptors are included, the ML-DFT model is referred to as NRA-type functional.To compute the XC potential from the XC energy density, a Monte Carlo method was used instead of backpropagation, avoiding complications from both the backpropagation through the inverse KS problem and the second-order derivative problem.
The resulting ML XC energy density model with local density descriptors shows a reasonable performance (see Fig. 4 in Ref. 24).However, the performance only becomes comparable to traditional hybrid functionals when the coarse-grained quasi-local density is included through the fifth descriptor (the NRA shown in the original paper).CCSD(T) and G2 95 results are used as the reference data.

C. XC fragment energy model
The HEDT guarantees the representability of the XC potential and XC energy (or energy density) by the quasi-local density.This one-to-one mapping between the local XC potential and quasilocal electron density can be utilized in several different ways.A slightly different approach from previous models is to divide the XC energy into contributions from naturally meaningful fractions (e.g., atoms).
As shown in Fig. 6, the electron density of a system is divided into four fragments, each with a unique mapping to the system's properties.When the mapping ρ frag,i ↦ Ei for any i ∈ {1, 2, 3, 4} for each fragment's XC energy contribution Exc = ∑ i Ei is specified, it uniquely determines a quasi-local XC functional Ei = Exc[ρ frag,i ].This mapping is relatively straightforward to find with atomic division.The total XC energy of a molecule can be equated to the summation of XC energy contributions from constituent atoms, and a machine learning model can read and interpret quasi-local densities around each nucleus to output the corresponding atomic XC energy contribution.
It should be noted that even though the XC energy can be expressed as the summary of the contribution from individual atoms, even higher-order interactions among two or more atoms can still be partitioned into the single-atom contribution because the quasi-local density around each nucleus contains information from all orders.However, it is the machine learning model's capability to determine how the energy contribution is split among the participating atoms.For instance, for a C=O bond in a specific environment, the XC energy correction attributable to the bond can be apportioned to both the carbon and oxygen atoms.
Atomic contributions to molecular potential energy surfaces (PES) have been constructed prior to the widespread use of deep learning models, as demonstrated in the work of Behler and Parrinello. 49However, to construct a truly universal XC functional that requires no additional information beyond the density itself, higher complexity models are necessary.Every aspect of the XC energy or potential arises from the subtle variations in the shape of the quasilocal density.Recent advancements in deep learning have made it possible to construct such models.
Dick and Fernandez-Serra 16 successfully demonstrated promising accuracy in small molecules using an XC energy fragment model based on atomic contributions.The model constructs specific neural networks for each atom type and samples the electron density surrounding each nucleus using Gaussian-orbital-like projectors.Symmetrized projected values serve as the input for the neural networks, with the output representing energy contribution from each atom.The total XC energy is calculated by summing the outputs of all atomic neural networks.Functional derivatives are needed with respect to density for SCF calculation.a rather simple transformation from density descriptors to density itself, where β is the index for different projectors, c β is the projected value of the density of the projector, and ψ β (r) is the shape of the projector.
While the model developed by Dick et al. has shown promising accuracy for small molecules, it is not yet universal.The model relies on different neural networks and projectors for each type of atom, and different models were trained for different datasets.Specifically, the researchers developed three distinct models for three different datasets.

VII. IMPROVING ML DFT MODEL PERFORMANCE
In this section, we will review existing approaches to improve the performance of ML DFT models.These include the use of different training strategies, designing specific loss functions, and imposing physical constraints that density functionals should satisfy.By implementing these methods, we can improve the accuracy of DFT calculations and enhance.

A. Fully differentiable training with SCF calculations
To build a ML-DFT model that accurately represents the universal XC functional, the trained model with a fixed set of parameters should be applicable to any atoms, molecules, and materials.However, optimizing parameters during the training phase can be highly complicated due to the tangled relationship between the ML model and the SCF calculations.The parameters in the model should be optimized in a way that aids the SCF procedure in converging to the correct density.If the same model is invoked during each SCF calculation, one may isolate the SCF procedure from the model training.This problem has been solved by implementing the KS equation with differential programming, [96][97][98] which is an emerging programming paradigm allowing one to take the derivative of an output of an arbitrary code snippet with respect to its input using automatic differentiation techniques. 66ne can combine the SCF calculation within the optimization procedure to better train a ML-DFT model.This idea has been first demonstrated in a simple 1D system by Li et al. 21Later on, Kasim and Vinko 19 and Dick and Fernandez-Serra 99 also implemented a neural network model for the three-dimensional molecules, where the derivatives can be computed by backpropagating through the SCF iterations.However, this approach requires a large amount of memory and may result in numerical instability when computing the derivatives, which makes it difficult to train on large dataset.One can apply the technique of implicit differentiation 69 to reduce the computational complexity and memory footprint of the actual implementation.

B. Designing loss functions
In supervised learning, the ML-DFT model is optimized by minimizing loss functions defined by the difference between the output values and those reference data.To train the model, it is common to use the following electron density loss: where ρ ML-KS is the electron density after KS-SCF calculation with the ML-DFT model for the XC functional or potential, ρ target is where v ML xc is the ML XC potential, and is the target or reference XC potential.In this case, the target potential should be pre-computed in the data preparation phase, and the model does not involve any SCF calculation during training.If the model output is the XC energy Exc (or the XC energy density εxc), in addition to reproducing the electron density, the loss function in energy, can also be added.This could be combined with other loss functions weighted by some hyper-parameters.
To construct an accurate ML-DFT model, it is important that the ML-DFT model not only reproduces the target energy but also reproduces the target electron density.The target electron density is often obtained from expensive ab initio methods.Gradient descent, or its variants, is commonly used for optimization during training.Automatic differentiation during backpropagation allows for effective computation of the gradient with respect to model parameters.If the density loss is included and the model is coupled with KS equations, backpropagation requires the inverse eigenvalue problem in the KS equations to be solved before parameter updates.This requires numerical techniques to access the network for parameter updates.Alternatively, reproducing the target density can be enforced by using the potential loss only as shown in Ref. 14.

C. Physical constraints for ML-DFT models
Although ML techniques have been widely employed for finding the exact form of universal XC functionals, these MLB XC functionals are seen as black boxes and may not satisfy the physical constraints that the XC functional should obey in principle.For instance, the exchange-energy density of any finite many-electron system satisfies the exact 1/r asymptotic behavior. 58This theoretical insight may be useful when designing parameterizations of new MLB XC functionals.Moreover, other physical constraints, such as spinscaling 100 for the exchange energy and the Lieb-Oxford bound 101 for the exchange-correlation energy, are derived from fundamental principles of DFT and, thus, can also be used to guide the ML modeling.
Recent efforts of designing MLB XC functionals satisfying certain physical constraints have been made to address this issue by integrating ML modeling and exact-constraint satisfaction. 102,103his approach has shown promise in producing ML constructed XC functionals that satisfy physical constraints and exhibit improved transferability and accuracy over traditional approximations.

VIII. OUTLOOK A. General quasi-local descriptor formalism
The quasi-local electron density, which contains enough intrinsic information about the molecular system as dictated by the HEDT, is clearly a more suitable descriptor for training a better ML-DFT model compared to that of using either the local electron density or the global one.With the quasi-local electron density descriptors, one can parameterize the mapping from electron density to XC quantity with sufficiently many features to capture the details of the mapping.Once the electron density is given, the XC quantities are uniquely determined.
The general workflow of a quasi-local ML-DFT model is depicted in Fig. 8.The quasi-local electron density distribution ρ in (r; r 0 ) around r 0 is inputted as the descriptors to the ML-DFT model; it outputs the intermediate XC potential vxc(r 0 ) or XC energy density εxc(r 0 ) at r 0 that can be used in the subsequent KS solver.The input electron density function may be obtained by CCSD, the quantum Monte Carlo method, or other high-precision quantum chemistry methods.After the KS solver, a new charge density function ρ new and other physical properties such as the total energy are obtained, and these can be used to form the loss function to train the ML model by comparing with the high-precision electron density and/or other quantities such as high-precision energy.Once the training is complete, the resulting ML-DFT model can be employed in the SCF calculation to calculate highly accurate physical properties such as electron density and total energy of the system.
Alternatively, ρ in can be calculated via the conventional DFT methods, such as B3LYP, as the B3LYP version of ρ in has a oneto-one correspondence to the higher-precision ρ in (for instance, CCSD).The advantage is that no SCF calculation is required to calculate the molecular properties, once the ML-DFT model is built, and the input ρ in can be obtained by employing the conventional DFT calculation.
One may extend the NN-based B3LYP functional developed in Ref. 5 into a quasi-local descriptor-based ML-DFT model.In this case, the ML-DFT model outputs a set of space-dependent coefficients {a 0 , aX, aC}, which calibrates the original B3LYP functional.We remark that an additional correction term ΔE of the energy functional can also be added to enhance the model's capability to calculate the absolute energy.Besides the approach of outputting the space-dependent coefficients and the correction term, one may also directly target an energy density 104  (by automatic differentiation) or the XC functional (by numerical integration 105 ).
Recently, a new work based on the quasi-local electron density formulation of the ML-DFT model was published. 106This is a quasilocal version of the electron density formulation of the NN-based ML-DFT model reported in Ref. 5. Instead of learning the mapping ρ quasi-local → v XC from scratch, the model learns the space-dependent coefficients combining three existing functionals as follows: In Eq. ( 14), f θ is a row vector of 3 elements outputted by the machine learning model, while ε LDA X (r), ε HF (r), and ε ωHF (r) are the local LDA, 56 local Hartree-Fock, and local range-separated Hartree-Fock energy densities (see Ref. 107), respectively.An extra D3 108 correction was added to the ML functional E MLP XC to produce the final XC energy prediction.
For the fragment energy model, the density-weighted interaction in Eq. ( 11) can be replaced by a summation over fragmental contribution, (r)  .
As the boundaries between fragments are not always clear-cut [for example in Eq. ( 13), the projector c β has a kernel of Gaussian orbital shape 16 ], there can be multiple fragment energies contributing to the potential at any given positions, allowing for a smooth transition between fragments.
Once the ML-DFT model is trained for a specific type of XC quantities, it can be incorporated into the SCF calculations and subsequently used for post-processing the molecular properties of interest, as in traditional DFT calculations.The quasi-local density descriptor approach emerges as the mainstream approach to constructing the ML-DFT model.The remaining is how best to design and represent the quasi-local electron density.Moreover, the electron density is also being used as the target, as it is the key entity in DFT and contains the health of information.More research is expected in this direction.

B. ML models for van der Waals interaction
An accurate description of the van der Waals (vdW) interaction is challenging for traditional DFT, as it is weak and is due to the interaction of transient atomic dipoles.While some conventional DFT approximations have shown remarkable performance in certain systems, 109 they often rely on nonlocal quantities that make them difficult to apply in the quasi-local ML-DFT method.
Because the vdW interaction is caused by the interaction among transient atomic dipole moments, it can be, in principle, machinelearnt from the electron density.As the vdW is weak, a minute change in electron density is induced.The minor changes in density and their corresponding XC potential are both higher order effects in a perturbative sense rather than the cause of vdW interaction.It is thus possible to machine-learn the vdW interaction directly from the electron density ignoring the high-order electron density changes.
Recall that the XC energy and potential can be written as follows: for a given ρ 0 with a small perturbation δρ, It is evident from Eq. ( 15) that the minor density change resulting from the vdW interaction can be mostly ignored when calculating the XC potential during SCF, within reasonable accuracy requirements.The second term, which accounts for second-order variation in the electron density, is significantly smaller than the first term, as the change in density for vdW interaction is minimal.However, the energy shift due to vdW interaction is significant and cannot be neglected.Therefore, including an additional correction term for vdW interaction after SCF calculation is a reasonable approach.A separate vdW ML model can be trained using the quasilocal electron density and added to the current ML XC model as an extra correction term to the XC energy.
Empirical correction approaches, like the widely-used DFT-D3 method, 108 are computationally efficient but limited in their effectiveness due to their reliance on a few empirical parameters and their sensitivity to specific systems.On the other hand, a customized ML model with a large number of tunable parameters and degrees of freedom may bring significant improvements.
Recently, Proppe et al. employed Gaussian process regression 110 to correct systematic errors in DFT calculation with D3type dispersion corrections. 111This model is referred to as D3-GP in the original work.The training data, consisting of 1248 samples of molecular dimers, are the differences between interaction energies obtained from PBE-D3(BJ) 108,112,113 /ma-def2-QZVPP 114,115 and DLPNO-CCSD(T) 116,117 /CBS 118 calculations.Once provided with reference data for new molecular systems, the underlying D3-GP model can learn to adapt to these and similar systems.The D3-GP model outperforms the existing PBE-specific correction schemes 113,119,120 with respect to three different validation sets.One may expect that with sufficient training data, an ML model for vdW correction is likely to outperform existing empirical models for dispersion correction.Once the ML-vdW model is trained and validated, combining this ML model with the quasi-local ML-DFT model is straightforward.

C. Other future research directions
The full potential of the ML-DFT model can be explored by utilizing larger and more diverse datasets that can significantly benefit the modeling of ML-DFT calculations.By incorporating diverse molecules, chemical environments, and properties, the ML-DFT models can capture finer details of the exchange-correlation interaction and thus improve the model's generalizability.Expanding the dataset to include molecules with various sizes, complexities, and properties would enhance the training of ML-DFT models and enable more accurate representations for the XC quantities.

While maintaining the efficiency of model training would become
The Journal of Chemical Physics PERSPECTIVE pubs.aip.org/aip/jcpchallenging, larger models with a higher number of parameters may effectively capture intricate features and correlations in the data, leading to improved accuracy and reliability in ML-DFT models.
Recently, the notion of neural operator 121 and the technique of operator learning 122 gain much attention from different scientific communities.The goal of operator learning is to seek a directly functional relation that maps elements from an infinite dimensional space to another infinite dimensional one.One of the great features of operator learning is that the parameterization of the mapping is discretization invariance, i.e., the resulting mapping from ML models is independent of the resolution of input and output data, as the operator learning model aims to learn the intrinsic structure of the map between the abstract spaces.One may expect that this approach could benefit the explore of XC functionals that map electron densities, which are smooth functions of spatial variables, to the energies of the underlying quantum systems.Moreover, by incorporating domain knowledge and physical constraints, ML-DFT models may have better representability for the exchange-correlation quantities, leading to the development of more accurate and physically meaningful XC functionals.

IX. CONCLUDING REMARKS
The explosive development in AI has catalyzed a quick turnover of machine-learning models for density functional theory.From an algorithmic perspective, most of the above-mentioned approaches have focused on applying ML architectures such as artificial or convolutional neural networks to learn the XC functionals.However, other promising candidates, such as graph neural networks (GNN), recurrent neural networks (RNN), 123 and transformers, 12 are also being explored for overhauling the design of XC functionals.GNN extends CNN toward irregular grids for electron density or XC potential.RNN is ideal for time-dependent data and may find profound applications in time-dependent DFT.On the other hand, transformers and other attention-based models allow the model to be smarter by deciding where to pay attention in the electron density or XC potential.Given the subtlety and sensitivity of electron density data in DFT problems, attention-based models may be a good fit.
Here, we have reviewed the machine learning approaches for constructing XC-related quantities (such as energy functional or potential) in DFT.The review began with a discussion of two pioneering works, ML-DFT models that use global descriptors and progressed toward more intuitive and transferable quasi-local models, concluding with an additional ML term for vdW interaction.For the quasi-local descriptor models, we introduced the holographic electron density theorem as the theoretical foundation and presented a series of successful implementation schemes.All quasi-local ML-DFT models (such as the ML XC potential model) share the same fundamental design elements and have deep physical connections.We have demonstrated successful stories for these variants, 14,16,21,24 and we encourage readers to read the respective original papers, as well as the open-source codes and examples provided.We hope that new generations of ML-DFT models will accurately construct the universal XC functional of DFT in the near future, revolutionizing the field of quantum chemistry, similar to how AlphaFold 124 has transformed the field of structural biology.Forward looking, the eventual ML-DFT model for the XC functional should have the following features.First, the descriptors should be made of the quasi-local electron density.Second, the targets should include the high precision electron density; and this can be the explicit target, or the implicit, for instance, in Ref. 5, the explicit target is the XC potential, which, in turn, leads to the high precision electron density by solving the KS equation.Finally, the XC potential and energy density can be the output or the intermediate target that leads to the target electron density.An additional machine-learning module for the vdW interaction may also be included in the workflow to deal with the weak interaction of transient atomic dipoles.Ultimately, the ML-DFT model combined with the vdW interaction module should be able to accurately reproduce the target energy and the target electron density for any molecular system.evaluation on this example, one may walk through the following steps below: 1. Before getting started, make sure all the prerequisites are installed and work properly.Those scripts (e.g., run_oep.py,gen_dataset.py,run_train.py,and run_xcnn.py) in the repo provide automatic scripts for generating data from WY calculation, collecting data, training the model with the data, and testing the model with SCF procedure, respectively.Interested readers are advised to follow the README from the GitHub repository in step 2) for re-compiling PySCF and additional custom implementations of the codes.
The codes provided within this tutorial constitute (i) data generation (with the WY method), (ii) training part, and (iii) SCF computation.One can build their own codes based on this GitHub repo for molecules or ions other than H 2 in this simple example.Depending on the format of the dataset, one needs to write their own scripts like run_oep.py,gen_dataset.py,run_train.py,and run_xcnn.py,for automating the whole algorithmic procedure.

APPENDIX B: OPTIMIZED EFFECTIVE POTENTIAL AND DATA GENERATION
The electron densities that are employed to train the ML models can be obtained using highly accurate ab initio methods such as wave-function based methods like CCSD. 72Besides the electron density, the values of XC potential are also needed.Given a density computed from CCSD, the corresponding XC potential can be calculated by various optimization procedures that effectively invert the KS equations (collectively referred to as the inverse Kohn-Sham methods; see also Ref. 128).The optimization procedure employed in Ref. 14 to generate a training dataset is the so-called Wu-Yang method (WY) developed in Ref. 91, which will be briefly elaborated here.
Readers might wonder that if a numerical optimization procedure can resolve XC potential from electron density, then why do we bother training an ML model that does the exact same thing?The answer lies in the core concept of DFT itself.What we want to predict from the ML model is the universal XC functional that maps The Journal of Chemical Physics PERSPECTIVE pubs.aip.org/aip/jcpany density to its corresponding XC potential.On the other hand, the optimization procedure only solves system-specific XC potential that is associated with a particular known electron density.The procedure entails only the mathematics of inverting KS equations, which does not include the physics of the many-particle system at all.In contrast, the ML model tries to learn the intrinsic physics behind it, which is by definition fundamental.Those values of electron densities and XC potentials generated by inverse KS methods are fed to the ML model as training data.
The solution to the inverse KS problem is not as straightforward as it first appears.An analytical solution is mostly absent and different kinds of numerical optimization techniques are usually employed.One of the popular potential optimization schemes was invented by Wu and Yang in Refs.91 and 92.For a given input density ρ in , one first constructs a Lagrangian, denoted as Ws, in terms of the total effective potential (denoted as v) and the single particle wave functions (denoted as ϕ i 's), In practice, the potential is projected onto a set of Gaussian basis functions. 129Once the effective potential is calculated, the XC potential vxc can be easily found by subtracting the external and the Hartree potentials. 130ith the pair of density and XC potential being obtained, the training procedure is decoupled from the KS SCF procedure, and the resulting ML model converts its inputs ρ into the outputs vxc.Training proceeds with a typical backpropagation procedure with an optimizer using stochastic gradient descent (SGD) 131 or the Adam method. 132Once large enough data are accessible for various types of molecules and quasi-local environments, the parameters in the ML XC potential model can be better trained and yield a more accurate and universal XC potential of real molecular systems.

The Journal of Chemical Physics
iv) the Lee-Yang-Parr correlation functional E LYP C ; 55 and (v) the Vosko-Wilk-Nusair correlation functional E VWN C

FIG. 1 .
FIG. 1. Left: structure of the ML-DFT model developed in Ref. 13. Right: prediction error.Δn and ΔE represent the errors of SCF density and total energy with respect to the exact reference, respectively.Reproduced with permission from Nagai et al., J. Chem.Phys.148, 241737 (2018).Copyright 2018 AIP Publishing LLC.

FIG. 2 .
FIG. 2. (a) The KRR model constructed to represent the density functional, mapping the electron to either the DFT/CCSD(T) energy or their energy difference; another KRR ML model (ML-HK) was used to map the external potential to the density.(b) Energies (dark blue for CCSD(T) and dark orange for DFT (PBE)) of different water geometries in the training set.(c) Test set (other water geometries than the training set) MAE improves when the training set size increases.(d) The learned DFT (top), CCSD(T) (middle), and energy difference (bottom) surfaces, respectively.In (b) and (d), diamond scatter represents minimum energy geometries.Reproduced with permission from Bogojeski et al., Nat.Commun.11, 5223 (2020).Copyright 2020 Author(s), licensed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/.28

FIG. 3 .
FIG. 3. The structure of ML XC energy model by Li et al. 21that includes KS SCF in the training.The forward and backward propagations pass in SCF during training are depicted in (a) as black solid and red dashed lines, respectively.(b) The details of one iteration of SCF with the parameterized neural network.The structure that utilizes the quasi-local information of the density to produce the XC energy density is depicted in (c).Instead of ρ, the symbol n is used for density, which is consistent with the symbol used in the original work.Reproduced with permission from Li et al., Phys.Rev. Lett.126, 036401 (2021).Copyright 2021 Author(s), licensed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/.21

(
KSR) due to its generalization (preventing overfitting) capability (see also Ref.71 for a spin-adapted version of KSR model).

Figure 3 (
a) depicts the computational procedure of the KSR model, which uses the electron density of the molecule as the model input.Then, the model consists of a fixed number of times (denoted K as the total number of iterations) of SCF iterations [see Fig.3(b) for the internal process of the SCF iteration], where each SCF iteration is parameterized by a neural network model, whose architecture is sketched in Fig.3(c), outputting a series of energies {E k } K k=1 and {ρ k } K k=1 The output of the ML-DFT model is the value of XC potential at r, and, therefore, once trained, the model predicts the XC potential at position r of the center of the sampling neighborhood.The entire XC potential is obtained by sweeping the model across the grid, and the output is used in the KS equation within the SCF procedure to calculate a new density.The above ML-DFT model that uses the quasi-local electron density as the descriptors is termed the quasi-local descriptor-based XC model.In the next session, we review three different types of quasi-local descriptor-based ML-DFT models.VI.QUASI-LOCAL DESCRIPTOR-BASED XC MODELSCompared to the local descriptor-based ML-DFT model such as the one proposed in Ref. 5, the quasi-local descriptor-based model can, in principle, be exact and, in practice, is certainly more accurate.This is justified by the HEDT, which states that the ground state electron density uniquely determines the ground state properties of any subdomain and of the total domain of the system.The quasi-local descriptor-based ML-DFT methods are promising.A. Quasi-local XC potential modelIn Ref. 14, Zhou et al. proved the rigorous foundation of the quasi-local descriptor-based ML-DFT method and, in addition, developed and implemented its ML-DFT and subsequent KS-SCF algorithm.Quasi-local densities (input or descriptors) and XC potentials (labeled data) were discretized on a grid whose points coincide with the set of quadrature points for potential integration.

Figure 7 FIG. 6 .FIG. 7 .
FIG. 6.The concept of the XC energy by fragments.In practice, the fragments are usually chemically meaningful parts like electron density around each nucleus (atom) in a molecule.From Wu et al., Quantum Chemistry in the Age of Machine Learning (pp.531-558).Copyright (2023) Elsevier.Reprinted with permission Elsevier.
FIG. 8.A general quasi-local ML-DFT modeling workflow.The ML-DFT model takes the quasi-local electron density (or other descriptors) ρ in (r; r 0 ) and outputs an intermediate XC quantity (either vxc or εxc); IM/EX simply means whether there are labels of the targeted quantity in the training steps.Then, the produced XC quantity is used in the KS solve to generate a new electron density ρ new and thus the total energy E tot (or any other quantities of interest).

FIG. 9 .
FIG. 9. (a) A typical (with a reasonable bond distance) SCF run for the trained model of the H 2 example will produce density errors comparable to NN.(b) The I value will be comparable to the corresponding value on the NN curve, which is significantly lower than the error of B3LYP.From Wu et al., Quantum Chemistry in the Age of Machine Learning (pp.531-558).Copyright (2023) Elsevier.Reprinted with permission Elsevier.

− 5 - 10 − 7
in terms of I value.Since only one H 2 structure is included in the simple_H2 training set, the error could be larger.Here, the I value between two (possibly different) densities is defined to beI = I[ρ 1 , ρ 2 ] = ∫ |ρ 1 (r) − ρ 2 (r)| 2 dr ∫ |ρ 1 (r)| 2 dr + ∫ |ρ 2 (r)| 2 dr.This tutorial is centered on a pre-built dataset from one H 2 structure for both training and SCF.To reproduce the result from the original paper, 14 a modified and re-compiled version of PySCF is needed for generating WY target data from scratch with the codes in the folder oep-wy (while the SCF part only needs the vanilla version of PySCF).One can refer to the README in the GitHub repo for more details on installing a custom version of the PySCF package.

PERSPECTIVE pubs.aip.org/aip/jcp higher
-order derivatives of electron density in the descriptors, as pointed out by Tozer et al. in Ref. 4.
the target or reference electron density, and E train [⋅] indicates the averaging operation over the training set.If the ML-DFT model is constructed to output the XC potential, one may skip solving the KS equations during training and impose the loss function in XC potential as 2. Create and enter a new folder; download the code and dataset by typing Here, all training settings and hyper-parameters are defined in the .cfgfile; to write a new .cfgfile for a different configuration, please refer to the README file provided with the GitHub repo. 5. Training will start on the provided H 2 dataset; by default, the number of epochs is 1000.6. Perform SCF calculations on the newly trained model by typing $ python ../xcnn/main.pytest.cfg 7. One can check the SCF performance of the model by examining the output file generated.A typical run for a small molecule like H 2 should result in an error at the level of 10