We present a local and transferable machine-learning approach capable of predicting the real-space density response of both molecules and periodic systems to homogeneous electric fields. The new method, Symmetry-Adapted Learning of Three-dimensional Electron Responses (SALTER), builds on the symmetry-adapted Gaussian process regression symmetry-adapted learning of three-dimensional electron densities framework. SALTER requires only a small, but necessary, modification to the descriptors used to represent the atomic environments. We present the performance of the method on isolated water molecules, bulk water, and a naphthalene crystal. Root mean square errors of the predicted density response lie at or below 10% with barely more than 100 training structures. Derived polarizability tensors and even Raman spectra further derived from these tensors show good agreement with those calculated directly from quantum mechanical methods. Therefore, SALTER shows excellent performance when predicting derived quantities, while retaining all of the information contained in the full electronic response. Thus, this method is capable of predicting vector fields in a chemical context and serves as a landmark for further developments.

Machine-learning (ML) tools are now widely used in computational chemistry to reduce the cost of simulations and aid the interpretation of data.^{1–4} ML models are often employed to predict energies and forces^{5–10} and drive molecular dynamics simulations.^{11–14} This has enabled *ab initio* quality simulations on timescales and system sizes, which are completely inaccessible by conventional *ab initio* molecular dynamics (AIMD) approaches.^{14–18} However, this approach comes at the cost of losing access to other properties directly related to the electronic density. These must be obtained by some independent method, which triggered the development of many recent models,^{19–23} including a number of models capable of predicting the electronic polarizability.^{10,24–26} We have recently developed SALTED (Symmetry Adapted Learning of Three-dimensional Electron Densities), a machine-learning method that predicts the real-space electronic density of molecular and condensed phase systems,^{27,28} based on earlier work that predicted the electron densities of molecules.^{29,30} SALTED may be applied in tandem with machine-learned potentials, providing access to every ground-state electronic-structure property that would be available in a traditional AIMD calculation.

In this article, we present SALTER (Symmetry-Adapted Learning of Three-dimensional Electron Responses). SALTER builds on SALTED to predict the static real-space *response* of the electron density of a molecule or material to a homogeneous electric field. This response defines the global dielectric susceptibility of the system, which is needed to simulate many flavors of vibrational Raman and sum-frequency spectra,^{31–37} and defines the high-frequency electronic screening that impacts a wide range of static and transport properties of molecules and materials.^{38–42} The computational burden of calculating this electronic response is considerable—the most commonly employed method, density-functional perturbation theory (DFPT),^{43} is typically 4–10 times more expensive than the underlying density functional theory (DFT) calculation.^{42,44} As such, this calculation is a bottleneck when evaluating dielectric properties along long molecular dynamics trajectories. Machine-learning approaches have recently been used to predict the non-local density-density response function, from which the density response to an applied field can be derived.^{45,46} Here, we target the density response itself, developing a machine-learning method that provides more information than direct predictions of the polarizability while being less cumbersome than predictions of the density-density response function.

As in our previous work,^{27,28} SALTER is based on an atom-centered expansion of the density response, with the coefficients of the expansion predicted using the symmetry-adapted Gaussian process regression (SA-GPR) approach.^{47} This produces a method that is local, transferable, and capable of treating molecular and periodic systems on the same footing.^{27} The key modification introduced in this work is the inclusion of a “dummy” atom in the representation of each atomic environment, which encodes information about the direction of the applied perturbation in a simple way while retaining all of the symmetry properties of the representation.

*ρ*,

^{27,28}and highlight the key differences to the previous method. For a more detailed account, we refer the reader to Refs. 27 and 28. The density response $\rho \beta (1)$ to a field applied along a Cartesian axis

*β*can be expanded using a set of auxiliary basis functions

*ϕ*, such that

**R**

_{i}is the position of atom

*i*, and the basis function

*ϕ*

_{iσ}is centered on atom

*i*and is the product of a radial part

*R*

_{n}(

*r*) and a real spherical harmonic

*Y*

_{λμ}(

*θ*,

*ϕ*); for brevity, we use a composite index

*σ*≡ (

*anλμ*), where

*a*labels the atomic species.

**T**(

**U**) is a translation vector to a point removed from the central reference unit cell by an integer multiple

**U**= (

*U*

_{x},

*U*

_{y},

*U*

_{z}) of the lattice vectors, present only when the system is periodic. The optimal coefficients for this expansion can be found by minimizing the loss function

**S**is the overlap matrix of the basis functions, and $w\beta (1)$ is a vector of the projections of the self-consistent density response $\rho \beta (1),QM(r)$ onto the basis,

*b*

_{jσβ}associated with

*M*reference environments and a covariant kernel k

_{σβ}(

*A*

_{i},

*M*

_{j}) that describes the similarity between the reference environments $Mj$ and the atomic environments that comprise the target structure $Ai$,

*b*

_{jσβ}are vectors of length 2

*λ*+ 1 and k

_{σ,β}(

*A*

_{i},

*M*

_{j}) is a square matrix of the same dimension. Defining and minimizing a loss function analogous to Eq. (2) allows us to determine the regression weights

**b**

_{β},

*A*runs over each of the

*N*structures in the training set, the matrix

**K**

_{MM}couples the reference atomic environments to one another, and we introduce the hyperparameter

*η*to control the regularization of the minimization. In all the examples presented, we use a regularization parameter of

*η*= 10

^{−8}, finding only a weak dependence of the observed error on

*η*. Equation (6) is solved iteratively using the conjugate gradient algorithm; for a detailed description of the implementation, see Ref. 28.

Equation (6) contains no explicit constraint to ensure that $\u222b\rho \u0303\beta (1),ML(r;b\beta )dr=0$. Nevertheless, we find that the charge conservation error in the examples presented is negligible: applying an electric field of 0.1 atomic unit (5.14 V/Å) leads to charge conservation errors typically below 10^{−4} per electron. This is consistent with our previous observations using SALTED to predict the electron density,^{27} in which case there is also no enforcement of charge conservation.^{48}

The key difference between SALTER and SALTED lies in the similarity kernel k_{σβ}(*A*_{i}, *M*_{j}). In Refs. 27 and 28, kernels are constructed from the *λ*-SOAP representation of the atomic environments *A*_{i} and *M*_{j}.^{47} This captures the symmetry covariant transformations of the spherical harmonics under rotation; the coefficients must also possess this symmetry since they expand auxiliary basis functions whose angular parts are spherical harmonics. As such, they are independent of the laboratory frame. However, in the present case, the similarity kernel k_{σβ}(*A*_{i}, *M*_{j}) must retain all of the properties exploited in previous work, while also including the laboratory frame direction along which the perturbing field is applied.

Our solution is simple. An additional “dummy” atom of atomic number zero is added to every atomic environment on the *β* axis, differentiating this axis from the others. The kernels are then calculated using *λ*-SOAP representations of these modified atomic environments in the normal way. This dummy atom is included in the kernels for all values of *λ*, ensuring that the kernels retain all of the required symmetry properties, while including (symmetry-adapted) information about the laboratory frame direction of the applied field. This idea is illustrated in Fig. 1 and defined mathematically below.

*λ*-SOAP kernel is defined as

^{47}

*λ*, and $D\lambda (R\u0302)$ is the Wigner matrix corresponding to rotation $R\u0302$. The atomic environment around atom

*i*is described as a sum of Gaussian functions of width

*ν*centered at the position of every atom

**R**

_{j}that lies within some radius

*r*

_{c}of the central atom,

*x*) is the Heaviside step function. To include information about the direction of an applied field, we include a “dummy” atom in the atomic environment, adding an additional term to Eq. (8),

*x*is a parameter between 0 and 1 specifying the position of the dummy atom and $r\u0302\beta $ is a unit vector in the Cartesian direction

*β*. Note that the width of the Gaussian associated with the dummy atom is not required to be the same as that associated with the physical atoms; in general,

*ζ*≠

*ν*. The

*λ*-SOAP kernel k

_{σβ}that accounts for the direction of the applied field is then obtained by replacing

*n*

_{i}(

**r**) with

*n*

_{iβ}(

**r**) in Eq. (7).

*N*

_{t}structures in the test set, with $\Delta c\u0304A=cA\u2212c\u0304$ and $\Delta w\u0304A(1)=wA(1)\u2212w\u0304(1)$; $c\u0304$ and $w\u0304(1)$ are the mean values for the vectors of coefficients and density projections in the dataset.

We used the FHI-aims software package^{49} to obtain the training data **S** and $w\beta (1)$ and reference density responses $\rho \beta (1),QM$ using density functional theory^{50,51} employing the local-density approximation (LDA) functional.^{50} In all cases, “light” basis sets were used, and the auxiliary basis functions introduced in Eq. (1) are constructed by taking all possible products of these basis functions and then eliminating linear dependencies from the resulting basis set.^{52} The molecular dynamics simulations were performed using i-PI.^{53} We note that the “light” settings used are not the most accurate for the prediction of these properties but are sufficient to test the quality of the ML model. Increasing the model accuracy only requires calculating new training data, which we will show below is not very costly.

First, we tackle the simple example of an isolated water molecule. We learn the density response to an electric field for a dataset of 1000 distorted configurations taken from Ref. 47 using both the usual kernels k_{σ} and the modified kernels k_{σβ}. Each molecule in the dataset is aligned to a reference molecule, as shown in Fig. 2. We use the SOAP parameters *r*_{c} = 4.0 Å, *ν* = 0.3 Å, as is commonly used to describe water,^{47} and chose *ζ* = *ν* and *x* = 0.9 to define the dummy atom. The choice of these parameters will be discussed in more detail below and makes little difference to the results in this simple example. The dataset of was split into a training set of 250 and a test set of 750 structures; *M* = 300 reference environments were used.

The accuracy of SALTER density responses for this simple dataset are summarized in the table in Fig. 2. The first row shows that while aligning the molecules is sufficient to allow reasonably accurate learning of the density response to a field applied along the *y* axis, the learning entirely fails for fields applied along the *x* or *z* axes. This can be explained by symmetry: since the molecules lie in the *xy* plane, the SOAP descriptors are invariant to reflections in that plane; there is no differentiation between a field applied in the +*z* and −*z* directions, resulting in errors approaching 100%. By contrast, once the dummy atom is introduced, this ambiguity is removed. As a result, using the modified kernels, we see errors of around 1% regardless of the field direction.^{54}

This example also displays a feature of SALTER. Since the molecules are aligned, the Cartesian axes are nonequivalent. As a result, a separate machine-learning model must be trained for each non-equivalent direction. This additional cost is offset by the conceptual simplicity of the approach, the computational simplicity of calculating the descriptors, and that this drawback only applies in the specific case of nonequivalent axes. It is not necessary to align systems for this approach to succeed; the alignment here has a purely pedagogical purpose.

Having established that our approach works in principle, we turn to a more realistic example: a cubic water cell of side 9.67 Å containing 32 water molecules. Our dataset consists of 500 configurations, divided into a training set of 400 structures and a test set of 100 structures; this is the same dataset as used in Ref. 28. Because there are now physical atoms distributed throughout each spherical atomic environment, it is not clear *a priori* what the optimal values for *x* and *ζ* in Eq. (9) would be. We performed a search of the parameter space using a subset of 50 structures from the full training set, finding the lowest errors when *x* = 0.3 and *ζ* = 1.0 Å. The results show only a weak dependence on *x*, since the role of the dummy atom is simply to break the symmetry of the atomic environment—the precise position of the atom in the environment, as long as it lies along the relevant Cartesian direction, is unimportant. Full details of this optimization are provided in the supplementary material.

The learning curves for the density response to a field applied along each Cartesian axis are shown in Fig. 3. Learning improves monotonically with increasing training set size, with the error reaching a plateau of around 12% for *N* ≥ 100. We have used *M* = 3000 reference environments for these calculations (see convergence with respect to *M* in the supplementary material). As the Cartesian directions are all equivalent, the machine-learning model trained on the density response to a field applied along the *x* axis is also used to predict the response to a field along the *y* and *z* axes; the errors observed are very similar regardless of the field direction, as would be expected by symmetry.

The learning curve for this dataset saturates at a relatively large % RMSE: around three times larger than the error in the predicted densities we reported in our previous work.^{28} The reason for this could be a deficiency of the descriptor or a lack of quality in the training data. Our descriptors are short ranged: any correlations in the density response, which are longer ranged than the 4 Å cutoff radius of the atomic environment, will be lost, limiting the accuracy of predictions. Alternatively, more accurate predictions may be possible if the training data were obtained using a larger basis set, since this allows a more accurate expansion of the density response and reduces the error in the training data, which in turn reduces the noise in the machine-learning model.

^{42}

^{,}

^{42,43}having obtained the unperturbed Kohn–Sham orbitals from a ground-state calculation or an ML prediction.

^{55}We denote the tensor defined by the density response found by a self-consistent DFPT calculation as

*α*^{QM}, while

*α*^{ML}is derived from the predicted density response; the error in

*α*^{ML}relative to

*α*^{QM}provides a measure of the accuracy of the predicted density responses.

Table I gives the errors in different elements of *α*^{ML} derived from the density response predicted by the machine-learning model with *N* = 400, *M* = 3000. In all cases, errors are around 5%; this compares favorably to previous direct machine-learning predictions of *α*^{ML}, which typically find errors $>10%$ for models trained on a few hundred structures, reducing to 5% only when trained on a few thousand structures.^{24–26} Indeed, a direct prediction of the dielectric susceptibilities of this dataset yielded errors between 20% and 30% using the same training set and learning parameters. Therefore, for a given accuracy, our indirect method of predicting *α*^{ML} reduces the computational effort required to generate training data by an order of magnitude relative to a direct prediction of *α*^{ML}.

Component . | xx
. | yy
. | zz
. | xy
. | xz
. | yz
. |
---|---|---|---|---|---|---|

% RMSE | 5.03 | 4.88 | 4.39 | 5.11 | 4.91 | 4.31 |

Component . | xx
. | yy
. | zz
. | xy
. | xz
. | yz
. |
---|---|---|---|---|---|---|

% RMSE | 5.03 | 4.88 | 4.39 | 5.11 | 4.91 | 4.31 |

The smaller errors in *α*^{ML} than in *ρ*^{(1),ML} can be understood from Eq. (11). If errors in the density response accumulate in regions of space, which do not contribute significantly to the integral, one would expect to see lower errors in *α*^{ML} than the density response from which it is derived. We discussed a similar dependence of the error in properties derived from electron density on the spatial distribution of errors in the electron density itself in Ref. 28.

Finally, we use SALTER to predict a quantity that is “twice removed” from the target prediction of the model: the vibrational Raman spectrum of the P2_{1}/a naphthalene crystal at 80 K. We used five 10 ps NVE *ab initio* molecular dynamics trajectories of a 2 × 2 × 1 crystal supercell (eight molecules) using the Perdew–Burke–Ernzerhof (PBE) functional^{56} with many-body dispersion corrections.^{57,58} To train the machine-learning model, we selected 100 structures from each of these trajectories, dividing these structures into a training set of 400 configurations and a test set of the remaining 100. The learning curves for the density response to a field applied along each Cartesian direction are shown in Fig. 4. These curves are converged with respect to *M* (see the supplementary material) and we see no significant difference in performance when predicting the response to a field applied along each Cartesian axis. The dataset is relatively homogeneous: a single training structure is sufficient to produce an ML model accurate to within 12%, and no improvement is seen when the training set size is increased beyond 100.

We then used the predicted density responses to calculate *α*^{ML} for each structure in the test set. Curiously, the errors given in Table II are significantly larger than those found in bulk water (Table I), despite the % RMSE of the density response itself being significantly lower for naphthalene. In contrast to the water dataset, these indirect predictions also underperform direct predictions of the dielectric susceptibility, which show errors of around 7% using the same training data and machine-learning parameters. Furthermore, the errors in the components of *α*^{ML} derived from a response to a field applied in the *z* direction are the largest, despite the errors in this response being the lowest (see crystal cell orientation in Fig. 4). This counter-intuitive result is again due to the fact that the errors in *α*^{ML} depend not just on the magnitude of the error but also its spatial distribution. Specifically, for a given magnitude of error, the greater the distance between the regions of positive and negative error, the greater the error in the dielectric susceptibility derived from the density response—as we observe along the *z* direction of naphthalene. This is illustrated in more detail in the supplementary material. This also suggests a possible solution: adding auxiliary basis functions with larger values of *λ* will likely improve the description of the electron density response far from the nuclei, resulting in a reduction in the error in the dielectric susceptibility.

Component . | xx
. | yy
. | zz
. | xy
. | xz
. | yz
. |
---|---|---|---|---|---|---|

% RMSE | 9.13 | 14.40 | 19.74 | 9.38 | 12.81 | 20.11 |

Component . | xx
. | yy
. | zz
. | xy
. | xz
. | yz
. |
---|---|---|---|---|---|---|

% RMSE | 9.13 | 14.40 | 19.74 | 9.38 | 12.81 | 20.11 |

We compared the Raman spectrum obtained using *α*^{QM} with one obtained indirectly using SALTER predictions of *ρ*^{(1),ML} to obtain *α*^{ML}. The anharmonic Raman spectrum of a disordered naphthalene sample (the powder spectrum) was calculated using ensemble-averaged time-correlation functions, as in Ref. 32. The spectra are shown in Fig. 5. Good agreement is seen at the polymorph-sensitive low frequency modes, with small discrepancies in the peak intensities appearing at higher frequencies. Using SALTER, we obtain Raman spectra of comparable quality to those obtained using *ab initio* methods and to those obtained from a direct learning of *α*^{QM},^{24} while dramatically reducing the simulation cost and retaining access to all the information included in the real-space distribution of *ρ*^{(1)}(** r**).

In conclusion, we presented a simple modification to the widely used *λ*-SOAP descriptors, which allows us to use machine learning to predict a vector field; specifically, the response of the electron density to a static electric field. By adding a “dummy” atom to each atomic environment that indicates the direction of the applied field, we retain all of the beneficial properties of the *λ*-SOAP kernels, which allowed us to predict the scalar electron density field in previous work,^{27,28} while including (symmetry-adapted) information about the laboratory frame direction of the applied field. Therefore, we expect the method developed here to be straightforward to generalize to the prediction of other vector fields. To the best of our knowledge, this is the first reported machine-learning model capable of predicting vector fields in a chemical context. We note that, after the initial submission of this paper, a very similar model was developed independently and used to predict the electron density of systems under a finite field.^{59}

We applied SALTER to two condensed phase systems: liquid water and the P2_{1}/a naphthalene crystal. A single machine-learning model successfully predicted the density response of water to a field applied along each of the Cartesian axes; we derived the dielectric susceptibility from these predictions, finding values within 5% of the reference DFPT calculations. For naphthalene, we constructed machine-learning models of similar accuracy regardless of the direction of the applied field. We found a counter-intuitive relationship between the errors in the density response, which were smallest when the field was applied along the *z* axis, and the errors in the corresponding dielectric susceptibilities, which were the largest for the *yz* and *zz* components of the tensor. This apparent contradiction can be explained by analyzing the spatial distribution of the error in the predicted density response. Nevertheless, the Raman powder spectrum obtained from these indirectly predicted dielectric susceptibilities was very similar to that obtained using DFPT, and at a fraction of the computational cost. In general, this method produces accurate derived quantities while retaining the full information contained in the electron density response to an applied field.

## SUPPLEMENTARY MATERIAL

The supplementary material contains details of the optimization of the dummy atom parameters, the convergence of the machine-learning models with respect to the number of reference environments, a description of additional machine-learning parameters, details of the auxiliary basis employed, and an analysis of the origin of the error in the polarizability or dielectric susceptibility derived from a predicted density response.

A.M.L. acknowledges partial support from the Alexander von Humboldt Foundation. P.L. and M.R. acknowledge support through the Lise Meitner Program of the Max Planck Society and the UFAST International Max Planck Research School.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Alan M. Lewis**: Investigation (lead); Methodology (lead); Software (lead); Writing – original draft (lead); Writing – review & editing (equal). **Paolo Lazzaroni**: Data curation (lead); Investigation (equal); Software (equal); Writing – review & editing (equal). **Mariana Rossi**: Conceptualization (lead); Methodology (equal); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The machine-learning datasets and MD trajectories can be found at dx.doi.org/10.5281/zenodo.7940570. The naphthalene training data can also be found in the NOMAD repository dx.doi.org/10.17172/NOMAD/2023.04.13-1. The code to build similarity kernels using the modified *λ*-SOAP descriptors can be found at github.com/alanmlewis/TENSOAP, with the release for this paper at dx.doi.org/10.5281/zenodo.7941362.

## REFERENCES

Should strict charge conservation be required, this could be achieved by either applying a uniform *post-hoc* correction to the predicted density response, or by excluding *λ* = 0 auxiliary basis functions from the expansion in Eq. (1).

*Ab initio*molecular simulations with numeric atom-centered orbitals

The learning of the response to a field in the *y* direction also improves when using the modified kernels, since the applied field direction is no longer simply being inferred from the aligned geometries.

The KS orbitals can be derived from a machine-learning model such as SALTED – see Ref. 28.