The recently introduced Gaussian Process State (GPS) provides a highly flexible, compact, and physically insightful representation of quantum many-body states based on ideas from the zoo of machine learning approaches. In this work, we give a comprehensive description of how such a state can be learned from given samples of a potentially unknown target state and show how regression approaches based on Bayesian inference can be used to compress a target state into a highly compact and accurate GPS representation. By application of a type II maximum likelihood method based on relevance vector machines, we are able to extract many-body configurations from the underlying Hilbert space, which are particularly relevant for the description of the target state, as support points to define the GPS. Together with an introduced optimization scheme for the hyperparameters of the model characterizing the weighting of modeled correlation features, this makes it possible to easily extract physical characteristics of the state such as the relative importance of particular correlation properties. We apply the Bayesian learning scheme to the problem of modeling ground states of small Fermi–Hubbard chains and show that the found solutions represent a systematically improvable trade-off between sparsity and accuracy of the model. Moreover, we show how the learned hyperparameters and the extracted relevant configurations, characterizing the correlation of the wave function, depend on the interaction strength of the Hubbard model and the target accuracy of the representation.

## I. INTRODUCTION

The difficulty in compactly and accurately describing quantum many-body systems is a key factor limiting numerical studies of realistic physical problems found in condensed or other chemical matter. This means that, in practice, numerical algorithms tackling the many-body problem need to efficiently compress the information encoded in the system to be numerically tractable. Of particular interest is a compact description of the many-body wave function fully characterizing the physical properties of the system. A common approach is to constrain the state to be represented by a compact functional form defining an approximate wave function ansatz for which expectation values can still be efficiently evaluated. These are often built from physical characteristics, which are expected (or known) to dominate the description of the target state. Such ansätze, including the Jastrow wave function,^{1} Correlator Product State (CPS),^{2} coupled-cluster wave functions,^{3} or tensor networks,^{4–6} are usually designed to explicitly model selected correlations (such as those based on low-rank, locality, or low-entanglement) of the target system, which makes them highly successful for specific applications.

Recent approaches to describe many-body wave functions have also used the idea of deriving their functional forms from the field of machine learning (ML), with the aim of avoiding high prior constraints on the emergent physics. Such ML-based approaches often represent a direct conceptual counterpart to other numerical techniques; instead of incorporating as much physical understanding as possible in order to reduce the complexity of the problem, algorithms are introduced to automatically find optimal approximations for the given problem based on data or some quality metrics in a non-parametric fashion. This includes the representation of states based on artificial neural networks, typically denoted as Neural Quantum States (NQSs),^{7–12} which have successfully modeled the ground states of lattice^{13–23} and *ab initio* ^{24–27} many-body systems, as well as been used in quantum state tomography,^{28} and a description of dynamical properties of many-body systems.^{29,30} Complementary to the representations based on different neural network architectures, an additional class of many-body wave functions motivated by machine learning with kernel methods, in particular Gaussian process regression, has been introduced recently. These states, named Gaussian Process States (GPSs),^{31} are explicitly data-driven and have a number of appealing properties. In particular, it is easy to incorporate anticipated physical intuition into the method to improve its efficiency, while the final description remains automatically learned and can, in practice, be constructed to be systematically improvable to exactness. The GPSs have already shown promise in the description of many-body ground states, which we substantiate and elaborate on in this work. In particular, we describe how the particular form of the GPS makes it well suited for the application of regression techniques based on Bayesian inference to obtain highly accurate and compact many-body representations from given wave function data, which can be used to compute physical quantities of interest.

In this work, we benchmark the accuracy of the Bayesian procedure to efficiently compress a given many-body wave function to the GPS form, by learning ground states of one-dimensional Fermi–Hubbard models. This defines a clear series of correlated and controllable test systems with known exact comparison results, defined by the Hamiltonian

where Fermionic creation and annihilation operators *c*^{†} and *c* are associated with discrete sites of a one-dimensional lattice, labeled by indices *i* and *j*, and with spins labeled by *σ*. The model, characterized by a hopping parameter *t* and the interaction strength *U*, provides a generic prototype for the description of strongly correlated electrons in the context of condensed matter physics and phenomena emerging in such systems. Numerical studies of this system face the many-body problem as they must find a compact (polynomial) description of the system while still being able to correctly describe the emergent correlated effects. We thus aim to represent the ground state as a compact map in the form of a GPS, which associates a wave function amplitude to any many-body configuration of the underlying Hilbert space.

The presented work expands on the GPS framework introduced in Ref. 31, which is then extended to gain further insight into this novel Bayesian approach to wave function representation. We show that this framework also allows us to explore the potential to extract interpretable physical information from such a state, including correlation length scales, and shows the potential of the GPS representation to tackle the many-body problem.

## II. GAUSSIAN PROCESS STATES

### A. The GPS representation

The GPS models the logarithm of the wave function amplitude as the mean of a Gaussian process regression estimate.^{32} This defines the amplitude for a many-body configuration **x** as

where the sum in the exponential is taken with respect to a set of “basis” configurations $xb\u2032$, which define the support points of the model. This model is, therefore, explicitly data-driven, with each basis configuration in the dataset also having an associated weight *w*_{b}. The other quantity required is the kernel function *k*(**x**, $xb\u2032$), which compares the test configuration **x** with the basis configurations. Since the ground state of Hubbard chains considered in this work can be made to be entirely positive in the site basis if appropriate boundary conditions are chosen, we do not require to learn sign structures of the wave function. Learning the sign structure of the wave function amplitudes—a challenging task for both the GPS and NQS representations^{18,20}—is a topic we defer to a later study after consideration here of the ability to learn the exponential space of positive-definite amplitudes in this form. We can, therefore, restrict the weights of the model to be real in this instance.

The estimator ∑_{b}*w*_{b}*k*(**x**, $xb\u2032$) can be considered a linear regression model in some (typically high dimensional) space of features of the many-body configurations. However, the feature map, transforming many-body configurations into this feature space, is only defined implicitly through the kernel function *k*(**x**, $xb\u2032$), which corresponds to a scalar product between the feature vectors associated with **x** and $xb\u2032$. An important property of this representation is that the transformation of the configurations into the feature space does not need to be evaluated explicitly. This means that, in the context of many-body wave functions, we can define kernel functions that represent an exponentially scaling number of correlation features emerging between local degrees of freedom of a system, while still maintaining a functional form that can be evaluated efficiently for any many-body configuration. This ensures that if an appropriate kernel function is chosen, a GPS can represent any wave function to arbitrary accuracy, making it a representation that is not limited in its expressibility. Such a general exact representation of arbitrary states usually requires a number of basis configurations which grows as the dimensionality of the underlying Hilbert space. Nonetheless, we aim to obtain compact but still highly accurate GPS representations by efficiently modeling the underlying physical structure of relevant states, mirroring the great success of function approximators based on kernel functions used in different machine learning applications, e.g., with Gaussian process regression,^{33} kernel ridge regression,^{34} or support vector machines.^{35} The main task discussed in the following is to learn a GPS representation of a given physical state that is as compact (in terms of a small number of basis configurations) and accurate as possible.

### B. The kernel function

A central element to the success of the method is the choice of the kernel function, defining the feature space in which the linear regression of the log-wave function amplitudes takes place. The features that are described with the ansatz are, therefore, defined before the actual representation is learned from the given data. This is conceptually different from how typical representations based on artificial neural networks are used in practice. Although the features learned for such states could also be designed by choosing particular input representations or specific network architectures, usually their features are, in principle, learned for a specific state via a Monte Carlo sampling and variational optimization of the chosen network architecture. However, being able to define the kernel function allows for a high level of direct control over the underlying physics that is modeled, and they can be explicitly tailored to enforce desired physical constraints on the wave function. For example, as shown in Ref. 31, many-body states such as the W-state or the Laughlin state can be represented very efficiently as a GPS using only a single basis configuration by introducing a specifically designed kernel function. Moreover, it was found that general kernel functions (not tailored for a specific state) can learn the ground state wave functions efficiently. This can be attributed to the fact that the kernel transforms the configurational input into a very high (or even infinite) dimensional feature space, where there always exists a solution that is of a linear form. Furthermore, similar to an NQS representation, it is possible to include a degree of “kernel learning,” in order to optimize hyperparameters of the kernel, resulting in a weighting of some more relevant features over others, with their values giving insight into the nature of the state.

In the context of many-body wave functions, the key challenge is that of finding regularities between the wave function amplitude of a given many-body configuration and the local occupancies of its sites. If there are correlations between the amplitude and occupancy patterns, then we expect to be able to exploit this information and obtain a compressed representation of the state. An example of the use of just such a correlation feature can be found in the Gutzwiller representation^{36} (as one example), which explicitly models the exponential suppression (or enhancement) of the wave function amplitude depending on the number of doubly occupied sites in the configuration. With the GPS, we are not limited to the extraction of such single-site features, and we can, in principle, represent and model correlations between an arbitrary number and a range of lattice sites. In this way, the kernel can represent a sum over exponentially many features, with the exponential form of the final wave function form of Eq. (2) ensuring overall product separability of these features and a resulting state, which preserves size extensivity of appropriate properties (such as the total energy).

With this aim, in this work, we choose to use a squared exponential kernel within a GPS framework. The base kernel is defined as

where *h*(**x**, **x**′) is the “Hamming distance”^{37} between the many-body configurations **x** and **x**′. Furthermore, *θ* is an adjustable hyperparameter controlling the weighting of the modeled features at different ranks. For the Fermi–Hubbard model, we denote the local occupancy of configuration **x** at site *i* as *x*_{i}, which can take one of four possibilities, {·, ↑, ↓, ↑↓}. The Hamming distance between two configurations can then be constructed as

where the sum ranges over all sites of the lattice and the delta function $\delta xi,xi\u2032$ is equal to one if the local occupancies at site *i* of configurations **x** and **x**′ are the same and to zero otherwise.

Such a kernel gives rise to a feature space that is a linear combination of all possible occupancy configurations across arbitrary many sites (derived in more detail in Ref. 31), which guarantees that an exact representation can always be obtained in the limit of a complete set of basis configurations. This property can be easily seen by considering the Taylor expansion of the kernel function,

In the second term, only single-site correlation features in the two configurations are represented so that these terms alone would give rise to a Gutzwiller-type parameterization of single-site features. The third term also contains terms of the form $\delta xi,xi\u2032\delta xj,xj\u2032$, thus also capturing two-site correlation features between different sites, and similarly, higher order correlation features are contained in the following terms. The *n*th order of correlations between different sites captured in the kernel is suppressed by a factor of *θ*^{n} so that the hyperparameter *θ* effectively controls the weighting of higher order correlation features with respect to the lower order features.

In addition to the weighting of the correlation order in the kernel function, we can introduce another hyperparameter, controlling the relative weighting of correlation features based on the distance between sites, since we expect short-range correlations to be dominant over the long-range correlations. We parameterize this effect through an explicit polynomial suppression of long-range features in the Hamming distance as

where |*r*_{i} − *r*_{0}| represents the distance of the site with index *i* from a chosen reference site with index zero. For positive values of *γ*, this construction explicitly introduces a stronger weighting of features resulting from the occupancy patterns around the chosen reference site. However, if the basis configurations of the ansatz are chosen from the set of all possible configurations (including translationally equivalent ones), then the chosen reference site will not bias or affect the accuracy of the ansatz since the model can simply shift the dataset of basis configurations to compensate for any change in the reference choice. Therefore, for translationally symmetric systems and a free choice for the model to pick desired basis configurations, the model is independent of the arbitrary choice of the reference site.

The choice of basis configurations for the model will thus most sensitively depend on their occupancies around this arbitrary reference site, where the hyperparameter *γ* explicitly controls the rate of decay in the relative importance of long range compared to short-range features from this reference site. It should be stressed that the adjustment of these hyperparameters (*γ* and *θ*) does not restrict the dimensionality of the feature space or bias the exact limit of the ansatz in the limit of complete and exact training data and basis size. However, the convergence to this exact limit with increasing datasets can be improved if the hyperparameters are set such that they preferentially weight the fitting of these more important features. This is because decreasing these hyperparameters leads to a more complex model, which requires a larger basis set to appropriately fit to a desired accuracy. This will be elaborated on in Sec. III E.

Using this modified, distance-regularized Hamming distance defined in Eq. (6) within the squared exponential form for the kernel (taking care with the limit of *r*_{i} → *r*_{0}), we obtain the final form of our kernel as

An important property of this kernel function is that the basis configurations **x**′ do not necessarily need to correspond to physical configurations associated with the target Hamiltonian. By building basis configurations associated with a lattice smaller than the target system (and restricting the sum in the exponential to site indices of the smaller lattice), we can explicitly restrict the range of the modeled correlation features as it is typically done in the framework of correlator product states.^{2} However, contrary to a CPS ansatz, the kernel in Eq. (7) allows for modeling the exponentially growing number of features across the considered range in a polynomially compact form. This is due to the fact that the kernel, which can be evaluated in polynomial time, implicitly defines the map into the space of described features independent of the number of basis configurations used in the GPS.

Finally, physical symmetries of the system can easily be incorporated into the GPS representation by explicitly encoding them into the kernel function. A direct way to do this is to include a sum over all symmetry operations in the kernel definition, giving the symmetrized kernel

where the operations $S$ represent the symmetry operations we want to respect and $S\u2009\u2009[x]$ denotes the transformed configuration that is obtained after applying $S$ to the configuration **x**. In this work, we always include translational symmetry of the lattice into the kernel through this construction, ensuring that the GPS state preserves translational symmetry at all times. We note that this kernel is then equivalent to the “complete” kernel introduced in Ref. 31, as the *p* → ∞ limit [Eq. (7) of that work], modeling all correlations, with a specific choice (|*r*_{i} − *r*_{0}|^{−1}) for the distance weighting of the features.

## III. BAYESIAN COMPRESSION OF QUANTUM STATES

### A. Bayesian learning with GPS

In this section, we address the problem of finding a suitable GPS representation from a given set of wave function data corresponding to a target state. This is essentially a task of supervised learning: given a finite set of input–output pairs, we want to infer a good general model describing the underlying relationship between the inputs and the outputs. Specifically here, we want to learn a good GPS representation of the ground state of the Hubbard model from a given set of wave function amplitudes, e.g., generated by exact diagonalization. We tackle this problem by applying well known machine learning techniques based on Bayesian inference,^{32,38} which have successfully been used for various different regression and classification tasks.

The essential paradigm of such learning schemes is to explicitly model a probability distribution over all possible models, which is statistically inferred from the given data. Applying such Bayesian learning schemes, we, therefore, do not only obtain a GPS that describes the given amplitudes well (e.g., by adopting the most probable or the mean of all models) but also gain insight about underlying statistical properties such as the uncertainty for new predictions associated with a chosen model. This additional interpretability of the results is a key advantage over non-statistical regression schemes such as a simple minimization of the squared errors on the training data.

In order to model the GPS as defined in Eq. (2), we require a set of basis configurations (which is hopefully optimally compact) along with associated weights, {(**x**_{b}′, *w*_{b})}, and the kernel function, which is in this case parameterized by the hyperparameters *θ* and *γ*. The objective is, therefore, to describe a probability distribution over all possible choices of basis configurations, associated weights and kernel hyperparameters defining the model, given (some or all) “training” wave function amplitudes {*ψ*_{T}(**x**)} of a target state *ψ*_{T}. The aim is for this GPS model to use the training data in order to faithfully represent this target state in a compact form. In general, this target state will be approximate, or even unknown. However, in this work, we will consider an exact *ψ*_{T}, to allow for exploration of optimal algorithms, and unambiguous conclusions into the underlying efficiency and accuracy of the model. The development of practical algorithms for unknown states based on this approach is explored in Ref. 31 and will be the subject of future work.

Instead of fitting directly on the set of training wave function amplitudes {*ψ*_{T}(**x**)}, we apply the Bayesian regression scheme to the logarithm of the amplitudes, *ϕ*(**x**) = ln *ψ*_{T}(**x**), with the linear model $M\u2261\u2211bwbk(x,xb\u2032)$. The central modeling assumption is that the linear model $M$ (which depends on the basis configurations, associated weights and kernel hyperparameters) approximates the log-wave function amplitudes well and that the probability distribution for the error in each log amplitude follows an independent normal distribution with a zero mean and a variance parameterized by $\sigma x2$. This assumption results in a multivariate normal distribution with a diagonal covariance matrix for the likelihood of the training data, $p(\varphi |M,\sigma 2)$. This is the probability of observing the given training log-wave function values (here, denoted by the vector ** ϕ**) from a particular set of model variables and associated variances (denoted by the vector

*σ*^{2}). In a typical Bayesian regression, one would use a single noise parameter

*σ*

^{2}, describing the variance of the likelihood when it is expected that the distribution of errors for this quantity is independent of the specific configuration. In our case, however, we are modeling the logarithm of the wave function and are rather interested in a fixed variance for the description of the

*actual*amplitudes and not necessarily for their logarithms. We can achieve this by choosing $\sigma x2$ in order to unbias the action of the logarithm on the variances and to instead fix an (approximately) constant variance in the likelihood of modeling the true wave function amplitudes of

*ψ*

_{T}(

**x**).

To do this, we note that the chosen assumptions result in a log-normal distribution for the likelihood of the actual wave function amplitudes, with variance

where ⟨*ψ*_{T}(**x**)⟩ represents the mean of the likelihood for the wave function amplitude from the model. This means that the variance of the distribution generally increases with an increase in the mean if we use a fixed value of $\sigma x2$ for all configurations. In order to obtain a fixed variance, we can solve the equation for $\sigma x2$ and introduce a single parameter $\sigma \u03032$, which defines the target variance we want to obtain for the likelihood of the actual wave function amplitudes. This gives the rescaled noise parameter for the variance of the likelihood as

Under the assumption that the model fit describes the training data reasonably well, ⟨*ψ*_{T}(**x**)⟩ can be approximated using the value of the training amplitude, giving the final rescaled noise as $\sigma x2=ln(1+\sigma \u03032\psi T(x)2)$. The target variance of the likelihood for the actual wave function amplitudes, $\sigma \u03032$, can then be used as an input parameter defining the desired accuracy, which should be achieved by the fit. A larger value of $\sigma \u03032$ corresponds to a larger variance of the likelihood so that the data might be described by a less accurate model and vice versa. This, therefore, represents the key value determining the accuracy of the final model.

### B. Bayesian optimization of the model weights

Using the modeling assumptions above, we can now address the problem of finding the *posterior* distribution, $p(M\u2009|\varphi ,\sigma \u03032)$, describing the probability distribution over all possible models, given the training data and a target accuracy defined by $\sigma \u03032$. The mean of this distribution will define the final GPS model parameters required for the wave function in Eq. (2). In order to obtain the posterior distribution, we apply Bayes’ theorem, giving the posterior as

with likelihood $p(\varphi |M,\sigma \u03032)$ where $p(M|\sigma \u03032)$ denotes the *prior*, which is our assumed probability distribution over the model variables in the absence of any data. The denominator of the expression, known as the *marginal likelihood*, normalizes the probability distribution.

As a first application of Bayesian inference, we can find the posterior distribution for the weights of the model (written as a vector $w$) for a fixed basis set and fixed hyperparameters *θ* and *γ*. Without including any prior belief about the distribution of the weights [i.e., using a uniform prior, *p*($w$)], the posterior follows a normal distribution, with mean equivalent to a direct minimization of the squared errors weighted by the associated values of $\sigma x2$ over the training set. This minimization can be performed in a closed form, with the most probable weights then given by

where *K* is the kernel matrix evaluated between all basis and training configurations and *S*^{2} is a diagonal matrix with elements $1\sigma x2$.

In addition to the problem of potentially obtaining (numerically) singular matrices *K*^{T}*S*^{2}*K*, it is also known that such least squares solutions are prone to overfitting the training data so that the obtained model does not necessarily generalize well for the prediction of configurations not contained in the training dataset. These both problems can be addressed by introducing an appropriate prior distribution to regularize the optimal weights. We employ the common choice of a Gaussian prior with a zero mean and a variance *α*^{−1}. This form ensures that the posterior distribution remains analytically tractable and introduces a bias toward learned models with weights distributed around zero. With this choice, the posterior distribution for the weights is also Gaussian, with a mean for the distribution given by

and the covariance matrix

The mean weights, which are adopted as the weights for the model, are equivalent to the ones obtained with a “regularized” least squares fit where the sum over the squared errors is an optimized subject to a regularization constraint for the sum of all squared weights—a procedure known as ridge regression^{39} or Tikhonov regularization. In order to allow for models satisfying the imposed zero-mean prior distribution of the weights, we shift our training data before the fit so that the mean over all training amplitudes *ϕ*(**x**) is zero. This only affects the normalization of the final GPS wave function and can be reverted either by shifting back the log-amplitudes when predictions are made or simply ignoring this shift if the final quantities are independent of the overall wave function norm.

Within this Bayesian approach, we are, therefore, able to obtain the optimal weights for a fit to the training data of a target state, given some basis configurations, kernel hyperparameters, as well as variances for the likelihood of the wave function amplitudes, $\sigma \u03032$, and prior for the weights, characterized by *α*. An illustration of the quality of this resulting model corresponding to a regularized least squares fit is visualized in Fig. 1. This shows the error in the overall modeled wave function amplitudes compared to the exact target ground state wave function, for the highly correlated *U*/*t* = 8 one-dimensional Hubbard model with eight sites as the number of basis configurations defining the model increases. Each point corresponds to a model where a desired number of basis configurations are chosen entirely at random from all possible configurations of the underlying configurational Hilbert space. The weights are then obtained by the mean of the posterior distribution as defined in Eq. (13), with the training data consisting of the entire set of symmetrically inequivalent amplitudes of the target state.

The first aspect that can be seen from Fig. 1 is that the quality of the fit (represented by the mean squared error obtained with the GPS across the Hilbert space) varies significantly for different random basis sets, giving mean squared errors ranging between ≈10^{3} and ≈6 × 10^{−7}. As expected, we achieve the smallest errors for the fit of the target state in the limit of large basis sets, which underlines a very general intuition about the GPS: With a larger number of basis configurations defining the model, the better the expressibility of the model, allowing for a higher accuracy in the fit to the target state. However, we are usually not interested in the limit where almost all the configurations from the Hilbert space are used as basis configurations for our ansatz but aim to achieve a more compact representation of the target state. Whereas the error barely fluctuates for larger basis sets where a significant fraction of all possible configurations are included in the chosen basis set, the mean squared errors fluctuate heavily for small basis sets corresponding to compact models. For example, a model defined with a single randomly chosen basis configuration achieves an error ranging between ≈2 × 10^{−4} and ≈100, indicating that not all basis set choices are equally well suited to achieve a compact but still accurate representation of the target state. It is clear (as expected) that an appropriate selection of basis configurations for a compact model is of paramount importance to its success. Learning a GPS from given data, therefore, also involves finding a suitable set of basis configurations (and potentially also kernel hyperparameters), which is the topic we turn to in Secs. III C– III E.

### C. Type II maximum likelihood

Ideally, we would infer the full model, including choice of basis configurations and kernel hyperparameters, in the same Bayesian approach as the weights, i.e., by defining a sensible prior for the full model and finding the model, which maximizes the posterior distribution $p(M|\varphi ,\sigma \u03032)$, given the training data. However, this approach would not be tractable in practice, requiring the introduction of additional modeling assumptions. Separating the known posterior distribution for the weights, the full posterior for the model can be written as

where we introduce the posterior for the basis set and the kernel hyperparameters as $p({xb},\theta ,\gamma |\varphi ,\sigma \u03032)$. Instead of maximizing the full posterior of the overall model, $p(M|\varphi ,\sigma \u03032)$, we apply the common type II maximum likelihood procedure. This means that we find the basis set and hyper parameters, which maximize $p({xb},\theta ,\gamma |\varphi ,\sigma \u03032)$, which is independent of the weights, and then use the found maximum of this distribution for inference of the resulting weights. If we assume uniform priors for the basis sets and kernel hyperparameters, then the posterior $p({xb},\theta ,\gamma |\varphi ,\sigma \u03032)$ is proportional to the marginal likelihood, which appears in the application of Bayes’ theorem for the inference of the weights,

We, therefore, aim to find an optimal and compact basis set and kernel hyperparameters by maximizing the marginal likelihood $pML=p(\varphi |{xb\u2032},\theta ,\gamma ,\sigma \u03032,\alpha )$ for the weights. Assuming a Gaussian likelihood and prior for $w$, as done in Sec. III B, the marginal likelihood can be obtained in a closed form for a chosen basis set. Its logarithm is then given by^{40}

where *N* denotes the number of training amplitudes and

An alternative form of the log marginal likelihood that can, in practice, be faster to evaluate is given by^{41}

Here, *N*_{b} denotes the number of basis configurations, the set {**x**} represents the set of all training configurations with associated log-amplitudes ** ϕ**, and

**and Σ are the mean and covariance matrix as given in Eqs. (13) and (14).**

*μ*The ability of the marginal likelihood to distinguish the quality of different basis sets can also be seen in Fig. 1 for the randomly selected basis sets of varying sizes. The marginal likelihood obtained for each basis set is indicated by the coloring of the different scatter points with warmer colors, representing a larger marginal likelihood. As can be seen in the left inset, in the regime of smaller basis sets, the computed value of the marginal likelihood is typically larger for those basis sets, which reach a higher accuracy compared to poorer quality basis sets of the same size. The marginal likelihood thus provides a good figure of merit to gauge the quality of a basis set of a given size. Overall, the computed value of the marginal likelihood for small basis sets increases with higher accuracy of the model, reaching a maximum at ≈360 configurations. For basis sets significantly larger than this, the marginal likelihood then decreases. This shows that the marginal likelihood not only simply inversely corresponds to the error obtained for the fit but also attempts to maximize the sparsity of the model, i.e., to minimize the number of chosen basis configurations. This well known property of the marginal likelihood under our chosen modeling assumptions^{38} makes statistical inference of the weights together with optimization of the marginal likelihood the ideal candidate for compressing a target state into a compact and representative GPS. If a higher accuracy (and correspondingly larger basis set) is desired, then this can be controlled with the $\sigma \u03032$ parameter. Decreasing this value will specify that we desire a more accurate fitting to the training data (reducing the variance in the likelihood of the model reproducing the data). Consequently, the maximization of the marginal likelihood will result in the selection of larger basis sets with higher accuracy in the fitting of the state. Consideration of the effect of varying $\sigma \u03032$ will be considered in more detail in Secs. III E and III F.

### D. The relevance vector machine for basis set selection

Finding the maximum of the marginal likelihood directly with respect to all possible basis sets is computationally intractable due to the combinatorial number of possible basis sets. However, it is possible to identify optimally relevant basis configurations from the set of all possible configurations by application of the *Relevance Vector Machine* (RVM).^{42} Within the RVM, the prior of the weights is not modeled by a single variance parameter for all weights, *α*, but rather is allowed to vary between configurations. This modifies the prior distribution for each weight as a normal distribution with a zero mean, but an individual variance of 1/*α*_{x′} that now depends on the configuration, **x**′. The full prior for the weights is, therefore, given by a multivariate normal distribution with a zero mean and covariance matrix *A*^{−1}, where *A* is a diagonal matrix with elements *α*_{x′}. This results in the same posterior distribution for the weights, with the mean, covariance matrix, and log marginal likelihood as defined in Eqs. (13), (14), (17), and (19), but where the matrix $\alpha 1$ is replaced by the diagonal matrix *A*. The type II maximum likelihood procedure then involves finding the maximum of the marginal likelihood with respect to the parameters *α*_{x′} for all the potential basis configurations {**x**′}. If the optimal parameters *α*_{x′} for a basis configuration **x**′ goes to infinity, the prior for the weight associated with this configuration becomes a delta function located at zero. This implies that the optimal weight for this candidate configuration tends to zero so that this basis configuration is not relevant for the model, and it can be removed. The final basis sets obtained by the RVM from a set of (potentially all) candidate configurations are then those configurations with finite optimized parameter *α*_{x′}. In practice, it can be observed that indeed many of the *α*_{x′} parameters go to infinity during the optimization of the marginal likelihood, which is another manifestation of the observation that a large marginal likelihood represents an optimization of not just the accuracy of the model but also the sparsity of the chosen configurational basis set, as dictated by the sparse form of the prior on the weights.

In this work, we apply the fast marginal likelihood optimization algorithm presented in Ref. 40. This starts by setting all but one of the *α*_{x′} to infinity before the marginal likelihood is optimized iteratively by updating a single value *α*_{x′} at each iteration step until the algorithm converges. Due to closed analytic forms for the marginal likelihood, updates for the value of a selected *α*_{x′} at each step can be found analytically, while it is also possible to identify the configuration that will give the largest improvement to the overall marginal likelihood under the update of its *α*_{x′} value. This allows for a very efficient algorithm to find the optimal basis set. If *α*_{x′} is updated from infinity to a finite value, then this configuration is added to the set of active basis configurations, and if the value is updated from a finite value to infinity, then the configuration is removed from the active basis set. A key advantage of this fast iterative optimization scheme is that the set of active basis configurations is typically kept small at all times throughout the optimization. This reduces the computational cost of the algorithm because it involves only the inversion of an *N*_{active} × *N*_{active} matrix at each iterative step where *N*_{active} is the number of active basis configurations (i.e., basis configurations with finite *α*_{x′}).

The success of the RVM to obtain a sparse set of relevant basis configurations is exemplified by returning to the scatter plot of Fig. 1. This also includes the result obtained with the RVM using the same kernel and accuracy hyperparameters as for the values obtained with randomly selected basis sets. The RVM, selecting the most relevant of all possible basis configurations, gives a GPS with a mean squared error on the target state of ≈2 × 10^{−6} using only 16 selected basis configurations, which is far higher in accuracy than any of the randomly selected basis set of the same size. This roughly corresponds to the accuracy achieved with the randomly selected basis configurations giving the maximal marginal likelihood, which are, however, an order of magnitude larger than the basis set selected by the RVM. Selecting the basis configuration by application of the RVM, as an optimization of the marginal likelihood with respect to the variances of the prior distributions for the weights, thus makes it possible to learn a sparse and optimal GPS form from given target state data in a deterministic and efficient manner.

### E. Optimization of hyperparameters

With the type II maximum likelihood scheme resulting in models representing an optimal balance between sparsity and accuracy, we also aim to optimize the remaining dependence of the model and the hyperparameters, in a similar principle of maximal marginal likelihood via the RVM. Figure 2 shows the model mean squared error with respect to the target state, the number of basis configurations selected, and the final logarithm of the marginal likelihood obtained, as the two kernel hyperparameters are varied, and the RVM procedure used to select the basis configurations for each point. The hyperparameter controlling the weighting of the order or rank of the features (*θ*) varies between 0.1 and 50, while the distance weighting hyperparameter (*γ*) is varied between 0.1 and 7. As in the previous example, the target state from which the representation is learned with the RVM corresponds to a ground state of an eight site Hubbard chain in the strongly correlated regime at *U*/*t* = 8, and we fix the variance parameter to $\sigma \u03032=10$.

The learned models reach mean squared errors on the target state ranging between ≈2 × 10^{−7} and ≈10^{−5} with selected basis sets varying in size between 273 and 10. For the chosen example, the highest accuracy is achieved in intermediate parameter regimes of the kernel hyperparameters. Choosing *θ* → 0 and *γ* → 0, the values of the kernel function vanish for all but symmetrically related pairs of configurations. As a consequence, approaching this limit results in a large fraction of all potential basis configurations selected by the RVM. In contrast, with increasing values of the correlation order weighting parameter *θ* (suppressing the fitting of higher rank correlations) and increasing values of the correlation range weighting parameters *γ* (suppressing the fitting of longer-ranged correlations), the model becomes less complex, requiring fewer configurations to be selected by the RVM and resulting in more compact representations.

Going to very large values of *θ* and *γ* reduces the complexity of the underlying model by preferentially weighting the fit to simpler features. This decreases the accuracy of the final GPS with respect to the target state. However, there is a minimum in the error at a particular choice of hyperparameters, which also roughly corresponds to the maximum in the log marginal likelihood. This intermediate hyperparameter regime corresponds to *θ* and *γ* between one and two and represents the optimal balance between the accuracy and compactness of the resulting GPS state, as enforced by the prior distribution on the weights. For this system, this corresponds to a fit of around 25 basis configurations and reaches a mean squared error of ≈3.4 × 10^{−7}, as denoted by the cross in Fig. 2.

More detailed insight into the relationship between fit accuracy, model complexity, and marginal likelihood can be gained from the results of Fig. 3. In this, these quantities are shown for the same target state as the kernel complexity is varied via the *θ* hyperparameter for a fixed *γ* = 1. These quantities are also shown for two different choices of the accuracy parameter $\sigma \u03032=1$ and $\sigma \u03032=10$, which formally controls the desired variance of the likelihood of modeling the wave function amplitudes of the target function. The smaller value of $\sigma \u03032$ leads to a more accurate but less sparse model across the whole displayed range of *θ*, allowing $\sigma \u03032$ to effectively control the desired accuracy-sparsity trade-off. In agreement with the results of Fig. 2, the number of basis configurations overall decreases with an increase in values of *θ*, as the model becomes simpler. However for both choices of $\sigma \u03032$, the obtained mean squared error is minimal in a regime of small *θ* values between ≈0.3 and ≈0.8.

The fact that the error does not further decrease for more complex models is due to the modeling assumptions in the Bayesian learning, where a larger value of $\sigma \u03032$ (approximately) corresponds to a larger variance of the modeled likelihood of the data, i.e., the accuracy to which we attempt to reproduce the training data. The inferred weights for a given basis set do not, then, necessarily give the minimal least squared error with respect to the training data due to the Gaussian noise model assumed for the fitted data which is governed by $\sigma \u03032$. Depending on the chosen noise parameter, the error of the model does, therefore, not automatically approach zero in the limit where (almost) all basis configurations are selected by the RVM due to this finite noise model controlling the desired accuracy. Nonetheless, for both choices of $\sigma \u03032$, the marginal likelihood takes the maximal value in an intermediate regime of *θ* where a good accuracy is achieved with a compact basis set. The value of *θ* that is associated with the maximum value of the marginal likelihood is indicated by vertical lines in Fig. 3. For both displayed values of $\sigma \u03032$, the marginal likelihood is maximal in the regime of *θ* ≈ 2 resulting in a basis set of 27 basis configurations with an error of ≈4 × 10^{−7} for $\sigma \u03032=10$. Choosing a smaller error parameter $\sigma \u03032=1$, the GPS model giving the maximal marginal likelihood is less sparse but also more accurate underlining the significance of the noise parameter $\sigma \u03032$, which can be used to tune the target accuracy that we want to achieve with the learned GPS obtained by the maximization of the marginal likelihood.

In order to obtain a final compressed GPS form of a given quantum state, we, therefore, also optimize the marginal likelihood with respect to the kernel hyperparameters in the application of the RVM. The only input parameter for this algorithm is the parameter $\sigma \u03032$ implicitly controlling the target accuracy we aim to achieve with the fit. We optimize the kernel hyperparameters by running the RVM multiple times using different hyperparameters and use the model giving the largest maximum likelihood as the final representation learned from the data. In order to find the maximum of the marginal likelihood (or a good approximation thereof) with as few repeated applications of the RVM, as possible we apply the *sequential model based optimization* scheme (also known as Bayesian optimization) based on the *tree of Parzen estimators*^{43,44} as implemented in the *hyperopt* python package.^{45} In this scheme, the space of possible hyperparameters is sampled according to an initial assumed distribution of the parameter space while also taking into account the evaluation history from previously sampled values of the kernel hyperparameters. Similar to the introduced Bayesian learning approach, this is done by modeling underlying probability distributions for the kernel hyper parameters and obtained log marginal likelihood values based on the observations and chosen prior distributions for the hyperparameters. New hyperparameter values are then sampled according to their associated expectation value of the improvement of the log marginal likelihood with respect to a chosen threshold value. Based on the general scaling behavior, as is shown in Fig. 2, we choose log-uniform priors for *θ* and *γ*.

### F. Full optimization of the GPS for correlated systems

We can now apply Bayesian optimization of weights, basis set configurations, and kernel hyperparameters to lattices at a variety of correlation strengths to consider the accuracy and compactness of the resulting GPS representation. Figure 4 shows the mean squared error with respect to the exact target state, the number of selected basis configurations, and the optimized kernel hyperparameters, for different values of the variance $\sigma \u03032$. The target state learned is the Hubbard model ground state of the eight site system at *U*/*t* = 2, 4, 6, and 8. As expected, for all values of *U*, the error decreases monotonically with decreasing values of $\sigma \u03032$, with the number of configurations required in the basis set correspondingly increasing.

For the most strongly correlated system at *U* = 8*t*, we obtain a worst mean squared error of ≈10^{−6} with only 18 basis configurations when $\sigma \u03032=100$. This error in the wave function prediction across all configurations can be systematically improved by almost six orders of magnitude (down to an error of only 10^{−12}—close to numerical precision) by decreasing the variance/noise down to $\sigma \u03032=10\u22125$, at which point the GPS requires 446 basis configurations. This is compared to the full complexity of the wave function, which for this small system with eight Fermionic degrees of freedom requires 4 900 configurational amplitudes (reducing to 618 if only symmetry-inequivalent configurations are considered). However, while this compression is not in itself particularly noteworthy, it should be stressed that due to the size extensive nature of the state, it is not anticipated that the size of the basis set for a desired accuracy will increase with the size of the system. This observation is a key to the development of numerical algorithms based on the GPS construction in Ref. 31, well beyond the system sizes of conventional calculation.

Furthermore, while the specific accuracy and basis set sizes vary with *U*/*t* in Fig. 4, these differences are relatively minor, and the overall trend is largely independent of the interaction strength. This bodes well for the GPS within the Bayesian learning framework to be applicable to significantly different physical regimes with similar accuracy. The optimized kernel hyperparameters are also largely independent of the specified noise $\sigma \u03032$, potentially rendering the optimization of the hyperparameters an optional step. The optimized values of *γ* range between ≈0.9 and ≈2.8, resulting in a suppression of long-range features, which is similar to the generic distance weighting proposed in Ref. 31 of 1/|*r*_{i} − *r*_{0}|. Of course, further flexibility can also be included into the kernel function with additional hyperparameters, such as a fully flexible distance weighting.^{31} The scope to improve the GPS form with more general “kernel learning” of the model complexity will be explored in future work.

## IV. EXTRACTION OF PHYSICAL INFORMATION FROM A GAUSSIAN PROCESS STATE

Having defined a general scheme to learn a compact GPS representation including its kernel hyperparameters from given wave function data, we then in theory have access to the wave function and, therefore, all physical (static) observables. However, by learning the kernel hyperparameters, we also potentially have access to a characterization of the state, which does not correspond simply to a physical observable but can still provide insight into the system. For the hyperparameters used in this work, the value of *θ* can provide information on the order or rank of correlations, and *γ* can provide a physical length scale for these correlations required to reach the target accuracy with the model. The optimized hyperparameters, therefore, provide meaningful insight about the underlying physics of the modeled target state.

The top panel of Fig. 5 shows the trends in the optimized hyperparameters obtained from learning the ground state of the Hubbard chain for different values of the interaction strength *U*/*t*. The optimized *γ* defines a (inverse) length scale for the correlations and shows an overall increase across the range of positive *U*/*t* going up from a value of ≈0.6 at *U*/*t* = 0 to a value of ≈1.9 at *U*/*t* = 10. In contrast, the optimal *θ* hyperparameter, defining an (inverse) correlation order weighting, decreases as the correlation strength increases, from *θ* ≈ 72 at *U*/*t* = 0 to *θ* ≈ 0.8 at *U*/*t* = 10. The optimized kernel parameter, therefore, reflects the change in correlations encoded in the wave function character, transitioning from a low-order long-range correlation characteristic of the mean field limit where kinetic energy considerations dominate to high-order but short-ranged correlation properties in the regime of strong interaction where strong but short-ranged magnetic and charge quantum fluctuations dominate.

In the center panel of Fig. 5, we report the relative error in the energy of the modeled state (rather than the wave function amplitude error as shown previously), as well as the number of selected basis configurations obtained in the GPS representations, as the correlation strength is changed. In the repulsive regime, the size of the basis set increases largely monotonically with interaction strength, ranging from 13 configurations at *U*/*t* = 0 to a total of 101 selected configurations at *U*/*t* = 10. While this increase in the basis set size is also reflected by a decrease in the relative energy error in the strongly correlated regime of *U*/*t* ⪆ 3.5, this is not true for smaller values. The maximum of the relative energy error at around *U*/*t* ≈ 3.5 points to the significant challenge of modeling wave functions in intermediately correlated regimes with a highly sparse GPS, where there is competition between the long-range and short-range correlations. While it is not surprising that this intermediate regime, where no single physical feature dominates, is one where the error in the GPS is highest, this may also be able to be addressed in the future by finding alternative kernel parameterizations, which are more flexible and better suited for a description of this physics.

Insight about specific correlation patterns, beyond just the values of the hyperparameters, can also be extracted from the unique GPS model, via an analysis of the basis set selected by the RVM with the optimized hyperparameters. To visualize the space of all many-body configurations of the underlying Hilbert space, we find a two-dimensional representation based on the *t-distributed stochastic neighbor embedding* (t-SNE)^{46} with the chosen kernel hyperparameters. In order to apply the t-SNE, we divide the Hilbert space into four different classes of configurations based on the occupancy of the central reference site used to define the (non-symmetrized) kernel of Eq. (7). The t-SNE algorithm then maps all configurations of each class into a two-dimensional space according to the scaled Hamming distance metric, *h*(**x**, **x**′)/*θ*, used in the kernel function. If two configurations have a small scaled Hamming distance, resulting from a large overlap in the feature space, then these two configurations are also likely mapped to geometrically close points in the t-SNE representation.

The t-SNE distributions for the Hilbert space basis configurations of the eight site Hubbard chain used in this work at *U*/*t* = 8 are shown in Fig. 6. Each scatter point represents one many-body configuration mapped to a position on the visualized 2D plane by using the t-SNE algorithm. As can be seen, the form of the kernel results in a hierarchical clustering of the configurations, with a self-similar structure of the clusters on different levels. Each cluster of configurations corresponds to further 16 clusters when zoomed in to the next level. This form arises due to the 16 different additional correlation features that arise when considering all possible occupancies of the site one further from the reference (16 = 4^{2} for all Fock states of the two additional sites at the next distance from the reference site). This structure arises mathematically from the distance weighting of the features describing the scaled distances as |*d*|^{−γ}, weighting non-matching occupancies less for sites further away from the reference central site. Consequently, this results in a fractal structure where each cluster can again be separated into different clusters based on the occupancies of the sites directly adjacent to the sites characterizing the higher level cluster, as shown in Fig. 6.

While this pictorial representation of the kernel function between configurations is interesting in its own right, we are also then able to use this representation to analyze the position of the specific basis configurations selected by the RVM to characterize the GPS. These configurations will likely exhibit the most relevant correlation features present in the state. In Fig. 6, these are marked in red for the *U*/*t* = 8 GPS. However, of more interest is how this selection of the basis configurations changes with the character of the wave function and correlation strength. This is shown in Fig. 7 for representations learned from ground states at different values of the interaction strength *U*/*t*.

In addition to indicating the selected basis configurations in Fig. 7, the choice of color reflects the magnitude of the optimized weights associated with these basis configurations. The largest positive weights are denoted with warmer (red) colors, while the largest negative weights are denoted with colder (blue) colors, while unselected basis configurations are shown in gray. What is found is that in the *attractive* Hubbard model, characterized by *U*/*t* < 0, the selected basis configurations predominantly reside in the sectors of the Hilbert space corresponding to singly occupied reference sites. As *U*/*t* increases and the system transitions into the more common *repulsive* Hubbard regime, the chosen basis configurations with dominant weights are rather found in the sectors of the Hilbert space with unoccupied or doubly occupied central sites. This indicates a change in the dominant character of the GPS whereby it goes from trying to suppress correlation features with singly occupancy in the attractive case (thereby promoting pairing order of the attractive electrons) to suppression of the double occupancy/holon character (thereby promoting single occupancy and magnetic order) in the repulsive Hubbard model.

The observation that with increasing values of *U*/*t* predominantly basis configurations with empty or doubly occupied reference sites are selected can be quantified by analyzing the fraction of selected basis configurations with such reference site occupancies. This is shown in the bottom panel of Fig. 5 and confirms that in the strongly attractive regime with *U*/*t* ⪅ −4, ∼10%–20% of the ten most relevant selected basis configurations (i.e., configurations with the smallest optimized *α* values) have a doublon or holon occupancy on the reference site. For increasingly repulsive effective interaction strengths, the fraction of basis configurations with such doublon or holon occupancies among the most relevant basis configurations increases, resulting in a fraction of ≈90% for *U*/*t* ⪆ 8, confirming the hypothesis from Fig. 7. Further insight into longer-ranged orderings, including a striped or inhomogeneous character in the dominant correlation features, can be extracted by considering the prevalence and weighting of longer range features present in the chosen basis configurations. This analysis of the learned GPS underlines how the physically motivated definition of the model makes it possible to easily interpret the character of the correlations that emerge from this Bayesian learning scheme and to gain physical insight into the target state beyond just the extraction of accurate physical observables.

## V. CONCLUSION

In this work, we have given an overview of how a supervised learning scheme can be applied to learn a compact and accurate representation of correlated many-body states in the form of a Gaussian process state from given data of the target state. Due to its particular form, corresponding to a linear model in some higher dimensional feature space, the GPS is particularly well suited for regression with statistical methods based on Bayesian inference. A central conclusion of this work is that the optimization of the marginal likelihood for basis selection yields representations balancing the sparsity and accuracy of the model without the need of additional cross validation of the model. The introduced supervised learning scheme based on the relevance vector machine, therefore, provides an efficient tool to compress given wave function data to a compact but highly expressive functional model in the form of a GPS. Because the central building block of the GPS, the kernel function, can be motivated from physical insight into the feature space it implicitly represents, it is possible to easily extract meaningful characteristics of the modeled state from the learned representation and obtain additional insight of the underlying physical properties of the state. In contrast to ansätze based on explicit physical representations such as the Jastrow ansatz, the GPS is, however, not limited or biased in its expressibility, making it possible to model arbitrary correlation properties in the state with the chosen kernel function, which can, however, still be evaluated in polynomial time.

The compression scheme presented in this work can easily be applied to a range of different applications where it is desired to obtain a compact functional form of a wave function from some limited observed data. In addition to the previously developed numerical schemes within ground states of larger lattice systems that do not allow for an exact numerical treatment, the compact and interpretable functional forms of the GPS could be similarly used for problems such as finding higher excited states, modeling many-body dynamics,^{29,30,47} describing *ab initio* chemical systems,^{24–27,48} or using them within quantum state tomography.^{28,49,50} Although the range of potential application is same as for the NQS, the particular form of the GPS model could make it a favorable and more interpretable ansatz in some instances so that the GPS complements the set of machine learning inspired many-body representations.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## ACKNOWLEDGMENTS

The authors would like to thank Gábor Csányi for useful discussions in the development of this work. G.H.B. gratefully acknowledges support from the Royal Society via a University Research Fellowship and funding from the Air Force Office of Scientific Research via Grant No. FA9550-18-1-0515. This project has also received support from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 759063. The authors are grateful to the UK Materials and Molecular Modeling Hub for computational resources, which is partially funded by the EPSRC (Grant No. EP/P020194/1).