The Gaussian process (GP) is a popular statistical technique for stochastic function approximation and uncertainty quantification from data. GPs have been adopted into the realm of machine learning (ML) in the last two decades because of their superior prediction abilities, especially in data-sparse scenarios, and their inherent ability to provide robust uncertainty estimates. Even so, their performance highly depends on intricate customizations of the core methodology, which often leads to dissatisfaction among practitioners when standard setups and off-the-shelf software tools are being deployed. Arguably, the most important building block of a GP is the kernel function, which assumes the role of a covariance operator. Stationary kernels of the Matérn class are used in the vast majority of applied studies; poor prediction performance and unrealistic uncertainty quantification are often the consequences. Non-stationary kernels show improved performance but are rarely used due to their more complicated functional form and the associated effort and expertise needed to define and tune them optimally. In this perspective, we want to help ML practitioners make sense of some of the most common forms of non-stationarity for Gaussian processes. We show a variety of kernels in action using representative datasets, carefully study their properties, and compare their performances. Based on our findings, we propose a new kernel that combines some of the identified advantages of existing kernels.

## I. INTRODUCTION

The Gaussian process (GP) is arguably the most popular member of the large family of stochastic processes and provides a powerful and flexible framework for stochastic function approximation in the form of Gaussian process regression (GPR). A GP is defined as a collection of random variables such that any finite subset has a joint normal (Gaussian) distribution.^{1} In the scope of GPR, the random variables are considered (model or latent) function values {*f*(**x**_{1}), *f*(**x**_{2}), *f*(**x**_{3}), …}. Equivalently, we can understand a GP as a description of distributions over function spaces, in which case $GP(m(x),K)$ is entirely specified by a mean function $m(x)=E[f(x)]$ and a covariance $K=k(xj,xi;h)=E[(f(xi)\u2212m(xi))(f(xj)\u2212m(xj))]$, where *k*(**x**, **x**_{i}; *h*) is the kernel or covariance function, *h* ∈ Θ is a vector of hyperparameters, and $X$ is the input space. This role of the kernel as a covariance operator makes it arguably the most important building block when it comes to optimizing the flexibility, prediction accuracy, and expressiveness of a GP.

In a recent review,^{2} it was pointed out that the vast majority of applied studies using GPs employ the radial basis function (RBF) kernel (also referred to as the squared exponential or Gaussian kernel). The fraction is even higher when we include other stationary kernels. Stationary kernels are characterized by *k*(**x**_{i}, **x**_{j}) = *k*(|**x**_{i} − **x**_{j}|), i.e., covariance matrix entries only depend on some distance between data points in the input domain, not on their respective location. Stationary kernels are popular because they carry little risk—in terms of model misspecification—and come with only a few hyperparameters that are easy to optimize or train. However, it has been shown that the stationarity assumption can lead to poor prediction performance and unrealistic uncertainty quantification that is affected mostly by data-point geometry.^{3} In other words, uncertainty will increase when moving away from data points at a constant rate across the input space. To overcome these limitations, significant research attention has been drawn to non-stationary Gaussian process regression.^{4–6} Non-stationary kernels depend on the respective locations in the input space explicitly, i.e., *k*(**x**_{i}, **x**_{j}) ≠ *k*(|**x**_{i} − **x**_{j}|). This makes them significantly more flexible and expressive, leading to higher-quality estimates of uncertainty across the domain. Even so, non-stationary kernels are rarely used in applied machine learning (ML) due to the inherent difficulty of customization, optimizing hyperparameters, and the associated risks of model misspecification (wrong hyperparameters), overfitting, and computational costs due to the need to find many hyperparameters.^{7,8} When applied correctly, non-stationary GPs have been shown to provide significant advantages over their stationary counterparts, especially in scenarios where the data exhibit non-stationary behavior.^{9}

This paper aims to bring some structure to the use of non-stationary kernels and related techniques to make it more feasible for the practitioner to use these kernels effectively. Throughout this paper—in the hope of covering the most practical set of available options—we focus on and compare the following four ways to handle non-stationarity in a dataset:

Ignore it: Most datasets and underlying processes exhibit some level of non-stationarity, which is often simply ignored. This leads to the use of stationary kernels of the form

*k*_{stat}=*k*(|**x**_{i}−**x**_{j}|); |·| is a norm appropriate to the properties of the input space. Given the key takeaways in Ref. 2, this option is chosen by many, which serves as the main motivation for the authors to write this perspective.Parametric non-stationarity: Kernels of the form $k(xi,xj)=\u2211d=1Ngd(xi)gd(xj)kstat(|xi\u2212xj|)$, where

*g*_{d}(**x**) can be any parametric function over the input space and*N*is some positive integer.Deep kernels: Kernels of the form

*k*_{stat}(|(*ϕ***x**_{i}) −(*ϕ***x**_{j})|), whereis a neural network (NN), and*ϕ**k*_{stat}is any stationary kernel. This kernel was introduced by Wilson*et al.*^{10}and was quickly established as a powerful technique, even though potential pitfalls related to overfitting were also discovered.^{11}Deep GPs: Achieving non-stationarity not by using a non-stationary kernel but by stacking stationary GPs—meaning the output of one GP becomes the input to the next, similar to how neural networks are made up of stacked layers—introduces a non-linear transformation of the input space and thereby non-stationarity.

^{12}

Non-stationarity can also be modeled through the prior-mean function of a Gaussian process, but for the purpose of this paper, we use a constant prior mean and focus on the four methods described earlier. From the brief explanations of the three types of non-stationarity, it is immediately obvious that deriving, implementing, and deploying a non-stationary kernel and deep GPs (DGPs) can be an intimidating endeavor. The function *g*_{d} (in No. 2 above), for instance, can be chosen to be any sum of local basis functions—linear, quadratic, cubic, piecewise constant, etc.—any polynomial or sums thereof, and splines, just to name a few. Given the vast variety of neural networks (in No. 3), choosing ** ϕ** is no easier. DGPs involve stacking GPs, which increases flexibility but also leads to a very large number of hyperparameters, or latent variables, making optimization and inference a big challenge. In addition, a more complex kernel generally results in reduced interpretability—a highly valued property of GPs.

This paper is concerned with shining a light on the characteristics and performance of different kernels and experiencing them in action. In a very pragmatic approach, we study the performance of the three methodologies above to tackle non-stationarity and compare them to a stationary Matérn kernel to create a baseline. Our contributions can be summarized as follows:

We strategically choose a representative set of test kernels and investigate their properties.

We choose test datasets to study the performance of the chosen kernels.

We derive performance and (non)-stationarity metrics that allow fair comparisons of the non-stationary kernels and the selected deep GP implementation.

We compare the performance of the candidate kernels tasked with modeling the test datasets and present the uncensored results.

We use the gathered insights to draw some conclusions and suggest a new kernel design for further investigation.

We identify and exploit two primary pathways to achieve non-stationary properties: through the application of deep architectures—deep kernels or deep GPs—and the utilization of parametric non-stationary kernels. Our work delves into the role of parametric non-stationary kernels, arguing that these kernels offer similar behavior to a deep-kernel architecture without any direct notion of deepness, hinting at the possibility that deepness and flexibility are practically synonymous. This provides a valuable computational perspective to the broader dialog surrounding deep architectures and non-stationarity, offering insights that may be instrumental in future research.

The remainder of this paper is organized as follows: The next section introduces some minimal but necessary theory. There, we also introduce some measures that will help us later when it comes to investigating the performance of different kernels. Then, we give an overview of the relevant literature and some background. Subsequently, we introduce a range of different kernels and some test datasets, which position the reader to see each kernel in action in different situations. We then discuss the key takeaways; we give some pointers toward improved kernel designs, make remarks about the connection between non-stationary kernels and multi-task learning, and conclude with a summary of the findings.

## II. PRELIMINARIES

The purpose of this section is to give the reader the necessary tools to follow along with our computational experiments. It is not intended to be a complete or comprehensive overview of GP theory or related methodologies. For the remainder of this paper, we consider some input space $X\u2282Rn$ with elements **x**. While we often think of the input space as a subspace of the Euclidean space, this is not a necessary assumption of the framework generally. We assume data $D$ is a set of tuples ${xi,yi}\u2200i={1,2,3,\u2026,|D|}$. In this work, except for a comment on multivariate GPs at the end, we will assume scalar *y*_{i}.

### A. GPs in a nutshell

*y*(

**x**) =

*f*(

**x**) +

*ϵ*(

**x**), where

*f*(

**x**) is the unknown latent function,

*y*(

**x**) is the noisy function evaluation (the measurement),

*ϵ*(

**x**) is the noise term, and

**x**is an element of the input space $X$. For a fixed, finite set of inputs $D={x1,\u2026,x|D|}$, a Gaussian process implies that the vector $f=(f(x1),\u2026,f(x|D|))$ has a multivariate Gaussian distribution

**K**is the covariance matrix defined by the kernel function

*K*

_{ij}=

*k*(

**x**

_{i},

**x**

_{j}), and

**m**is the prior mean obtained by evaluating the prior-mean function at points

**x**

_{i}. Conditioning on

**f**, we can define a likelihood for the corresponding vector of data, $y=(y(x1),\u2026,y(x|D|))$, denoted

*p*(

**y**|

**f**), which allows us to perform Bayesian inference, typically (but not necessarily) using a multivariate Gaussian distribution

**V**is used to describe measurement errors arising from

*ϵ*(

**x**). Training the hyperparameters, extending the prior over latent function values at points of interest, marginalizing over

**f**, and conditioning on the data

**y**yield the posterior probability density function (PDF) for the requested points.

^{1}For example, for a Gaussian process with Gaussian likelihood, marginalization over

**f**can be obtained in closed form

*h*when the data

**y**are given,

*h*. After the hyperparameters are found, the posterior is defined as

**=**

*κ**k*(

**x**

_{0},

**x**

_{j}), $K=k(x0,x0)$, and

**m**

_{0}is the prior mean function evaluated at the prediction points

**x**

_{0}. If, in a Bayesian setting, we want to further incorporate uncertainties in the hyperparameters

*h*∈ Θ, Eq. (5) becomes

*p*(

**f**

_{0}|

**y**,

*h*) is available in closed form (5). This basic framework can be extended by more flexible mean, noise, and kernel functions.

^{2,13–15}Much of the GP framework depends on the ability to extend the original prior

*p*(

**f**) to the joint distribution

*p*(

**f**,

**f**

_{0}), which is made possible through the definition of the kernel as a covariance operator that can be evaluated on pairs of arbitrary inputs (

**x**

_{i},

**x**

_{j}).

### B. The covariance operator of a GP: The kernel or covariance function

**K**in Eq. (1) are defined by

*K*

_{ij}=

*k*(

**x**

_{i},

**x**

_{j}). Because of that, kernel functions have to be symmetric and positive semi-definite (psd); a complete characterization of the class of valid kernel functions is given by Bochner’s theorem.

^{16,17}Kernels are called “stationary” if the function only depends on the distance between the two inputs, not their respective locations, i.e.,

*k*(

**x**

_{i},

**x**

_{j}) =

*k*(|

**x**

_{i}−

**x**

_{j}|). The arguably most prominent members are kernels of the Matérn class (see, for instance, Ref. 18), which includes the exponential

^{2}here, ‖·‖

_{2}is the Euclidean norm, which makes the kernel function both stationary and isotropic. These two special cases of the Matérn kernel class have two hyperparameters:

*σ*

_{s}, the signal variance, and

*l*, the length scale, both of which are constant scalars applied to the entire domain $X$. Stationary and isotropic kernels can be advanced by allowing anisotropy in the norm, i.e., $((xi\u2212xj)TM(xi\u2212xj))1/2$, where

**M**is some symmetric positive definite matrix whose entries are typically included in the vector of hyperparameters that need to be found. Incorporating anisotropy allows the kernel function to stretch distances such that the implied covariances have ellipsoidal patterns (as opposed to spherical patterns for isotropic kernels). When $M=1lIn$, the isotropic kernels in Eqs. (7) and (8) are recovered; more generally, when

**M**is diagonal with different elements along the diagonal, automatic relevance detection is recovered.

^{19}Other stationary kernel designs include the spectral kernel, the periodic kernel, and the rational quadratic kernel.

Non-stationary kernels do not have the restriction of only depending on the distance between the input points but depend on their explicit positions **x**_{i} and **x**_{j}, i.e., *k*(**x**_{i}, **x**_{j}) ≠ *k*(|**x**_{i} − **x**_{j}|). Formulating new non-stationary kernels comes with the difficulty of proving positive semi-definiteness, which is often challenging. However, the statistics and machine learning literature has a variety of general approaches for applying valid non-stationary kernels; this is the primary topic of this paper, which is discussed in Sec. III. For now, it is important to keep in mind that the essence of stationary vs non-stationary kernels—due to the way they deal with the locations in the input space—manifests itself in the covariance matrix which, in the stationary case, has to be constant along all bands parallel to the diagonal for sorted and equidistant input points, a property that is not observed in the non-stationary case (see Fig. 1). Therefore, non-stationary kernels lead to more descriptive, expressive, and flexible encodings of covariances and, therefore, uncertainties. Of course, in addition, the function space contains a much broader class of functions when non-stationary kernels are used.

### C. A note on scalability

Computational complexity is a significant challenge for GPs for both stationary and non-stationary kernels. The need to calculate the log-determinant and invert the covariance matrix—or solve a linear system instead—results in a computational complexity of $O(|D|3)$, where $|D|$ is the number of data points.^{1,20} This complexity limits the application of GPs to large-scale datasets. Several methods have been proposed to overcome this issue. Sparse methods^{20,21} and scalable GPR approaches^{22} have been developed for stationary GPs. For non-stationary GPs, methods such as local GPs^{23} and the Bayesian treed GP^{24} have been proposed to tackle this issue. These methods have an origin in divide-and-conquer methods attempting to break down the problem into smaller, manageable pieces that can be solved independently,^{25} thereby reducing the computational complexity for each piece. However, these approaches can lead to a loss of global information, so finding the right balance between computational efficiency and model accuracy remains a key challenge. A recent approach can scale GPs to millions of data points without using approximations by allowing a compactly supported, non-stationary kernel to discover naturally occurring sparsity.^{8}

## III. NON-STATIONARY KERNELS

Stationary kernels are widely used primarily because they are easy and convenient to implement, even though the implied assumption of translation-invariant covariances is almost never exactly true for real-world datasets. As mentioned in Sec. II, there are serious challenges associated with both deriving non-stationary kernels and choosing an appropriate and practical non-stationary kernel from the valid options for any given implementation of a Gaussian process. We now provide a brief overview of the literature on non-stationary kernels, including both a historical perspective of early developments followed by greater detail on three modern frameworks for non-stationary kernels as well as metrics for quantifying non-stationarity in datasets.

### A. Historical perspective

It has now been over three decades since the first paper on non-stationary kernels via “deformations” or warping of the input space appeared.^{4} Since then, the statistics literature has developed a number of approaches for non-stationary kernels, mostly in the context of modeling spatially referenced data. These methods can broadly be categorized as basis function expansions and kernel convolutions, in addition to the aforementioned deformation approach. We now briefly summarize each method, focusing on aspects that apply directly to kernel functions for Gaussian processes.

^{4}is that instead of deriving new classes of non-stationary kernels, one can keep isotropic kernels but obtain non-stationarity implicitly by rescaling interpoint distances in a systematic way over the input space. In other words, this approach transforms $X$ to a new domain, say $X*$, wherein stationarity holds. The transformation, say $\varphi :Rn\u2192Rn*$, is a (possibly non-linear) mapping applied to elements of $X$ to yield a non-stationary kernel via

*k*

_{stat}is an arbitrary stationary kernel function. Two extensions were later proposed to this approach

^{26,27}that supposed the mapping

**(·) was itself a stochastic process. For example,**

*ϕ*^{27}place a Gaussian process prior to

**(·)—essentially coming up with the idea of deep kernels more than a decade before related ideas appeared in the machine learning literature. In some cases,**

*ϕ**n** >

*n*, i.e., the mapping involves dimension expansion.

^{28}Ultimately, early approaches to warping the input space were largely unused due to a lack of computational tools for optimizing the mapping function

**(·) in a reliable and robust manner.**

*ϕ*^{29,30}of a (mean-zero) stochastic process in terms of orthogonal eigenfunctions

*E*

_{m}(·) and weights

*w*

_{m},

*v*

_{m}come from the Fredholm integral equation

*M*terms, the finite sum approximation to the kernel can be used instead and is optimal in the sense that it minimizes the variance of the truncation error for all sets of

*M*basis functions when the

*E*

_{m}(·) are the exact solutions to the Fredholm equation.

^{31}The main task is then to model the weight-eigenfunction pairs {

*w*

_{m},

*E*

_{m}(·)}, which can be performed empirically using singular value decomposition

^{32}or parametrically using, e.g., wavelets.

^{33}

*W*(·) is an

*n*-dimensional stochastic process and

*κ*

_{x}(·) is a kernel function that depends on input location

**x**,

^{36}summarizes the extremely flexible class of stochastic models defined using Eq. (12): (see, for example, Refs. 37–39). The popularity of this approach is due largely to the fact that it is much easier to specify (possibly parametric) kernel functions than a covariance function directly since the kernel functions only require $\u222bRd\kappa x(u)du<\u221e$ and $\u222bRd\kappa x2(u)du<\u221e$. The process

*f*(·) in Eq. (12) is a Gaussian process when

*W*(·) is chosen to be Gaussian, and the associated covariance function is

**x**

_{i}−

**x**

_{j}‖ and is hence non-stationary. Various choices can be made for using this general framework in practice: replace the integral in Eq. (12) with a discrete sum approximation

^{40}or choose specific

*κ*

_{x}(·) such that the integral in Eq. (13) can be evaluated in closed form.

^{41}The latter choice can be generalized to yield a closed-form kernel function that allows all aspects of the resulting covariances to be input-dependent: the length-scale,

^{5,6}the signal variance,

^{42}and even the differentiability of realizations from the resulting stochastic process when the Matérn kernel is leveraged.

^{43}This approach is often referred to as “parametric” non-stationarity since a non-stationary kernel function is obtained by allowing its hyperparameters to depend on the input location. In practice, some care needs to be taken to ensure that the kernel function is not too flexible and can be accurately optimized.

^{6,44}We return to a version of this approach in Sec. III B.

In conclusion of this section, the statistics literature contains a broad set of techniques (only some of which are summarized here) for developing non-stationary kernel functions. However, historically speaking, these techniques were not widely adopted because, in most of the cases described here, the number of hyperparameters is in the same order as the number of data points. This property makes it very difficult to apply the kernels to real-world datasets due to the complex algorithms required to fit or optimize such models.

Nonetheless, the potential benefits of applying non-stationary kernels far outweigh the risks, in our opinion, and this perspective is all about managing this trade-off. To do so, we now introduce three modern approaches to handle non-stationarity in datasets in order to later test them and compare their performance (Sec. IV).

### B. Non-stationarity via a parametric signal variance

*g*(

**x**

_{1})

*g*(

**x**

_{2}) is always symmetric and psd

^{14}and is, therefore, a valid kernel function. In addition, any product of kernels is a valid kernel, which gives rise to kernels of the form

^{42}in particular, consider parametric signal variance and (anisotropic) length scale. In an extension of (14), any sum of kernels is a valid kernel, which allows us to write

*g*can be any function defined on the input domain, but we will restrict ourselves to functions of the form

*c*

_{k}are some coefficients (or parameters), and

*β*(

**x**

_{k},

**x**) are basis functions centered at

**x**

_{k}. For our computational experiments, we use radial basis functions of the form

*w*is the width parameter.

### C. Deep kernels

**due to their flexible approximation properties, which give rise to so-called deep kernels (see Algorithm 1 for an implementation example). For our tests, we define a 2-layer-deep network with varying layer widths. While it is possible through deep kernels to perform dimensionality reduction, in this work, we map the original input space into a linear space of the same dimensionality, i.e.,**

*ϕ**n*=

*n**. Care must be taken not to use neural networks whose weights and biases, given the dataset set size, are underdetermined. That is why comparatively small networks are commonly preferred. We use ReLu as an activation function. The neural network weights and biases are treated as hyperparameters and are trained accordingly. We set

*k*

_{stat}in (18) to be the Matérn kernel with

*ν*= 3/2.

Input Set of points as arrays x_{1}, x_{2}, hyperparameters h ∈ Θ |

Output Deep kernel covariance matrix of shape [len(x_{1}) × len(x_{2})] |

Step 1: Define the neural network architecture with input dimension n, hidden layers, |

and appropriate activation functions |

Step 2: Initialize or set the weights and biases of the neural network using h |

Step 3: Transform x_{1} and x_{2} through the neural network |

• Apply the forward pass of the neural network to all points in sets x_{1} and x_{2} |

• Store the transformed values as x_{1nn} and x_{2nn} |

Step 4: Calculate the pairwise distance matrix between x_{1nn} and x_{2nn} |

Step 5: Compute the deep kernel value using the distance matrix |

• Apply a kernel function (e.g., exponential) to the distance matrix |

• Scale and combine with other kernel functions if needed |

Step 6: Return the deep kernel value |

Input Set of points as arrays x_{1}, x_{2}, hyperparameters h ∈ Θ |

Output Deep kernel covariance matrix of shape [len(x_{1}) × len(x_{2})] |

Step 1: Define the neural network architecture with input dimension n, hidden layers, |

and appropriate activation functions |

Step 2: Initialize or set the weights and biases of the neural network using h |

Step 3: Transform x_{1} and x_{2} through the neural network |

• Apply the forward pass of the neural network to all points in sets x_{1} and x_{2} |

• Store the transformed values as x_{1nn} and x_{2nn} |

Step 4: Calculate the pairwise distance matrix between x_{1nn} and x_{2nn} |

Step 5: Compute the deep kernel value using the distance matrix |

• Apply a kernel function (e.g., exponential) to the distance matrix |

• Scale and combine with other kernel functions if needed |

Step 6: Return the deep kernel value |

Our deep kernel construction shares the perspective of using a neural network to estimate a warping function.^{4} In Ref. 45, the authors propose an approach to deep compositional spatial models that differs from traditional warping methods in that it models an injective warping function through a composition of multiple elemental injective functions in a deep-learning framework. This allows for greater flexibility in capturing non-stationary and anisotropic spatial data and is able to provide better predictions and uncertainty quantification than other deep stochastic models of similar complexity. This uncertainty quantification is point-wise, similar to the deep GPs we introduce next.

### D. Deep GPs

Deep Gaussian process (DGP) models are hierarchical extensions of Gaussian processes where GP layers are stacked—similar to a neural network—enhancing modeling flexibility and accuracy^{12,46,47} (more details can be found in Appendix D). The DGP model is one of the deep hierarchical models^{48–50} and consists of a number of variational Gaussian process layers defined by a mean function and a stationary covariance (kernel) function.

Let us first consider a simple 2-layer DGP taking *n*-dimensional inputs **x**. The first layer uses a constant zero-mean *GP*^{(1)} for a lower-dimensional representation of the input data, with its coordinate components distributed as $fi(1)\u2223x\u223cNn(0,\Sigma 1(x))$ and $f(1)=(f1(1),\u2026,fn1(1))$, where *n*_{1} is the number of nodes in the hidden layer where each component $fi(1)$ is mutually independent.

The second layer uses a constant zero mean and takes the first layer’s output. Its output *f*^{(2)} is distributed as *f*^{(2)}∣*f*^{(1)} ∼ *N*_{n}(0, Σ_{2}(*f*^{(1)})) and is fed into an independent GP *GP*^{(2)} as the second layer, generating the model output. We can impose more than two hidden layers for a single DGP according to our needs, where $f(1)\u2208Rn\xd7n1,f(2)\u2208Rn1\xd7n2,\cdots f(L)\u2208RnL\u22121\xd7n$, with layouts as random matrices. This is analogous to the connecting matrices in the architecture of a neural network. However, we match the two-layer structure used in our deep kernel and point out that the complexity of the neural architecture may not always lead to better performance.

For *L* > 2 layers in a DGP, we iterate this procedure of using the previous layer’s output as the next layer’s GP input until reaching the last layer. For an *L*-layer DGP architecture, this creates a sequence of *L* independent GPs, *GP*^{(1)}, *GP*^{(2)}, …, *GP*^{(L)}, corresponding to each layer. Each layer’s forward method applies the mean and covariance functions to input data and returns a multivariate normal distribution as output, which serves as the next layer’s input.

*L*layers, we can represent the model as follows using unique mean functions

*m*

^{(1)}, …,

*m*

^{(L)}and covariance kernels

*k*

^{(1)}, …,

*k*

^{(L)}that generate Σ

_{1}, …, Σ

_{L}used in the probabilistic distributions,

^{51}The deep Gaussian Processes (GPs) can be perceived as hierarchical models whose kernel does not admit a closed form. Crucially, this “hierarchy of means” is constructed via the means of the layer-distributions in the deep GP, but not higher moments.

^{52–54}Only the mean functions at each layer of the deep GP are contingent upon computations from preceding layers, signifying hierarchies that rely on the first-order structure at every layer of the model.

Using a slightly different framework based on the Vecchia approximation,^{55} we introduced a “deep Vecchia ensemble,” a hybrid method combining Gaussian processes (GPs) and deep neural networks (DNNs) for regression. This model builds an ensemble of GPs on DNN hidden layers, utilizing Vecchia approximations for GP scalability. Mathematically, the joint distribution of variables is decomposed using Vecchia’s approximation, and predictions are combined using a generalized product of experts. This approach offers both representation learning and uncertainty quantification. As described in the last section, a GP model can utilize a deep kernel, constructively combining the neural network’s and the GP’s strengths, leading to a model that benefits from the GP’s interpretability and the NN’s flexibility.^{10}

Our Bayesian DGP (BDGP) architecture follows^{56} and includes a two-layer neural network applied as a transformation to the input data. The first layer uses a rectified linear unit (ReLU) activation function, and the second employs a sigmoid activation. This non-linear feature mapping expresses complex input space patterns (see Appendix C).

Contrasting DGPs with deep-kernel GPs, DGPs use multiple GP layers to capture intricate dependencies, whereas deep-kernel GPs employ a NN for input data transformation before GP application. Essentially, while DGPs exploit GP layering to manage complex dependencies, deep kernel learning leverages NNs for non-linear input data transformation, enhancing the GP’s high-dimensional function representation ability.

### E. Measuring non-stationarity of datasets

When it comes to characterizing non-stationarity, some methods focus on non-stationarity in the mean function (e.g., polynomial regression), while others concentrate on non-stationarity in the variance (e.g., geographically weighted regression^{57}). Non-stationarity is typically characterized by a change in statistical properties over the input space, e.g., changes in the dataset’s mean, variance, or other higher moments. Quantifying non-stationarity is an active area of research, and in this paper, we introduce a particular kind of non-stationarity measure for the purpose of judging our test kernels when applied to our test datasets. Overall, measuring a given dataset’s non-stationarity properties is an important ingredient in understanding the performance of a particular kernel. For the reader’s convenience, we offer our non-stationarity measure as a pseudocode (see Algorithm 2). For a detailed theoretical motivation for the non-stationarity measure we use in this work, please refer to Appendix A.

1: procedure MEASURE NONSTATIONARITY (X, y, m, size, c) |

⊳ X is the set of data points, y is the collected data, m is the |

numberof iterations, size is the size of the subdomain, |

c is the number of selected points in each iteration |

2: length scale list = [] |

3: signal variance list = [] |

4: for i in 0 to m do |

5: a ∼ U([0,1 − size]^{n}) |

6: b = a + size |

7: select c test data points X_{t} from [a,b]^{n}: |

8: select associated y_{t} in the dataset or evaluate synthetic |

function at X_{t} |

9: initialize a stationary GP |

10: signal variance, length scale = run_MLE(X_{t}, y_{t}) |

11: append new signal variance to the signal variance list |

12: append new length scale to the length scale list |

13: end for |

14: return signal variance list, length scale list |

15: end procedure |

1: procedure MEASURE NONSTATIONARITY (X, y, m, size, c) |

⊳ X is the set of data points, y is the collected data, m is the |

numberof iterations, size is the size of the subdomain, |

c is the number of selected points in each iteration |

2: length scale list = [] |

3: signal variance list = [] |

4: for i in 0 to m do |

5: a ∼ U([0,1 − size]^{n}) |

6: b = a + size |

7: select c test data points X_{t} from [a,b]^{n}: |

8: select associated y_{t} in the dataset or evaluate synthetic |

function at X_{t} |

9: initialize a stationary GP |

10: signal variance, length scale = run_MLE(X_{t}, y_{t}) |

11: append new signal variance to the signal variance list |

12: append new length scale to the length scale list |

13: end for |

14: return signal variance list, length scale list |

15: end procedure |

To avoid bias through user-based subset selection, we draw the location of a subdomain from a uniform distribution over the input domain (in this case [0, 1]^{n}). We then select a set of data points randomly from within this subdomain 100 times and use MLE [maximum likelihood estimation, maximizing Eq. (4)] to get a stationary length scale and signal variance in each iteration. In the case of a synthetic function, the locations of the data points are drawn from a uniform distribution over the subdomain; in the case of a fixed dataset, the points are randomly selected. The final distribution of all signal variances and length scales is then assessed to measure non-stationarity (see Algorithm 2).

To test our non-stationary measure, we applied it to three synthetic functions (see Fig. 2). In each case, we indicate the final variance of the length scale and signal variance distributions. In the first scenario (first row in Fig. 2), where the signal is a high-frequency sine function, the algorithm’s behavior leads to a high concentration of points in a single compact cluster when plotting the estimated length scale vs the signal variance. This concentration reflects the inherent stationarity of a signal, where the statistical properties do not change over the input space. The unimodal and concentrated distributions of both estimated parameters in the violin plots further corroborate this observation. The consistency in the length scale and signal variance across multiple iterations of MLE indicates that the underlying data structure is stationary.

In the second case (second row in Fig. 2), the signal is a trigonometric curve with local oscillations. The algorithm’s response to this function results in a wider spread of both hyperparameters compared to the first test function. This behavior is reflected in the violin plots on the right.

The third scenario (third row in Fig. 2) presents a more complex signal structure, a trigonometric curve with varying amplitude and frequency. The algorithm’s reaction to this non-stationary signal is manifested in a broad distribution of length scales and some spread in the signal variance. These distributions are reflected in the violin plots.

The three cases in Fig. 2 demonstrate the algorithm’s capability to measure non-stationarity through the local, stationary GP-hyperparameter distributions. The varying behaviors in clustering and distribution of the estimated parameters across the three cases provide insights into the underlying stationarity or non-stationarity of the signals. The algorithm’s sensitivity to these variations underscores its potential as a valuable tool for understanding and characterizing non-stationarity in different contexts.

### F. Performance measures

*y*

_{i}are the data values of the test dataset and $f0i$ are the posterior mean predictions. The RMSE metric provides a measure of how closely the model’s predictions align with the actual values—approaching zero as fit quality improves—while the log marginal likelihood evaluates the fit of the Gaussian Process model given the observed data. The log marginal likelihood will increase as the model fits the data more accurately. The CRPS is defined as

*ψ*is the probability density function of a standard normal distribution and Ψ is the associated cumulative distribution function. For a GP,

*f*

_{0}is Gaussian with mean

*μ*and variance

*σ*

^{2}. The CRPS is negative and approaches zero as fit quality improves. In our computational experiments, we use the CRPS as a performance metric on a set of points by averaging (23) overall predictions. The CRPS is arguably the more important score compared to the RMSE because it is

*uncertainty aware*. In other words, if the prediction accuracy is low but uncertainty in those regions is high—the algorithm is aware of its inaccuracy—the score improves.

Computational cost is becoming a main research topic in recent studies of large-scale non-parametric models,^{22,25,58} especially GPs.^{20,59–61} In our analysis, we sought to examine the progression of the optimization process. To achieve this, we established a callback function during the optimization phase, tracking the RMSE, the log marginal likelihood, and the CRPS as a function of compute time. In terms of interpretation, ideally, we expect the RMSE and the CRPS to increasingly approach zero over time, suggesting that the model’s predictive accuracy and estimation of uncertainties are improving. On the other hand, the log marginal likelihood should increase, indicating a better fit of the model to the observed data. This analysis gives us a summary of the model’s learning process and helps us understand the progression of the optimization, thus providing valuable insights into the efficacy of our model and the optimization strategy employed.

## IV. COMPUTATIONAL EXPERIMENTS

The purpose of this section is to see how different kernels and a deep GP deal with non-stationarity in several datasets and to compare the characteristics and properties of the solutions. To make this comparison fair and easier, we ran all tests on the same Intel i9 CPU (Intel Core i9-9900KF CPU @ 3.60 GHz × 8) and used the total compute time as the cost. As a performance metric, we calculate and observe the RMSE [Root Mean Squared Error, Eq. (22)], the CRPS [continuous rank probability score, Eq. (23)] of the prediction, both applied to a test dataset, and the log marginal likelihood of the observational data [Eq. (4)]. We attempted to run fair tests in good faith; this means the effort spent to set up each kernel or methodology was roughly proportional to a method’s perceived complexity, within reasonable bounds, similar to the effort expended by an ML practitioner—this meant minutes of effort for stationary kernels and hours to days for non-stationary kernels and DGPs. The optimizer to reach the final model was *SciPy*’s differential evolution. We used an in-house MCMC (Markov Chain Monte Carlo) algorithm to create the plots showing the evolution of the performance metrics over time. In cases where our efforts did not lead to satisfactory performance, we chose to present the result “as is” to give the reader the ability to judge for themselves. We always start our tests with a standard stationary kernel to create a baseline—stationary kernels with few hyperparameters are currently the most widely used kernel class. To further the hands-on aspect of this section, we also included the algorithms used in the Appendix and on a specifically designed website, along with links to download the data. The performance-measure-over-time plots were created without considering deep GPs due to incompatible differences in the implementations (see Appendix D). All computational experiments were run multiple times to make sure we showed representative results.

This section first introduces three datasets we will use later to evaluate the performance of the test methodologies. Second, we present the unredacted, uncensored results of the test runs. The purpose is not to judge some kernels or methodologies as better or worse universally, but to evaluate how these techniques perform when tested under certain well-defined conditions and under the described constraints. We encourage the reader to follow our tests, to rerun them if desired, and to judge the performance of the methods for themselves.

### A. Introducing the test datasets

We will consider three test datasets. All datasets are normalized such that the range and the image—the set of all measured function values—are in [0, 1].

Second, we consider a three-dimensional climate dataset that is available online (https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/), consisting of *in situ* measurements of daily maximum surface air temperature (°C) collected from weather stations across the contiguous United States (geospatial locations defined by longitude and latitude) over time (the third dimension). The data and its non-stationarity measures are presented in Fig. 4.

Third, we consider a dataset that was collected during an x-ray scattering experiment at the CMS beamline at NSLSII, Brookhaven National Laboratory. The dataset originated from an autonomous exploration of the multidimensional material state-spaces underlying the self-assembly of copolymer mixtures. Because the scientific outcome of this experiment has not been published yet, all scientific insights have been obscured by normalization and the removal of units. The dataset is presented in Fig. 5.

To evaluate the predictive performance metrics, we divided our datasets into training datasets and test datasets by random selection and applied the metrics only to the latter.

### B. Results

In this section, we present quantitative results of how the different kernels and deep GPs performed when tasked to learn the underlying data-generating latent functions that produced the test datasets introduced in the previous section. For each test, we show the model and its uncertainties across the domain or a subdomain after convergence of a global evolutionary optimization of the log marginal likelihood and the performance measures as a function of the compute time of an MCMC algorithm. Code snippets can be found in the Appendix and on our website (see the Data Availability paragraph at the end). To reiterate, for all tests, we put ourselves in the position of an ML practitioner setting up each algorithm under a *reasonable-effort* constraint—we did not optimize each method to its full extent because this would not lead to a fair comparison. However, we used our best judgment and followed online documentation closely. The same thought process was put into the exact design of the kernels; of course, one could always argue that a particular model would have performed better using more hyperparameters. For the fairness of the comparison, we kept the number of hyperparameters similar across the kernels for a particular computational experiment and increased the number of kernel parameters in a near-proportional fashion for the non-stationary kernels as we moved to higher dimensions. See Table I for the number of hyperparameters for the different experiments. All non-stationary kernels were implemented in the open-source GP package gpCAM (https://github.com/lbl-camera/gpCAM). We tried two different deep GPs, the GPflux package (https://github.com/secondmind-labs/GPflux) and the Bayesian deep GP (BDGP) by Refs. 56 and 62 (https://cran.r-project.org/web/packages/deepgp/index.html). We selected the latter in its two-layer version for our final comparisons because of performance issues with the GPflux package (see Fig. 7).

. | Number of hyperparameters . | ||
---|---|---|---|

. | per experiment . | ||

Kernel functions . | 1D synthetic . | 3D climate . | 3D x-ray . |

Stationary, k_{stat} | 2 | 3 | 3 |

Parametric non-stationary, k_{para} | 15 | 58 | 58 |

Deep kernel, k_{nn} | 48 | 186 | 186 |

. | Number of hyperparameters . | ||
---|---|---|---|

. | per experiment . | ||

Kernel functions . | 1D synthetic . | 3D climate . | 3D x-ray . |

Stationary, k_{stat} | 2 | 3 | 3 |

Parametric non-stationary, k_{para} | 15 | 58 | 58 |

Deep kernel, k_{nn} | 48 | 186 | 186 |

#### 1. One-dimensional synthetic function

*k*

_{stat}) is a Matérn kernel with

*ν*= 3/2

*l*is the length scale and

*d*= ‖

*x*

_{i}−

*x*

_{j}‖

_{2}= |

*x*

_{i}−

*x*

_{j}|. The parametric non-stationary kernel in this experiment was defined as

*g*

_{a}functions in the sum, a constant width of the radial basis functions for each

*g*

_{a}, and a constant length scale. The deep kernel is

*ϕ*is a fully connected neural network mapping $R\u2192R$, with ReLu activation functions and two hidden layers of width five, which led to a total of 48 hyperparameters (weights, biases, and one constant signal variance). The results, presented in Fig. 6, show a gradual improvement in approximation performance as more flexible kernels are used. The stationary kernel (

*k*

_{stat}) stands out through its fast computation time. However, the keen observer notices similar uncertainties independent of the local properties of the latent function; only point spacing is considered in the uncertainty estimate. That is in stark contrast to the parametric non-stationary kernel (

*k*

_{para}) and the deep kernel (

*k*

_{nn}), which both predict lower uncertainties in the well-behaved center region of the domain. This is a very desirable characteristic of non-stationary kernels. The deep kernel and parametric non-stationary kernel reached very similar approximation performance, but the deep kernel was by far the most costly to train. The BDGP predicts a very smooth model with subpar accuracy compared to the other methods. We repeated the experiment with different values for the nugget and let the algorithm choose the nugget without further success. This is not to say the method cannot perform better, but we remind the reader that we are working under the assumption of reasonable effort, which, in this case, was insufficient to reach a better performance. We share the code with the reader in the Appendix (see Appendix C) for reproducibility purposes. We also tested another DGP implementation without success (see Fig. 7). The MCMC sampling runs revealed what was expected, the stationary kernel converges most robustly; however, all kernels led to stable convergence within a reasonable compute time.

#### 2. The climate model

Our three-dimensional climate dataset was introduced in Sec. IV A. The stationary reference kernel (*k*_{stat}) is a Matérn kernel with *ν* = 3/2 [Eq. (25)]. In all cases, stationary and non-stationary, we added to the kernel matrix the noise covariance matrix $V=\sigma n2I$, where $\sigma n2$ is the nugget variance, treated as an additional hyperparameter. The parametric non-stationary kernel is similar to Eq. (26); however, we place radial basis functions (27) at {0,0.5,1}^{3} and added a nugget variance, leading to a total of 58 hyperparameters. The deep kernel $knn(\varphi (xi),\varphi (xj)),\varphi :R3\u2192R3$ has two hidden layers of width 10 but is otherwise identical to (28), yielding 186 hyperparameters. The results, two-dimensional slices through the three-dimensional input space, are presented in Fig. 8. For completeness, we included the deep GP result in Fig. 9, which, however, in our run, was not competitive. Once again, the stationary kernel (*k*_{stat}) delivers fast and robust results; however, it lacks accuracy compared to the parametric non-stationary kernel (*k*_{para}) and the deep kernel (*k*_{nn}). The performance of the two non-stationary kernels is on par, with a slight advantage in CRPS for the deep kernel and a significant advantage in compute time for the parametric non-stationary kernel. The MCMC (Fig. 10) sample runs revealed stable convergence, however, at significantly different time scales.

#### 3. The x-ray scattering model

Our three-dimensional x-ray scattering dataset was introduced in Sec. IV A. The kernel setup is identical to our climate example (see Sec. IV B 2). The results are presented in Fig. 11. This experiment shows improving performance as kernel complexity increases; however, the improvements are moderate. As before, if accuracy is the priority, non-stationary kernels should be considered. Again, the deep kernel (*k*_{nn}) performed very competitively—with a significant advantage in terms of RMSE and CRPS—but did not reach the same log marginal likelihood as the competitors, which can be traced back to solving a high-dimensional optimization problem due to a large number of hyperparameters. This opens the door to even better performance if more effort and time are spent training the model. What stands out in Fig. 11 is the smaller and more detailed predicted uncertainties for the deep kernel, which would affect decision-making in an online data acquisition context. In addition, the posterior mean has more intricate details compared to the stationary and parametric non-stationary kernels. Figure 12 reveals stable convergence of the MCMC sampling for all kernels at similar time scales, supporting the deep kernel as the superior choice for this model. We again included the BDGP, which, however, was not competitive (Fig. 13).

## V. DISCUSSION

### A. Interpretations test-by-test

The one-dimensional synthetic test function (24) exhibits a complex behavior that is captured by our stationarity analysis through a broad distribution in the length scale and signal variance. The scattering patterns reveal a multifaceted behavior, with different regions of the data exhibiting distinct statistical properties. The linear correlation within one cluster (left, scatter plot, Fig. 3) and the dispersion in the other cluster capture the interplay between the sinusoidal and quadratic components of the function. The linear cluster indicates a linear correlation between these two hyperparameters within a specific region of input data. This pattern may correspond to the sinusoidal components of the function, where local oscillations exhibit a consistent relationship between the length scale and signal variance. The additional dispersed points in the scatter plot likely correspond to the regions influenced by the quadratic term in the function, where the statistical properties vary, leading to a broader distribution of the hyperparameters. The bimodal distributions observed in the violin plot for the length scale further corroborate the non-stationarity. The results provide insights into the intricate interplay between the sinusoidal and quadratic components of the function. Due to the strong non-stationarity in the data, non-stationary kernels performed well for this test function. Due to the low dimensionality, the number of hyperparameters is low in all cases, leading to robust training. It is, in our opinion, safe to conclude that in one-dimensional cases with moderate dataset sizes and suspected non-stationarity in the data, non-stationary kernels are to be preferred. Both our parametric and deep non-stationary kernels performed well, with a slight edge in accuracy for the parametric kernel (see Fig. 6). Our two tested deep GP setups (Figs. 6 and 7) were not competitive given our reasonable-constraint effort, which in this case was in the order of days.

Moving on to the climate data example, Fig. 4 shows a quite concentrated cluster near (0,0), indicating weak non-stationarity. This concentration suggests that the statistical properties are consistent across this region, possibly reflecting a dominant pattern or behavior in the data. The computational experiments (see Fig. 11) reveal a trade-off between accuracy and compute time. One has to put much more effort—number of hyperparameters and time—into the computation for a relatively small gain in accuracy. Both the parametric non-stationary kernel and the deep kernel achieve higher accuracy than the stationary kernel but are costly. In time-sensitive situations, the stationary kernel is likely to be preferred; for best accuracy, the parametric non-stationary kernel or the deep kernel is the superior choice. Looking at the CRPS, the deep kernel has a slight edge in accurately estimating uncertainties over the parametric kernel; however, the parametric non-stationary kernel reaches the highest log marginal likelihood. For this dataset, we tested the BDGP without much success under our reasonable-effort constraint (see Fig. 9).

Finally, for the x-ray scattering data, Fig. 5 shows a high level of dispersion in the scatter and violin plots, indicating strong non-stationarity, which is further corroborated in the violin plots. This strong non-stationarity—particularly in the length scale—clearly benefits the performance of the deep kernel (see Fig. 11). The stationary kernel shows quite poor performance both in terms of RMSE and CRPS. Since the stationary kernels only allow us to estimate uncertainties based on the global properties of the data and local geometry, it is expected that non-stationarity kernels estimate uncertainties more accurately, which manifests itself in a higher CRPS. The parametric non-stationary kernel leads the field in log marginal likelihood. Surprisingly, among the three tested kernels, the deep kernel leads to by far the lowest log marginal likelihood, which, again, suggests a better optimizer might lead to much-improved performance. Once again, our setup of the BDGP was not competitive (see Fig. 13).

### B. Key takeaways from the computational experiments

While we included as much information in the computational experiments and the Appendix as possible to give the reader a chance to make up their own minds, here we summarize the following key takeaways:

Stationary kernels proved to be surprisingly competitive in terms of the RMSE and are unbeatable when evaluating accuracy per time. It seems worth it in most cases to run a stationary GP first for comparison.

Uncertainty is estimated more accurately by non-stationary kernels; if UQ-driven decision-making is the goal, non-stationary kernels are to be preferred. This is not a surprise since, given a constant (possibly anisotropic) length scale and a signal variance, the posterior variance only depends on data-point geometry.

The parametric non-stationary kernel encodes a flexible non-stationarity while maintaining interpretability. The involved parametric functions over the input space can be visualized and interpreted.

Deep kernels are some of the most flexible kernels, but interpretation is lost in all but the simplest cases, which can easily lead to model misspecifications (wrong model class and hyperparameters). Our models took a fair amount of experimenting before an acceptable performance was achieved. In online applications, in which testing beforehand is not possible, deep kernels should be used with caution.

Non-stationarity in the covariance structure appears in signal variance and length scale; ideally, a kernel addresses both [see Eq. (29)].

While not included in the tests, experimenting with prior mean functions has shown that non-stationarity properties highly depend on the prior mean function of the GP. This is especially true for the non-stationary signal variance.

Extrapolation and non-stationary kernels are difficult to combine. While the parametric non-stationary kernel can be set up for extrapolation, traditional neural networks are poorly equipped for that task.

We should think of the number of hyperparameters conservatively; too many bear the risk of model misspecification (through local minima) and overfitting.

The parametric non-stationary kernel achieved overall better RMSE; the deep kernel led to better uncertainty quantification, as indicated by the CRPS.

### C. A parametric deep kernel

^{6,42}generally in the same flavor as the parametric non-stationary signal variance. The deep kernel, on the other hand, only acts on the length scale by warping the input space. It seems logical to ask what happens if we combine the two concepts. The kernel

### D. Connection between multi-task learning and non-stationary kernels

Multi-task learning offers a powerful paradigm to leverage shared information across multiple related tasks, thereby enhancing the predictive performance of each individual task.^{63–65} This is particularly beneficial when data for some tasks are sparse, as information from data-rich tasks can be used to inform predictions for data-scarce tasks. Flexible non-stationary kernels offer an interesting benefit for multi-task learning: instead of employing specialized techniques, such as coregionalization, one can reformulate the problem to a single task problem and let a flexible non-stationary kernel learn intricate correlations between input $(Xi)$ and output $(Xo)$ space locations. By transforming the multi-task learning problem over $Xi$ to a single-task learning problem over $Xi\xd7Xo$, no further changes to the core algorithm are required. This has been known for a long time and is referred to as problem-transformation methods.^{66} These methods were originally dismissed as not being able to capture intricate correlations between the tasks; however, this is only true if stationary, separable kernels are used. A flexible non-stationary kernel is able to flexibly encode covariances across the input and output domains, independent of the indexing of the tasks. This makes it possible to transfer all the complexities of multi-task learning to the kernel design and use the rest of the GP framework as-is, inheriting the superior robustness and interpretability properties of single-task GPs.

## VI. SUMMARY AND CONCLUSION

In this paper, we put on the hat of a machine learning practitioner trying to find the best kernel or methodology within the scope of a Gaussian process (GP) to address non-stationarity in various datasets. We introduced three different datasets—one synthetic, one climate dataset, and one originating from an x-ray scatting experiment at the CMS beamline at NSLSII, Brookhaven National Laboratory. We introduced a non-stationarity measure and studied each dataset to be able to judge their non-stationarity properties quantitatively. We then presented four different methodologies to address the non-stationarity: ignoring it by using a stationary kernel, a parametric non-stationary kernel that uses a flexible non-constant signal variance, a deep kernel that uses a neural network to warp the input space, and a deep GP. We set all methodologies up under reasonable effort constraints to allow for a fair comparison, just like a practitioner might encounter. In our case, that meant minutes of setup time for the stationary kernels and hours to days for the non-stationary kernels and the deep GPs. After the methodologies were set up, we ran our computational tests and presented the results unredacted and uncensored. This way, we hope, the reader gets the best value out of the comparisons. This is also to ensure that the reader has the chance to come up with conclusions different from ours. To further the readers’ ability to double-check and learn, we are publishing all our codes online.

Our tests have shown that even weak non-stationarity in a dataset motivates the use of non-stationary kernels if training time is not an issue of concern. If training time is very limited, stationary kernels are still the preferred choice. We have discovered that non-stationarity in the covariance comes in two flavors, in the signal variance and the length scale and, ideally, both should be addressed through novel kernel designs; we drew attention to one such kernel design. However, non-stationary kernels come with a great risk of model misspecification. If a new model should be trusted out of the box, a stationary kernel might be the preferred choice.

We hope that this paper motivates more practitioners to deploy and experiment with non-stationary kernels while also being aware of some of the risks.

## ACKNOWLEDGMENTS

The work was supported by the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy Contract No. DE-AC02-05CH11231. This work was further supported by the Regional and Global Model Analysis Program of the Office of Biological and Environmental Research in the Department of Energy Office of Science under Contract No. DE-AC02-05CH11231. This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain the correct information, neither the United States Government nor any agency thereof, nor the Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof or the Regents of the University of California. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or the Regents of the University of California.

We want to acknowledge Kevin G. Yager from Brookhaven National Laboratory for providing the x-ray scattering dataset. This data collection used resources from the Center for Functional Nanomaterials (CFN) and the National Synchrotron Light Source II (NSLS-II), which are the U.S. Department of Energy Office of Science User Facilities at Brookhaven National Laboratory, funded under Contract No. DE-SC0012704.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

M.M.N. originally decided to write this paper, led the project, developed the test scripts and software with help from H.L. and M.D.R., and ran the computational experiments. H.L. suggested the non-stationarity measure and improved its practicality with help from M.M.N. and M.D.R. H.L. also did the majority of the work regarding deep GPs, both regarding the algorithms and the article. M.D.R. oversaw all developments, especially regarding parametric non-stationary kernels, and further assisted with writing and editing the article. All authors iteratively refined the core ideas, algorithms, and methods. All decisions regarding the work were made in agreement with all authors. All authors iteratively revised the article.

**Marcus M. Noack**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **Hengrui Luo**: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Resources (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Mark D. Risser**: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Software (equal); Supervision (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

All data and the Jupyter notebook that runs all experiments can be found on https://gpcam.lbl.gov/examples/a-unifying-perspective-on-non-stationary-kernels-for-deeper-gps.

All experiments (except deep GPs) were run using the open-source Python package gpCAM (https://github.com/lbl-camera/gpCAM, gpcam.lbl.gov).

### APPENDIX A: THEORETICAL MOTIVATION FOR OUR NON-STATIONARITY MEASURE

The simple technique used in this paper is to measure non-stationarity in the covariance structure of a dataset based on the idea of model misspecification. Unlike traditional non-stationary measures (e.g., histograms, periodograms), our non-stationary measure is model-specific and can generate visual checks in wider scenarios. We followed White^{67} and Schervish^{68} in the following arguments.

*q*(

**y**∣

**X**) of size

*n*has pairs of input and responses generated independently and identically from a data-generating mechanism

*q*(i.e., a possibly non-stationary GP or even completely non-Gaussian models), and we can consider a parametric stationary GP model

*p*(

**y**∣

**X**, (

*σ*

_{s},

*l*)) which may or may not contain the “true” data generating mechanism

*q*. The MLE estimate of the stationary kernel parameters (

*σ*

_{s},

*l*) maximizes $1n\u2211i=1n\u2061log\u2061p(y(i)\u2223X(i),(\sigma s,l))$. By the law of large numbers for the independent identically distributed log

*p*(

**y**

_{(i)}∣

**X**

_{(i)}, (

*σ*

_{s},

*l*)) summands, this quantity converges in joint probability to the expectation of $Eq\u2061log\u2061p(y\u2223X,(\sigma s,l))$ with respect to the data-generating mechanism

*q*. When

*q*(

**y**∣

**X**) =

*p*(

**y**∣

**x**, (

*σ*

_{s,0},

*l*

_{0})) for some kernel parameters (

*σ*

_{s,0},

*l*

_{0}), that is, the input-response data are actually generated by the GP with a parametric stationary kernel with true parameter (

*σ*

_{s,0},

*l*

_{0}). The following argument shows this equivalence between MLE and the KL-divergence minimizer,

*σ*

_{s,0},

*l*

_{0}). This explains the consistency of the MLE, and we will expect that the MLE $(\sigma s,l)*$ of kernel parameters (

*σ*

_{s},

*l*) will concentrate around the true parameter value (

*σ*

_{s,0},

*l*

_{0}). With regard to our procedure, if all the data are generated by a GP

*p*(

**y**|

**X**, (

*σ*

_{s,0},

*l*

_{0})), then each subsample needs to produce an MLE of (

*σ*

_{s},

*l*) that is close to (

*σ*

_{s,0},

*l*

_{0}). Therefore, we will consider a

*stably concentrated*(

*σ*

_{s},

*l*)-plot as strong evidence that the data are from a stationary GP. When

*q*(

**y**∣

**X**) does not necessarily belong to the GP with parametric stationary kernels, we may write similarly

*σ*

_{s},

*l*), the value of $(\sigma s,l)*$ obtained by maximizing $Eq\u2061log\u2061p(y|X,(\sigma s,l))$ under a misspecified data-generating mechanism

*q*is the minimizer of $DKLq(y|X)p(y|X,(\sigma s,l))$, often referred to as pseudo-true parameters, which is observed to be sensitive to the selection of the subsample. In other words, the minimizer of $DKLq(y|X)p(y|X,(\sigma s,l))$ exhibits higher sensitivity than $DKLp(y|X,(\sigma s,0,l0))p(y|X,(\sigma s,l))$. If we assume the joint distribution of

*p*(

**y**|

**X**, (

*σ*

_{s,0},

*l*

_{0})) has mean $\mu (\sigma s,0,l0)$ and covariance $\Sigma (\sigma s,0,l0)$, the joint distribution of

*p*(

**y**|

**X**, (

*σ*

_{s},

*l*)) has mean $\mu (\sigma s,l)$ and covariance $\Sigma (\sigma s,l)$,

*p*and

*q*are from a GP will only affect this quantity at $O(1/n)$ by convergence rates of sample means and variances, while if

*q*is not a stationary GP model, the magnitude could be large. We have to point out that this observation—that the KL-divergence becomes sensitive to subsampling when the data-generating mechanism and the model do not belong to the same family—is empirical. However, it motivates our algorithm and works well in practice. Although the MLE kernel parameters $(\sigma s,l)*$ under misspecification will converge to the pseudo-true parameters rather than the true parameters,

^{69,70}these pseudo-true parameters are sensitive to subsampling. This creates a much more scattered (

*σ*

_{s},

*l*)-plot without a specific concentration pattern. By looking into the behavior of MLE parameters of stationary kernels, we can quantify how far it is from “stationarity,” which is the rationale of our technique. As aforementioned, a stationary kernel will assume that the pairwise covariance does not depend on the location in space but only on the distance of the input points. This leads us (and others) to deduce that non-stationarity will manifest itself by varying hyperparameters of a stationary kernel—one isotropic length scale and a constant signal variance—when they are trained on localized subsets of the data. In our visualization (

*σ*

_{s},

*l*)-plot, we can observe whether the length scale

*l*or variance parameter

*σ*

_{s}is the source of non-stationarity.

### APPENDIX B: GPflux DGP SCRIPT

### APPENDIX C: 1d BAYESIAN DEEP GP SCRIPT

### APPENDIX D: THE DEEP GP ALGORITHM

*Y*are the observations,

*X*are the inputs,

*F*represents the latent functions (outputs of the GP layers),

*p*(

*Y*|

*F*) is the likelihood, and

*p*(

*F*|

*X*) is the Gaussian process prior. However, in DGPs, the optimization of this marginal likelihood becomes more intricate due to the cascading nature of layers. Each layer’s latent functions become the inputs for the subsequent layer, resulting in a composite model where the inference is not straightforward. The depth introduces additional non-linearity and complexity. From a frequentist viewpoint, the intricacies of DGP are typically managed through variational inference. This method centers on approximating the authentic posterior using a variational distribution characterized by specific variational parameters. The primary goal becomes the maximization of the evidence lower bound (ELBO) in Algorithm 3, effectively providing a lower threshold on the log marginal likelihood. This transformation changes a previously intractable inference conundrum into a solvable optimization challenge, wherein the variational parameters are systematically tuned to align the variational distribution more closely with the authentic posterior. On the other hand, from a Bayesian perspective,

^{62}it offers a profound exploration into the nuances of DGPs. This comparative analysis of an array of DGP models sheds light on their distinctive performance metrics and characteristics. A notable highlight from the paper is the unveiling of an innovative Bayesian approach to classification problems using DGPs, emphasizing their prowess in adeptly managing multifaceted structured data and in discerning intricate patterns and associations.

Require: x_{data}, y_{data}, x_{pred} |

1: Create inducing points Z spanning the range of X |

2: Define two GP layers: |

• Each layer is defined with an RBF kernel |

• Using a random collection of inputs Z as inducing points |

• The last layer has a zero mean function |

3: Define a Gaussian likelihood per layer with noise variance 1e − 4 |

4: Construct a 2-layer deep GP model with the GP layers and likelihood |

5: Compile the model with the stochastic Adam optimizer.^{71,72} |

(Alternatively, one can perform Bayesian variational inference.^{56}) |

6: Train the model for 10^{4} epochs: |

• Optimize the parameters to maximize the marginal likelihood p(Y|X) |

• Where p(Y|X) = ∫p(Y|F)p(F|X) dF |

7: Predict on x_{pred} and extract mean and variance |

Require: x_{data}, y_{data}, x_{pred} |

1: Create inducing points Z spanning the range of X |

2: Define two GP layers: |

• Each layer is defined with an RBF kernel |

• Using a random collection of inputs Z as inducing points |

• The last layer has a zero mean function |

3: Define a Gaussian likelihood per layer with noise variance 1e − 4 |

4: Construct a 2-layer deep GP model with the GP layers and likelihood |

5: Compile the model with the stochastic Adam optimizer.^{71,72} |

(Alternatively, one can perform Bayesian variational inference.^{56}) |

6: Train the model for 10^{4} epochs: |

• Optimize the parameters to maximize the marginal likelihood p(Y|X) |

• Where p(Y|X) = ∫p(Y|F)p(F|X) dF |

7: Predict on x_{pred} and extract mean and variance |

### APPENDIX E: AVAILABLE DEEP GP IMPLEMENTATIONS

Table II offers a comprehensive overview of various implementations of Deep Gaussian Processes (DGPs), a powerful tool in machine learning. It is fascinating to see the diversity of implementations across different programming languages and frameworks, such as R, TensorFlow, and PyTorch. This wide variety reflects the growing interest and applicability of DGPs in various domains, from scientific research to industrial applications.

Repository/package . | Language . | Referencea . | Fittingb . |
---|---|---|---|

deepGP | R | Sauer et al.^{62} | MCMC (elliptic sampler) |

SDSP | R/TensorFlow | Zammit-Mangion et al.^{45} | SVI |

SDSP-MCMC | R/stan | Zammit-Mangion et al.^{45} | MCMC |

SIWGP | R/TensorFlow | Zammit-Mangion et al.^{45} | MLE |

GPFlux | TensorFlow | Dutordoir et al.^{71} | MLE/SVI (variational) |

DeepCGP | TensorFlow | Blomqvist et al.^{73} | SVI (doubly stochastic) |

DGPs_with_IWVI | TensorFlow | Salimbeni et al.^{74} | SVI |

sghmc_dgp | TensorFlow | Havasi et al.^{75} | MCMC (HMC) |

Deep-Gaussian-Process | TensorFlow2 | Salimbeni and Deisenroth^{72} | SVIc |

GPyTorch/DeepGP | PyTorch | Salimbeni and Deisenroth,^{72} Gardner et al.^{76} | SVI (doubly stochastic) |

DGPRFF | PyTorch | Cutajar et al.^{77} | SVI |

Doubly stochastic-DGP | PyTorch | Salimbeni and Deisenroth^{72} | SVI (doubly stochastic) |

PyDeepGP | Python | Dai et al.,^{78} Damianou and Lawrence^{12} | MLE/SVI (variational) |

DGPsparse | TensorFlow | Damianou and Lawrence^{12} | MLE |

DGPfull | Python | Schmidt and O’Hagan^{27} | MCMC (elliptic sampler) |

Repository/package . | Language . | Referencea . | Fittingb . |
---|---|---|---|

deepGP | R | Sauer et al.^{62} | MCMC (elliptic sampler) |

SDSP | R/TensorFlow | Zammit-Mangion et al.^{45} | SVI |

SDSP-MCMC | R/stan | Zammit-Mangion et al.^{45} | MCMC |

SIWGP | R/TensorFlow | Zammit-Mangion et al.^{45} | MLE |

GPFlux | TensorFlow | Dutordoir et al.^{71} | MLE/SVI (variational) |

DeepCGP | TensorFlow | Blomqvist et al.^{73} | SVI (doubly stochastic) |

DGPs_with_IWVI | TensorFlow | Salimbeni et al.^{74} | SVI |

sghmc_dgp | TensorFlow | Havasi et al.^{75} | MCMC (HMC) |

Deep-Gaussian-Process | TensorFlow2 | Salimbeni and Deisenroth^{72} | SVIc |

GPyTorch/DeepGP | PyTorch | Salimbeni and Deisenroth,^{72} Gardner et al.^{76} | SVI (doubly stochastic) |

DGPRFF | PyTorch | Cutajar et al.^{77} | SVI |

Doubly stochastic-DGP | PyTorch | Salimbeni and Deisenroth^{72} | SVI (doubly stochastic) |

PyDeepGP | Python | Dai et al.,^{78} Damianou and Lawrence^{12} | MLE/SVI (variational) |

DGPsparse | TensorFlow | Damianou and Lawrence^{12} | MLE |

DGPfull | Python | Schmidt and O’Hagan^{27} | MCMC (elliptic sampler) |

^{a}

We sort the table by the publication date of the associated paper.

^{b}

We follow the main fitting method featured by the package; some packages may support more fitting methods than listed here.

^{c}

Stochastic variational inference.

One of the key aspects of the table is the different fitting methods used, such as MCMC (Markov Chain Monte Carlo), SVI (Stochastic Variational Inference), and MLE (Maximum Likelihood Estimation). The choice of fitting method can significantly impact the efficiency and accuracy of the model. It is noteworthy that some packages may support more fitting methods than those listed, indicating flexibility and adaptability in the approach to modeling with DGPs.

Recent developments in the field are evident from the inclusion of implementations associated with recent publications. This ongoing research and development signal a vibrant and evolving field where new methodologies and techniques are continually being explored and refined.

Accessibility is another important aspect highlighted in the table. Many of the implementations are openly accessible through platforms like GitHub. This open-source approach fosters collaboration and innovation, allowing researchers and practitioners to not only utilize these resources but also contribute to their development.

The table’s caption mentions that it extends and compares information from other sources, suggesting a comprehensive effort to collect and compare various DGP implementations. This comparative approach adds value to the table by placing the different implementations in context with one another, providing a broader perspective on the landscape of DGPs.

While the table provides a valuable summary, it also hints at potential challenges. Users may need to delve into the individual papers or repositories to understand the specific features, advantages, and limitations of each implementation. This deeper exploration may reveal nuances and details that are beyond the scope of the table but are essential for a complete understanding of each DGP implementation.

## REFERENCES

*Gaussian Processes for Machine Learning*

*Advances in Neural Information Processing Systems 16*

*Advances in Neural Information Processing Systems 25*

*Statistical Methods for Spatial Data Analysis*

*Artificial Intelligence and Statistics*

*Uncertainty in Artificial Intelligence*

*Artificial Intelligence and Statistics*

*Lectures on Fourier Integrals*

*Interpolation of Spatial Data: Some Theory for Kriging*

*Advances in Neural Information Processing Systems 20*

*Advances in Neural Information Processing Systems 18*

*Artificial Intelligence and Statistics*

*Handbook of Spatial Statistics*, Chapman & Hall/CRC Handbooks of Modern Statistical Methods

*GeoENV II: Geostatistics for Environmental Applications*

*Spatial Objective Analysis: With Applications in Atmospheric Science*

*Quantitative Methods for Current Environmental Issues*

*Advances in Neural Information Processing Systems 17*

*The Handbook of Marketing Research: Uses, Misuses, and Future Advances*

*Geographically Weighted Regression: The Analysis of Spatially Varying Relationships*

*Advances in Neural Information Processing Systems 30*

*Advances in Neural Information Processing Systems*

*Advances in Neural Information Processing Systems*