A Unifying Perspective on Non-Stationary Kernels for Deeper Gaussian Processes

The Gaussian process (GP) is a popular statistical technique for stochastic function approximation and uncertainty quantification from data. GPs have been adopted into the realm of machine learning in the last two decades because of their superior prediction abilities, especially in data-sparse scenarios, and their inherent ability to provide robust uncertainty estimates. Even so, their performance highly depends on intricate customizations of the core methodology, which often leads to dissatisfaction among practitioners when standard setups and off-the-shelf software tools are being deployed. Arguably the most important building block of a GP is the kernel function which assumes the role of a covariance operator. Stationary kernels of the Mat\'ern class are used in the vast majority of applied studies; poor prediction performance and unrealistic uncertainty quantification are often the consequences. Non-stationary kernels show improved performance but are rarely used due to their more complicated functional form and the associated effort and expertise needed to define and tune them optimally. In this perspective, we want to help ML practitioners make sense of some of the most common forms of non-stationarity for Gaussian processes. We show a variety of kernels in action using representative datasets, carefully study their properties, and compare their performances. Based on our findings, we propose a new kernel that combines some of the identified advantages of existing kernels.


INTRODUCTION
The Gaussian process (GP) is arguably the most popular member of the large family of stochastic processes and provides a powerful and flexible framework for stochastic function approximation in the form of Gaussian process regression (GPR).A GP is defined as a collection of random variables, such that any finite subset has a joint normal (Gaussian) distribution [1].In the scope of GPR, the random variables are considered (model or latent) function values {f (x 1 ), f (x 2 ), f (x 3 ), ...}.Equivalently, we can understand a GP as a description of distributions over function spaces, in which case GP(m(x), K) is entirely specified by a mean function m(x) = E[f (x)] and a covariance K = k(x j , where k(x, x i ; h) is the kernel or covariance function, h ∈ Θ is a vector of hyperparameters, and X is the input space.This role of the kernel as a covariance operator makes it arguably the most important building block when it comes to optimizing the flexibility, prediction accuracy, and expressiveness of a GP.
In a recent review [2], it was pointed out that the vast majority of applied studies using GPs employ the radial basis function (RBF) kernel (also referred to as the squared exponential or Gaussian kernel).The fraction is even higher when we include other stationary kernels.Stationary kernels are characterized by k(x i , x j ) = k(|x i − x j |), i.e., covariance matrix entries only depend on some distance between data points in the input domain, not on their respective location.Stationary kernels are popular because they carry little risk -in terms of model misspecification -and come with only a few hyperparameters that are easy to optimize or train.However, it has been shown that the stationarity assumption can lead to poor prediction performance and unrealistic uncertainty quantification that is affected mostly by data-point geometry [3].
In other words, uncertainty will increase when moving away from data points at a constant rate across the input space.To overcome these limitations, significant research attention has been drawn to non-stationary Gaussian process regression [4,5,6].Non-stationary kernels depend on the respective locations in the input space explicitly, i.e., k(x i , x j ) ̸ = k(|x i − x j |).This makes them significantly more flexible and expressive leading to higher-quality estimates of uncertainty across the domain.Even so, non-stationary kernels are rarely used in applied machine learning (ML) due to the inherent difficulty of customization, optimizing hyperparameters, and the associated risks of model misspecification (wrong hyperparameters), overfitting, and computational costs due to the need to find many hyperparameters [7,8].When applied correctly, non-stationary GPs have been shown to provide significant advantages over their stationary counterparts, especially in scenarios where the data exhibit non-stationary behavior [9].
This paper aims to bring some structure to the use of non-stationary kernels and related techniques to make it more feasible for the practitioner to use these kernels effectively.Throughout this paper -in the hope of covering the most practical set of available options -we focus on and compare four ways to handle non-stationarity in a dataset: 1. Ignore it: Most datasets and underlying processes exhibit some level of non-stationarity which is often simply ignored.This leads to the use of stationary kernels of the form k stat = k(|x i − x j |); | • | is a norm appropriate to the properties of the input space.Given the key takeaways in [2], this option is chosen by many which serves as the main motivation for the authors to write this perspective.
2. Parametric non-stationarity: Kernels of the form k(x i , x j ) = N d=1 g d (x i )g d (x j )k stat (|x i − x j |), where g d (x) can be any parametric function over the input space and N is some positive integer.
3. Deep kernels: Kernels of the form k stat (|ϕ(x i ) − ϕ(x j )|), where ϕ is a neural network (NN), and k stat is any stationary kernel.This kernel was introduced by Wilson et al. [10] and was quickly established as a powerful technique, even though potential pitfalls related to overfitting were also discovered [11].
4. Deep GPs: Achieving non-stationarity not by using a non-stationary kernel, but by stacking stationary GPs -meaning the output of one GP becomes the input to the next, similar to how neural networks are made up of stacked layers -which introduces a non-linear transformation of the input space and thereby non-stationarity [12].
Non-stationarity can also be modeled through the prior-mean function of a Gaussian process but, for the purpose of this paper, we use a constant prior mean and focus on the four methods described above.From the brief explanations of the three types of non-stationarity, it is immediately obvious that deriving, implementing, and deploying a non-stationary kernel and deep GPs (DGPs) can be an intimidating endeavor.The function g d (in #2 above), for instance, can be chosen to be any sum of local basis functions -linear, quadratic, cubic, piecewise constant, etc. -any polynomial or sums thereof, and splines, just to name a few.Given the vast variety of neural networks (in #3), choosing ϕ is no easier.DGPs involve stacking GPs which increases flexibility but also leads to a very large number of hyperparameters, or latent variables, making optimization and inference a big challenge.In addition, a more complex kernel generally results in reduced interpretability -a highly valued property of GPs.
This paper is concerned with shining a light on the characteristics and performance of different kernels and experiencing them in action.In a very pragmatic approach, we study the performance of the three methodologies above to tackle non-stationarity and compare them to a stationary Matérn kernel to create a baseline.Our contributions can be summarized as follows: We identify and exploit two primary pathways to achieve non-stationary properties: through the application of deep architectures -deep kernels or deep GPs -and the utilization of parametric non-stationary kernels.Our work delves into the role of parametric non-stationary kernels arguing that these kernels offer a similar behavior to a deep-kernel architecture without any direct notion of deepness, hinting at the possibility that deepness and flexibility are practically synonymous.This provides a valuable computational perspective to the broader dialogue surrounding deep architectures and non-stationarity, offering insights that may be instrumental in future research.
The remainder of this paper is organized as follows.The next section introduces some minimal but necessary theory.There, we also introduce some measures that will help us later when it comes to investigating the performance of different kernels.Then we give an overview of the relevant literature and some background.Subsequently, we introduce a range of different kernels and some test datasets, which positions the reader to see each kernel in action in different situations.We then discuss the key takeaways; we give some pointers toward improved kernel designs, make remarks about the connection between non-stationary kernels and multi-task learning, and conclude with a summary of the findings.

PRELIMINARIES
The purpose of this section is to give the reader the necessary tools to follow along with our computational experiments.It is not intended to be a complete or comprehensive overview of the GP theory or related methodologies.For the remainder of this paper, we consider some input space X ⊂ R n with elements x.
While we often think of the input space as a subspace of the Euclidean space, this is not a necessary assumption of the framework in general.We assume data D as a set of tuples {x i , In this work, except for a remark on multivariate GPs at the end, we will assume scalar y i .

GPs in a Nutshell
The GP is a versatile statistical learning tool that can be applied to any black-box function.The ability of GPs to estimate both the mean and the variance of the latent function makes it an ideal ML tool for quantifying uncertainties in the function approximation.GPs assume a model of the form y(x) = f (x) + ϵ(x), where f (x) is the unknown latent function, y(x) is the noisy function evaluation (the measurement), ϵ(x) is the noise term, and x is an element of the input space X .For a fixed, finite set of inputs D = {x 1 , . . ., x |D| }, a Gaussian process implies that the vector f = (f (x 1 ), . . ., f (x |D| )) has a multivariate Gaussian distribution (here taken to be the prior distribution over function values), where K is the covariance matrix defined by the kernel function K ij = k(x i , x j ), and m is the prior mean obtained by evaluating the prior-mean function at points x i .Conditioning on f , we can define a likelihood for the corresponding vector of data, y = (y(x 1 ), . . ., y(x |D| )), denoted p(y|f ), which allows us to perform Bayesian inference, typically (but not necessarily) using a multivariate Gaussian distribution where the covariance matrix V is used to describe measurement errors arising from ϵ(x).Training the hyperparameters, extending the prior over latent function values at points of interest, marginalizing over f , and conditioning on the data y yield the posterior probability density function (PDF) for the requested points [1].For example, for a Gaussian process with Gaussian likelihood, marginalization over f can be obtained in closed form where we suppress the implicit conditioning on the hyperparameters.Training the GP can be done by maximizing the log marginal likelihood -i.e., the log of Equation ( 3) taken as a function of the hyperparameters h when the data y is given - Figure 1: The key concept and essence of non-stationary kernels.A synthetic function -that is also later used for our computational experiments -was sampled at 40 equidistant points.The function is comprised of high-frequency regions (far left and right), and near-constant-gradient regions (center, green circle).A Gaussian process (GP) is tasked with interpolating the data using a stationary (top, a) and a non-stationary (bottom, b) kernel.For each case, the function approximation and the prior covariance matrix are presented.While the posterior mean is similar in both cases, the posterior variance differs substantially.Focusing on the central region, the uncertainty increases between data points, even though the function is very wellbehaved there.The covariance matrix can deliver clues as to why this might happen.The matrix is constant along diagonals, which translates into uncertainties that depend on the distance from surrounding data points only, independent of where in the domain they are located.The non-stationary kernel has no such restriction and provides more realistic estimates of the uncertainty.The covariance entries are not constant along diagonals but correspond to different regions of the function (blue line connections).
with respect to the hyperparameters h.After the hyperparameters are found, the posterior is defined as where κ κ κ = k(x 0 , x j ), K K K = k(x 0 , x 0 ), and m 0 is the prior mean function evaluated at the prediction points x 0 .If, in a Bayesian setting, we want to further incorporate uncertainties in the hyperparameters h ∈ Θ, Equation (5) becomes where p(f 0 |y, h) is available in closed form (5).This basic framework can be extended by more flexible mean, noise, and kernel functions [13,2,14,14,15].Much of the GP framework depends on the ability to extend the original prior p(f ) to the joint distribution p(f , f 0 ), which is made possible through the definition of the kernel as a covariance operator that can be evaluated on pairs of arbitrary inputs (x i , x j ).

The Covariance Operator of a GP: The Kernel or Covariance Function
In this work, the focus is on the kernel function -also called covariance function -of a GP, denoted k : X × X → R. In the GP framework, the kernel function assumes the role of a covariance operator, i.e., the elements of the covariance matrix K in Equation ( 1) are defined by K ij = k(x i , x j ).Because of that, kernel functions have to be symmetric and positive semi-definite (psd); a complete characterization of the class of valid kernel functions is given by Bochner's theorem [16,17].
Kernels are called "stationary" if the function only depends on the distance between the two inputs, not their respective locations, i.e., k(x i , The arguably most prominent members are kernels of the Matérn class (see for instance [18]), which includes the exponential and the squared exponential (RBF) kernel the by-far most used kernel by practitioners [2]; here, || • || 2 is the Euclidean norm, which makes the kernel function both stationary and isotropic.These two special cases of the Matérn kernel class have two hyperparameters: σ s , the signal variance, and l, the length scale, both of which are constant scalars applied to the entire domain X .Stationary and isotropic kernels can be advanced by allowing anisotropy in the norm, i.e., (( where M is some symmetric positive definite matrix whose entries are typically included in the vector of hyperparameters that need to be found.Incorporating anisotropy allows the kernel function to stretch distances such that the implied covariances have ellipsoidal patterns (as opposed to spherical patterns for isotropic kernels).When M = 1 l I n the isotropic kernels in Equations ( 7) and ( 8) are recovered; more generally, when M is diagonal with different elements along the diagonal, automatic relevance detection is recovered [19].Other stationary kernel designs include the spectral kernel, the periodic kernel, and the rational quadratic kernel.
Non-stationary kernels do not have the restriction of only depending on the distance between the input points, but depend on their explicit positions x i and x j , i.e., k(x i , x j ) ̸ = k(|x i − x j |).Formulating new non-stationary kernels comes with the difficulty of proving positive semi-definiteness which is often challenging.However, the statistics and machine learning literature has a variety of general approaches for applying valid non-stationary kernels; this is the primary topic of this paper which is discussed in Section 3.For now, it is important to keep in mind that the essence of stationary versus non-stationary kernels -due to the way they deal with the locations in the input space -manifests itself in the covariance matrix which, in the stationary case, has to be constant along all bands parallel to the diagonal for sorted and equidistant input points, a property that is not observed in the non-stationary case (see Figure 1).Therefore, nonstationary kernels lead to more descriptive, expressive, and flexible encodings of covariances and therefore uncertainties.Of course, in addition, the function space contains a much broader class of functions as well when non-stationary kernels are used.

A Note on Scalability
Computational complexity is a significant challenge for GPs, for both stationary and non-stationary kernels.The need to calculate the log-determinant and invert the covariance matrix -or solve a linear system instead -results in computational complexity of O(|D| 3 ), where |D| is the number of data points [1,20].This complexity limits the application of GPs for large-scale datasets.Several methods have been proposed to overcome this issue.Sparse methods [21,20] and scalable GPR approaches [22] have been developed for stationary GPs.For non-stationary GPs, methods such as local GPs [23] and the Bayesian treed GP [24] have been proposed to tackle this issue.These methods have an origin in divide-and-conquer methods attempting to break down the problem into smaller, manageable pieces that can be solved independently [25], thereby reducing the computational complexity for each piece.However, these approaches can lead to a loss of global information, so finding the right balance between computational efficiency and model accuracy remains a key challenge.A recent approach can scale GPs to millions of data points without using approximations by allowing a compactly supported, non-stationary kernel to discover naturally occurring sparsity [8].

NON-STATIONARY KERNELS
Stationary kernels are widely used primarily because they are easy and convenient to implement, even though the implied assumption of translation-invariant covariances is almost never exactly true for realworld data sets.As mentioned in Section 2, there are serious challenges associated with both deriving nonstationary kernels and choosing an appropriate and practical non-stationary kernel from the valid options for any given implementation of a Gaussian process.We now provide a brief overview of the literature on nonstationary kernels, including both a historical perspective of early developments followed by greater detail on three modern frameworks for non-stationary kernels as well as metrics for quantifying non-stationarity in data sets.

Historical Perspective
It has now been over three decades since the first paper on non-stationary kernels via "deformations" or warping of the input space appeared [4].Since then, the statistics literature has developed a number of approaches for non-stationary kernels, mostly in the context of modeling spatially-referenced data.These methods can broadly be categorized as basis function expansions and kernel convolutions, in addition to the aforementioned deformation approach.We now briefly summarize each method, focusing on aspects that apply directly to kernel functions for Gaussian processes.
The fundamental idea underpinning the deformation or warping approach [4] is that instead of deriving new classes of non-stationary kernels one can keep isotropic kernels but obtain non-stationarity implicitly by rescaling interpoint distances in a systematic way over the input space.In other words, this approach transforms X to a new domain, say X * , wherein stationarity holds.The transformation, say ϕ : R n → R n * , is a (possibly nonlinear) mapping applied to elements of X to yield a non-stationary kernel via where k stat is an arbitrary stationary kernel function.Two extensions were later proposed to this approach [26,27] that supposed the mapping ϕ(•) was itself a stochastic process.For example, [27] placed a Gaussian process prior on ϕ(•) -essentially coming up with the idea of deep kernels more than a decade before related ideas appeared in the machine learning literature.In some cases n * > n, i.e., the mapping involves dimension expansion [28].Ultimately, early approaches to warping the input space were largely unused due to a lack of computational tools for optimizing the mapping function ϕ(•) in a reliable and robust manner.
In contrast to deformations, basis function expansion methods provide constructive approaches for developing non-stationary kernel functions.The main idea for this approach arises from the Karhunen-Loève Expansion [29,30] of a (mean-zero) stochastic process in terms of orthogonal eigenfunctions E m (•) and weights w m : This framework defines a Gaussian process if the weights have a Gaussian distribution; the implied kernel function is where the eigenfunctions and weight variances v m come from the Fredholm integral equation If the infinite series in Equation ( 10) is truncated to the leading M terms, the finite sum approximation to the kernel can be used instead and is optimal in the sense that it minimizes the variance of the truncation error for all sets of M basis functions when the E m (•) are the exact solutions to the Fredholm equation [31].The main task is then to model the weight-eigenfunction pairs {w m , E m (•)}, which can be done empirically using singular value decomposition [32] or parametrically using, e.g., wavelets [33].
Like basis function expansions, the kernel convolution approach is useful in that it provides a constructive approach to specifying both stochastic models and their covariance functions.The main result is that a stochastic process can be defined by the kernel convolution [ 34,35], where W (•) is a n-dimensional stochastic process and κ x (•) is a kernel function that depends on input location x. [36] summarizes the extremely flexible class of stochastic models defined using Equation ( 12): see, for example, [37], [38], and [39].The popularity of this approach is due largely to the fact that it is much easier to specify (possibly parametric) kernel functions than a covariance function directly since the kernel functions only require 12) is a Gaussian process when W (•) is chosen to be Gaussian, and the associated covariance function is which cannot be written in terms of ||x i − x j || and is hence non-stationary.Various choices can be made for using this general framework in practice: replace the integral in Equation ( 12) with a discrete sum approximation [40] or choose specific κ x (•) such that the integral in Equation ( 13) can be evaluated in closed form [41].The latter choice can be generalized to yield a closed-form kernel function that allows all aspects of the resulting covariances to be input-dependent: the length-scale [5,6], the signal variance [42], and even the differentiability of realizations from the resulting stochastic process when the Matérn kernel is leveraged [43].This approach is often referred to as "parametric" non-stationarity since a non-stationary kernel function is obtained by allowing its hyperparameters to depend on input location.In practice, some care needs to be taken to ensure that the kernel function is not too flexible and can be accurately optimized [6,44].We return to a version of this approach in the next section.
In conclusion of this section, the statistics literature contains a broad set of techniques (only some of which are summarized here) for developing non-stationary kernel functions.However, historically speaking, these techniques were not widely adopted because, in most of the cases described here, the number of hyperparameters is on the same order as the number of data points.This property makes it very difficult to apply the kernels to real-world data sets due to the complex algorithms required to fit or optimize such models.
Nonetheless, the potential benefits of applying non-stationary kernels far outweigh the risks in our opinion, and this perspective is all about managing this trade-off.To do so, we now introduce three modern approaches to handle non-stationarity in datasets in order to later test them and compare their performance (Section 4).

Non-Stationarity via a Parametric Signal Variance
This class of non-stationary kernels uses a parametric function as the signal variance.The term g(x 1 )g(x 2 ) is always symmetric and psd [14] and is, therefore, a valid kernel function.Also, any product of kernels is a valid kernel, which gives rise to kernels of the form Algorithm 1 Deep Kernel with Neural Network Input Set of points as arrays x 1 , x 2 , Hyperparameters h ∈ Θ Output Deep kernel covariance matrix of shape (len(x 1 ) × len(x 2 )) Step 1: Define the neural network architecture with input dimension n, hidden layers, and appropriate activation functions Step 2: Initialize or set the weights and biases of the neural network using h Step 3: Transform x 1 and x 2 through the neural network • Apply the forward pass of the neural network to all points in sets x 1 and x 2 • Store the transformed values as x 1nn and x 2nn Step 4: Calculate the pairwise distance matrix between x 1nn and x 2nn Step 5: Compute the deep kernel value using the distance matrix • Apply a kernel function (e.g., exponential) to the distance matrix • Scale and combine with other kernel functions if needed Step 6: Return the deep kernel value This can be seen as a special case of the non-stationary kernel derived in [6] and [42] wherein the lengthscale is taken to be a constant.[42], in particular, consider parametric signal variance and (anisotropic) length scale.In an extension of ( 14), any sum of kernels is a valid kernel which allows us to write The function g can be any function defined on the input domain, but we will restrict ourselves to functions of the form where c k are some coefficients (or parameters), and β(x k , x) are basis functions centered at x k .For our computational experiments, we use radial basis functions of the form where w is the width parameter.

Deep Kernels
Our version of parametric non-stationarity operates on the signal variance only.This is by design so that we can separate the effects of the different kernels later in our tests.This next approach uses a constant signal variance but warps the input space to yield flexible non-constant length scales.The set of valid kernels is closed under non-linear transformation of the input space as long as this transformation is constant across the domain and the resulting space is considered a linear space.This motivates the definition of kernels of the form where ϕ : R n → R n * can again be any scalar or vector function on the input space.Deep neural networks have been established as a preferred choice for ϕ due to their flexible approximation properties which gives rise to, so-called, deep kernels (see Algorithm 1 for an implementation example).For our tests, we define a 2-layer-deep network with varying layer widths.While it is possible through deep kernels to perform dimensionality reduction, in this work we map the original input space into a linear space of the same dimensionality, i.e., n = n * .Care must be taken not to use neural networks whose weights and biases, given the dataset set size, are underdetermined.That is why comparatively small networks are commonly preferred.We use ReLu as an activation function.The neural network weights and biases are treated as hyperparameters and are trained accordingly.We set k stat in (18) to be the Matérn kernel with ν = 3/2.
Our deep kernel construction shares the perspective of using a neural network to estimate a warping function [4].In [45], the authors propose an approach of deep compositional spatial models that differs from traditional warping methods in that it models an injective warping function through a composition of multiple elemental injective functions in a deep-learning framework.This allows for greater flexibility in capturing non-stationary and anisotropic spatial data and is able to provide better predictions and uncertainty quantification than other deep stochastic models of similar complexity.This uncertainty quantification is point-wise, similar to the deep GPs we introduce next.

Deep GPs
Deep Gaussian process (DGP) models are hierarchical extensions of Gaussian processes where GP layers are stacked -similar to a neural network -enhancing modeling flexibility and accuracy [12,46,47] (more details can be found in the Appendix D).The DGP model is one of the deep hierarchical models [48,49,50] and consists of a number of variational Gaussian process layers defined by a mean function and a stationary covariance (kernel) function.Let us first consider a simple 2-layer DGP taking n-dimensional inputs x.The first layer uses a constant zero-mean GP (1) for a lower-dimensional representation of the input data having its coordinate components distributed as f n1 ) where n 1 is the number of nodes in the hidden layer where each component f (1) i is mutually independent.The second layer uses a constant zero mean and takes the first layer's output.Its output f (2) is distributed as )) and is fed into an independent GP GP (2) as the second layer, generating the model output.We can impose more than two hidden layers for a single DGP according to our needs, where with layouts as random matrices.This is analogous to the connecting matrices in the architecture of a neural network.However, we match the 2-layer structure used in our deep kernel and point out that the complexity of the neural architecture may not always lead to better performance.For L > 2 layers in a DGP, we iterate this procedure of using the previous layer's output as the next layer's GP input until reaching the last layer.For a L-layer DGP architecture, this creates a sequence of L independent GPs GP (1) , GP (2) , • • • , GP (L) corresponding to each layer.Each layer's forward method applies the mean and covariance functions to input data and returns a multivariate normal distribution as output, which serves as the next layer's input.
Although each layer of a DGP is equipped with stationary kernels, the output of one GP layer becomes the input to the next GP layer, hence the final output will not be stationary.For a DGP with L layers, we can represent the model as follows using unique mean functions m (1) , • • • , m (L) and covariance kernels k (1) Then, an optimizer and the variational Evidence Lower Bound (ELBO) are used for training the DGP model.Using a variational approximation in ELBO, instead of exact inference, leads to manageable computational complexity of deeper GP models [51].The deep Gaussian Processes (GPs) can be perceived as hierarchical models whose kernel does not admit a closed form.Crucially, this "hierarchy of means" is constructed via the means of the layer-distributions in the deep GP, but not higher moments like [52,53,54].Only the mean functions at each layer of the deep GP are contingent upon computations from preceding layers, signifying hierarchies that rely on the first-order structure at every layer of the model.
Using a slightly different framework based on the Vecchia approximation, [55] introduced a "deep Vecchia ensemble," a hybrid method combining Gaussian processes (GPs) and deep neural networks (DNNs) for regression.This model builds an ensemble of GPs on DNN hidden layers, utilizing Vecchia approximations for GP scalability.Mathematically, the joint distribution of variables is decomposed using Vecchia's approximation, and predictions are combined using a generalized product of experts.This approach offers both representation learning and uncertainty quantification.As described in the last section, a GP model can utilize a deep kernel, constructively combining the neural network's and the GP's strength, leading to a model that benefits from the GP's interpretability and the NN's flexibility [10].
Our Bayesian DGP (BDGP) architecture follows [56] and includes a two-layer neural network, applied as a transformation to the input data.The first layer uses a rectified linear unit (ReLU) activation function and the second employs a sigmoid activation.This non-linear feature mapping expresses complex input space patterns (See Appendix C).
Contrasting DGPs with deep-kernel GPs, DGPs use multiple GP layers to capture intricate dependencies, whereas deep-kernel GPs employ a NN for input data transformation before GP application.Essentially, while DGPs exploit GP layering to manage complex dependencies, deep kernel learning leverages NNs for non-linear input data transformation, enhancing the GP's high-dimensional function representation ability.

Measuring Non-Stationarity of Datasets
When it comes to characterizing non-stationarity, some methods focus on non-stationarity in the mean function (e.g., polynomial regression), while others concentrate on the non-stationarity in the variance (e.g., geographically weighted regression [57]).Non-stationarity is typically characterized by a change in statistical properties over the input space, e.g., changes in the dataset's mean, variance, or other higher moments.
Quantifying non-stationarity is an active area of research and in this paper, we introduce a particular kind of non-stationarity measure for the purpose of judging our test kernels when applied to our test datasets Overall, measuring a given dataset's non-stationarity properties is an important ingredient in understanding the performance of a particular kernel.For the reader's convenience, we offer our non-stationarity measure as a pseudocode (see Algorithm 2).For detailed theoretical motivation of the non-stationarity measure we use in this work, please refer to Appendix A.
To avoid bias through user-based subset selection we draw the location of a subdomain from a uniform distribution over the input domain (in this case [0, 1] n ).We then select a set of data points randomly from within this subdomain 100 times and use MLE (maximum likelihood estimation, maximizing Equation ( 4)) to get a stationary length scale and signal variance in each iteration.In the case of a synthetic function, the locations of the data points are drawn from a uniform distribution over the subdomain; in the case of a fixed dataset, the points are randomly selected.The final distribution of all signal variances and length scales is then assessed to measure non-stationarity (See Algorithm 2).
1: procedure MEASURENONSTATIONARITY(X, y, m, size, c) ▷ X is the set of data points, y is the collected data, m is the number of iterations, size is the size of the subdomain, c is the number of selected points in each iteration return signal variance list, length scale list 15: end procedure To test our non-stationary measure, we applied it to three synthetic functions (see Figure 2).In each case, we indicate the final variance of the length scale and signal variance distributions.In the first scenario (first row in Figure 2), where the signal is a high-frequency sine function, the algorithm's behavior leads to a high concentration of points in a single compact cluster when plotting the estimated length scale versus the signal variance.This concentration reflects the inherent stationarity of a signal, where the statistical properties do not change over the input space.The unimodal and concentrated distributions of both estimated parameters in the violin plots further corroborate this observation.The consistency in the length scale and signal variance across multiple iterations of MLE indicates that the underlying data structure is stationary.When a set of data points is drawn randomly from within the input domain and a GP, using a stationary kernel, is trained via log marginal likelihood maximization after each draw (MLE), the final distribution of the hyperparameters -here a signal variance and an isotropic length scale -can be used to measure non-stationarity.This distribution can be visualized via scatter (middle) and violin (right) plots.We assign the variance of these distributions as a numerical measure of stationarity.The stationary function (top) leads to a very narrow distribution of the length scale and the signal variance.For the non-stationary functions (second and third row), the distributions for both hyperparameters are broader and the associated variances are larger.In this figure, the axis scales are kept constant to allow easy comparison of the stationarity properties of the test functions.
In the second case (second row in Figure 2), the signal is a trigonometric curve with local oscillations.The algorithm's response to this function results in a wider spread of both hyperparameters compared to the first test function.This behavior is reflected in the violin plots on the right.
The third scenario (third row in Figure 2) presents a more complex signal structure, a trigonometric curve with varying amplitude and frequency.The algorithm's reaction to this non-stationary signal is manifested in a broad distribution of length scales and some spread in the signal variance.These distributions are reflected in the violin plots.
The three cases in Figure 2 demonstrate the algorithm's capability to measure non-stationarity through the local, stationary GP-hyperparameter distributions.The varying behaviors in clustering and distribution of the estimated parameters across the three cases provide insights into the underlying stationarity or nonstationarity of the signals.The algorithm's sensitivity to these variations underscores its potential as a valuable tool for understanding and characterizing non-stationarity in different contexts.

Performance Measures
Throughout our computational experiments, we will measure the performance via three different error metrics as a function of training time.We argue that this allows us to compare methodologies across different implementations as long as all tests are run on the same computing architecture with similar hardware utilization.As for error metrics, we utilize the log marginal likelihood (LML, Equation 4), the root mean square error (RMSE), and the Continuous Ranked Probability Score (CRPS).The RMSE and CRPS are evaluated on test data points that are not part of the training data set.The RMSE is defined as where y i are the data values of the test dataset and f i 0 are the posterior mean predictions.The RMSE metric provides a measure of how closely the model's predictions align with the actual values -approaching zero as fit quality improves -while the log marginal likelihood evaluates the fit of the Gaussian Process model given the observed data.The log marginal likelihood will increase as the model fits the data more accurately.The CRPS is defined as where ψ is the probability density function of a standard normal distribution and Ψ is the associated cumulative distribution function.For a GP, f 0 is Gaussian with mean µ and variance σ 2 .The CRPS is negative and approaches zero as fit quality improves.In our computational experiments, we use the CRPS as a performance metric on a set of points by averaging ( 23) over all predictions.The CRPS is arguably the more important score compared to the RMSE because it is uncertainty aware.In other words, if the prediction accuracy is low, but uncertainty in those regions is high -the algorithm is aware of its inaccuracy -the score improves.
Computational cost is becoming a main research topic in recent studies in large-scale non-parametric models [25,58,22], especially GPs [59,60,61,20].In our analysis, we sought to examine the progression of the optimization process.To achieve this, we established a callback function during the optimization phase, tracking the RMSE, the log marginal likelihood, and the CRPS as a function of compute time.In terms of interpretation, ideally, we expect the RMSE and the CRPS to increasingly approach zero over time, suggesting that the model's predictive accuracy and estimation of uncertainties are improving.On the other hand, the log marginal likelihood should increase, indicating a better fit of the model to the observed data.This analysis gives us a summary of the model's learning process and helps us understand the progression of the optimization, thus providing valuable insights into the efficacy of our model and the optimization strategy employed.

COMPUTATIONAL EXPERIMENTS
The purpose of this section is to see how different kernels and a deep GP deal with non-stationarity in several datasets and to compare the characteristics and properties of the solutions.To make this comparison fair and easier, we ran all tests on the same Intel i9 CPU (Intel Core i9-9900KF CPU @ 3.60GHz × 8) and used the total compute time as the cost.As a performance metric, we calculate and observe the RMSE (Root Mean Squared Error, Equation ( 22)), the CRPS (continuous rank probability score, Equation ( 23)) of the prediction, both applied to a test dataset, and the log marginal likelihood of the observational data (Equation ( 4)).We attempted to run fair tests in good faith; this means, the effort spent to set up each kernel or methodology was roughly proportional to a method's perceived complexity, within reasonable bounds, similar to the effort expended by an ML practitioner -this meant minutes of effort for stationary kernels and hours to days for non-stationary kernels and DGPs.The optimizer to reach the final model was scipy's differential evolution.We used an in-house MCMC (Markov Chain Monte Carlo) algorithm to create the plots showing the evolution of the performance metrics over time.In cases when our efforts did not lead to satisfactory performance, we chose to present the result "as is" to give the reader the ability to judge for themselves.We always start our tests with a standard stationary kernel to create a baseline -stationary kernels with few hyperparameters are currently the most widely used kernel class.To further the hands-on aspect of this section, we also included the used algorithms in the Appendix and on a specifically designed website together with links to download the data.The performance-measure-over-time plots were created without considering deep GPs due to incompatible differences in the implementations (see Appendix D).All computational experiments were run multiple times to make sure we showed representative results.
This section, first, introduces three datasets we use later to evaluate the performance of the test methodologies.Second, we present the unredacted, uncensored results of the test runs.The purpose is not to judge some kernels or methodologies as better or worse universally, but to evaluate how these techniques perform when tested under certain well-defined conditions and under the described constraints.We encourage the reader to follow our tests, to rerun them if desired, and to judge the performance of the methods for themselves.

Introducing the Test Datasets
We will consider three test data sets.All datasets are normalized such that the range and the image -the set of all measured function values -are in [0, 1].
For the first dataset, we define a one-dimensional synthetic function data is drawn from.50 data points are drawn randomly and the noise ϵ ∼ N (0, 0.001) is added.Figure 3 presents the function and its non-stationary measures.
Second, we consider a three-dimensional climate dataset that is available online (https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/),consisting of in situ measurements of daily maximum surface air temperature ( • C) collected from weather stations across the contiguous United States (geospatial locations defined by longitude and latitude) over time (the third dimension).The data and its non-stationarity measures are presented in Figure 4.
Third, we consider a dataset that was collected during an X-ray scattering experiment at the CMS beamline at NSLSII, Brookhaven National Laboratory.The dataset originated from an autonomous exploration of multidimensional material state-spaces underlying the self-assembly of copolymer mixtures.Because the scientific outcome of this experiment has not been published yet, all scientific insights have been obscured by normalization and the removal of units.The dataset is presented in Figure 5.
To evaluate the predictive performance metrics, we divided our datasets into training datasets and test datasets by random selection and applied the metrics only to the latter.

Results
In this section, we present quantitative results of how the different kernels and deep GPs performed when tasked to learn the underlying data-generating latent functions that produced the test datasets introduced in the previous section.For each test, we show the model and its uncertainties across the domain or a subdomain after convergence of a global evolutionary optimization of the log marginal likelihood, and the performance measures as a function of the compute time of an MCMC algorithm.Code snippets can be found in the Appendix and on our website (see the Code Availability paragraph at the end).To reiterate, for all tests, we put ourselves in the position of an ML practitioner setting up each algorithm under a reasonableeffort constraint -we did not optimize each method to its full extent because this would not lead to a fair comparison.However, we used our best judgment and followed online documentation closely.The same thought process was put into the exact design of the kernels; of course, one could always argue that a particular model would have performed better using more hyperparameters.For the fairness of the comparison, we kept the number of hyperparameters similar across the kernels for a particular computational experiment and increased the number of kernel parameters in a near-proportional fashion for the non-stationary kernels as we moved to higher dimensions.See Table 1 24)).Non-stationarity appears to be clearly present in the length scale and the signal variance.[56,62] (https://cran.r-project.org/web/packages/deepgp/index.html).We selected the latter in its two-layer version for our final comparisons because of performance issues with the gpflux package (see Figure 7).
where l is the length scale, and The parametric non-stationary kernel in this experiment was defined as where x = {0, 0.2, 0.4, 0.6, 0.8, 1}, leading to a total of 15 hyperparameters -counting two g a functions in the sum, a constant width of the radial basis functions for each g a , and a constant length scale.The deep kernel is where ϕ is a fully connected neural network mapping R → R, with ReLu activation functions and two hidden layers of width five, which led to a total of 48 hyperparameters (weights, biases, and one constant signal variance).The results, presented in Figure 6, show a gradual improvement in approximation performance as more flexible kernels are used.The stationary kernel (k stat ) stands out through its fast computation time.However, the keen observer notices similar uncertainties independent of local properties of the latent function; only point spacing is considered in the uncertainty estimate.That is in stark contrast to the parametric non-stationary kernel (k para ) and the deep kernel (k nn ) which both predict lower uncertainties in the wellbehaved center region of the domain.This is a very desirable characteristic of non-stationary kernels.The deep kernel and parametric non-stationary kernel reached very similar approximation performance but the deep kernel was by far the most costly to train.The BDGP predicts a very smooth model with subpar accuracy compared to the other methods.We repeated the experiment with different values for the nugget, and let the algorithm choose the nugget, without further success.This is not to say the method cannot perform better, but we remind the reader that we are working under the assumption of reasonable effort, which, in this case, was insufficient to reach a better performance.We share the code with the reader in the Appendix (see C) for reproducibility purposes.We also tested another DGP implementation without success (see Figure 7).The MCMC sampling runs revealed what was expected, the stationary kernel converges most robustly; however, all kernels led to a stable convergence within a reasonable compute time.We note that the deep-GP result highly depended on the specified nugget -too small, and the algorithms produced NaNs, just a little larger, and the presented smoothing was observed.The result was the same when we allowed the algorithm to choose its own nugget.Note the asterisk next to the deep-GP compute time, denoting that this is a separate software written in a different language (R), and compute times can therefore not be directly compared.The calculations of the performance measures as a function of MCMC training time were run five times and displayed are the mean and confidence bounds (one standard deviation).
Figure 7: Approximation result for the one-dimensional synthetic dataset using the gpflux DGP.Despite following the online documentation closely, the model appears to show artifacts.This is the reason why we did not use the gpflux DGP for our higher-dimensional test cases.For reproducibility purposes, we publish our exact run script in the Appendix (see B).

The Climate Model
Our three-dimensional climate dataset was introduced in Section 4.1.The stationary reference kernel (k stat ) is a Matérn kernel with ν = 3/2 (Equation 25).In all cases, stationary and non-stationary, we added to the kernel matrix the noise covariance matrix V = σ 2 n I, where σ 2 n is the nugget variance, treated as an additional hyperparameter.The parametric non-stationary kernel is similar to Equation ( 26); however, we place radial basis functions ( 27) at {0, 0.5, 1} 3 and added a nugget variance, leading to a total of 58 hyperparameters.The deep kernel k nn (ϕ(x i ), ϕ(x j )), ϕ : R 3 → R 3 has two hidden layers of width 10, but is otherwise identical to (28), yielding 186 hyperparameters.The results, two-dimensional slices through the three-dimensional input space, are presented in Figure 8.For completeness, we included the deep GP result in Figure 10, which, however, in our run was not competitive.Once again, the stationary kernel (k stat ) delivers fast and robust results; however, lacks accuracy compared to the parametric nonstationary kernel (k para ) and the deep kernel (k nn ).The performance of the two non-stationary kernels is on par with a slight advantage in CRPS for the deep kernel and a significant advantage in compute time for the parametric non-stationary kernel.The MCMC (Figure 9) sample runs revealed stable convergence, however, at significantly different time scales.Notable is the fast computing time of the stationary kernel due to the fact that only three hyperparameters have to be found: a signal variance, one isotropic length scale, and the nugget (i.i.d.noise).This parametric non-stationary kernel led to a total number of 58 hyperparameters, which increases the computing time significantly compared to the stationary kernel.The accuracy, in this case, is slightly higher.It is debatable and highly application-driven whether the parametric non-stationary kernel should be used for this dataset.If time is not an issue, the increased accuracy can pay off in downstream operations.The deep kernel contains 186 hyperparameters which are identified robustly.The accuracy of the prediction and the UQ is similar to the parametric non-stationary kernel, with a slight advantage in the CRPS but approximately twice the compute time.As with all our test runs, this run was repeated several times with the same result.

The X-Ray Scattering Model
Our three-dimensional X-ray scattering dataset was introduced in Section 4.1.The kernel setup is identical to our climate example (see Section 4.2.2).The results are presented in Figure 11.This experiment shows improving performance as kernel complexity increases, however, the improvements are moderate.As before, if accuracy is the priority, non-stationary kernels should be considered.Again, the deep kernel (k nn ) performed very competitively -with a significant advantage in terms of RMSE and CRPS -but did not reach the same log marginal likelihood as the competitors, which can be traced back to solving a high-dimensional optimization problem due to a large number of hyperparameters.This opens the door to an even better performance if more effort and time is spent in training the model.What stands out in Figure 11 is the smaller and more detailed predicted uncertainties for the deep kernel which would affect decision-making in an online data acquisition context.Also, the posterior mean has more intricate details compared to the stationary and the parametric non-stationary kernel.Figure 12 reveals stable convergence of the MCMC sampling for all kernels at similar time scales supporting the deep kernel as the superior choice for this model.We again included the BDGP, which however was not competitive (Figure 13).Once again, the stationary kernel stands out due to fast computing speeds.The compute time increases significantly for the parametric non-stationary kernel, due to the 58 hyperparameters that have to be found.Since the CRPS is our most relevant performance metric, the approximation and the UQ are better than for the stationary kernel.The deep kernel, for this dataset, performs best with respect to the RMSE and CRPS while achieving lower log marginal likelihood.The training time increases again because 186 hyperparameters have to be found.×{0.24}.This result is, again, included for completeness; it is not competitive compared to the tested kernels.Our script is presented in Appendix C.

Interpretations Test-by-Test
The one-dimensional synthetic test function (24) exhibits a complex behavior that is captured by our stationarity analysis through a broad distribution in the length scale and signal variance.The scattering patterns reveal a multifaceted behavior, with different regions of the data exhibiting distinct statistical properties.The linear correlation within one cluster (left, scatter plot, Figure 3) and the dispersion in the other cluster captures the interplay between the sinusoidal and quadratic components of the function.The linear cluster indicates a linear correlation between these two hyperparameters within a specific region of input data.This pattern may correspond to the sinusoidal components of the function, where local oscillations exhibit a consistent relationship between the length scale and signal variance.The additional dispersed points in the scatter plot likely correspond to the regions influenced by the quadratic term in the function, where the statistical properties vary, leading to a broader distribution of the hyperparameters.The bimodal distributions observed in the violin plot for the length scale further corroborate the non-stationarity.The results provide insights into the intricate interplay between the sinusoidal and quadratic components of the function.Due to the strong non-stationarity in the data, non-stationary kernels performed well for this test function.Due to the low dimensionality, the number of hyperparameters is low in all cases, leading to robust training.It is in our opinion safe to conclude, that in one-dimensional cases, with moderate dataset sizes, and suspected non-stationarity in the data, non-stationary kernels are to be preferred.Both our parametric and deep non-stationary kernels performed well with a slight edge in accuracy for the parametric kernel (see Figure 6).Our two tested deep GP setups (Figures 6 and 7) were not competitive given our reasonable-constraint effort, which in this case, was in the order of days.
Moving on to the climate data example, Figure 4 shows a quite concentrated cluster near (0, 0) indicating weak non-stationarity.This concentration suggests that the statistical properties are consistent across this region, possibly reflecting a dominant pattern or behavior in the data.The computational experiments (see Figures 11) reveal a trade-off between accuracy and compute time.One has to put much more effortnumber of hyperparameters and time -into the computation for a relatively small gain in accuracy.Both the parametric non-stationary kernel and the deep kernel achieve higher accuracy than the stationary kernel but are costly.In time-sensitive situations, the stationary kernel is likely to be preferred in this situation; for best accuracy, the parametric non-stationary kernel or the deep kernel is the superior choice.Looking at the CRPS, the deep kernel has a slight edge in accurately estimating uncertainties over the parametric kernel; however, the parametric non-stationary kernel reaches the highest log marginal likelihood.For this dataset, we tested the BDGP without much success under our reasonable-effort constraint (see Figure 10).
Finally, for the X-ray scattering data, Figure 5 shows a high level of dispersion in the scatter and violin plots, indicating strong non-stationarity, which is further corroborated in the violin plots.This strong nonstationarity -particularly in the length scale -clearly benefits the performance of the deep kernel (see Figure 11).The stationary kernel shows quite poor performance both in terms of RMSE and CRPSsince stationary kernels only allow us to estimate uncertainties based on global properties of the data and local geometry it is expected that non-stationarity kernels estimate uncertainties more accurately which manifests itself in a higher CRPS.The parametric non-stationary kernel leads the field in log marginal likelihood.Surprisingly, among the three tested kernels, the deep kernel leads to by far the lowest log marginal likelihood, which, again, suggests a better optimizer might lead to a much-improved performance.Once again, our setup of the BDGP was not competitive (see Figure 13).

Key Takeaways from the Computational Experiments
While we included as much information in the computational experiments and the Appendix as possible to give the reader a chance to make up their own minds, here we summarize some key takeaways.
1. Stationary kernels proved to be surprisingly competitive in terms of the RMSE and are unbeatable when evaluating accuracy per time.It seems worth it in most cases to run a stationary GP first for comparison.
2. Uncertainty is estimated more accurately by non-stationary kernels; if UQ-driven decision-making is the goal, non-stationary kernels are to be preferred.This is not a surprise since, given a constant (possibly anisotropic) length scale and a signal variance, the posterior variance only depends on datapoint geometry.
3. The parametric non-stationary kernel encodes a flexible non-stationarity while maintaining interpretability.The involved parametric functions over the input space can be visualized and interpreted.
4. Deep kernels are some of the most flexible kernels but interpretation is lost in all but the simplest cases, which can easily lead to model misspecifications (wrong model class and hyperparameters).
Our models took a fair amount of experimenting before an acceptable performance was achieved.In online applications, in which testing beforehand is not possible, deep kernels should be used with caution.
5. Non-stationarity in the covariance structure appears in signal variance and length scale; ideally a kernel addresses both (see Equation 29).
6.While not included in the tests, experimenting with prior mean functions has shown that non-stationarity properties highly depend on the prior mean function of the GP.This is especially true for the nonstationary signal variance.
7. Extrapolation and non-stationary kernels are difficult to combine.While the parametric non-stationary kernel can be set up for extrapolation, traditional neural networks are poorly equipped for that task.
8. We should think of the number of hyperparameters conservatively; too many bear the risk of model misspecification (through local minima) and overfitting.
9. The parametric non-stationary kernel achieved overall better RMSE; the deep kernel led to better uncertainty quantification as indicated by the CRPS.

A Parametric Deep Kernel
Given the observation that non-stationarity in the covariance structure of a dataset originates from a nonconstant signal variance and length scales, one might argue that both should be addressed in the kernel design.Our parametric non-stationary kernel attempts to account for all non-stationary purely through the signal variance; it leaves the length scale constant -however, implementations exist that allow nonconstant and anisotropic length scales [6,42], generally in the same flavor as the parametric non-stationary signal variance.The deep kernel, on the other hand, only acts on the length scale by warping the input space.It seems logical to ask what happens if we combine the two concepts.The kernel achieves just that; it is a combination of our parametric non-stationary kernel and the deep kernel.Modeling our synthetic dataset, see Figure 14, we see a good performance with moderate improvements compared to our earlier tests (Figure 6).The kernel might constitute somewhat of an overkill for such a simple problem but might lead to more significant gains in real applications.We encourage the reader to give this kernel a try in their next application.
RM SE = 0.038 L = 95.53CRP S = −0.015T ime = 472.17s Figure 14: Approximation result for the one-dimensional synthetic dataset using the parametric deep nonstationary kernel (29).The kernel combines non-stationarity in the signal variance and the length scale by leveraging both the parametric non-stationarity kernel design and a deep kernel.The approximation is on par with our previously tested methods but the reached likelihood is higher.This kernel might perform very well when strong non-stationarity is present in length scale and signal variance

Connection between Multi-Task Learning and Non-Stationary Kernels
Multi-task learning offers a powerful paradigm to leverage shared information across multiple related tasks, thereby enhancing the predictive performance of each individual task [63,64,65].This is particularly beneficial when data for some tasks are sparse, as information from data-rich tasks can be used to inform predictions for data-scarce tasks.Flexible non-stationary kernels offer an interesting benefit for multi-task learning: instead of employing specialized techniques, such as coregionalization, one can reformulate the problem to a single task problem and let a flexible non-stationary kernel learn intricate correlations between input (X i ) and output (X o ) space locations.By transforming the multi-task learning problem over X i to a single-task learning problem over X i × X o , no further changes to the core algorithm are required.This has been known for a long time and is referred to as problem-transformation methods [66].These methods were originally dismissed as not being able to capture intricate correlations between the tasks; however, this is only true if stationary, separable kernels are used.A flexible non-stationary kernel is able to flexibly encode covariances across the input and the output domain, independent of the indexing of the tasks.This makes it possible to transfer all complexities of multi-task learning to the kernel design and use the rest of the GP framework as-is, inheriting the superior robustness and interpretability properties of single-task GPs.

SUMMARY AND CONCLUSION
In this paper, we put on the hat of a machine learning practitioner trying to find the best kernel or methodology within the scope of a Gaussian process (GP) to address non-stationarity in various datasets.We introduced three different datasets -one synthetic, one climate dataset, and one originating from an Xray scatting experiment at the CMS beamline at NSLSII, Brookhaven National Laboratory.We introduced a non-stationarity measure and studied each dataset to be able to judge their non-stationarity properties quantitatively.We then presented four different methodologies to address the non-stationarity: Ignoring it by using a stationary kernel, a parametric non-stationary kernel, that uses a flexible non-constant signal variance, a deep kernel that uses a neural network to warp the input space, and a deep GP.We set all methodologies up under reasonable effort constraints to allow for a fair comparison; just like a a practitioner might encounter.In our case, that meant minutes of setup time for the stationary kernels and hours to days for the non-stationary kernels and the deep GPs.After the methodologies were set up, we ran our computational tests and presented the results unredacted and uncensored.This way, we hope, the reader gets the best value out of the comparisons.This is also to ensure that the reader has the chance to come up with conclusions different from ours.To further the readers' ability to double-check and learn, we are publishing all our codes online.
Our tests have shown that even weak non-stationarity in a dataset motivates the use of non-stationary kernels if training time is not an issue of concern.If training time is very limited, stationary kernels are still the preferred choice.We have discovered that non-stationarity in the covariance comes in two flavors, in the signal variance and the length scale and, ideally, both should be addressed through novel kernel designs; we drew attention to one such kernel design.However, non-stationary kernels come with a great risk of model misspecification.If a new model should be trusted out of the box, a stationary kernel might be the preferred choice.
We hope that this paper motivates more practitioners to deploy and experiment with non-stationary kernels but to also be aware of some of the risks.

D THE DEEP GP ALGORITHM
DGPs provide a fusion of the non-parametric flexibility of GPs with the hierarchical structure typical of deep neural networks.In essence, DGPs stack multiple layers of Gaussian processes on top of each other, with the output of one layer serving as the input for the next, much like the architecture of deep neural networks.Fitting a DGPs is a nuanced procedure that marries the principles of GP with the hierarchical structure of deep networks.Through techniques like variational inference, the challenges posed by the depth and complexity of the model are addressed, paving the way for powerful, flexible, and interpretable modeling of data.When fitting a DGP, the primary objective is to optimize the marginal likelihood of the observed data, given the parameters of the Gaussian processes in each layer and the likelihood function.
The marginal likelihood, in the context of GPs, is the likelihood of the observations after integrating out the latent functions.Mathematically, this is expressed as Here, Y are the observations, X are the inputs, F represents the latent functions (outputs of the GP layers), p(Y |F ) is the likelihood, and p(F |X) is the Gaussian process prior.However, in DGPs, the optimization of this marginal likelihood becomes more intricate due to the cascading nature of layers.Each layer's latent functions become the inputs for the subsequent layer, resulting in a composite model where the inference is not straightforward.The depth introduces additional non-linearity and complexity.From a frequentist viewpoint, the intricacies of DGP are typically managed through variational inference.This method centers on approximating the authentic posterior using a variational distribution, characterized by specific variational parameters.The primary goal becomes the maximization of the evidence lower bound (ELBO) in Algorithm 3, effectively providing a lower threshold on the log marginal likelihood.This transformation changes a previously intractable inference conundrum into a solvable optimization challenge, wherein the variational parameters are systematically tuned to align the variational distribution more closely with the authentic posterior.On the other hand, from a Bayesian perspective, [62] offers a profound exploration into the nuances of DGPs.This comparative analysis on an array of DGP models sheds light on their distinctive performance metrics and characteristics.A notable highlight from the paper is the unveiling of an innovative Bayesian approach to classification problems using DGPs, emphasizing their prowess in adeptly managing multifaceted structured data and in discerning intricate patterns and associations.
for i in 0 to m do 5:a ∼ U ([0, 1 − size] n ) 6: b = a + size 7: select c test data points X t from [a, b] n : 8:select associated y t in dataset or evaluate synthetic function at X t length scale = run_MLE(X t , y t )

Figure 2 :
Figure2: Our way of measuring the non-stationarity of a dataset or a synthetic function.When a set of data points is drawn randomly from within the input domain and a GP, using a stationary kernel, is trained via log marginal likelihood maximization after each draw (MLE), the final distribution of the hyperparameters -here a signal variance and an isotropic length scale -can be used to measure non-stationarity.This distribution can be visualized via scatter (middle) and violin (right) plots.We assign the variance of these distributions as a numerical measure of stationarity.The stationary function (top) leads to a very narrow distribution of the length scale and the signal variance.For the non-stationary functions (second and third row), the distributions for both hyperparameters are broader and the associated variances are larger.In this figure, the axis scales are kept constant to allow easy comparison of the stationarity properties of the test functions.

Figure 3 :
Figure 3: Test dataset 1 (a) and its non-stationarity measures visualized as distributions (b, c, d) in the hyperparameters of a stationary kernel trained on local subsets of that data.The dataset is derived from a one-dimensional synthetic function (see Equation (24)).Non-stationarity appears to be clearly present in the length scale and the signal variance.

Figure 4 :Figure 5 :
Figure 4: Test dataset 2 (a, b) and its non-stationarity measures visualized as distributions in the hyperparameters (c, d, e) of a stationary kernel trained on local subsets of that data.The dataset consists of recorded temperatures across the United States and a period of time.Weak non-stationarity appears to be present in the length scale and the signal variance.

Figure 6 :
Figure 6: Performance overview for dataset 1.From the top-left: stationary reference kernel (k stat ), parametric non-stationary kernel (k para ), deep kernel (k nn ), and deep GP.In this test, the parametric nonstationary kernel and the deep kernel reached similar approximation performances, while the former used significantly fewer hyperparameters, which is why it can be trained significantly faster.The deep GP (see Appendix D for the algorithm) over-smoothed the model and took a long time to train.We note that the deep-GP result highly depended on the specified nugget -too small, and the algorithms produced NaNs, just a little larger, and the presented smoothing was observed.The result was the same when we allowed the algorithm to choose its own nugget.Note the asterisk next to the deep-GP compute time, denoting that this is a separate software written in a different language (R), and compute times can therefore not be directly compared.The calculations of the performance measures as a function of MCMC training time were run five times and displayed are the mean and confidence bounds (one standard deviation).

Figure 8 :
Figure 8: Performance overview for the climate model.From the top: k stat , k para , and k nn .The posterior mean is displayed in the left column.The posterior variance is on the right.The model function is defined over [0, 1] 2 × {0.5}.This model is obtained after the convergence of a global evolutionary optimizer.Notable is the fast computing time of the stationary kernel due to the fact that only three hyperparameters have to be found: a signal variance, one isotropic length scale, and the nugget (i.i.d.noise).This parametric non-stationary kernel led to a total number of 58 hyperparameters, which increases the computing time significantly compared to the stationary kernel.The accuracy, in this case, is slightly higher.It is debatable and highly application-driven whether the parametric non-stationary kernel should be used for this dataset.If time is not an issue, the increased accuracy can pay off in downstream operations.The deep kernel contains 186 hyperparameters which are identified robustly.The accuracy of the prediction and the UQ is similar to the parametric non-stationary kernel, with a slight advantage in the CRPS but approximately twice the compute time.As with all our test runs, this run was repeated several times with the same result.

Figure 9 :Figure 10 :
Figure 9: MCMC sampling convergence for the climate model and all kernels.From the top: k stat , k para , and k nn .The MCMC is converging robustly for the stationary and parametric non-stationary kernels; the deep kernel is by far the slowest to converge and the error metrics are barely improving.This might be due to the weak non-stationarity in this dataset (see Figure4).All runs were repeated five times and displayed are the mean and confidence bounds (one standard deviation).

Figure 11 :
Figure 11: Performance overview for the X-ray model.From the top: k stat , k para , and k nn .The posterior mean is displayed in the left column.The posterior variance is on the right.The model function is defined over [0, 1] 2 × {0.24}.This model is obtained after the convergence of a global evolutionary optimizer.Once again, the stationary kernel stands out due to fast computing speeds.The compute time increases significantly for the parametric non-stationary kernel, due to the 58 hyperparameters that have to be found.Since the CRPS is our most relevant performance metric, the approximation and the UQ are better than for the stationary kernel.The deep kernel, for this dataset, performs best with respect to the RMSE and CRPS while achieving lower log marginal likelihood.The training time increases again because 186 hyperparameters have to be found.

Figure 12 :Figure 13 :
Figure12: MCMC sampling convergence for the X-ray model and all kernels.From the top: k stat , k para , and k nn .The MCMC is converging robustly and, in this case, at similar time scales, making the case for the deep kernel that reached the best CRPS (in the optimization).All runs were repeated five times and displayed are the mean and confidence bounds (one standard deviation).

2
Figure 13: Performance overview for the X-ray scattering model, computed with the BDGP.The model function -posterior mean (left) and variance (right) -is defined over [0, 1] 2 ×{0.24}.This result is, again, included for completeness; it is not competitive compared to the tested kernels.Our script is presented in Appendix C.

Algorithm 3 4 :
Detailed Deep Gaussian Process Model Training Require: x data , y data , x pred 1: Create inducing points Z spanning the range of X 2: Define two GP layers: • Each layer is defined with an RBF kernel • Using a random collection of inputs Z as inducing points • The last layer has a zero mean function 3: Define a Gaussian likelihood per layer with noise variance 1e − 4 Construct a 2-layer Deep GP model with the GP layers and likelihood 5: Compile the model with the stochastic Adam optimizer [71, 72] (Alternatively, one can perform Bayesian variational inference [56].)6: Train the model for 10 4 epochs: • Optimize the parameters to maximize the marginal likelihood p(Y |X) • Where p(Y |X) = p(Y |F )p(F |X) dF 7: Predict on x pred and extract mean and variance E AVAILABLE DEEP GP IMPLEMENTATIONS

Table 1 :
Number of hyperparameters for our computational experiments.

Table 2
offers a comprehensive overview of various implementations of Deep Gaussian Processes (DGPs), a powerful tool in machine learning.It's fascinating to see the diversity of implementations across different programming languages and frameworks, such as R, TensorFlow, and PyTorch.This wide variety reflects the growing interest and applicability of DGPs in various domains, from scientific research to industrial applications.One of the key aspects of the table is the different fitting methods used, such as MCMC (Markov Chain Monte Carlo), SVI (Stochastic Variational Inference), and MLE (Maximum Likelihood Estimation).The choice of fitting method can significantly impact the efficiency and accuracy of the model.It's noteworthy that some packages may support more fitting methods than those listed, indicating flexibility and adaptability in the approach to modeling with DGPs.Recent developments in the field are evident from the inclusion of implementations associated with recent publications.This ongoing research and development signal a vibrant and evolving field, where new methodologies and techniques are continually being explored and refined.Accessibility is another important aspect highlighted in the table.Many of the implementations are openly accessible through platforms like GitHub.This open-source approach fosters collaboration and innovation, allowing researchers and practitioners to not only utilize these resources but also contribute to their development.The table's caption mentions that it extends and compares information from other sources, suggesting a comprehensive effort to collate and compare various DGP implementations.This comparative approach adds value to the table by placing the different implementations in context with one another, providing a broader perspective on the landscape of DGPs.While the table provides a valuable summary, it also hints at potential challenges.Users may need to delve into the individual papers or repositories to understand the specific features, advantages, and limitations of each implementation.This deeper exploration may reveal nuances and details that are beyond the scope of the table but are essential for a complete understanding of each DGP implementation.