The Gaussian process (GP) is a popular statistical technique for stochastic function approximation and uncertainty quantification from data. GPs have been adopted into the realm of machine learning (ML) in the last two decades because of their superior prediction abilities, especially in data-sparse scenarios, and their inherent ability to provide robust uncertainty estimates. Even so, their performance highly depends on intricate customizations of the core methodology, which often leads to dissatisfaction among practitioners when standard setups and off-the-shelf software tools are being deployed. Arguably, the most important building block of a GP is the kernel function, which assumes the role of a covariance operator. Stationary kernels of the Matérn class are used in the vast majority of applied studies; poor prediction performance and unrealistic uncertainty quantification are often the consequences. Non-stationary kernels show improved performance but are rarely used due to their more complicated functional form and the associated effort and expertise needed to define and tune them optimally. In this perspective, we want to help ML practitioners make sense of some of the most common forms of non-stationarity for Gaussian processes. We show a variety of kernels in action using representative datasets, carefully study their properties, and compare their performances. Based on our findings, we propose a new kernel that combines some of the identified advantages of existing kernels.

The Gaussian process (GP) is arguably the most popular member of the large family of stochastic processes and provides a powerful and flexible framework for stochastic function approximation in the form of Gaussian process regression (GPR). A GP is defined as a collection of random variables such that any finite subset has a joint normal (Gaussian) distribution.1 In the scope of GPR, the random variables are considered (model or latent) function values {f(x1), f(x2), f(x3), …}. Equivalently, we can understand a GP as a description of distributions over function spaces, in which case GP(m(x),K) is entirely specified by a mean function m(x)=E[f(x)] and a covariance K=k(xj,xi;h)=E[(f(xi)m(xi))(f(xj)m(xj))], where k(x, xih) is the kernel or covariance function, h ∈ Θ is a vector of hyperparameters, and X is the input space. This role of the kernel as a covariance operator makes it arguably the most important building block when it comes to optimizing the flexibility, prediction accuracy, and expressiveness of a GP.

In a recent review,2 it was pointed out that the vast majority of applied studies using GPs employ the radial basis function (RBF) kernel (also referred to as the squared exponential or Gaussian kernel). The fraction is even higher when we include other stationary kernels. Stationary kernels are characterized by k(xi, xj) = k(|xixj|), i.e., covariance matrix entries only depend on some distance between data points in the input domain, not on their respective location. Stationary kernels are popular because they carry little risk—in terms of model misspecification—and come with only a few hyperparameters that are easy to optimize or train. However, it has been shown that the stationarity assumption can lead to poor prediction performance and unrealistic uncertainty quantification that is affected mostly by data-point geometry.3 In other words, uncertainty will increase when moving away from data points at a constant rate across the input space. To overcome these limitations, significant research attention has been drawn to non-stationary Gaussian process regression.4–6 Non-stationary kernels depend on the respective locations in the input space explicitly, i.e., k(xi, xj) ≠ k(|xixj|). This makes them significantly more flexible and expressive, leading to higher-quality estimates of uncertainty across the domain. Even so, non-stationary kernels are rarely used in applied machine learning (ML) due to the inherent difficulty of customization, optimizing hyperparameters, and the associated risks of model misspecification (wrong hyperparameters), overfitting, and computational costs due to the need to find many hyperparameters.7,8 When applied correctly, non-stationary GPs have been shown to provide significant advantages over their stationary counterparts, especially in scenarios where the data exhibit non-stationary behavior.9 

This paper aims to bring some structure to the use of non-stationary kernels and related techniques to make it more feasible for the practitioner to use these kernels effectively. Throughout this paper—in the hope of covering the most practical set of available options—we focus on and compare the following four ways to handle non-stationarity in a dataset:

  1. Ignore it: Most datasets and underlying processes exhibit some level of non-stationarity, which is often simply ignored. This leads to the use of stationary kernels of the form kstat = k(|xixj|); |·| is a norm appropriate to the properties of the input space. Given the key takeaways in Ref. 2, this option is chosen by many, which serves as the main motivation for the authors to write this perspective.

  2. Parametric non-stationarity: Kernels of the form k(xi,xj)=d=1Ngd(xi)gd(xj)kstat(|xixj|), where gd(x) can be any parametric function over the input space and N is some positive integer.

  3. Deep kernels: Kernels of the form kstat(|ϕ(xi) − ϕ(xj)|), where ϕ is a neural network (NN), and kstat is any stationary kernel. This kernel was introduced by Wilson et al.10 and was quickly established as a powerful technique, even though potential pitfalls related to overfitting were also discovered.11 

  4. Deep GPs: Achieving non-stationarity not by using a non-stationary kernel but by stacking stationary GPs—meaning the output of one GP becomes the input to the next, similar to how neural networks are made up of stacked layers—introduces a non-linear transformation of the input space and thereby non-stationarity.12 

Non-stationarity can also be modeled through the prior-mean function of a Gaussian process, but for the purpose of this paper, we use a constant prior mean and focus on the four methods described earlier. From the brief explanations of the three types of non-stationarity, it is immediately obvious that deriving, implementing, and deploying a non-stationary kernel and deep GPs (DGPs) can be an intimidating endeavor. The function gd (in No. 2 above), for instance, can be chosen to be any sum of local basis functions—linear, quadratic, cubic, piecewise constant, etc.—any polynomial or sums thereof, and splines, just to name a few. Given the vast variety of neural networks (in No. 3), choosing ϕ is no easier. DGPs involve stacking GPs, which increases flexibility but also leads to a very large number of hyperparameters, or latent variables, making optimization and inference a big challenge. In addition, a more complex kernel generally results in reduced interpretability—a highly valued property of GPs.

This paper is concerned with shining a light on the characteristics and performance of different kernels and experiencing them in action. In a very pragmatic approach, we study the performance of the three methodologies above to tackle non-stationarity and compare them to a stationary Matérn kernel to create a baseline. Our contributions can be summarized as follows:

  1. We strategically choose a representative set of test kernels and investigate their properties.

  2. We choose test datasets to study the performance of the chosen kernels.

  3. We derive performance and (non)-stationarity metrics that allow fair comparisons of the non-stationary kernels and the selected deep GP implementation.

  4. We compare the performance of the candidate kernels tasked with modeling the test datasets and present the uncensored results.

  5. We use the gathered insights to draw some conclusions and suggest a new kernel design for further investigation.

We identify and exploit two primary pathways to achieve non-stationary properties: through the application of deep architectures—deep kernels or deep GPs—and the utilization of parametric non-stationary kernels. Our work delves into the role of parametric non-stationary kernels, arguing that these kernels offer similar behavior to a deep-kernel architecture without any direct notion of deepness, hinting at the possibility that deepness and flexibility are practically synonymous. This provides a valuable computational perspective to the broader dialog surrounding deep architectures and non-stationarity, offering insights that may be instrumental in future research.

The remainder of this paper is organized as follows: The next section introduces some minimal but necessary theory. There, we also introduce some measures that will help us later when it comes to investigating the performance of different kernels. Then, we give an overview of the relevant literature and some background. Subsequently, we introduce a range of different kernels and some test datasets, which position the reader to see each kernel in action in different situations. We then discuss the key takeaways; we give some pointers toward improved kernel designs, make remarks about the connection between non-stationary kernels and multi-task learning, and conclude with a summary of the findings.

The purpose of this section is to give the reader the necessary tools to follow along with our computational experiments. It is not intended to be a complete or comprehensive overview of GP theory or related methodologies. For the remainder of this paper, we consider some input space XRn with elements x. While we often think of the input space as a subspace of the Euclidean space, this is not a necessary assumption of the framework generally. We assume data D is a set of tuples {xi,yi}i={1,2,3,,|D|}. In this work, except for a comment on multivariate GPs at the end, we will assume scalar yi.

The GP is a versatile statistical learning tool that can be applied to any black-box function. The ability of GPs to estimate both the mean and the variance of the latent function makes it an ideal ML tool for quantifying uncertainties in the function approximation. GPs assume a model of the form y(x) = f(x) + ϵ(x), where f(x) is the unknown latent function, y(x) is the noisy function evaluation (the measurement), ϵ(x) is the noise term, and x is an element of the input space X. For a fixed, finite set of inputs D={x1,,x|D|}, a Gaussian process implies that the vector f=(f(x1),,f(x|D|)) has a multivariate Gaussian distribution
(1)
(here taken to be the prior distribution over function values), where K is the covariance matrix defined by the kernel function Kij = k(xi, xj), and m is the prior mean obtained by evaluating the prior-mean function at points xi. Conditioning on f, we can define a likelihood for the corresponding vector of data, y=(y(x1),,y(x|D|)), denoted p(y|f), which allows us to perform Bayesian inference, typically (but not necessarily) using a multivariate Gaussian distribution
(2)
where the covariance matrix V is used to describe measurement errors arising from ϵ(x). Training the hyperparameters, extending the prior over latent function values at points of interest, marginalizing over f, and conditioning on the data y yield the posterior probability density function (PDF) for the requested points.1 For example, for a Gaussian process with Gaussian likelihood, marginalization over f can be obtained in closed form
(3)
where we suppress the implicit conditioning on the hyperparameters. Training the GP can be performed by maximizing the log marginal likelihood—i.e., the log of Eq. (3) taken as a function of the hyperparameters h when the data y are given,
(4)
with respect to the hyperparameters h. After the hyperparameters are found, the posterior is defined as
(5)
where κ = k(x0, xj), K=k(x0,x0), and m0 is the prior mean function evaluated at the prediction points x0. If, in a Bayesian setting, we want to further incorporate uncertainties in the hyperparameters h ∈ Θ, Eq. (5) becomes
(6)
where p(f0|y, h) is available in closed form (5). This basic framework can be extended by more flexible mean, noise, and kernel functions.2,13–15 Much of the GP framework depends on the ability to extend the original prior p(f) to the joint distribution p(f, f0), which is made possible through the definition of the kernel as a covariance operator that can be evaluated on pairs of arbitrary inputs (xi, xj).
In this work, the focus is on the kernel function—also called the covariance function—of a GP, denoted k:X×XR. In the GP framework, the kernel function assumes the role of a covariance operator, i.e., the elements of the covariance matrix K in Eq. (1) are defined by Kij = k(xi, xj). Because of that, kernel functions have to be symmetric and positive semi-definite (psd); a complete characterization of the class of valid kernel functions is given by Bochner’s theorem.16,17Kernels are called “stationary” if the function only depends on the distance between the two inputs, not their respective locations, i.e., k(xi, xj) = k(|xixj|). The arguably most prominent members are kernels of the Matérn class (see, for instance, Ref. 18), which includes the exponential
(7)
and the squared exponential (RBF) kernel
(8)
the by-far most used kernel by practitioners;2 here, ‖·‖2 is the Euclidean norm, which makes the kernel function both stationary and isotropic. These two special cases of the Matérn kernel class have two hyperparameters: σs, the signal variance, and l, the length scale, both of which are constant scalars applied to the entire domain X. Stationary and isotropic kernels can be advanced by allowing anisotropy in the norm, i.e., ((xixj)TM(xixj))1/2, where M is some symmetric positive definite matrix whose entries are typically included in the vector of hyperparameters that need to be found. Incorporating anisotropy allows the kernel function to stretch distances such that the implied covariances have ellipsoidal patterns (as opposed to spherical patterns for isotropic kernels). When M=1lIn, the isotropic kernels in Eqs. (7) and (8) are recovered; more generally, when M is diagonal with different elements along the diagonal, automatic relevance detection is recovered.19 Other stationary kernel designs include the spectral kernel, the periodic kernel, and the rational quadratic kernel.

Non-stationary kernels do not have the restriction of only depending on the distance between the input points but depend on their explicit positions xi and xj, i.e., k(xi, xj) ≠ k(|xixj|). Formulating new non-stationary kernels comes with the difficulty of proving positive semi-definiteness, which is often challenging. However, the statistics and machine learning literature has a variety of general approaches for applying valid non-stationary kernels; this is the primary topic of this paper, which is discussed in Sec. III. For now, it is important to keep in mind that the essence of stationary vs non-stationary kernels—due to the way they deal with the locations in the input space—manifests itself in the covariance matrix which, in the stationary case, has to be constant along all bands parallel to the diagonal for sorted and equidistant input points, a property that is not observed in the non-stationary case (see Fig. 1). Therefore, non-stationary kernels lead to more descriptive, expressive, and flexible encodings of covariances and, therefore, uncertainties. Of course, in addition, the function space contains a much broader class of functions when non-stationary kernels are used.

FIG. 1.

The key concept and essence of non-stationary kernels. A synthetic function—which is also later used for our computational experiments—was sampled at 40 equidistant points. The function is comprised of high-frequency regions (far left and right) and near-constant-gradient regions (center, green circle). A Gaussian process (GP) is tasked with interpolating the data using a stationary (top, a) and a non-stationary (bottom, b) kernel. For each case, the function approximation and the prior covariance matrix are presented. While the posterior mean is similar in both cases, the posterior variance differs substantially. Focusing on the central region, the uncertainty increases between data points, even though the function is very well-behaved there. The covariance matrix can provide clues as to why this might happen. The matrix is constant along diagonals, which translates into uncertainties that depend on the distance from surrounding data points only, independent of where in the domain they are located. The non-stationary kernel has no such restriction and provides more realistic estimates of the uncertainty. The covariance entries are not constant along diagonals but correspond to different regions of the function (blue line connections).

FIG. 1.

The key concept and essence of non-stationary kernels. A synthetic function—which is also later used for our computational experiments—was sampled at 40 equidistant points. The function is comprised of high-frequency regions (far left and right) and near-constant-gradient regions (center, green circle). A Gaussian process (GP) is tasked with interpolating the data using a stationary (top, a) and a non-stationary (bottom, b) kernel. For each case, the function approximation and the prior covariance matrix are presented. While the posterior mean is similar in both cases, the posterior variance differs substantially. Focusing on the central region, the uncertainty increases between data points, even though the function is very well-behaved there. The covariance matrix can provide clues as to why this might happen. The matrix is constant along diagonals, which translates into uncertainties that depend on the distance from surrounding data points only, independent of where in the domain they are located. The non-stationary kernel has no such restriction and provides more realistic estimates of the uncertainty. The covariance entries are not constant along diagonals but correspond to different regions of the function (blue line connections).

Close modal

Computational complexity is a significant challenge for GPs for both stationary and non-stationary kernels. The need to calculate the log-determinant and invert the covariance matrix—or solve a linear system instead—results in a computational complexity of O(|D|3), where |D| is the number of data points.1,20 This complexity limits the application of GPs to large-scale datasets. Several methods have been proposed to overcome this issue. Sparse methods20,21 and scalable GPR approaches22 have been developed for stationary GPs. For non-stationary GPs, methods such as local GPs23 and the Bayesian treed GP24 have been proposed to tackle this issue. These methods have an origin in divide-and-conquer methods attempting to break down the problem into smaller, manageable pieces that can be solved independently,25 thereby reducing the computational complexity for each piece. However, these approaches can lead to a loss of global information, so finding the right balance between computational efficiency and model accuracy remains a key challenge. A recent approach can scale GPs to millions of data points without using approximations by allowing a compactly supported, non-stationary kernel to discover naturally occurring sparsity.8 

Stationary kernels are widely used primarily because they are easy and convenient to implement, even though the implied assumption of translation-invariant covariances is almost never exactly true for real-world datasets. As mentioned in Sec. II, there are serious challenges associated with both deriving non-stationary kernels and choosing an appropriate and practical non-stationary kernel from the valid options for any given implementation of a Gaussian process. We now provide a brief overview of the literature on non-stationary kernels, including both a historical perspective of early developments followed by greater detail on three modern frameworks for non-stationary kernels as well as metrics for quantifying non-stationarity in datasets.

It has now been over three decades since the first paper on non-stationary kernels via “deformations” or warping of the input space appeared.4 Since then, the statistics literature has developed a number of approaches for non-stationary kernels, mostly in the context of modeling spatially referenced data. These methods can broadly be categorized as basis function expansions and kernel convolutions, in addition to the aforementioned deformation approach. We now briefly summarize each method, focusing on aspects that apply directly to kernel functions for Gaussian processes.

The fundamental idea underpinning the deformation or warping approach4 is that instead of deriving new classes of non-stationary kernels, one can keep isotropic kernels but obtain non-stationarity implicitly by rescaling interpoint distances in a systematic way over the input space. In other words, this approach transforms X to a new domain, say X*, wherein stationarity holds. The transformation, say ϕ:RnRn*, is a (possibly non-linear) mapping applied to elements of X to yield a non-stationary kernel via
(9)
where kstat is an arbitrary stationary kernel function. Two extensions were later proposed to this approach26,27 that supposed the mapping ϕ(·) was itself a stochastic process. For example,27 place a Gaussian process prior to ϕ(·)—essentially coming up with the idea of deep kernels more than a decade before related ideas appeared in the machine learning literature. In some cases, n* > n, i.e., the mapping involves dimension expansion.28 Ultimately, early approaches to warping the input space were largely unused due to a lack of computational tools for optimizing the mapping function ϕ(·) in a reliable and robust manner.
In contrast to deformations, basis function expansion methods provide constructive approaches for developing non-stationary kernel functions. The main idea for this approach arises from the Karhunen–Loève expansion29,30 of a (mean-zero) stochastic process in terms of orthogonal eigenfunctions Em(·) and weights wm,
(10)
This framework defines a Gaussian process if the weights have a Gaussian distribution; the implied kernel function is
where the eigenfunctions and weight variances vm come from the Fredholm integral equation
(11)
If the infinite series in Eq. (10) is truncated to the leading M terms, the finite sum approximation to the kernel can be used instead and is optimal in the sense that it minimizes the variance of the truncation error for all sets of M basis functions when the Em(·) are the exact solutions to the Fredholm equation.31 The main task is then to model the weight-eigenfunction pairs {wm, Em(·)}, which can be performed empirically using singular value decomposition32 or parametrically using, e.g., wavelets.33 
Like basis function expansions, the kernel convolution approach is useful in that it provides a constructive approach to specifying both stochastic models and their covariance functions. The main result is that a stochastic process can be defined by the kernel convolution
(12)
References 34 and 35, where W(·) is an n-dimensional stochastic process and κx(·) is a kernel function that depends on input location x,36 summarizes the extremely flexible class of stochastic models defined using Eq. (12): (see, for example, Refs. 37–39). The popularity of this approach is due largely to the fact that it is much easier to specify (possibly parametric) kernel functions than a covariance function directly since the kernel functions only require Rdκx(u)du< and Rdκx2(u)du<. The process f(·) in Eq. (12) is a Gaussian process when W(·) is chosen to be Gaussian, and the associated covariance function is
(13)
which cannot be written in terms of ‖xixj‖ and is hence non-stationary. Various choices can be made for using this general framework in practice: replace the integral in Eq. (12) with a discrete sum approximation40 or choose specific κx(·) such that the integral in Eq. (13) can be evaluated in closed form.41 The latter choice can be generalized to yield a closed-form kernel function that allows all aspects of the resulting covariances to be input-dependent: the length-scale,5,6 the signal variance,42 and even the differentiability of realizations from the resulting stochastic process when the Matérn kernel is leveraged.43 This approach is often referred to as “parametric” non-stationarity since a non-stationary kernel function is obtained by allowing its hyperparameters to depend on the input location. In practice, some care needs to be taken to ensure that the kernel function is not too flexible and can be accurately optimized.6,44 We return to a version of this approach in Sec. III B.

In conclusion of this section, the statistics literature contains a broad set of techniques (only some of which are summarized here) for developing non-stationary kernel functions. However, historically speaking, these techniques were not widely adopted because, in most of the cases described here, the number of hyperparameters is in the same order as the number of data points. This property makes it very difficult to apply the kernels to real-world datasets due to the complex algorithms required to fit or optimize such models.

Nonetheless, the potential benefits of applying non-stationary kernels far outweigh the risks, in our opinion, and this perspective is all about managing this trade-off. To do so, we now introduce three modern approaches to handle non-stationarity in datasets in order to later test them and compare their performance (Sec. IV).

This class of non-stationary kernels uses a parametric function as the signal variance. The term g(x1)g(x2) is always symmetric and psd14 and is, therefore, a valid kernel function. In addition, any product of kernels is a valid kernel, which gives rise to kernels of the form
(14)
This can be seen as a special case of the non-stationary kernel derived in Refs. 6 and 42, wherein the length-scale is taken to be a constant;42 in particular, consider parametric signal variance and (anisotropic) length scale. In an extension of (14), any sum of kernels is a valid kernel, which allows us to write
(15)
The function g can be any function defined on the input domain, but we will restrict ourselves to functions of the form
(16)
where ck are some coefficients (or parameters), and β(xk, x) are basis functions centered at xk. For our computational experiments, we use radial basis functions of the form
(17)
where w is the width parameter.
Our version of parametric non-stationarity operates on the signal variance only. This is by design so that we can separate the effects of the different kernels later in our tests. This next approach uses a constant signal variance but warps the input space to yield flexible non-constant length scales. The set of valid kernels is closed under a non-linear transformation of the input space as long as this transformation is constant across the domain, and the resulting space is considered a linear space. This motivates the definition of kernels of the form
(18)
where ϕ:RnRn* can again be any scalar or vector function on the input space. Deep neural networks have been established as a preferred choice for ϕ due to their flexible approximation properties, which give rise to so-called deep kernels (see Algorithm 1 for an implementation example). For our tests, we define a 2-layer-deep network with varying layer widths. While it is possible through deep kernels to perform dimensionality reduction, in this work, we map the original input space into a linear space of the same dimensionality, i.e., n = n*. Care must be taken not to use neural networks whose weights and biases, given the dataset set size, are underdetermined. That is why comparatively small networks are commonly preferred. We use ReLu as an activation function. The neural network weights and biases are treated as hyperparameters and are trained accordingly. We set kstat in (18) to be the Matérn kernel with ν = 3/2.
ALGORITHM 1.

Deep kernel with neural network.

Input Set of points as arrays x1, x2, hyperparameters h ∈ Θ 
Output Deep kernel covariance matrix of shape [len(x1) × len(x2)] 
Step 1: Define the neural network architecture with input dimension n, hidden layers, 
and appropriate activation functions 
Step 2: Initialize or set the weights and biases of the neural network using h 
Step 3: Transform x1 and x2 through the neural network 
• Apply the forward pass of the neural network to all points in sets x1 and x2 
• Store the transformed values as x1nn and x2nn 
Step 4: Calculate the pairwise distance matrix between x1nn and x2nn 
Step 5: Compute the deep kernel value using the distance matrix 
• Apply a kernel function (e.g., exponential) to the distance matrix 
• Scale and combine with other kernel functions if needed 
Step 6: Return the deep kernel value 
Input Set of points as arrays x1, x2, hyperparameters h ∈ Θ 
Output Deep kernel covariance matrix of shape [len(x1) × len(x2)] 
Step 1: Define the neural network architecture with input dimension n, hidden layers, 
and appropriate activation functions 
Step 2: Initialize or set the weights and biases of the neural network using h 
Step 3: Transform x1 and x2 through the neural network 
• Apply the forward pass of the neural network to all points in sets x1 and x2 
• Store the transformed values as x1nn and x2nn 
Step 4: Calculate the pairwise distance matrix between x1nn and x2nn 
Step 5: Compute the deep kernel value using the distance matrix 
• Apply a kernel function (e.g., exponential) to the distance matrix 
• Scale and combine with other kernel functions if needed 
Step 6: Return the deep kernel value 

Our deep kernel construction shares the perspective of using a neural network to estimate a warping function.4 In Ref. 45, the authors propose an approach to deep compositional spatial models that differs from traditional warping methods in that it models an injective warping function through a composition of multiple elemental injective functions in a deep-learning framework. This allows for greater flexibility in capturing non-stationary and anisotropic spatial data and is able to provide better predictions and uncertainty quantification than other deep stochastic models of similar complexity. This uncertainty quantification is point-wise, similar to the deep GPs we introduce next.

Deep Gaussian process (DGP) models are hierarchical extensions of Gaussian processes where GP layers are stacked—similar to a neural network—enhancing modeling flexibility and accuracy12,46,47 (more details can be found in  Appendix D). The DGP model is one of the deep hierarchical models48–50 and consists of a number of variational Gaussian process layers defined by a mean function and a stationary covariance (kernel) function.

Let us first consider a simple 2-layer DGP taking n-dimensional inputs x. The first layer uses a constant zero-mean GP(1) for a lower-dimensional representation of the input data, with its coordinate components distributed as fi(1)xNn(0,Σ1(x)) and f(1)=(f1(1),,fn1(1)), where n1 is the number of nodes in the hidden layer where each component fi(1) is mutually independent.

The second layer uses a constant zero mean and takes the first layer’s output. Its output f(2) is distributed as f(2)f(1)Nn(0, Σ2(f(1))) and is fed into an independent GP GP(2) as the second layer, generating the model output. We can impose more than two hidden layers for a single DGP according to our needs, where f(1)Rn×n1,f(2)Rn1×n2,f(L)RnL1×n, with layouts as random matrices. This is analogous to the connecting matrices in the architecture of a neural network. However, we match the two-layer structure used in our deep kernel and point out that the complexity of the neural architecture may not always lead to better performance.

For L > 2 layers in a DGP, we iterate this procedure of using the previous layer’s output as the next layer’s GP input until reaching the last layer. For an L-layer DGP architecture, this creates a sequence of L independent GPs, GP(1), GP(2), …, GP(L), corresponding to each layer. Each layer’s forward method applies the mean and covariance functions to input data and returns a multivariate normal distribution as output, which serves as the next layer’s input.

Although each layer of a DGP is equipped with stationary kernels, the output of one GP layer becomes the input to the next GP layer; hence, the final output will not be stationary. For a DGP with L layers, we can represent the model as follows using unique mean functions m(1), …, m(L) and covariance kernels k(1), …, k(L) that generate Σ1, …, ΣL used in the probabilistic distributions,
(19)
(20)
(21)
Then, an optimizer and the variational Evidence Lower Bound (ELBO) are used for training the DGP model. Using a variational approximation in ELBO instead of an exact inference leads to the manageable computational complexity of deeper GP models.51 The deep Gaussian Processes (GPs) can be perceived as hierarchical models whose kernel does not admit a closed form. Crucially, this “hierarchy of means” is constructed via the means of the layer-distributions in the deep GP, but not higher moments.52–54 Only the mean functions at each layer of the deep GP are contingent upon computations from preceding layers, signifying hierarchies that rely on the first-order structure at every layer of the model.

Using a slightly different framework based on the Vecchia approximation,55 we introduced a “deep Vecchia ensemble,” a hybrid method combining Gaussian processes (GPs) and deep neural networks (DNNs) for regression. This model builds an ensemble of GPs on DNN hidden layers, utilizing Vecchia approximations for GP scalability. Mathematically, the joint distribution of variables is decomposed using Vecchia’s approximation, and predictions are combined using a generalized product of experts. This approach offers both representation learning and uncertainty quantification. As described in the last section, a GP model can utilize a deep kernel, constructively combining the neural network’s and the GP’s strengths, leading to a model that benefits from the GP’s interpretability and the NN’s flexibility.10 

Our Bayesian DGP (BDGP) architecture follows56 and includes a two-layer neural network applied as a transformation to the input data. The first layer uses a rectified linear unit (ReLU) activation function, and the second employs a sigmoid activation. This non-linear feature mapping expresses complex input space patterns (see  Appendix C).

Contrasting DGPs with deep-kernel GPs, DGPs use multiple GP layers to capture intricate dependencies, whereas deep-kernel GPs employ a NN for input data transformation before GP application. Essentially, while DGPs exploit GP layering to manage complex dependencies, deep kernel learning leverages NNs for non-linear input data transformation, enhancing the GP’s high-dimensional function representation ability.

When it comes to characterizing non-stationarity, some methods focus on non-stationarity in the mean function (e.g., polynomial regression), while others concentrate on non-stationarity in the variance (e.g., geographically weighted regression57). Non-stationarity is typically characterized by a change in statistical properties over the input space, e.g., changes in the dataset’s mean, variance, or other higher moments. Quantifying non-stationarity is an active area of research, and in this paper, we introduce a particular kind of non-stationarity measure for the purpose of judging our test kernels when applied to our test datasets. Overall, measuring a given dataset’s non-stationarity properties is an important ingredient in understanding the performance of a particular kernel. For the reader’s convenience, we offer our non-stationarity measure as a pseudocode (see Algorithm 2). For a detailed theoretical motivation for the non-stationarity measure we use in this work, please refer to  Appendix A.

ALGORITHM 2.

Measuring non-stationarity via local, stationary-GP hyperparameter distributions.

1: procedure MEASURE NONSTATIONARITY (X, y, m, size, c
X is the set of data points, y is the collected data, m is the 
numberof iterations, size is the size of the subdomain, 
c is the number of selected points in each iteration 
2: length scale list = [] 
3: signal variance list = [] 
4: for i in 0 to m do 
5: aU([0,1 − size]n
6: b = a + size 
7: select c test data points Xt from [a,b]n
8: select associated yt in the dataset or evaluate synthetic 
function at Xt 
9: initialize a stationary GP 
10: signal variance, length scale = run_MLE(Xt, yt
11: append new signal variance to the signal variance list 
12: append new length scale to the length scale list 
13: end for 
14: return signal variance list, length scale list 
15: end procedure 
1: procedure MEASURE NONSTATIONARITY (X, y, m, size, c
X is the set of data points, y is the collected data, m is the 
numberof iterations, size is the size of the subdomain, 
c is the number of selected points in each iteration 
2: length scale list = [] 
3: signal variance list = [] 
4: for i in 0 to m do 
5: aU([0,1 − size]n
6: b = a + size 
7: select c test data points Xt from [a,b]n
8: select associated yt in the dataset or evaluate synthetic 
function at Xt 
9: initialize a stationary GP 
10: signal variance, length scale = run_MLE(Xt, yt
11: append new signal variance to the signal variance list 
12: append new length scale to the length scale list 
13: end for 
14: return signal variance list, length scale list 
15: end procedure 

To avoid bias through user-based subset selection, we draw the location of a subdomain from a uniform distribution over the input domain (in this case [0, 1]n). We then select a set of data points randomly from within this subdomain 100 times and use MLE [maximum likelihood estimation, maximizing Eq. (4)] to get a stationary length scale and signal variance in each iteration. In the case of a synthetic function, the locations of the data points are drawn from a uniform distribution over the subdomain; in the case of a fixed dataset, the points are randomly selected. The final distribution of all signal variances and length scales is then assessed to measure non-stationarity (see Algorithm 2).

To test our non-stationary measure, we applied it to three synthetic functions (see Fig. 2). In each case, we indicate the final variance of the length scale and signal variance distributions. In the first scenario (first row in Fig. 2), where the signal is a high-frequency sine function, the algorithm’s behavior leads to a high concentration of points in a single compact cluster when plotting the estimated length scale vs the signal variance. This concentration reflects the inherent stationarity of a signal, where the statistical properties do not change over the input space. The unimodal and concentrated distributions of both estimated parameters in the violin plots further corroborate this observation. The consistency in the length scale and signal variance across multiple iterations of MLE indicates that the underlying data structure is stationary.

FIG. 2.

Our way of measuring the non-stationarity of a dataset or a synthetic function. When a set of data points is drawn randomly from within the input domain and a GP, using a stationary kernel, is trained via log marginal likelihood maximization after each draw (MLE), the final distribution of the hyperparameters—here a signal variance and an isotropic length scale—can be used to measure non-stationarity. This distribution can be visualized via scatter (middle) and violin (right) plots. We assign the variance of these distributions as a numerical measure of stationarity. The stationary function (top) leads to a very narrow distribution of the length scale and the signal variance. For the non-stationary functions (second and third rows), the distributions for both hyperparameters are broader, and the associated variances are larger. In this figure, the axis scales are kept constant to allow easy comparison of the stationarity properties of the test functions.

FIG. 2.

Our way of measuring the non-stationarity of a dataset or a synthetic function. When a set of data points is drawn randomly from within the input domain and a GP, using a stationary kernel, is trained via log marginal likelihood maximization after each draw (MLE), the final distribution of the hyperparameters—here a signal variance and an isotropic length scale—can be used to measure non-stationarity. This distribution can be visualized via scatter (middle) and violin (right) plots. We assign the variance of these distributions as a numerical measure of stationarity. The stationary function (top) leads to a very narrow distribution of the length scale and the signal variance. For the non-stationary functions (second and third rows), the distributions for both hyperparameters are broader, and the associated variances are larger. In this figure, the axis scales are kept constant to allow easy comparison of the stationarity properties of the test functions.

Close modal

In the second case (second row in Fig. 2), the signal is a trigonometric curve with local oscillations. The algorithm’s response to this function results in a wider spread of both hyperparameters compared to the first test function. This behavior is reflected in the violin plots on the right.

The third scenario (third row in Fig. 2) presents a more complex signal structure, a trigonometric curve with varying amplitude and frequency. The algorithm’s reaction to this non-stationary signal is manifested in a broad distribution of length scales and some spread in the signal variance. These distributions are reflected in the violin plots.

The three cases in Fig. 2 demonstrate the algorithm’s capability to measure non-stationarity through the local, stationary GP-hyperparameter distributions. The varying behaviors in clustering and distribution of the estimated parameters across the three cases provide insights into the underlying stationarity or non-stationarity of the signals. The algorithm’s sensitivity to these variations underscores its potential as a valuable tool for understanding and characterizing non-stationarity in different contexts.

Throughout our computational experiments, we will measure performance via three different error metrics as a function of training time. We argue that this allows us to compare methodologies across different implementations as long as all tests are run on the same computing architecture with similar hardware utilization. As for error metrics, we utilize the log marginal likelihood [LML, Eq. (4)], the root mean square error (RMSE), and the Continuous Ranked Probability Score (CRPS). The RMSE and CRPS are evaluated on test data points that are not part of the training dataset. The RMSE is defined as
(22)
where yi are the data values of the test dataset and f0i are the posterior mean predictions. The RMSE metric provides a measure of how closely the model’s predictions align with the actual values—approaching zero as fit quality improves—while the log marginal likelihood evaluates the fit of the Gaussian Process model given the observed data. The log marginal likelihood will increase as the model fits the data more accurately. The CRPS is defined as
(23)
where ψ is the probability density function of a standard normal distribution and Ψ is the associated cumulative distribution function. For a GP, f0 is Gaussian with mean μ and variance σ2. The CRPS is negative and approaches zero as fit quality improves. In our computational experiments, we use the CRPS as a performance metric on a set of points by averaging (23) overall predictions. The CRPS is arguably the more important score compared to the RMSE because it is uncertainty aware. In other words, if the prediction accuracy is low but uncertainty in those regions is high—the algorithm is aware of its inaccuracy—the score improves.

Computational cost is becoming a main research topic in recent studies of large-scale non-parametric models,22,25,58 especially GPs.20,59–61 In our analysis, we sought to examine the progression of the optimization process. To achieve this, we established a callback function during the optimization phase, tracking the RMSE, the log marginal likelihood, and the CRPS as a function of compute time. In terms of interpretation, ideally, we expect the RMSE and the CRPS to increasingly approach zero over time, suggesting that the model’s predictive accuracy and estimation of uncertainties are improving. On the other hand, the log marginal likelihood should increase, indicating a better fit of the model to the observed data. This analysis gives us a summary of the model’s learning process and helps us understand the progression of the optimization, thus providing valuable insights into the efficacy of our model and the optimization strategy employed.

The purpose of this section is to see how different kernels and a deep GP deal with non-stationarity in several datasets and to compare the characteristics and properties of the solutions. To make this comparison fair and easier, we ran all tests on the same Intel i9 CPU (Intel Core i9-9900KF CPU @ 3.60 GHz × 8) and used the total compute time as the cost. As a performance metric, we calculate and observe the RMSE [Root Mean Squared Error, Eq. (22)], the CRPS [continuous rank probability score, Eq. (23)] of the prediction, both applied to a test dataset, and the log marginal likelihood of the observational data [Eq. (4)]. We attempted to run fair tests in good faith; this means the effort spent to set up each kernel or methodology was roughly proportional to a method’s perceived complexity, within reasonable bounds, similar to the effort expended by an ML practitioner—this meant minutes of effort for stationary kernels and hours to days for non-stationary kernels and DGPs. The optimizer to reach the final model was SciPy’s differential evolution. We used an in-house MCMC (Markov Chain Monte Carlo) algorithm to create the plots showing the evolution of the performance metrics over time. In cases where our efforts did not lead to satisfactory performance, we chose to present the result “as is” to give the reader the ability to judge for themselves. We always start our tests with a standard stationary kernel to create a baseline—stationary kernels with few hyperparameters are currently the most widely used kernel class. To further the hands-on aspect of this section, we also included the algorithms used in the Appendix and on a specifically designed website, along with links to download the data. The performance-measure-over-time plots were created without considering deep GPs due to incompatible differences in the implementations (see  Appendix D). All computational experiments were run multiple times to make sure we showed representative results.

This section first introduces three datasets we will use later to evaluate the performance of the test methodologies. Second, we present the unredacted, uncensored results of the test runs. The purpose is not to judge some kernels or methodologies as better or worse universally, but to evaluate how these techniques perform when tested under certain well-defined conditions and under the described constraints. We encourage the reader to follow our tests, to rerun them if desired, and to judge the performance of the methods for themselves.

We will consider three test datasets. All datasets are normalized such that the range and the image—the set of all measured function values—are in [0, 1].

For the first dataset, we define a one-dimensional synthetic function
(24)
data are drawn from. 50 data points are drawn randomly and the noise ϵN(0,0.001) is added. Figure 3 presents the function and its non-stationary measures.
FIG. 3.

Test dataset 1 (a) and its non-stationarity measures visualized as distributions (b)–(d) in the hyperparameters of a stationary kernel trained on local subsets of that data. The dataset is derived from a one-dimensional synthetic function [see Eq. (24)]. Non-stationarity appears to be clearly present in the length scale and the signal variance.

FIG. 3.

Test dataset 1 (a) and its non-stationarity measures visualized as distributions (b)–(d) in the hyperparameters of a stationary kernel trained on local subsets of that data. The dataset is derived from a one-dimensional synthetic function [see Eq. (24)]. Non-stationarity appears to be clearly present in the length scale and the signal variance.

Close modal

Second, we consider a three-dimensional climate dataset that is available online (https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/), consisting of in situ measurements of daily maximum surface air temperature (°C) collected from weather stations across the contiguous United States (geospatial locations defined by longitude and latitude) over time (the third dimension). The data and its non-stationarity measures are presented in Fig. 4.

FIG. 4.

Test dataset 2 [(a) and (b)] and its non-stationarity measures visualized as distributions in the hyperparameters [(c)–(e)] of a stationary kernel trained on local subsets of that data. The dataset consists of recorded temperatures across the United States over a period of time. Weak non-stationarity appears to be present in the length scale and the signal variance.

FIG. 4.

Test dataset 2 [(a) and (b)] and its non-stationarity measures visualized as distributions in the hyperparameters [(c)–(e)] of a stationary kernel trained on local subsets of that data. The dataset consists of recorded temperatures across the United States over a period of time. Weak non-stationarity appears to be present in the length scale and the signal variance.

Close modal

Third, we consider a dataset that was collected during an x-ray scattering experiment at the CMS beamline at NSLSII, Brookhaven National Laboratory. The dataset originated from an autonomous exploration of the multidimensional material state-spaces underlying the self-assembly of copolymer mixtures. Because the scientific outcome of this experiment has not been published yet, all scientific insights have been obscured by normalization and the removal of units. The dataset is presented in Fig. 5.

FIG. 5.

Test dataset 3 (a) and its non-stationarity measures visualized as distributions in the hyperparameters [(b)–(d)] of a stationary kernel trained on local subsets of that data. The dataset consists of analyzed x-ray scattering signals over [0,1]3. Non-stationarity is apparent in both the length scale and the signal variance.

FIG. 5.

Test dataset 3 (a) and its non-stationarity measures visualized as distributions in the hyperparameters [(b)–(d)] of a stationary kernel trained on local subsets of that data. The dataset consists of analyzed x-ray scattering signals over [0,1]3. Non-stationarity is apparent in both the length scale and the signal variance.

Close modal

To evaluate the predictive performance metrics, we divided our datasets into training datasets and test datasets by random selection and applied the metrics only to the latter.

In this section, we present quantitative results of how the different kernels and deep GPs performed when tasked to learn the underlying data-generating latent functions that produced the test datasets introduced in the previous section. For each test, we show the model and its uncertainties across the domain or a subdomain after convergence of a global evolutionary optimization of the log marginal likelihood and the performance measures as a function of the compute time of an MCMC algorithm. Code snippets can be found in the Appendix and on our website (see the Data Availability paragraph at the end). To reiterate, for all tests, we put ourselves in the position of an ML practitioner setting up each algorithm under a reasonable-effort constraint—we did not optimize each method to its full extent because this would not lead to a fair comparison. However, we used our best judgment and followed online documentation closely. The same thought process was put into the exact design of the kernels; of course, one could always argue that a particular model would have performed better using more hyperparameters. For the fairness of the comparison, we kept the number of hyperparameters similar across the kernels for a particular computational experiment and increased the number of kernel parameters in a near-proportional fashion for the non-stationary kernels as we moved to higher dimensions. See Table I for the number of hyperparameters for the different experiments. All non-stationary kernels were implemented in the open-source GP package gpCAM (https://github.com/lbl-camera/gpCAM). We tried two different deep GPs, the GPflux package (https://github.com/secondmind-labs/GPflux) and the Bayesian deep GP (BDGP) by Refs. 56 and 62 (https://cran.r-project.org/web/packages/deepgp/index.html). We selected the latter in its two-layer version for our final comparisons because of performance issues with the GPflux package (see Fig. 7).

TABLE I.

Number of hyperparameters for our computational experiments.

Number of hyperparameters
per experiment
Kernel functions1D synthetic3D climate3D x-ray
Stationary, kstat 
Parametric non-stationary, kpara 15 58 58 
Deep kernel, knn 48 186 186 
Number of hyperparameters
per experiment
Kernel functions1D synthetic3D climate3D x-ray
Stationary, kstat 
Parametric non-stationary, kpara 15 58 58 
Deep kernel, knn 48 186 186 

1. One-dimensional synthetic function

Our one-dimensional synthetic test function was introduced in Sec. IV A. The stationary reference kernel (kstat) is a Matérn kernel with ν = 3/2
(25)
where l is the length scale and d = ‖xixj2 = |xixj|. The parametric non-stationary kernel in this experiment was defined as
(26)
where
(27)
x̃={0,0.2,0.4,0.6,0.8,1}, leading to a total of 15 hyperparameters—counting two ga functions in the sum, a constant width of the radial basis functions for each ga, and a constant length scale. The deep kernel is
(28)
where ϕ is a fully connected neural network mapping RR, with ReLu activation functions and two hidden layers of width five, which led to a total of 48 hyperparameters (weights, biases, and one constant signal variance). The results, presented in Fig. 6, show a gradual improvement in approximation performance as more flexible kernels are used. The stationary kernel (kstat) stands out through its fast computation time. However, the keen observer notices similar uncertainties independent of the local properties of the latent function; only point spacing is considered in the uncertainty estimate. That is in stark contrast to the parametric non-stationary kernel (kpara) and the deep kernel (knn), which both predict lower uncertainties in the well-behaved center region of the domain. This is a very desirable characteristic of non-stationary kernels. The deep kernel and parametric non-stationary kernel reached very similar approximation performance, but the deep kernel was by far the most costly to train. The BDGP predicts a very smooth model with subpar accuracy compared to the other methods. We repeated the experiment with different values for the nugget and let the algorithm choose the nugget without further success. This is not to say the method cannot perform better, but we remind the reader that we are working under the assumption of reasonable effort, which, in this case, was insufficient to reach a better performance. We share the code with the reader in the Appendix (see  Appendix C) for reproducibility purposes. We also tested another DGP implementation without success (see Fig. 7). The MCMC sampling runs revealed what was expected, the stationary kernel converges most robustly; however, all kernels led to stable convergence within a reasonable compute time.
FIG. 6.

Performance overview for dataset 1. From the top-left: stationary reference kernel (kstat), parametric non-stationary kernel (kpara), deep kernel (knn), and deep GP. In this test, the parametric non-stationary kernel and the deep kernel reached similar approximation performances, while the former used significantly fewer hyperparameters, which is why it can be trained significantly faster. The deep GP (see  Appendix D for the algorithm) over-smoothed the model and took a long time to train. We note that the deep-GP result highly depended on the specified nugget—too small, and the algorithms produced NaNs just a little larger, and the presented smoothing was observed. The result was the same when we allowed the algorithm to choose its own nugget. Note the asterisk next to the deep-GP compute time, denoting that this is separate software written in a different language (R), and compute times can, therefore, not be directly compared. The calculations of the performance measures as a function of MCMC training time were run five times, and displayed are the mean and confidence bounds (one standard deviation).

FIG. 6.

Performance overview for dataset 1. From the top-left: stationary reference kernel (kstat), parametric non-stationary kernel (kpara), deep kernel (knn), and deep GP. In this test, the parametric non-stationary kernel and the deep kernel reached similar approximation performances, while the former used significantly fewer hyperparameters, which is why it can be trained significantly faster. The deep GP (see  Appendix D for the algorithm) over-smoothed the model and took a long time to train. We note that the deep-GP result highly depended on the specified nugget—too small, and the algorithms produced NaNs just a little larger, and the presented smoothing was observed. The result was the same when we allowed the algorithm to choose its own nugget. Note the asterisk next to the deep-GP compute time, denoting that this is separate software written in a different language (R), and compute times can, therefore, not be directly compared. The calculations of the performance measures as a function of MCMC training time were run five times, and displayed are the mean and confidence bounds (one standard deviation).

Close modal
FIG. 7.

Approximation result for the one-dimensional synthetic dataset using the GPflux DGP. Despite following the online documentation closely, the model appears to show artifacts. This is the reason why we did not use the GPflux DGP for our higher-dimensional test cases. For reproducibility purposes, we publish our exact run script in the Appendix (see  Appendix B).

FIG. 7.

Approximation result for the one-dimensional synthetic dataset using the GPflux DGP. Despite following the online documentation closely, the model appears to show artifacts. This is the reason why we did not use the GPflux DGP for our higher-dimensional test cases. For reproducibility purposes, we publish our exact run script in the Appendix (see  Appendix B).

Close modal

2. The climate model

Our three-dimensional climate dataset was introduced in Sec. IV A. The stationary reference kernel (kstat) is a Matérn kernel with ν = 3/2 [Eq. (25)]. In all cases, stationary and non-stationary, we added to the kernel matrix the noise covariance matrix V=σn2I, where σn2 is the nugget variance, treated as an additional hyperparameter. The parametric non-stationary kernel is similar to Eq. (26); however, we place radial basis functions (27) at {0,0.5,1}3 and added a nugget variance, leading to a total of 58 hyperparameters. The deep kernel knn(ϕ(xi),ϕ(xj)),ϕ:R3R3 has two hidden layers of width 10 but is otherwise identical to (28), yielding 186 hyperparameters. The results, two-dimensional slices through the three-dimensional input space, are presented in Fig. 8. For completeness, we included the deep GP result in Fig. 9, which, however, in our run, was not competitive. Once again, the stationary kernel (kstat) delivers fast and robust results; however, it lacks accuracy compared to the parametric non-stationary kernel (kpara) and the deep kernel (knn). The performance of the two non-stationary kernels is on par, with a slight advantage in CRPS for the deep kernel and a significant advantage in compute time for the parametric non-stationary kernel. The MCMC (Fig. 10) sample runs revealed stable convergence, however, at significantly different time scales.

FIG. 8.

Performance overview for the climate model. From the top: kstat, kpara, and knn. The posterior mean is displayed in the left column. The posterior variance is on the right. The model function is defined over [0,1]2 ×{0.5}. This model is obtained after the convergence of a global evolutionary optimizer. Notable is the fast computing time of the stationary kernel due to the fact that only three hyperparameters have to be found: a signal variance, one isotropic length scale, and the nugget (i.i.d. noise). This parametric non-stationary kernel led to a total of 58 hyperparameters, which increases the computing time significantly compared to the stationary kernel. The accuracy, in this case, is slightly higher. It is debatable and highly application-driven whether the parametric non-stationary kernel should be used for this dataset. If time is not an issue, the increased accuracy can pay off in downstream operations. The deep kernel contains 186 hyperparameters, which are identified robustly. The accuracy of the prediction and the UQ is similar to the parametric non-stationary kernel, with a slight advantage in the CRPS but approximately twice the compute time. As with all our test runs, this one was repeated several times with the same result.

FIG. 8.

Performance overview for the climate model. From the top: kstat, kpara, and knn. The posterior mean is displayed in the left column. The posterior variance is on the right. The model function is defined over [0,1]2 ×{0.5}. This model is obtained after the convergence of a global evolutionary optimizer. Notable is the fast computing time of the stationary kernel due to the fact that only three hyperparameters have to be found: a signal variance, one isotropic length scale, and the nugget (i.i.d. noise). This parametric non-stationary kernel led to a total of 58 hyperparameters, which increases the computing time significantly compared to the stationary kernel. The accuracy, in this case, is slightly higher. It is debatable and highly application-driven whether the parametric non-stationary kernel should be used for this dataset. If time is not an issue, the increased accuracy can pay off in downstream operations. The deep kernel contains 186 hyperparameters, which are identified robustly. The accuracy of the prediction and the UQ is similar to the parametric non-stationary kernel, with a slight advantage in the CRPS but approximately twice the compute time. As with all our test runs, this one was repeated several times with the same result.

Close modal
FIG. 9.

Performance overview for the climate model, computed with the BDGP. The model function—posterior mean (left) and variance (right)—is defined over [0,1]2 × {0.5}. This result is only presented for completeness. Clearly, our BDGP setup is not competitive compared to the tested kernels. It is presented in  Appendix C.

FIG. 9.

Performance overview for the climate model, computed with the BDGP. The model function—posterior mean (left) and variance (right)—is defined over [0,1]2 × {0.5}. This result is only presented for completeness. Clearly, our BDGP setup is not competitive compared to the tested kernels. It is presented in  Appendix C.

Close modal
FIG. 10.

MCMC sampling convergence for the climate model and all kernels. From the top: kstat, kpara, and knn. The MCMC is converging robustly for the stationary and parametric non-stationary kernels; the deep kernel is by far the slowest to converge, and the error metrics are barely improving. This might be due to the weak non-stationarity in this dataset (see Fig. 4). All runs were repeated five times, and displayed are the mean and confidence bounds (one standard deviation).

FIG. 10.

MCMC sampling convergence for the climate model and all kernels. From the top: kstat, kpara, and knn. The MCMC is converging robustly for the stationary and parametric non-stationary kernels; the deep kernel is by far the slowest to converge, and the error metrics are barely improving. This might be due to the weak non-stationarity in this dataset (see Fig. 4). All runs were repeated five times, and displayed are the mean and confidence bounds (one standard deviation).

Close modal

3. The x-ray scattering model

Our three-dimensional x-ray scattering dataset was introduced in Sec. IV A. The kernel setup is identical to our climate example (see Sec. IV B 2). The results are presented in Fig. 11. This experiment shows improving performance as kernel complexity increases; however, the improvements are moderate. As before, if accuracy is the priority, non-stationary kernels should be considered. Again, the deep kernel (knn) performed very competitively—with a significant advantage in terms of RMSE and CRPS—but did not reach the same log marginal likelihood as the competitors, which can be traced back to solving a high-dimensional optimization problem due to a large number of hyperparameters. This opens the door to even better performance if more effort and time are spent training the model. What stands out in Fig. 11 is the smaller and more detailed predicted uncertainties for the deep kernel, which would affect decision-making in an online data acquisition context. In addition, the posterior mean has more intricate details compared to the stationary and parametric non-stationary kernels. Figure 12 reveals stable convergence of the MCMC sampling for all kernels at similar time scales, supporting the deep kernel as the superior choice for this model. We again included the BDGP, which, however, was not competitive (Fig. 13).

FIG. 11.

Performance overview for the x-ray model. From the top: kstat, kpara, and knn. The posterior mean is displayed in the left column. The posterior variance is on the right. The model function is defined over [0,1]2 ×{0.24}. This model is obtained after the convergence of a global evolutionary optimizer. Once again, the stationary kernel stands out due to fast computing speeds. The compute time increases significantly for the parametric non-stationary kernel due to the 58 hyperparameters that have to be found. Since the CRPS is our most relevant performance metric, the approximation and the UQ are better than for the stationary kernel. The deep kernel, for this dataset, performs best with respect to the RMSE and CRPS while achieving lower log marginal likelihood. The training time increases again because 186 hyperparameters have to be found.

FIG. 11.

Performance overview for the x-ray model. From the top: kstat, kpara, and knn. The posterior mean is displayed in the left column. The posterior variance is on the right. The model function is defined over [0,1]2 ×{0.24}. This model is obtained after the convergence of a global evolutionary optimizer. Once again, the stationary kernel stands out due to fast computing speeds. The compute time increases significantly for the parametric non-stationary kernel due to the 58 hyperparameters that have to be found. Since the CRPS is our most relevant performance metric, the approximation and the UQ are better than for the stationary kernel. The deep kernel, for this dataset, performs best with respect to the RMSE and CRPS while achieving lower log marginal likelihood. The training time increases again because 186 hyperparameters have to be found.

Close modal
FIG. 12.

MCMC sampling convergence for the x-ray model and all kernels. From the top: kstat, kpara, and knn. The MCMC is converging robustly and, in this case, at similar time scales, making the case for the deep kernel that reached the best CRPS (in the optimization). All runs were repeated five times, and displayed are the mean and confidence bounds (one standard deviation).

FIG. 12.

MCMC sampling convergence for the x-ray model and all kernels. From the top: kstat, kpara, and knn. The MCMC is converging robustly and, in this case, at similar time scales, making the case for the deep kernel that reached the best CRPS (in the optimization). All runs were repeated five times, and displayed are the mean and confidence bounds (one standard deviation).

Close modal
FIG. 13.

Performance overview for the x-ray scattering model, computed with the BDGP. The model function—posterior mean (left) and variance (right)—is defined over [0,1]2 ×{0.24}. This result is again included for completeness; it is not competitive compared to the tested kernels. Our script is presented in  Appendix C.

FIG. 13.

Performance overview for the x-ray scattering model, computed with the BDGP. The model function—posterior mean (left) and variance (right)—is defined over [0,1]2 ×{0.24}. This result is again included for completeness; it is not competitive compared to the tested kernels. Our script is presented in  Appendix C.

Close modal

The one-dimensional synthetic test function (24) exhibits a complex behavior that is captured by our stationarity analysis through a broad distribution in the length scale and signal variance. The scattering patterns reveal a multifaceted behavior, with different regions of the data exhibiting distinct statistical properties. The linear correlation within one cluster (left, scatter plot, Fig. 3) and the dispersion in the other cluster capture the interplay between the sinusoidal and quadratic components of the function. The linear cluster indicates a linear correlation between these two hyperparameters within a specific region of input data. This pattern may correspond to the sinusoidal components of the function, where local oscillations exhibit a consistent relationship between the length scale and signal variance. The additional dispersed points in the scatter plot likely correspond to the regions influenced by the quadratic term in the function, where the statistical properties vary, leading to a broader distribution of the hyperparameters. The bimodal distributions observed in the violin plot for the length scale further corroborate the non-stationarity. The results provide insights into the intricate interplay between the sinusoidal and quadratic components of the function. Due to the strong non-stationarity in the data, non-stationary kernels performed well for this test function. Due to the low dimensionality, the number of hyperparameters is low in all cases, leading to robust training. It is, in our opinion, safe to conclude that in one-dimensional cases with moderate dataset sizes and suspected non-stationarity in the data, non-stationary kernels are to be preferred. Both our parametric and deep non-stationary kernels performed well, with a slight edge in accuracy for the parametric kernel (see Fig. 6). Our two tested deep GP setups (Figs. 6 and 7) were not competitive given our reasonable-constraint effort, which in this case was in the order of days.

Moving on to the climate data example, Fig. 4 shows a quite concentrated cluster near (0,0), indicating weak non-stationarity. This concentration suggests that the statistical properties are consistent across this region, possibly reflecting a dominant pattern or behavior in the data. The computational experiments (see Fig. 11) reveal a trade-off between accuracy and compute time. One has to put much more effort—number of hyperparameters and time—into the computation for a relatively small gain in accuracy. Both the parametric non-stationary kernel and the deep kernel achieve higher accuracy than the stationary kernel but are costly. In time-sensitive situations, the stationary kernel is likely to be preferred; for best accuracy, the parametric non-stationary kernel or the deep kernel is the superior choice. Looking at the CRPS, the deep kernel has a slight edge in accurately estimating uncertainties over the parametric kernel; however, the parametric non-stationary kernel reaches the highest log marginal likelihood. For this dataset, we tested the BDGP without much success under our reasonable-effort constraint (see Fig. 9).

Finally, for the x-ray scattering data, Fig. 5 shows a high level of dispersion in the scatter and violin plots, indicating strong non-stationarity, which is further corroborated in the violin plots. This strong non-stationarity—particularly in the length scale—clearly benefits the performance of the deep kernel (see Fig. 11). The stationary kernel shows quite poor performance both in terms of RMSE and CRPS. Since the stationary kernels only allow us to estimate uncertainties based on the global properties of the data and local geometry, it is expected that non-stationarity kernels estimate uncertainties more accurately, which manifests itself in a higher CRPS. The parametric non-stationary kernel leads the field in log marginal likelihood. Surprisingly, among the three tested kernels, the deep kernel leads to by far the lowest log marginal likelihood, which, again, suggests a better optimizer might lead to much-improved performance. Once again, our setup of the BDGP was not competitive (see Fig. 13).

While we included as much information in the computational experiments and the Appendix as possible to give the reader a chance to make up their own minds, here we summarize the following key takeaways:

  1. Stationary kernels proved to be surprisingly competitive in terms of the RMSE and are unbeatable when evaluating accuracy per time. It seems worth it in most cases to run a stationary GP first for comparison.

  2. Uncertainty is estimated more accurately by non-stationary kernels; if UQ-driven decision-making is the goal, non-stationary kernels are to be preferred. This is not a surprise since, given a constant (possibly anisotropic) length scale and a signal variance, the posterior variance only depends on data-point geometry.

  3. The parametric non-stationary kernel encodes a flexible non-stationarity while maintaining interpretability. The involved parametric functions over the input space can be visualized and interpreted.

  4. Deep kernels are some of the most flexible kernels, but interpretation is lost in all but the simplest cases, which can easily lead to model misspecifications (wrong model class and hyperparameters). Our models took a fair amount of experimenting before an acceptable performance was achieved. In online applications, in which testing beforehand is not possible, deep kernels should be used with caution.

  5. Non-stationarity in the covariance structure appears in signal variance and length scale; ideally, a kernel addresses both [see Eq. (29)].

  6. While not included in the tests, experimenting with prior mean functions has shown that non-stationarity properties highly depend on the prior mean function of the GP. This is especially true for the non-stationary signal variance.

  7. Extrapolation and non-stationary kernels are difficult to combine. While the parametric non-stationary kernel can be set up for extrapolation, traditional neural networks are poorly equipped for that task.

  8. We should think of the number of hyperparameters conservatively; too many bear the risk of model misspecification (through local minima) and overfitting.

  9. The parametric non-stationary kernel achieved overall better RMSE; the deep kernel led to better uncertainty quantification, as indicated by the CRPS.

Given the observation that non-stationarity in the covariance structure of a dataset originates from non-constant signal variance and length scales, one might argue that both should be addressed in the kernel design. Our parametric non-stationary kernel attempts to account for all non-stationary purely through the signal variance; it leaves the length scale constant; however, implementations exist that allow non-constant and anisotropic length scales,6,42 generally in the same flavor as the parametric non-stationary signal variance. The deep kernel, on the other hand, only acts on the length scale by warping the input space. It seems logical to ask what happens if we combine the two concepts. The kernel
(29)
achieves just that; it is a combination of our parametric non-stationary kernel and the deep kernel. Modeling our synthetic dataset (see Fig. 14), we see good performance with moderate improvements compared to our earlier tests (Fig. 6). The kernel might constitute somewhat of an overkill for such a simple problem but might lead to more significant gains in real applications. We encourage the reader to give this kernel a try in their next application.
FIG. 14.

Approximation result for the one-dimensional synthetic dataset using the parametric deep non-stationary kernel (29). The kernel combines non-stationarity in the signal variance and the length scale by leveraging both the parametric non-stationarity kernel design and a deep kernel. The approximation is on par with our previously tested methods, but the likelihood reached is higher. This kernel might perform very well when strong non-stationarity is present in the length scale and signal variance.

FIG. 14.

Approximation result for the one-dimensional synthetic dataset using the parametric deep non-stationary kernel (29). The kernel combines non-stationarity in the signal variance and the length scale by leveraging both the parametric non-stationarity kernel design and a deep kernel. The approximation is on par with our previously tested methods, but the likelihood reached is higher. This kernel might perform very well when strong non-stationarity is present in the length scale and signal variance.

Close modal

Multi-task learning offers a powerful paradigm to leverage shared information across multiple related tasks, thereby enhancing the predictive performance of each individual task.63–65 This is particularly beneficial when data for some tasks are sparse, as information from data-rich tasks can be used to inform predictions for data-scarce tasks. Flexible non-stationary kernels offer an interesting benefit for multi-task learning: instead of employing specialized techniques, such as coregionalization, one can reformulate the problem to a single task problem and let a flexible non-stationary kernel learn intricate correlations between input (Xi) and output (Xo) space locations. By transforming the multi-task learning problem over Xi to a single-task learning problem over Xi×Xo, no further changes to the core algorithm are required. This has been known for a long time and is referred to as problem-transformation methods.66 These methods were originally dismissed as not being able to capture intricate correlations between the tasks; however, this is only true if stationary, separable kernels are used. A flexible non-stationary kernel is able to flexibly encode covariances across the input and output domains, independent of the indexing of the tasks. This makes it possible to transfer all the complexities of multi-task learning to the kernel design and use the rest of the GP framework as-is, inheriting the superior robustness and interpretability properties of single-task GPs.

In this paper, we put on the hat of a machine learning practitioner trying to find the best kernel or methodology within the scope of a Gaussian process (GP) to address non-stationarity in various datasets. We introduced three different datasets—one synthetic, one climate dataset, and one originating from an x-ray scatting experiment at the CMS beamline at NSLSII, Brookhaven National Laboratory. We introduced a non-stationarity measure and studied each dataset to be able to judge their non-stationarity properties quantitatively. We then presented four different methodologies to address the non-stationarity: ignoring it by using a stationary kernel, a parametric non-stationary kernel that uses a flexible non-constant signal variance, a deep kernel that uses a neural network to warp the input space, and a deep GP. We set all methodologies up under reasonable effort constraints to allow for a fair comparison, just like a practitioner might encounter. In our case, that meant minutes of setup time for the stationary kernels and hours to days for the non-stationary kernels and the deep GPs. After the methodologies were set up, we ran our computational tests and presented the results unredacted and uncensored. This way, we hope, the reader gets the best value out of the comparisons. This is also to ensure that the reader has the chance to come up with conclusions different from ours. To further the readers’ ability to double-check and learn, we are publishing all our codes online.

Our tests have shown that even weak non-stationarity in a dataset motivates the use of non-stationary kernels if training time is not an issue of concern. If training time is very limited, stationary kernels are still the preferred choice. We have discovered that non-stationarity in the covariance comes in two flavors, in the signal variance and the length scale and, ideally, both should be addressed through novel kernel designs; we drew attention to one such kernel design. However, non-stationary kernels come with a great risk of model misspecification. If a new model should be trusted out of the box, a stationary kernel might be the preferred choice.

We hope that this paper motivates more practitioners to deploy and experiment with non-stationary kernels while also being aware of some of the risks.

The work was supported by the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy Contract No. DE-AC02-05CH11231. This work was further supported by the Regional and Global Model Analysis Program of the Office of Biological and Environmental Research in the Department of Energy Office of Science under Contract No. DE-AC02-05CH11231. This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain the correct information, neither the United States Government nor any agency thereof, nor the Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof or the Regents of the University of California. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or the Regents of the University of California.

We want to acknowledge Kevin G. Yager from Brookhaven National Laboratory for providing the x-ray scattering dataset. This data collection used resources from the Center for Functional Nanomaterials (CFN) and the National Synchrotron Light Source II (NSLS-II), which are the U.S. Department of Energy Office of Science User Facilities at Brookhaven National Laboratory, funded under Contract No. DE-SC0012704.

The authors have no conflicts to disclose.

M.M.N. originally decided to write this paper, led the project, developed the test scripts and software with help from H.L. and M.D.R., and ran the computational experiments. H.L. suggested the non-stationarity measure and improved its practicality with help from M.M.N. and M.D.R. H.L. also did the majority of the work regarding deep GPs, both regarding the algorithms and the article. M.D.R. oversaw all developments, especially regarding parametric non-stationary kernels, and further assisted with writing and editing the article. All authors iteratively refined the core ideas, algorithms, and methods. All decisions regarding the work were made in agreement with all authors. All authors iteratively revised the article.

Marcus M. Noack: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Hengrui Luo: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Resources (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Mark D. Risser: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Software (equal); Supervision (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal).

All data and the Jupyter notebook that runs all experiments can be found on https://gpcam.lbl.gov/examples/a-unifying-perspective-on-non-stationary-kernels-for-deeper-gps.

All experiments (except deep GPs) were run using the open-source Python package gpCAM (https://github.com/lbl-camera/gpCAM, gpcam.lbl.gov).

The simple technique used in this paper is to measure non-stationarity in the covariance structure of a dataset based on the idea of model misspecification. Unlike traditional non-stationary measures (e.g., histograms, periodograms), our non-stationary measure is model-specific and can generate visual checks in wider scenarios. We followed White67 and Schervish68 in the following arguments.

Suppose q(yX) of size n has pairs of input and responses generated independently and identically from a data-generating mechanism q (i.e., a possibly non-stationary GP or even completely non-Gaussian models), and we can consider a parametric stationary GP model p(yX, (σs, l)) which may or may not contain the “true” data generating mechanism q. The MLE estimate of the stationary kernel parameters (σs, l) maximizes 1ni=1nlogp(y(i)X(i),(σs,l)). By the law of large numbers for the independent identically distributed log  p(y(i)X(i), (σs, l)) summands, this quantity converges in joint probability to the expectation of Eqlogp(yX,(σs,l)) with respect to the data-generating mechanism q. When q(yX) = p(yx, (σs,0, l0)) for some kernel parameters (σs,0, l0), that is, the input-response data are actually generated by the GP with a parametric stationary kernel with true parameter (σs,0, l0). The following argument shows this equivalence between MLE and the KL-divergence minimizer,
(A1)
(A2)
(A3)
and then Eqlogp(yX,(σs,l)) is maximized at (σs,0, l0). This explains the consistency of the MLE, and we will expect that the MLE (σs,l)* of kernel parameters (σs, l) will concentrate around the true parameter value (σs,0, l0). With regard to our procedure, if all the data are generated by a GP p(y|X, (σs,0, l0)), then each subsample needs to produce an MLE of (σs, l) that is close to (σs,0, l0). Therefore, we will consider a stably concentrated (σs, l)-plot as strong evidence that the data are from a stationary GP. When q(yX) does not necessarily belong to the GP with parametric stationary kernels, we may write similarly
Since the term Eqlogq(y|X) does not involve the kernel parameter (σs, l), the value of (σs,l)* obtained by maximizing Eqlogp(y|X,(σs,l)) under a misspecified data-generating mechanism q is the minimizer of DKLq(y|X)p(y|X,(σs,l)), often referred to as pseudo-true parameters, which is observed to be sensitive to the selection of the subsample. In other words, the minimizer of DKLq(y|X)p(y|X,(σs,l)) exhibits higher sensitivity than DKLp(y|X,(σs,0,l0))p(y|X,(σs,l)). If we assume the joint distribution of p(y|X, (σs,0, l0)) has mean μ(σs,0,l0) and covariance Σ(σs,0,l0), the joint distribution of p(y|X, (σs, l)) has mean μ(σs,l) and covariance Σ(σs,l),
We can discover that subsampling when both p and q are from a GP will only affect this quantity at O(1/n) by convergence rates of sample means and variances, while if q is not a stationary GP model, the magnitude could be large. We have to point out that this observation—that the KL-divergence becomes sensitive to subsampling when the data-generating mechanism and the model do not belong to the same family—is empirical. However, it motivates our algorithm and works well in practice. Although the MLE kernel parameters (σs,l)* under misspecification will converge to the pseudo-true parameters rather than the true parameters,69,70 these pseudo-true parameters are sensitive to subsampling. This creates a much more scattered (σs, l)-plot without a specific concentration pattern. By looking into the behavior of MLE parameters of stationary kernels, we can quantify how far it is from “stationarity,” which is the rationale of our technique. As aforementioned, a stationary kernel will assume that the pairwise covariance does not depend on the location in space but only on the distance of the input points. This leads us (and others) to deduce that non-stationarity will manifest itself by varying hyperparameters of a stationary kernel—one isotropic length scale and a constant signal variance—when they are trained on localized subsets of the data. In our visualization (σs, l)-plot, we can observe whether the length scale l or variance parameter σs is the source of non-stationarity.

DGPs provide a fusion of the non-parametric flexibility of GPs with the hierarchical structure typical of deep neural networks. In essence, DGPs stack multiple layers of Gaussian processes on top of each other, with the output of one layer serving as the input for the next, much like the architecture of deep neural networks. Fitting a DGP is a nuanced procedure that marries the principles of GP with the hierarchical structure of deep networks. Through techniques like variational inference, the challenges posed by the depth and complexity of the model are addressed, paving the way for powerful, flexible, and interpretable modeling of data. When fitting a DGP, the primary objective is to optimize the marginal likelihood of the observed data, given the parameters of the Gaussian processes in each layer and the likelihood function. The marginal likelihood, in the context of GPs, is the likelihood of the observations after integrating out the latent functions. Mathematically, this is expressed as
Here, Y are the observations, X are the inputs, F represents the latent functions (outputs of the GP layers), p(Y|F) is the likelihood, and p(F|X) is the Gaussian process prior. However, in DGPs, the optimization of this marginal likelihood becomes more intricate due to the cascading nature of layers. Each layer’s latent functions become the inputs for the subsequent layer, resulting in a composite model where the inference is not straightforward. The depth introduces additional non-linearity and complexity. From a frequentist viewpoint, the intricacies of DGP are typically managed through variational inference. This method centers on approximating the authentic posterior using a variational distribution characterized by specific variational parameters. The primary goal becomes the maximization of the evidence lower bound (ELBO) in Algorithm 3, effectively providing a lower threshold on the log marginal likelihood. This transformation changes a previously intractable inference conundrum into a solvable optimization challenge, wherein the variational parameters are systematically tuned to align the variational distribution more closely with the authentic posterior. On the other hand, from a Bayesian perspective,62 it offers a profound exploration into the nuances of DGPs. This comparative analysis of an array of DGP models sheds light on their distinctive performance metrics and characteristics. A notable highlight from the paper is the unveiling of an innovative Bayesian approach to classification problems using DGPs, emphasizing their prowess in adeptly managing multifaceted structured data and in discerning intricate patterns and associations.
ALGORITHM 3.

Detailed deep Gaussian process model training.

Require: xdata, ydata, xpred 
1: Create inducing points Z spanning the range of X 
2: Define two GP layers: 
• Each layer is defined with an RBF kernel 
• Using a random collection of inputs Z as inducing points 
• The last layer has a zero mean function 
3: Define a Gaussian likelihood per layer with noise variance 1e − 4 
4: Construct a 2-layer deep GP model with the GP layers and likelihood 
5: Compile the model with the stochastic Adam optimizer.71,72 
(Alternatively, one can perform Bayesian variational inference.56
6: Train the model for 104 epochs: 
• Optimize the parameters to maximize the marginal likelihood p(Y|X
• Where p(Y|X) = ∫p(Y|F)p(F|X) dF 
7: Predict on xpred and extract mean and variance 
Require: xdata, ydata, xpred 
1: Create inducing points Z spanning the range of X 
2: Define two GP layers: 
• Each layer is defined with an RBF kernel 
• Using a random collection of inputs Z as inducing points 
• The last layer has a zero mean function 
3: Define a Gaussian likelihood per layer with noise variance 1e − 4 
4: Construct a 2-layer deep GP model with the GP layers and likelihood 
5: Compile the model with the stochastic Adam optimizer.71,72 
(Alternatively, one can perform Bayesian variational inference.56
6: Train the model for 104 epochs: 
• Optimize the parameters to maximize the marginal likelihood p(Y|X
• Where p(Y|X) = ∫p(Y|F)p(F|X) dF 
7: Predict on xpred and extract mean and variance 

Table II offers a comprehensive overview of various implementations of Deep Gaussian Processes (DGPs), a powerful tool in machine learning. It is fascinating to see the diversity of implementations across different programming languages and frameworks, such as R, TensorFlow, and PyTorch. This wide variety reflects the growing interest and applicability of DGPs in various domains, from scientific research to industrial applications.

TABLE II.

Comparison of different DGP implementations. This table extends Table I in Zammit-Mangion et al.45 and tables in Dutordoir et al.71 

Repository/packageLanguageReferenceaFittingb
deepGP Sauer et al.62  MCMC (elliptic sampler) 
SDSP R/TensorFlow Zammit-Mangion et al.45  SVI 
SDSP-MCMC R/stan Zammit-Mangion et al.45  MCMC 
SIWGP R/TensorFlow Zammit-Mangion et al.45  MLE 
GPFlux TensorFlow Dutordoir et al.71  MLE/SVI (variational) 
DeepCGP TensorFlow Blomqvist et al.73  SVI (doubly stochastic) 
DGPs_with_IWVI TensorFlow Salimbeni et al.74  SVI 
sghmc_dgp TensorFlow Havasi et al.75  MCMC (HMC) 
Deep-Gaussian-Process TensorFlow2 Salimbeni and Deisenroth72  SVIc 
GPyTorch/DeepGP PyTorch Salimbeni and Deisenroth,72 Gardner et al.76  SVI (doubly stochastic) 
DGPRFF PyTorch Cutajar et al.77  SVI 
Doubly stochastic-DGP PyTorch Salimbeni and Deisenroth72  SVI (doubly stochastic) 
PyDeepGP Python Dai et al.,78 Damianou and Lawrence12  MLE/SVI (variational) 
DGPsparse TensorFlow Damianou and Lawrence12  MLE 
DGPfull Python Schmidt and O’Hagan27  MCMC (elliptic sampler) 
Repository/packageLanguageReferenceaFittingb
deepGP Sauer et al.62  MCMC (elliptic sampler) 
SDSP R/TensorFlow Zammit-Mangion et al.45  SVI 
SDSP-MCMC R/stan Zammit-Mangion et al.45  MCMC 
SIWGP R/TensorFlow Zammit-Mangion et al.45  MLE 
GPFlux TensorFlow Dutordoir et al.71  MLE/SVI (variational) 
DeepCGP TensorFlow Blomqvist et al.73  SVI (doubly stochastic) 
DGPs_with_IWVI TensorFlow Salimbeni et al.74  SVI 
sghmc_dgp TensorFlow Havasi et al.75  MCMC (HMC) 
Deep-Gaussian-Process TensorFlow2 Salimbeni and Deisenroth72  SVIc 
GPyTorch/DeepGP PyTorch Salimbeni and Deisenroth,72 Gardner et al.76  SVI (doubly stochastic) 
DGPRFF PyTorch Cutajar et al.77  SVI 
Doubly stochastic-DGP PyTorch Salimbeni and Deisenroth72  SVI (doubly stochastic) 
PyDeepGP Python Dai et al.,78 Damianou and Lawrence12  MLE/SVI (variational) 
DGPsparse TensorFlow Damianou and Lawrence12  MLE 
DGPfull Python Schmidt and O’Hagan27  MCMC (elliptic sampler) 
a

We sort the table by the publication date of the associated paper.

b

We follow the main fitting method featured by the package; some packages may support more fitting methods than listed here.

c

Stochastic variational inference.

One of the key aspects of the table is the different fitting methods used, such as MCMC (Markov Chain Monte Carlo), SVI (Stochastic Variational Inference), and MLE (Maximum Likelihood Estimation). The choice of fitting method can significantly impact the efficiency and accuracy of the model. It is noteworthy that some packages may support more fitting methods than those listed, indicating flexibility and adaptability in the approach to modeling with DGPs.

Recent developments in the field are evident from the inclusion of implementations associated with recent publications. This ongoing research and development signal a vibrant and evolving field where new methodologies and techniques are continually being explored and refined.

Accessibility is another important aspect highlighted in the table. Many of the implementations are openly accessible through platforms like GitHub. This open-source approach fosters collaboration and innovation, allowing researchers and practitioners to not only utilize these resources but also contribute to their development.

The table’s caption mentions that it extends and compares information from other sources, suggesting a comprehensive effort to collect and compare various DGP implementations. This comparative approach adds value to the table by placing the different implementations in context with one another, providing a broader perspective on the landscape of DGPs.

While the table provides a valuable summary, it also hints at potential challenges. Users may need to delve into the individual papers or repositories to understand the specific features, advantages, and limitations of each implementation. This deeper exploration may reveal nuances and details that are beyond the scope of the table but are essential for a complete understanding of each DGP implementation.

1.
C. K.
Williams
and
C. E.
Rasmussen
,
Gaussian Processes for Machine Learning
(
MIT Press
,
Cambridge, MA
,
2006
), Vol.
2
.
2.
K. E.
Pilario
,
M.
Shafiee
,
Y.
Cao
,
L.
Lao
, and
S.-H.
Yang
, “
A review of kernel methods for feature extraction in nonlinear process monitoring
,”
Processes
8
(
1
),
24
(
2019
).
3.
M. M.
Noack
and
K. G.
Reyes
, “
Mathematical nuances of Gaussian process-driven autonomous experimentation
,”
MRS Bull.
48
,
153
163
(
2023
).
4.
P. D.
Sampson
and
P.
Guttorp
, “
Nonparametric estimation of nonstationary spatial covariance structure
,”
J. Am. Stat. Assoc.
87
(
417
),
108
119
(
1992
).
5.
C.
Paciorek
and
M.
Schervish
, “
Nonstationary covariance functions for Gaussian process regression
,” in
Advances in Neural Information Processing Systems 16
(
NeurIPS
,
2003
).
6.
C. J.
Paciorek
and
M. J.
Schervish
, “
Spatial modelling using a new class of nonstationary covariance functions
,”
Environmetrics
17
(
5
),
483
506
(
2006
).
7.
J.
Snoek
,
H.
Larochelle
, and
R. P.
Adams
, “
Practical Bayesian optimization of machine learning algorithms
,” in
Advances in Neural Information Processing Systems 25
(
NeurIPS
,
2012
).
8.
M. M.
Noack
,
H.
Krishnan
,
M. D.
Risser
, and
K. G.
Reyes
, “
Exact Gaussian processes for massive datasets via non-stationary sparsity-discovering kernels
,”
Sci. Rep.
13
(
1
),
3155
(
2023
).
9.
S.
Oliver
and
C. A.
Gotway
,
Statistical Methods for Spatial Data Analysis
(
CRC Press
,
2017
).
10.
A. G.
Wilson
,
Z.
Hu
,
R.
Salakhutdinov
, and
E. P.
Xing
, “
Deep kernel learning
,” in
Artificial Intelligence and Statistics
(
PMLR
,
2016
), pp.
370
378
.
11.
S. W.
Ober
,
C. E.
Rasmussen
, and
M.
van der Wilk
, “
The promises and pitfalls of deep kernel learning
,” in
Uncertainty in Artificial Intelligence
(
PMLR
,
2021
), pp.
1206
1216
.
12.
A.
Damianou
and
N. D.
Lawrence
, “
Deep Gaussian processes
,” in
Artificial Intelligence and Statistics
(
PMLR
,
2013
), pp.
207
215
.
13.
V.
Picheny
and
D.
Ginsbourger
, “
A nonstationary space-time Gaussian process model for partially converged simulations
,”
SIAM/ASA J. Uncertainty Quantif.
1
(
1
),
57
78
(
2013
).
14.
M. M.
Noack
and
J. A.
Sethian
, “
Advanced stationary and nonstationary kernel designs for domain-aware Gaussian processes
,”
Commun. Appl. Math. Comput. Sci.
17
(
1
),
131
156
(
2022
).
15.
M. M.
Noack
,
P. H.
Zwart
,
D. M.
Ushizima
,
M.
Fukuto
,
K. G.
Yager
,
K. C.
Elbert
,
C. B.
Murray
,
A.
Stein
,
G. S.
Doerk
,
E. H. R.
Tsai
et al, “
Gaussian processes for autonomous data acquisition at large-scale synchrotron and neutron facilities
,”
Nat. Rev. Phys.
3
(
10
),
685
697
(
2021
).
16.
S.
Bochner
et al,
Lectures on Fourier Integrals
(
Princeton University Press
,
1959
), Vol.
42
.
17.
R. J.
Adler
,
The Geometry of Random Fields
(
John Wiley & Sons
,
1981
).
18.
M. L.
Stein
,
Interpolation of Spatial Data: Some Theory for Kriging
(
Springer Science & Business Media
,
1999
).
19.
D.
Wipf
and
S.
Nagarajan
, “
A new view of automatic relevance determination
,” in
Advances in Neural Information Processing Systems 20
(
NeurIPS
,
2007
).
20.
H.
Luo
,
G.
Nattino
, and
M. T.
Pratola
, “
Sparse additive Gaussian process regression
,”
J. Mach. Learn. Res.
23
(
61
),
1
34
(
2022
).
21.
E.
Snelson
and
Z.
Ghahramani
, “
Sparse Gaussian processes using pseudo-inputs
,” in
Advances in Neural Information Processing Systems 18
(
NeurIPS
,
2005
).
22.
H.
James
,
A.
Matthews
, and
Z.
Ghahramani
, “
Scalable variational Gaussian process classification
,” in
Artificial Intelligence and Statistics
(
PMLR
,
2015
), pp.
351
360
.
23.
R. B.
Gramacy
and
D. W.
Apley
, “
Local Gaussian process approximation for large computer experiments
,”
J. Comput. Graphical Stat.
24
(
2
),
561
578
(
2015
).
24.
R. B.
Gramacy
and
H. K. H.
Lee
, “
Bayesian treed Gaussian process models with an application to computer modeling
,”
J. Am. Stat. Assoc.
103
(
483
),
1119
1130
(
2008
).
25.
H.
Luo
and
M. T.
Pratola
, “
Sharded Bayesian additive regression trees
,” (
2023
).
26.
D.
Damian
,
P. D.
Sampson
, and
P.
Guttorp
, “
Bayesian estimation of semi-parametric non-stationary spatial covariance structures
,”
Environmetrics
12
(
2
),
161
178
(
2001
).
27.
A. M.
Schmidt
and
A.
O'Hagan
, “
Bayesian inference for non-stationary spatial covariance structure via spatial deformations
,”
J. R. Stat. Soc. Ser. B: Stat. Methodol.
65
(
3
),
743
758
(
2003
).
28.
L.
Bornn
,
G.
Shaddick
, and
J. V.
Zidek
, “
Modeling nonstationary processes through dimension expansion
,”
J. Am. Stat. Assoc.
107
(
497
),
281
289
(
2012
).
29.
K.
Karhunen
, “
Zur Spektraltheorie stochastischer Prozesse
,”
Ann. Acad. Sci. Finnicae, Ser. A
1
,
34
(
1946
).
30.
M.
Loeve
,
Probability Theory: Foundations
(
Random Sequences (Van Nostrand
,
1955
).
31.
K.
Christopher
, “
Wikle. Low-rank representations for spatial processes
,” in
Handbook of Spatial Statistics, Chapman & Hall/CRC Handbooks of Modern Statistical Methods
, edited by
A. E.
Gelfand
,
M.
Fuentes
,
P.
Guttorp
, and
P.
Diggle
(
Taylor & Francis
,
2010
).
32.
D. M.
Holland
,
N.
Saltzman
,
L. H.
Cox
, and
N.
Douglas
, “
Spatial prediction of dulfur dioxide in the eastern United States
,” in
GeoENV II: Geostatistics for Environmental Applications
, edited by
J.
Gomez-Hernandez
,
A.
Soares
, and
R.
Friodevaux
(
Kluwer Academic Publishers
,
1998
), pp.
65
75
.
33.
D.
Nychka
,
C.
Wikle
, and
J. A.
Royle
, “
Multiresolution models for nonstationary spatial covariance functions
,”
Stat. Modell.
2
(
4
),
315
331
(
2002
).
34.
H. J.
Thiebaux
, “
Anisotropic correlation functions for objective analysis
,”
Mon. Weather Rev.
104
,
994
1002
(
1976
).
35.
H.
Jean Thiébaux
and
M. A.
Pedder
,
Spatial Objective Analysis: With Applications in Atmospheric Science
(
Academic Press
,
1987
).
36.
H.
Dave
, “
Space and space-time modeling using process convolutions
,” in
C. W.
Anderson
,
V.
Barnett
,
P. C.
Chatwin
, and
A. H.
El-Shaarawi
(Eds.)
Quantitative Methods for Current Environmental Issues
(
Springer London
,
2002
), pp.
37
56
.
37.
R. P.
Barry
and
J. M. V.
Hoef
, “
Blackbox kriging: Spatial prediction without specifying variogram models
,”
J. Agric. Biol. Environ. Stat.
1
(
3
),
297
322
(
1996
).
38.
J.
Bernardo
,
J.
Berger
,
A.
Dawid
,
K.
Ickstadt
, and
R.
Wolpert
, “
Spatial regression for marked pointprocesses
,” in Bayesian Statistics 6: Proceedings of the Sixth Valencia Meeting (
1998
), pp.
323
341
.
39.
J. M.
Ver Hoef
,
N.
Cressie
, and
R. P.
Barry
, “
Flexible spatial models for kriging and cokriging using moving averages and the fast Fourier transform (FFT)
,”
J. Comput. Graphical Stat.
13
(
2
),
265
282
(
2004
).
40.
D.
Higdon
, “
A process-convolution approach to modelling temperatures in the North Atlantic Ocean
,”
Environ. Ecol. Stat.
5
(
2
),
173
190
(
1998
).
41.
H.
Dave
,
J.
Swall
, and
J.
Kern
, “
Non-stationary spatial modeling
,” arXiv:2212.08043 (
2022
).
42.
M. D.
Risser
and
C. A.
Calder
, “
Regression-based covariance functions for nonstationary spatial modeling
,”
Environmetrics
26
(
4
),
284
297
(
2015
).
43.
M. L.
Stein
, “
Nonstationary spatial covariance functions
,
Unpublished technical report
,
2005
.
44.
E. B.
Anderes
and
M. L.
Stein
, “
Estimating deformations of isotropic Gaussian random fields on the plane
,”
Ann. Stat.
36
(
2
),
719
741
(
2008
).
45.
A.
Zammit-Mangion
,
T. L. J.
Ng
,
Q.
Vu
, and
M.
Filippone
, “
Deep compositional spatial models
,”
J. Am. Stat. Assoc.
117
(
540
),
1787
1808
(
2022
).
46.
M. M.
Dunlop
,
M. A.
Girolami
,
A. M.
Stuart
, and
L.
Aretha Teckentrup
, “
How deep are deep Gaussian processes?
,”
J. Mach. Learn. Res.
19
(
54
),
1
46
(
2018
).
47.
A.
Jones
,
F. W.
Townes
,
D.
Li
, and
B. E.
Engelhardt
, “
Alignment of spatial genomics data using deep Gaussian processes
,”
Nat. Methods
20
,
1379
1387
(
2023
).
48.
R.
Ranganath
,
L.
Tang
,
L.
Charlin
, and
D. M.
Blei
, “
Deep exponential families
,” arXiv:1411.2581 (
2014
).
49.
R.
Salakhutdinov
and
H.
Larochelle
, “
Efficient learning of deep Boltzmann machines
,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(
JMLR Workshop and Conference Proceedings
,
2010
), p.
693
700
.
50.
T.
Yee
,
M.
Jordan
,
M.
Beal
, and
D.
Blei
, “
Sharing clusters among related groups: Hierarchical Dirichlet processes
,” in
Advances in Neural Information Processing Systems 17
(
NeurIPS
,
2004
).
51.
M.
Titsias
and
N. D.
Lawrence
, “
Bayesian Gaussian process latent variable model
,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(
JMLR Workshop and Conference Proceedings
,
2010
), pp.
844
851
.
52.
G. M.
Allenby
and
P. E.
Rossi
, “
Hierarchical Bayes models
,” in
The Handbook of Marketing Research: Uses, Misuses, and Future Advances
(
NeurIPS
,
2006
), pp.
418
440
.
53.
M. J.
Daniels
and
R. E.
Kass
, “
Nonconjugate bayesian estimation of covariance matrices and its use in hierarchical models
,”
J. Am. Stat. Assoc.
94
(
448
),
1254
1263
(
1999
).
54.
P. L.
Bartlett
,
A.
Montanari
, and
A.
Rakhlin
, “
Deep learning: A statistical viewpoint
,” (
2021
).
55.
F.
Jimenez
and
M.
Katzfuss
, “
Vecchia Gaussian process ensembles on internal representations of deep neural networks
,” arXiv:2305.17063 (
2023
).
56.
A.
Sauer
,
A.
Cooper
, and
R. B.
Gramacy
, “
Vecchia-approximated deep Gaussian processes for computer experiments
,”
J. Comput. Graphical Stat.
32
(
3
),
824
(
2022
).
57.
A.
Stewart Fotheringham
,
C.
Brunsdon
, and
M.
Charlton
,
Geographically Weighted Regression: The Analysis of Spatially Varying Relationships
(
John Wiley & Sons
,
2003
).
58.
S. L.
Scott
,
A. W.
Blocker
,
F. V.
Bonassi
,
H. A.
Chipman
,
E. I.
George
, and
R. E.
McCulloch
, “
Bayes and big data: The consensus Monte Carlo algorithm
,”
Int. J. Manage. Sci. Eng. Manage.
11
,
78
88
(
2016
).
59.
A.
Litvinenko
,
Y.
Sun
,
M. G.
Genton
, and
D. E.
Keyes
, “
Likelihood approximation with hierarchical matrices for large spatial datasets
,”
Comput. Satat. Data Anal.
137
,
115
132
(
2019
).
60.
C. J.
Geoga
,
M.
Anitescu
, and
M. L.
Stein
, “
Scalable Gaussian process computations using hierarchical matrices
,”
J. Comput. Graphical Stat.
29
(
2
),
227
237
(
2020
).
61.
J. A.
Lin
,
J.
Antorán
,
S.
Padhy
,
D.
Janz
,
J. M.
Hernández-Lobato
, and
A.
Terenin
, “
Sampling from Gaussian process posteriors using stochastic gradient descent
,” arXiv:2306.11589 (
2023
).
62.
A.
Sauer
,
A.
Cooper
, and
R. B.
Gramacy
, “
Non-stationary Gaussian process surrogates
,” arXiv:2305.19242 (
2023
).
63.
H.
Luo
and
J. D.
Strait
, “
Nonparametric multi-shape modeling with uncertainty quantification
,” arXiv:2206.09127 (
2022
), pp.
1
52
.
64.
W. M.
Sid-Lakhdar
,
Y.
Cho
,
J. W.
Demmel
,
H.
Luo
,
X. S.
Li
,
Y.
Liu
, and
O.
Marques
, GPTune User Guide, Alphabetic,
2020
.
65.
X.
Zhu
,
Y.
Liu
,
P.
Ghysels
,
D.
Bindel
, and
S.
XiaoyeLi
, “
GPTuneBand: Multi-task and multi-fidelity autotuning for large-scale high performance computing applications
,” in
Proceedings of the 2022 SIAM Conference on Parallel Processing for Scientific Computing
(
SIAM
,
2022
), pp.
1
13
.
66.
H.
Borchani
,
G.
Varando
,
C.
Bielza
, and
P.
Larranaga
, “
A survey on multi-output regression
,”
WIREs Data Min. Knowl. Discovery
5
(
5
),
216
233
(
2015
).
67.
H.
White
, “
Maximum likelihood estimation of misspecified models
,”
Econometrica
50
,
1
25
(
1982
).
68.
M. J.
Schervish
,
Theory of Statistics
(
Springer Science & Business Media
,
2012
).
69.
P.
Grünwald
and
T.
Van Ommen
, Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it (
2017
).
70.
B. J. K.
Kleijn
and
A. W.
van der Vaart
, Misspecification in infinite-dimensional Bayesian statistics,
2006
.
71.
D.
Vincent
,
H.
Salimbeni
,
E.
Hambro
,
J.
McLeod
,
F.
Leibfried
,
A.
Artemev
,
M.
van der Wilk
,
H.
James
,
M. P.
Deisenroth
, and
S. T.
John
, “
GPflux: A library for deep Gaussian processes
,” arXiv:2104.05674 (
2021
).
72.
H.
Salimbeni
and
M.
Deisenroth
, “
Doubly stochastic variational inference for deep Gaussian processes
,” in
Advances in Neural Information Processing Systems 30
(
NeurIPS
,
2017
).
73.
K.
Blomqvist
,
S.
Kaski
, and
M.
Heinonen
, “
Deep convolutional Gaussian processes
,” in
Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II
(
Springer
,
2020
), pp.
582
597
.
74.
H.
Salimbeni
,
D.
Vincent
,
H.
James
, and
M.
Deisenroth
, “
Deep Gaussian processes with importance-weighted variational inference
,” in
International Conference on Machine Learning
(
PMLR
,
2019
), pp.
5589
5598
.
75.
M.
Havasi
,
J. M.
Hernández-Lobato
, and
J. J.
Murillo-Fuentes
, “
Inference in deep Gaussian processes using stochastic gradient Hamiltonian Monte Carlo
,” in
Advances in Neural Information Processing Systems
(
NeurIPS
,
2018
).
76.
J. R.
Gardner
,
G.
Pleiss
,
D.
Bindel
,
K. Q.
Weinberger
, and
A. G.
Wilson
, “
Gpytorch: Blackbox matrix-matrix Gaussian process inference with Gpu acceleration
,” in
Advances in Neural Information Processing Systems
(
NeurIPS
,
2018
).
77.
K.
Cutajar
,
E.
V Bonilla
,
P.
Michiardi
, and
M.
Filippone
, “
Random feature expansions for deep Gaussian processes
,” in
International Conference on Machine Learning
(
PMLR
,
2017
), pp.
884
893
.
78.
Z.
Dai
,
A.
Damianou
,
J.
González
, and
N.
Lawrence
, “
Variational auto-encoded deep Gaussian processes
,” arXiv:1511.06455 (
2015
).