The Gaussian process (GP) is a popular statistical technique for stochastic function approximation and uncertainty quantification from data. GPs have been adopted into the realm of machine learning (ML) in the last two decades because of their superior prediction abilities, especially in data-sparse scenarios, and their inherent ability to provide robust uncertainty estimates. Even so, their performance highly depends on intricate customizations of the core methodology, which often leads to dissatisfaction among practitioners when standard setups and off-the-shelf software tools are being deployed. Arguably, the most important building block of a GP is the kernel function, which assumes the role of a covariance operator. Stationary kernels of the Matérn class are used in the vast majority of applied studies; poor prediction performance and unrealistic uncertainty quantification are often the consequences. Non-stationary kernels show improved performance but are rarely used due to their more complicated functional form and the associated effort and expertise needed to define and tune them optimally. In this perspective, we want to help ML practitioners make sense of some of the most common forms of non-stationarity for Gaussian processes. We show a variety of kernels in action using representative datasets, carefully study their properties, and compare their performances. Based on our findings, we propose a new kernel that combines some of the identified advantages of existing kernels.
I. INTRODUCTION
The Gaussian process (GP) is arguably the most popular member of the large family of stochastic processes and provides a powerful and flexible framework for stochastic function approximation in the form of Gaussian process regression (GPR). A GP is defined as a collection of random variables such that any finite subset has a joint normal (Gaussian) distribution.1 In the scope of GPR, the random variables are considered (model or latent) function values {f(x1), f(x2), f(x3), …}. Equivalently, we can understand a GP as a description of distributions over function spaces, in which case is entirely specified by a mean function and a covariance , where k(x, xi; h) is the kernel or covariance function, h ∈ Θ is a vector of hyperparameters, and is the input space. This role of the kernel as a covariance operator makes it arguably the most important building block when it comes to optimizing the flexibility, prediction accuracy, and expressiveness of a GP.
In a recent review,2 it was pointed out that the vast majority of applied studies using GPs employ the radial basis function (RBF) kernel (also referred to as the squared exponential or Gaussian kernel). The fraction is even higher when we include other stationary kernels. Stationary kernels are characterized by k(xi, xj) = k(|xi − xj|), i.e., covariance matrix entries only depend on some distance between data points in the input domain, not on their respective location. Stationary kernels are popular because they carry little risk—in terms of model misspecification—and come with only a few hyperparameters that are easy to optimize or train. However, it has been shown that the stationarity assumption can lead to poor prediction performance and unrealistic uncertainty quantification that is affected mostly by data-point geometry.3 In other words, uncertainty will increase when moving away from data points at a constant rate across the input space. To overcome these limitations, significant research attention has been drawn to non-stationary Gaussian process regression.4–6 Non-stationary kernels depend on the respective locations in the input space explicitly, i.e., k(xi, xj) ≠ k(|xi − xj|). This makes them significantly more flexible and expressive, leading to higher-quality estimates of uncertainty across the domain. Even so, non-stationary kernels are rarely used in applied machine learning (ML) due to the inherent difficulty of customization, optimizing hyperparameters, and the associated risks of model misspecification (wrong hyperparameters), overfitting, and computational costs due to the need to find many hyperparameters.7,8 When applied correctly, non-stationary GPs have been shown to provide significant advantages over their stationary counterparts, especially in scenarios where the data exhibit non-stationary behavior.9
This paper aims to bring some structure to the use of non-stationary kernels and related techniques to make it more feasible for the practitioner to use these kernels effectively. Throughout this paper—in the hope of covering the most practical set of available options—we focus on and compare the following four ways to handle non-stationarity in a dataset:
Ignore it: Most datasets and underlying processes exhibit some level of non-stationarity, which is often simply ignored. This leads to the use of stationary kernels of the form kstat = k(|xi − xj|); |·| is a norm appropriate to the properties of the input space. Given the key takeaways in Ref. 2, this option is chosen by many, which serves as the main motivation for the authors to write this perspective.
Parametric non-stationarity: Kernels of the form , where gd(x) can be any parametric function over the input space and N is some positive integer.
Deep kernels: Kernels of the form kstat(|ϕ(xi) − ϕ(xj)|), where ϕ is a neural network (NN), and kstat is any stationary kernel. This kernel was introduced by Wilson et al.10 and was quickly established as a powerful technique, even though potential pitfalls related to overfitting were also discovered.11
Deep GPs: Achieving non-stationarity not by using a non-stationary kernel but by stacking stationary GPs—meaning the output of one GP becomes the input to the next, similar to how neural networks are made up of stacked layers—introduces a non-linear transformation of the input space and thereby non-stationarity.12
Non-stationarity can also be modeled through the prior-mean function of a Gaussian process, but for the purpose of this paper, we use a constant prior mean and focus on the four methods described earlier. From the brief explanations of the three types of non-stationarity, it is immediately obvious that deriving, implementing, and deploying a non-stationary kernel and deep GPs (DGPs) can be an intimidating endeavor. The function gd (in No. 2 above), for instance, can be chosen to be any sum of local basis functions—linear, quadratic, cubic, piecewise constant, etc.—any polynomial or sums thereof, and splines, just to name a few. Given the vast variety of neural networks (in No. 3), choosing ϕ is no easier. DGPs involve stacking GPs, which increases flexibility but also leads to a very large number of hyperparameters, or latent variables, making optimization and inference a big challenge. In addition, a more complex kernel generally results in reduced interpretability—a highly valued property of GPs.
This paper is concerned with shining a light on the characteristics and performance of different kernels and experiencing them in action. In a very pragmatic approach, we study the performance of the three methodologies above to tackle non-stationarity and compare them to a stationary Matérn kernel to create a baseline. Our contributions can be summarized as follows:
We strategically choose a representative set of test kernels and investigate their properties.
We choose test datasets to study the performance of the chosen kernels.
We derive performance and (non)-stationarity metrics that allow fair comparisons of the non-stationary kernels and the selected deep GP implementation.
We compare the performance of the candidate kernels tasked with modeling the test datasets and present the uncensored results.
We use the gathered insights to draw some conclusions and suggest a new kernel design for further investigation.
We identify and exploit two primary pathways to achieve non-stationary properties: through the application of deep architectures—deep kernels or deep GPs—and the utilization of parametric non-stationary kernels. Our work delves into the role of parametric non-stationary kernels, arguing that these kernels offer similar behavior to a deep-kernel architecture without any direct notion of deepness, hinting at the possibility that deepness and flexibility are practically synonymous. This provides a valuable computational perspective to the broader dialog surrounding deep architectures and non-stationarity, offering insights that may be instrumental in future research.
The remainder of this paper is organized as follows: The next section introduces some minimal but necessary theory. There, we also introduce some measures that will help us later when it comes to investigating the performance of different kernels. Then, we give an overview of the relevant literature and some background. Subsequently, we introduce a range of different kernels and some test datasets, which position the reader to see each kernel in action in different situations. We then discuss the key takeaways; we give some pointers toward improved kernel designs, make remarks about the connection between non-stationary kernels and multi-task learning, and conclude with a summary of the findings.
II. PRELIMINARIES
The purpose of this section is to give the reader the necessary tools to follow along with our computational experiments. It is not intended to be a complete or comprehensive overview of GP theory or related methodologies. For the remainder of this paper, we consider some input space with elements x. While we often think of the input space as a subspace of the Euclidean space, this is not a necessary assumption of the framework generally. We assume data is a set of tuples . In this work, except for a comment on multivariate GPs at the end, we will assume scalar yi.
A. GPs in a nutshell
B. The covariance operator of a GP: The kernel or covariance function
Non-stationary kernels do not have the restriction of only depending on the distance between the input points but depend on their explicit positions xi and xj, i.e., k(xi, xj) ≠ k(|xi − xj|). Formulating new non-stationary kernels comes with the difficulty of proving positive semi-definiteness, which is often challenging. However, the statistics and machine learning literature has a variety of general approaches for applying valid non-stationary kernels; this is the primary topic of this paper, which is discussed in Sec. III. For now, it is important to keep in mind that the essence of stationary vs non-stationary kernels—due to the way they deal with the locations in the input space—manifests itself in the covariance matrix which, in the stationary case, has to be constant along all bands parallel to the diagonal for sorted and equidistant input points, a property that is not observed in the non-stationary case (see Fig. 1). Therefore, non-stationary kernels lead to more descriptive, expressive, and flexible encodings of covariances and, therefore, uncertainties. Of course, in addition, the function space contains a much broader class of functions when non-stationary kernels are used.
C. A note on scalability
Computational complexity is a significant challenge for GPs for both stationary and non-stationary kernels. The need to calculate the log-determinant and invert the covariance matrix—or solve a linear system instead—results in a computational complexity of , where is the number of data points.1,20 This complexity limits the application of GPs to large-scale datasets. Several methods have been proposed to overcome this issue. Sparse methods20,21 and scalable GPR approaches22 have been developed for stationary GPs. For non-stationary GPs, methods such as local GPs23 and the Bayesian treed GP24 have been proposed to tackle this issue. These methods have an origin in divide-and-conquer methods attempting to break down the problem into smaller, manageable pieces that can be solved independently,25 thereby reducing the computational complexity for each piece. However, these approaches can lead to a loss of global information, so finding the right balance between computational efficiency and model accuracy remains a key challenge. A recent approach can scale GPs to millions of data points without using approximations by allowing a compactly supported, non-stationary kernel to discover naturally occurring sparsity.8
III. NON-STATIONARY KERNELS
Stationary kernels are widely used primarily because they are easy and convenient to implement, even though the implied assumption of translation-invariant covariances is almost never exactly true for real-world datasets. As mentioned in Sec. II, there are serious challenges associated with both deriving non-stationary kernels and choosing an appropriate and practical non-stationary kernel from the valid options for any given implementation of a Gaussian process. We now provide a brief overview of the literature on non-stationary kernels, including both a historical perspective of early developments followed by greater detail on three modern frameworks for non-stationary kernels as well as metrics for quantifying non-stationarity in datasets.
A. Historical perspective
It has now been over three decades since the first paper on non-stationary kernels via “deformations” or warping of the input space appeared.4 Since then, the statistics literature has developed a number of approaches for non-stationary kernels, mostly in the context of modeling spatially referenced data. These methods can broadly be categorized as basis function expansions and kernel convolutions, in addition to the aforementioned deformation approach. We now briefly summarize each method, focusing on aspects that apply directly to kernel functions for Gaussian processes.
In conclusion of this section, the statistics literature contains a broad set of techniques (only some of which are summarized here) for developing non-stationary kernel functions. However, historically speaking, these techniques were not widely adopted because, in most of the cases described here, the number of hyperparameters is in the same order as the number of data points. This property makes it very difficult to apply the kernels to real-world datasets due to the complex algorithms required to fit or optimize such models.
Nonetheless, the potential benefits of applying non-stationary kernels far outweigh the risks, in our opinion, and this perspective is all about managing this trade-off. To do so, we now introduce three modern approaches to handle non-stationarity in datasets in order to later test them and compare their performance (Sec. IV).
B. Non-stationarity via a parametric signal variance
C. Deep kernels
Input Set of points as arrays x1, x2, hyperparameters h ∈ Θ |
Output Deep kernel covariance matrix of shape [len(x1) × len(x2)] |
Step 1: Define the neural network architecture with input dimension n, hidden layers, |
and appropriate activation functions |
Step 2: Initialize or set the weights and biases of the neural network using h |
Step 3: Transform x1 and x2 through the neural network |
• Apply the forward pass of the neural network to all points in sets x1 and x2 |
• Store the transformed values as x1nn and x2nn |
Step 4: Calculate the pairwise distance matrix between x1nn and x2nn |
Step 5: Compute the deep kernel value using the distance matrix |
• Apply a kernel function (e.g., exponential) to the distance matrix |
• Scale and combine with other kernel functions if needed |
Step 6: Return the deep kernel value |
Input Set of points as arrays x1, x2, hyperparameters h ∈ Θ |
Output Deep kernel covariance matrix of shape [len(x1) × len(x2)] |
Step 1: Define the neural network architecture with input dimension n, hidden layers, |
and appropriate activation functions |
Step 2: Initialize or set the weights and biases of the neural network using h |
Step 3: Transform x1 and x2 through the neural network |
• Apply the forward pass of the neural network to all points in sets x1 and x2 |
• Store the transformed values as x1nn and x2nn |
Step 4: Calculate the pairwise distance matrix between x1nn and x2nn |
Step 5: Compute the deep kernel value using the distance matrix |
• Apply a kernel function (e.g., exponential) to the distance matrix |
• Scale and combine with other kernel functions if needed |
Step 6: Return the deep kernel value |
Our deep kernel construction shares the perspective of using a neural network to estimate a warping function.4 In Ref. 45, the authors propose an approach to deep compositional spatial models that differs from traditional warping methods in that it models an injective warping function through a composition of multiple elemental injective functions in a deep-learning framework. This allows for greater flexibility in capturing non-stationary and anisotropic spatial data and is able to provide better predictions and uncertainty quantification than other deep stochastic models of similar complexity. This uncertainty quantification is point-wise, similar to the deep GPs we introduce next.
D. Deep GPs
Deep Gaussian process (DGP) models are hierarchical extensions of Gaussian processes where GP layers are stacked—similar to a neural network—enhancing modeling flexibility and accuracy12,46,47 (more details can be found in Appendix D). The DGP model is one of the deep hierarchical models48–50 and consists of a number of variational Gaussian process layers defined by a mean function and a stationary covariance (kernel) function.
Let us first consider a simple 2-layer DGP taking n-dimensional inputs x. The first layer uses a constant zero-mean GP(1) for a lower-dimensional representation of the input data, with its coordinate components distributed as and , where n1 is the number of nodes in the hidden layer where each component is mutually independent.
The second layer uses a constant zero mean and takes the first layer’s output. Its output f(2) is distributed as f(2)∣f(1) ∼ Nn(0, Σ2(f(1))) and is fed into an independent GP GP(2) as the second layer, generating the model output. We can impose more than two hidden layers for a single DGP according to our needs, where , with layouts as random matrices. This is analogous to the connecting matrices in the architecture of a neural network. However, we match the two-layer structure used in our deep kernel and point out that the complexity of the neural architecture may not always lead to better performance.
For L > 2 layers in a DGP, we iterate this procedure of using the previous layer’s output as the next layer’s GP input until reaching the last layer. For an L-layer DGP architecture, this creates a sequence of L independent GPs, GP(1), GP(2), …, GP(L), corresponding to each layer. Each layer’s forward method applies the mean and covariance functions to input data and returns a multivariate normal distribution as output, which serves as the next layer’s input.
Using a slightly different framework based on the Vecchia approximation,55 we introduced a “deep Vecchia ensemble,” a hybrid method combining Gaussian processes (GPs) and deep neural networks (DNNs) for regression. This model builds an ensemble of GPs on DNN hidden layers, utilizing Vecchia approximations for GP scalability. Mathematically, the joint distribution of variables is decomposed using Vecchia’s approximation, and predictions are combined using a generalized product of experts. This approach offers both representation learning and uncertainty quantification. As described in the last section, a GP model can utilize a deep kernel, constructively combining the neural network’s and the GP’s strengths, leading to a model that benefits from the GP’s interpretability and the NN’s flexibility.10
Our Bayesian DGP (BDGP) architecture follows56 and includes a two-layer neural network applied as a transformation to the input data. The first layer uses a rectified linear unit (ReLU) activation function, and the second employs a sigmoid activation. This non-linear feature mapping expresses complex input space patterns (see Appendix C).
Contrasting DGPs with deep-kernel GPs, DGPs use multiple GP layers to capture intricate dependencies, whereas deep-kernel GPs employ a NN for input data transformation before GP application. Essentially, while DGPs exploit GP layering to manage complex dependencies, deep kernel learning leverages NNs for non-linear input data transformation, enhancing the GP’s high-dimensional function representation ability.
E. Measuring non-stationarity of datasets
When it comes to characterizing non-stationarity, some methods focus on non-stationarity in the mean function (e.g., polynomial regression), while others concentrate on non-stationarity in the variance (e.g., geographically weighted regression57). Non-stationarity is typically characterized by a change in statistical properties over the input space, e.g., changes in the dataset’s mean, variance, or other higher moments. Quantifying non-stationarity is an active area of research, and in this paper, we introduce a particular kind of non-stationarity measure for the purpose of judging our test kernels when applied to our test datasets. Overall, measuring a given dataset’s non-stationarity properties is an important ingredient in understanding the performance of a particular kernel. For the reader’s convenience, we offer our non-stationarity measure as a pseudocode (see Algorithm 2). For a detailed theoretical motivation for the non-stationarity measure we use in this work, please refer to Appendix A.
1: procedure MEASURE NONSTATIONARITY (X, y, m, size, c) |
⊳ X is the set of data points, y is the collected data, m is the |
numberof iterations, size is the size of the subdomain, |
c is the number of selected points in each iteration |
2: length scale list = [] |
3: signal variance list = [] |
4: for i in 0 to m do |
5: a ∼ U([0,1 − size]n) |
6: b = a + size |
7: select c test data points Xt from [a,b]n: |
8: select associated yt in the dataset or evaluate synthetic |
function at Xt |
9: initialize a stationary GP |
10: signal variance, length scale = run_MLE(Xt, yt) |
11: append new signal variance to the signal variance list |
12: append new length scale to the length scale list |
13: end for |
14: return signal variance list, length scale list |
15: end procedure |
1: procedure MEASURE NONSTATIONARITY (X, y, m, size, c) |
⊳ X is the set of data points, y is the collected data, m is the |
numberof iterations, size is the size of the subdomain, |
c is the number of selected points in each iteration |
2: length scale list = [] |
3: signal variance list = [] |
4: for i in 0 to m do |
5: a ∼ U([0,1 − size]n) |
6: b = a + size |
7: select c test data points Xt from [a,b]n: |
8: select associated yt in the dataset or evaluate synthetic |
function at Xt |
9: initialize a stationary GP |
10: signal variance, length scale = run_MLE(Xt, yt) |
11: append new signal variance to the signal variance list |
12: append new length scale to the length scale list |
13: end for |
14: return signal variance list, length scale list |
15: end procedure |
To avoid bias through user-based subset selection, we draw the location of a subdomain from a uniform distribution over the input domain (in this case [0, 1]n). We then select a set of data points randomly from within this subdomain 100 times and use MLE [maximum likelihood estimation, maximizing Eq. (4)] to get a stationary length scale and signal variance in each iteration. In the case of a synthetic function, the locations of the data points are drawn from a uniform distribution over the subdomain; in the case of a fixed dataset, the points are randomly selected. The final distribution of all signal variances and length scales is then assessed to measure non-stationarity (see Algorithm 2).
To test our non-stationary measure, we applied it to three synthetic functions (see Fig. 2). In each case, we indicate the final variance of the length scale and signal variance distributions. In the first scenario (first row in Fig. 2), where the signal is a high-frequency sine function, the algorithm’s behavior leads to a high concentration of points in a single compact cluster when plotting the estimated length scale vs the signal variance. This concentration reflects the inherent stationarity of a signal, where the statistical properties do not change over the input space. The unimodal and concentrated distributions of both estimated parameters in the violin plots further corroborate this observation. The consistency in the length scale and signal variance across multiple iterations of MLE indicates that the underlying data structure is stationary.
In the second case (second row in Fig. 2), the signal is a trigonometric curve with local oscillations. The algorithm’s response to this function results in a wider spread of both hyperparameters compared to the first test function. This behavior is reflected in the violin plots on the right.
The third scenario (third row in Fig. 2) presents a more complex signal structure, a trigonometric curve with varying amplitude and frequency. The algorithm’s reaction to this non-stationary signal is manifested in a broad distribution of length scales and some spread in the signal variance. These distributions are reflected in the violin plots.
The three cases in Fig. 2 demonstrate the algorithm’s capability to measure non-stationarity through the local, stationary GP-hyperparameter distributions. The varying behaviors in clustering and distribution of the estimated parameters across the three cases provide insights into the underlying stationarity or non-stationarity of the signals. The algorithm’s sensitivity to these variations underscores its potential as a valuable tool for understanding and characterizing non-stationarity in different contexts.
F. Performance measures
Computational cost is becoming a main research topic in recent studies of large-scale non-parametric models,22,25,58 especially GPs.20,59–61 In our analysis, we sought to examine the progression of the optimization process. To achieve this, we established a callback function during the optimization phase, tracking the RMSE, the log marginal likelihood, and the CRPS as a function of compute time. In terms of interpretation, ideally, we expect the RMSE and the CRPS to increasingly approach zero over time, suggesting that the model’s predictive accuracy and estimation of uncertainties are improving. On the other hand, the log marginal likelihood should increase, indicating a better fit of the model to the observed data. This analysis gives us a summary of the model’s learning process and helps us understand the progression of the optimization, thus providing valuable insights into the efficacy of our model and the optimization strategy employed.
IV. COMPUTATIONAL EXPERIMENTS
The purpose of this section is to see how different kernels and a deep GP deal with non-stationarity in several datasets and to compare the characteristics and properties of the solutions. To make this comparison fair and easier, we ran all tests on the same Intel i9 CPU (Intel Core i9-9900KF CPU @ 3.60 GHz × 8) and used the total compute time as the cost. As a performance metric, we calculate and observe the RMSE [Root Mean Squared Error, Eq. (22)], the CRPS [continuous rank probability score, Eq. (23)] of the prediction, both applied to a test dataset, and the log marginal likelihood of the observational data [Eq. (4)]. We attempted to run fair tests in good faith; this means the effort spent to set up each kernel or methodology was roughly proportional to a method’s perceived complexity, within reasonable bounds, similar to the effort expended by an ML practitioner—this meant minutes of effort for stationary kernels and hours to days for non-stationary kernels and DGPs. The optimizer to reach the final model was SciPy’s differential evolution. We used an in-house MCMC (Markov Chain Monte Carlo) algorithm to create the plots showing the evolution of the performance metrics over time. In cases where our efforts did not lead to satisfactory performance, we chose to present the result “as is” to give the reader the ability to judge for themselves. We always start our tests with a standard stationary kernel to create a baseline—stationary kernels with few hyperparameters are currently the most widely used kernel class. To further the hands-on aspect of this section, we also included the algorithms used in the Appendix and on a specifically designed website, along with links to download the data. The performance-measure-over-time plots were created without considering deep GPs due to incompatible differences in the implementations (see Appendix D). All computational experiments were run multiple times to make sure we showed representative results.
This section first introduces three datasets we will use later to evaluate the performance of the test methodologies. Second, we present the unredacted, uncensored results of the test runs. The purpose is not to judge some kernels or methodologies as better or worse universally, but to evaluate how these techniques perform when tested under certain well-defined conditions and under the described constraints. We encourage the reader to follow our tests, to rerun them if desired, and to judge the performance of the methods for themselves.
A. Introducing the test datasets
We will consider three test datasets. All datasets are normalized such that the range and the image—the set of all measured function values—are in [0, 1].
Second, we consider a three-dimensional climate dataset that is available online (https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/), consisting of in situ measurements of daily maximum surface air temperature (°C) collected from weather stations across the contiguous United States (geospatial locations defined by longitude and latitude) over time (the third dimension). The data and its non-stationarity measures are presented in Fig. 4.
Third, we consider a dataset that was collected during an x-ray scattering experiment at the CMS beamline at NSLSII, Brookhaven National Laboratory. The dataset originated from an autonomous exploration of the multidimensional material state-spaces underlying the self-assembly of copolymer mixtures. Because the scientific outcome of this experiment has not been published yet, all scientific insights have been obscured by normalization and the removal of units. The dataset is presented in Fig. 5.
To evaluate the predictive performance metrics, we divided our datasets into training datasets and test datasets by random selection and applied the metrics only to the latter.
B. Results
In this section, we present quantitative results of how the different kernels and deep GPs performed when tasked to learn the underlying data-generating latent functions that produced the test datasets introduced in the previous section. For each test, we show the model and its uncertainties across the domain or a subdomain after convergence of a global evolutionary optimization of the log marginal likelihood and the performance measures as a function of the compute time of an MCMC algorithm. Code snippets can be found in the Appendix and on our website (see the Data Availability paragraph at the end). To reiterate, for all tests, we put ourselves in the position of an ML practitioner setting up each algorithm under a reasonable-effort constraint—we did not optimize each method to its full extent because this would not lead to a fair comparison. However, we used our best judgment and followed online documentation closely. The same thought process was put into the exact design of the kernels; of course, one could always argue that a particular model would have performed better using more hyperparameters. For the fairness of the comparison, we kept the number of hyperparameters similar across the kernels for a particular computational experiment and increased the number of kernel parameters in a near-proportional fashion for the non-stationary kernels as we moved to higher dimensions. See Table I for the number of hyperparameters for the different experiments. All non-stationary kernels were implemented in the open-source GP package gpCAM (https://github.com/lbl-camera/gpCAM). We tried two different deep GPs, the GPflux package (https://github.com/secondmind-labs/GPflux) and the Bayesian deep GP (BDGP) by Refs. 56 and 62 (https://cran.r-project.org/web/packages/deepgp/index.html). We selected the latter in its two-layer version for our final comparisons because of performance issues with the GPflux package (see Fig. 7).
. | Number of hyperparameters . | ||
---|---|---|---|
. | per experiment . | ||
Kernel functions . | 1D synthetic . | 3D climate . | 3D x-ray . |
Stationary, kstat | 2 | 3 | 3 |
Parametric non-stationary, kpara | 15 | 58 | 58 |
Deep kernel, knn | 48 | 186 | 186 |
. | Number of hyperparameters . | ||
---|---|---|---|
. | per experiment . | ||
Kernel functions . | 1D synthetic . | 3D climate . | 3D x-ray . |
Stationary, kstat | 2 | 3 | 3 |
Parametric non-stationary, kpara | 15 | 58 | 58 |
Deep kernel, knn | 48 | 186 | 186 |
1. One-dimensional synthetic function
2. The climate model
Our three-dimensional climate dataset was introduced in Sec. IV A. The stationary reference kernel (kstat) is a Matérn kernel with ν = 3/2 [Eq. (25)]. In all cases, stationary and non-stationary, we added to the kernel matrix the noise covariance matrix , where is the nugget variance, treated as an additional hyperparameter. The parametric non-stationary kernel is similar to Eq. (26); however, we place radial basis functions (27) at {0,0.5,1}3 and added a nugget variance, leading to a total of 58 hyperparameters. The deep kernel has two hidden layers of width 10 but is otherwise identical to (28), yielding 186 hyperparameters. The results, two-dimensional slices through the three-dimensional input space, are presented in Fig. 8. For completeness, we included the deep GP result in Fig. 9, which, however, in our run, was not competitive. Once again, the stationary kernel (kstat) delivers fast and robust results; however, it lacks accuracy compared to the parametric non-stationary kernel (kpara) and the deep kernel (knn). The performance of the two non-stationary kernels is on par, with a slight advantage in CRPS for the deep kernel and a significant advantage in compute time for the parametric non-stationary kernel. The MCMC (Fig. 10) sample runs revealed stable convergence, however, at significantly different time scales.
3. The x-ray scattering model
Our three-dimensional x-ray scattering dataset was introduced in Sec. IV A. The kernel setup is identical to our climate example (see Sec. IV B 2). The results are presented in Fig. 11. This experiment shows improving performance as kernel complexity increases; however, the improvements are moderate. As before, if accuracy is the priority, non-stationary kernels should be considered. Again, the deep kernel (knn) performed very competitively—with a significant advantage in terms of RMSE and CRPS—but did not reach the same log marginal likelihood as the competitors, which can be traced back to solving a high-dimensional optimization problem due to a large number of hyperparameters. This opens the door to even better performance if more effort and time are spent training the model. What stands out in Fig. 11 is the smaller and more detailed predicted uncertainties for the deep kernel, which would affect decision-making in an online data acquisition context. In addition, the posterior mean has more intricate details compared to the stationary and parametric non-stationary kernels. Figure 12 reveals stable convergence of the MCMC sampling for all kernels at similar time scales, supporting the deep kernel as the superior choice for this model. We again included the BDGP, which, however, was not competitive (Fig. 13).
V. DISCUSSION
A. Interpretations test-by-test
The one-dimensional synthetic test function (24) exhibits a complex behavior that is captured by our stationarity analysis through a broad distribution in the length scale and signal variance. The scattering patterns reveal a multifaceted behavior, with different regions of the data exhibiting distinct statistical properties. The linear correlation within one cluster (left, scatter plot, Fig. 3) and the dispersion in the other cluster capture the interplay between the sinusoidal and quadratic components of the function. The linear cluster indicates a linear correlation between these two hyperparameters within a specific region of input data. This pattern may correspond to the sinusoidal components of the function, where local oscillations exhibit a consistent relationship between the length scale and signal variance. The additional dispersed points in the scatter plot likely correspond to the regions influenced by the quadratic term in the function, where the statistical properties vary, leading to a broader distribution of the hyperparameters. The bimodal distributions observed in the violin plot for the length scale further corroborate the non-stationarity. The results provide insights into the intricate interplay between the sinusoidal and quadratic components of the function. Due to the strong non-stationarity in the data, non-stationary kernels performed well for this test function. Due to the low dimensionality, the number of hyperparameters is low in all cases, leading to robust training. It is, in our opinion, safe to conclude that in one-dimensional cases with moderate dataset sizes and suspected non-stationarity in the data, non-stationary kernels are to be preferred. Both our parametric and deep non-stationary kernels performed well, with a slight edge in accuracy for the parametric kernel (see Fig. 6). Our two tested deep GP setups (Figs. 6 and 7) were not competitive given our reasonable-constraint effort, which in this case was in the order of days.
Moving on to the climate data example, Fig. 4 shows a quite concentrated cluster near (0,0), indicating weak non-stationarity. This concentration suggests that the statistical properties are consistent across this region, possibly reflecting a dominant pattern or behavior in the data. The computational experiments (see Fig. 11) reveal a trade-off between accuracy and compute time. One has to put much more effort—number of hyperparameters and time—into the computation for a relatively small gain in accuracy. Both the parametric non-stationary kernel and the deep kernel achieve higher accuracy than the stationary kernel but are costly. In time-sensitive situations, the stationary kernel is likely to be preferred; for best accuracy, the parametric non-stationary kernel or the deep kernel is the superior choice. Looking at the CRPS, the deep kernel has a slight edge in accurately estimating uncertainties over the parametric kernel; however, the parametric non-stationary kernel reaches the highest log marginal likelihood. For this dataset, we tested the BDGP without much success under our reasonable-effort constraint (see Fig. 9).
Finally, for the x-ray scattering data, Fig. 5 shows a high level of dispersion in the scatter and violin plots, indicating strong non-stationarity, which is further corroborated in the violin plots. This strong non-stationarity—particularly in the length scale—clearly benefits the performance of the deep kernel (see Fig. 11). The stationary kernel shows quite poor performance both in terms of RMSE and CRPS. Since the stationary kernels only allow us to estimate uncertainties based on the global properties of the data and local geometry, it is expected that non-stationarity kernels estimate uncertainties more accurately, which manifests itself in a higher CRPS. The parametric non-stationary kernel leads the field in log marginal likelihood. Surprisingly, among the three tested kernels, the deep kernel leads to by far the lowest log marginal likelihood, which, again, suggests a better optimizer might lead to much-improved performance. Once again, our setup of the BDGP was not competitive (see Fig. 13).
B. Key takeaways from the computational experiments
While we included as much information in the computational experiments and the Appendix as possible to give the reader a chance to make up their own minds, here we summarize the following key takeaways:
Stationary kernels proved to be surprisingly competitive in terms of the RMSE and are unbeatable when evaluating accuracy per time. It seems worth it in most cases to run a stationary GP first for comparison.
Uncertainty is estimated more accurately by non-stationary kernels; if UQ-driven decision-making is the goal, non-stationary kernels are to be preferred. This is not a surprise since, given a constant (possibly anisotropic) length scale and a signal variance, the posterior variance only depends on data-point geometry.
The parametric non-stationary kernel encodes a flexible non-stationarity while maintaining interpretability. The involved parametric functions over the input space can be visualized and interpreted.
Deep kernels are some of the most flexible kernels, but interpretation is lost in all but the simplest cases, which can easily lead to model misspecifications (wrong model class and hyperparameters). Our models took a fair amount of experimenting before an acceptable performance was achieved. In online applications, in which testing beforehand is not possible, deep kernels should be used with caution.
Non-stationarity in the covariance structure appears in signal variance and length scale; ideally, a kernel addresses both [see Eq. (29)].
While not included in the tests, experimenting with prior mean functions has shown that non-stationarity properties highly depend on the prior mean function of the GP. This is especially true for the non-stationary signal variance.
Extrapolation and non-stationary kernels are difficult to combine. While the parametric non-stationary kernel can be set up for extrapolation, traditional neural networks are poorly equipped for that task.
We should think of the number of hyperparameters conservatively; too many bear the risk of model misspecification (through local minima) and overfitting.
The parametric non-stationary kernel achieved overall better RMSE; the deep kernel led to better uncertainty quantification, as indicated by the CRPS.
C. A parametric deep kernel
D. Connection between multi-task learning and non-stationary kernels
Multi-task learning offers a powerful paradigm to leverage shared information across multiple related tasks, thereby enhancing the predictive performance of each individual task.63–65 This is particularly beneficial when data for some tasks are sparse, as information from data-rich tasks can be used to inform predictions for data-scarce tasks. Flexible non-stationary kernels offer an interesting benefit for multi-task learning: instead of employing specialized techniques, such as coregionalization, one can reformulate the problem to a single task problem and let a flexible non-stationary kernel learn intricate correlations between input and output space locations. By transforming the multi-task learning problem over to a single-task learning problem over , no further changes to the core algorithm are required. This has been known for a long time and is referred to as problem-transformation methods.66 These methods were originally dismissed as not being able to capture intricate correlations between the tasks; however, this is only true if stationary, separable kernels are used. A flexible non-stationary kernel is able to flexibly encode covariances across the input and output domains, independent of the indexing of the tasks. This makes it possible to transfer all the complexities of multi-task learning to the kernel design and use the rest of the GP framework as-is, inheriting the superior robustness and interpretability properties of single-task GPs.
VI. SUMMARY AND CONCLUSION
In this paper, we put on the hat of a machine learning practitioner trying to find the best kernel or methodology within the scope of a Gaussian process (GP) to address non-stationarity in various datasets. We introduced three different datasets—one synthetic, one climate dataset, and one originating from an x-ray scatting experiment at the CMS beamline at NSLSII, Brookhaven National Laboratory. We introduced a non-stationarity measure and studied each dataset to be able to judge their non-stationarity properties quantitatively. We then presented four different methodologies to address the non-stationarity: ignoring it by using a stationary kernel, a parametric non-stationary kernel that uses a flexible non-constant signal variance, a deep kernel that uses a neural network to warp the input space, and a deep GP. We set all methodologies up under reasonable effort constraints to allow for a fair comparison, just like a practitioner might encounter. In our case, that meant minutes of setup time for the stationary kernels and hours to days for the non-stationary kernels and the deep GPs. After the methodologies were set up, we ran our computational tests and presented the results unredacted and uncensored. This way, we hope, the reader gets the best value out of the comparisons. This is also to ensure that the reader has the chance to come up with conclusions different from ours. To further the readers’ ability to double-check and learn, we are publishing all our codes online.
Our tests have shown that even weak non-stationarity in a dataset motivates the use of non-stationary kernels if training time is not an issue of concern. If training time is very limited, stationary kernels are still the preferred choice. We have discovered that non-stationarity in the covariance comes in two flavors, in the signal variance and the length scale and, ideally, both should be addressed through novel kernel designs; we drew attention to one such kernel design. However, non-stationary kernels come with a great risk of model misspecification. If a new model should be trusted out of the box, a stationary kernel might be the preferred choice.
We hope that this paper motivates more practitioners to deploy and experiment with non-stationary kernels while also being aware of some of the risks.
ACKNOWLEDGMENTS
The work was supported by the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy Contract No. DE-AC02-05CH11231. This work was further supported by the Regional and Global Model Analysis Program of the Office of Biological and Environmental Research in the Department of Energy Office of Science under Contract No. DE-AC02-05CH11231. This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain the correct information, neither the United States Government nor any agency thereof, nor the Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof or the Regents of the University of California. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or the Regents of the University of California.
We want to acknowledge Kevin G. Yager from Brookhaven National Laboratory for providing the x-ray scattering dataset. This data collection used resources from the Center for Functional Nanomaterials (CFN) and the National Synchrotron Light Source II (NSLS-II), which are the U.S. Department of Energy Office of Science User Facilities at Brookhaven National Laboratory, funded under Contract No. DE-SC0012704.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
M.M.N. originally decided to write this paper, led the project, developed the test scripts and software with help from H.L. and M.D.R., and ran the computational experiments. H.L. suggested the non-stationarity measure and improved its practicality with help from M.M.N. and M.D.R. H.L. also did the majority of the work regarding deep GPs, both regarding the algorithms and the article. M.D.R. oversaw all developments, especially regarding parametric non-stationary kernels, and further assisted with writing and editing the article. All authors iteratively refined the core ideas, algorithms, and methods. All decisions regarding the work were made in agreement with all authors. All authors iteratively revised the article.
Marcus M. Noack: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Hengrui Luo: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Resources (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Mark D. Risser: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Software (equal); Supervision (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
All data and the Jupyter notebook that runs all experiments can be found on https://gpcam.lbl.gov/examples/a-unifying-perspective-on-non-stationary-kernels-for-deeper-gps.
All experiments (except deep GPs) were run using the open-source Python package gpCAM (https://github.com/lbl-camera/gpCAM, gpcam.lbl.gov).
APPENDIX A: THEORETICAL MOTIVATION FOR OUR NON-STATIONARITY MEASURE
The simple technique used in this paper is to measure non-stationarity in the covariance structure of a dataset based on the idea of model misspecification. Unlike traditional non-stationary measures (e.g., histograms, periodograms), our non-stationary measure is model-specific and can generate visual checks in wider scenarios. We followed White67 and Schervish68 in the following arguments.
APPENDIX B: GPflux DGP SCRIPT
APPENDIX C: 1d BAYESIAN DEEP GP SCRIPT
APPENDIX D: THE DEEP GP ALGORITHM
Require: xdata, ydata, xpred |
1: Create inducing points Z spanning the range of X |
2: Define two GP layers: |
• Each layer is defined with an RBF kernel |
• Using a random collection of inputs Z as inducing points |
• The last layer has a zero mean function |
3: Define a Gaussian likelihood per layer with noise variance 1e − 4 |
4: Construct a 2-layer deep GP model with the GP layers and likelihood |
5: Compile the model with the stochastic Adam optimizer.71,72 |
(Alternatively, one can perform Bayesian variational inference.56) |
6: Train the model for 104 epochs: |
• Optimize the parameters to maximize the marginal likelihood p(Y|X) |
• Where p(Y|X) = ∫p(Y|F)p(F|X) dF |
7: Predict on xpred and extract mean and variance |
Require: xdata, ydata, xpred |
1: Create inducing points Z spanning the range of X |
2: Define two GP layers: |
• Each layer is defined with an RBF kernel |
• Using a random collection of inputs Z as inducing points |
• The last layer has a zero mean function |
3: Define a Gaussian likelihood per layer with noise variance 1e − 4 |
4: Construct a 2-layer deep GP model with the GP layers and likelihood |
5: Compile the model with the stochastic Adam optimizer.71,72 |
(Alternatively, one can perform Bayesian variational inference.56) |
6: Train the model for 104 epochs: |
• Optimize the parameters to maximize the marginal likelihood p(Y|X) |
• Where p(Y|X) = ∫p(Y|F)p(F|X) dF |
7: Predict on xpred and extract mean and variance |
APPENDIX E: AVAILABLE DEEP GP IMPLEMENTATIONS
Table II offers a comprehensive overview of various implementations of Deep Gaussian Processes (DGPs), a powerful tool in machine learning. It is fascinating to see the diversity of implementations across different programming languages and frameworks, such as R, TensorFlow, and PyTorch. This wide variety reflects the growing interest and applicability of DGPs in various domains, from scientific research to industrial applications.
Repository/package . | Language . | Referencea . | Fittingb . |
---|---|---|---|
deepGP | R | Sauer et al.62 | MCMC (elliptic sampler) |
SDSP | R/TensorFlow | Zammit-Mangion et al.45 | SVI |
SDSP-MCMC | R/stan | Zammit-Mangion et al.45 | MCMC |
SIWGP | R/TensorFlow | Zammit-Mangion et al.45 | MLE |
GPFlux | TensorFlow | Dutordoir et al.71 | MLE/SVI (variational) |
DeepCGP | TensorFlow | Blomqvist et al.73 | SVI (doubly stochastic) |
DGPs_with_IWVI | TensorFlow | Salimbeni et al.74 | SVI |
sghmc_dgp | TensorFlow | Havasi et al.75 | MCMC (HMC) |
Deep-Gaussian-Process | TensorFlow2 | Salimbeni and Deisenroth72 | SVIc |
GPyTorch/DeepGP | PyTorch | Salimbeni and Deisenroth,72 Gardner et al.76 | SVI (doubly stochastic) |
DGPRFF | PyTorch | Cutajar et al.77 | SVI |
Doubly stochastic-DGP | PyTorch | Salimbeni and Deisenroth72 | SVI (doubly stochastic) |
PyDeepGP | Python | Dai et al.,78 Damianou and Lawrence12 | MLE/SVI (variational) |
DGPsparse | TensorFlow | Damianou and Lawrence12 | MLE |
DGPfull | Python | Schmidt and O’Hagan27 | MCMC (elliptic sampler) |
Repository/package . | Language . | Referencea . | Fittingb . |
---|---|---|---|
deepGP | R | Sauer et al.62 | MCMC (elliptic sampler) |
SDSP | R/TensorFlow | Zammit-Mangion et al.45 | SVI |
SDSP-MCMC | R/stan | Zammit-Mangion et al.45 | MCMC |
SIWGP | R/TensorFlow | Zammit-Mangion et al.45 | MLE |
GPFlux | TensorFlow | Dutordoir et al.71 | MLE/SVI (variational) |
DeepCGP | TensorFlow | Blomqvist et al.73 | SVI (doubly stochastic) |
DGPs_with_IWVI | TensorFlow | Salimbeni et al.74 | SVI |
sghmc_dgp | TensorFlow | Havasi et al.75 | MCMC (HMC) |
Deep-Gaussian-Process | TensorFlow2 | Salimbeni and Deisenroth72 | SVIc |
GPyTorch/DeepGP | PyTorch | Salimbeni and Deisenroth,72 Gardner et al.76 | SVI (doubly stochastic) |
DGPRFF | PyTorch | Cutajar et al.77 | SVI |
Doubly stochastic-DGP | PyTorch | Salimbeni and Deisenroth72 | SVI (doubly stochastic) |
PyDeepGP | Python | Dai et al.,78 Damianou and Lawrence12 | MLE/SVI (variational) |
DGPsparse | TensorFlow | Damianou and Lawrence12 | MLE |
DGPfull | Python | Schmidt and O’Hagan27 | MCMC (elliptic sampler) |
We sort the table by the publication date of the associated paper.
We follow the main fitting method featured by the package; some packages may support more fitting methods than listed here.
Stochastic variational inference.
One of the key aspects of the table is the different fitting methods used, such as MCMC (Markov Chain Monte Carlo), SVI (Stochastic Variational Inference), and MLE (Maximum Likelihood Estimation). The choice of fitting method can significantly impact the efficiency and accuracy of the model. It is noteworthy that some packages may support more fitting methods than those listed, indicating flexibility and adaptability in the approach to modeling with DGPs.
Recent developments in the field are evident from the inclusion of implementations associated with recent publications. This ongoing research and development signal a vibrant and evolving field where new methodologies and techniques are continually being explored and refined.
Accessibility is another important aspect highlighted in the table. Many of the implementations are openly accessible through platforms like GitHub. This open-source approach fosters collaboration and innovation, allowing researchers and practitioners to not only utilize these resources but also contribute to their development.
The table’s caption mentions that it extends and compares information from other sources, suggesting a comprehensive effort to collect and compare various DGP implementations. This comparative approach adds value to the table by placing the different implementations in context with one another, providing a broader perspective on the landscape of DGPs.
While the table provides a valuable summary, it also hints at potential challenges. Users may need to delve into the individual papers or repositories to understand the specific features, advantages, and limitations of each implementation. This deeper exploration may reveal nuances and details that are beyond the scope of the table but are essential for a complete understanding of each DGP implementation.