We present a new data-driven model to reconstruct nonlinear flow from spatially sparse observations. The proposed model is a version of a Conditional Variational Auto-Encoder (CVAE), which allows for probabilistic reconstruction and thus uncertainty quantification of the prediction. We show that in our model, conditioning on measurements from the complete flow data leads to a CVAE where only the decoder depends on the measurements. For this reason, we call the model semi-conditional variational autoencoder. The method, reconstructions, and associated uncertainty estimates are illustrated on the velocity data from simulations of 2D flow around a cylinder and bottom currents from a simulation of the southern North Sea by the Bergen Ocean Model. The reconstruction errors are compared to those of the Gappy proper orthogonal decomposition method.

## I. INTRODUCTION

Reconstruction of non-linear dynamic processes from sparse observations traditionally requires knowledge of the processes and the governing equations to be able to generalize to a wider area around, in-between, and beyond the measurements. Alternatively, it is possible to learn the underlying processes or equations based on the data itself, the so-called data-driven methods. In geophysics and environmental monitoring, measurements are often only available at sparse locations. For instance, within the field of meteorology, atmospheric pressures, temperatures, and wind are only measured at a limited number of stations. Producing accurate and general weather predictions requires methods that both forecast for the future and also reconstruct where no data are available. Within oceanography, one faces the same problem that *in situ* information about the ocean dynamics is only available at sparse locations such as buoys or sub-sea sensors.

Both the weather and ocean currents can be approximated with models that are governed by physical laws, e.g., the Navier–Stokes equation. However, it is of crucial importance to incorporate observations in order to obtain accurate reliable reconstructions and forecasts.

Reconstruction and inference based on sparse observations are important in numerous applications.^{1–6} Bolton and Zanna^{3} used Convolutional Neural Networks (CNNs) to hindcast ocean models, and Yeo^{7} reconstructed time series from nonlinear dynamics based on sparse observations. Oikonomo *et al.*^{8} proposed a method for filling data gaps in groundwater level observations, and Kong *et al.*^{2} used reconstruction techniques to model the characteristics of cartridge valves.

The above mentioned applications and methods are just some of the many examples of reconstruction of a dynamic process based on limited information. Here, we focus on the reconstruction of flow. This problem can be formulated as follows: Let $w\u2208Rd$, $d\u2208N$, represent a state vector of the flow containing, for example, velocity, pressure, and temperature. Here, we will focus on in-compressible unsteady fluid flows in 2D, so $w=(u,v)\u2208R2$, where *u* and *$v$* are the two velocity components. The velocity vector, $w$, is typically obtained from computational fluid dynamic simulations on a meshed spatial domain, $P$, at discrete times, $T={t1,\u2026,tK}$.

Let $P={p1,\u2026,pN}$ consist of *N* grid points *p*_{n}, *n* = 1, …, *N*. Then, the state of the flow, $w$, evaluated on $P$ at a time $ti\u2208T$, can be represented as a vector $x(i)\u2208R2N$,

The collection of *x*^{(i)}, *i* = 1, …, *K*, constitutes the dataset ** X**. In order to account for incompressibility, we introduce a discrete divergence operator

*L*

_{di$v$}, which is given by a

*N*× 2

*N*matrix associated with a finite difference scheme, and

Furthermore, we assume that the state is measured only at specific points in $P$, that is, at $Q={q1,\u2026,qM}\u2282P$ with typically *M* ≪ *N*. Hence, there is $M={m(i)\u2208R2M:\u2009m(i)=C\u2009x(i),\u2009\u2200x(i)\u2208X}$, where $C\u2208R2M\xd72N$ represents a sampling matrix. More specifically, ** C** is a two block matrix,

and $O\u2208RM\xd7N$ is the zero matrix. The problem of reconstructing fluid flow *x*^{(i)} ∈ ** X** from

*m*^{(i)}∈

**is presented as a schematic plot in Fig. 1.**

*M*There have been a wide range of methods for solving the problem, e.g., Refs. 9–14. In particular, the use of proper orthogonal decomposition (POD)^{9} techniques has been popular. It is a traditional dimensional reduction technique where, based on a dataset, a number of basis functions are constructed. The key idea is that linear combinations of these basis functions can reconstruct the original data within some error margin, efficiently reducing the dimension of the problem. In a modified version of the POD, the Gappy POD (GPOD),^{10} the aim is to fill the gap in-between sparse measurements. Given a POD basis, one can minimize the *L*_{2}-error with the measurements and find a linear combination of the POD-basis that complements the measurements. If the basis is not known, an iterative scheme can be formulated to optimize the basis based on the measurements. The original application of GPOD^{10} was related to reconstruction of human faces, and it has later been applied to fluid flow reconstruction.^{4,15–17} This motivates us to use this method for comparison in our study.

A similar approach is the Compressed Sensing (CS) technique.^{11} As for the GPOD method, the aim is to solve a linear system; in the CS case, it will be a under-determined linear system. Hence, there is a need for additional information about the system to be able to solve it; typically, this can be a condition/constraint related to the smoothness of the solution. The core difference between CS and GPOD is, however, the sparsity constraint. That is, instead of minimizing the L_{2}-norm, we minimize the L_{1}-norm. Minimizing the L_{1}-norm favors sparse solutions, i.e., solutions with a small number of nonzero coefficients.

Another reconstruction approach is Dynamical Mode Decomposition (DMD).^{12} Instead of using principal components in the spatial domain, DMD seeks to find modes or representations that are associated with a specific frequency in the data, i.e., modes in the temporal domain. Again, the goal is to find a solution to an undetermined linear system and reconstruct based on the measurements by minimizing the deviance from the observed values.

During the past decade, data-driven methods have become popular, partly because of the growth and availability of data, but also driven by new technology and improved hardware. Modeling non-linear relationships with linear approximations is one of the fundamental limitation of the DMD, CS, and GPOD methods. There have been efforts that involve neural networks and other machine learning methods to solve the flow reconstruction and decomposition problem. For example, Pawar and San^{18} used data assimilation to correct a physics-based model coupled with a neural network as a surrogate to resolved flow dynamics in multiscale systems. Other examples include the work of Erichson *et al.*^{19} that used shallow neural networks to learn the input-to-output mapping between sensor measurements and flow fields for a deterministic reconstruction of the data and that of Pérez *et al.*^{20} that reconstructed three-dimensional flow fields from two-dimensional data through higher order dynamic mode decomposition,^{21} while Gao *et al.*^{22} used an attentional generative adversarial neural network to reconstruct and interpolate air pollution data from Beijing. Probabilistic reconstruction of flow fields is not that widespread; however, Maulik *et al.*^{23} recently developed a probabilistic neural network for fluid flow surrogate modeling and data recovery through Mixture Density Networks (MDN). MDNs can account for complex multi-modal conditional distributions but are, in general, difficult to optimize. In addition, MDNs have tendency to overfit the training data and are prone to mode collapse.^{24} These are examples of recent initiatives that seek to either reconstruct or decompose flow fields based on the input from limited information with data-driven methods.

Another promising and interesting method where the neural networks are informed with a physical law, the so-called Physic-Informed Neural Networks (PINNs),^{14} has been developed during the past few years. In PINNs, the reconstruction is typically informed by a Partial Differential Equation (PDE) (e.g., Navier Stokes equation for fluid flow), and thus, the neural network can learn to fill the gap between measurements that are in compliance with this equation. This is what Raissi *et al.*^{25} have shown for the benchmark examples such as flow around a 2D and 3D cylinder.

Although PINNs are showing encouraging results, we have yet to see applications to complex systems such as atmospheric or oceanographic systems, where other aspects have to be accounted for in large scale oceanic circulation models that are driven by forcing, such as tides, bathymetry, and river-influx. That being said, these problems may be resolved through PINNs in the future.

Popular non-linear data-driven approaches for reconstruction of fluid flow are different variations of autoencoders.^{13,26,27} An autoencoder^{28} is a special configuration of an artificial neural network that first encodes the data by gradually decreasing the size of the hidden layers. Through this process, the data are represented in a lower dimensional space. A second neural network then takes the output of the encoder as the input and decodes the representation back to its original shape. These two neural networks together constitute an autoencoder.

Principal Component Analysis (PCA)^{29} also represent the data in a different and more compact space. However, PCA reduces the dimension of the data by finding orthogonal basis functions or principal components through singular value decomposition. In fact, it has been showed with a linear activation function, PCA and autoencoders produce the same basis functions.^{30} The probabilistic version of the autoencoder is called Variational Auto-Encoder (VAE).^{31} Conditional Variational Auto-Encoders (CVAEs)^{32} are conditional probabilistic autoencoders, that is, the model is dependent on some additional information such that it is possible to create representations that are dependent on this information.

Here, we address the mentioned problem from a probabilistic point of view. Let $x:P\u2192R2N$ and $m:Q\u2192R2M$ be two multivariate random variables associated with the flow on $P$ and $Q$, respectively. Then, the datasets ** X** and

**consist of the realizations of**

*M***and**

*x***, respectively. Using**

*m***and**

*X***, we intend to approximate the probability distribution**

*M**p*(

**|**

*x***). This would allow us not only to predict**

*m*

*x*^{(i)}given

*m*^{(i)}but also to estimate an associated uncertainty. We use a variational autoencoder to approximate

*p*(

**|**

*x***). The method we use is a Bayesian Neural Network (BNN)**

*m*^{33}approximated through variational inference,

^{34,35}which we have called

*Semi-Conditional Variational Auto-Encoder*(SCVAE). A detailed description of the SCVAE method for reconstruction and associated uncertainty quantification is given in Sec. III B.

Here, we focus on fluid flow, being the main driving mechanism behind transport and dilution of tracers in marine waters. The world’s oceans are under tremendous stress,^{36} UN has declared 2021–2030 as the ocean decade,^{80} and an ecosystem based Marine Spatial Planning initiative has been launched by the Intergovernmental Oceanographic Commission of UNESCO (IOC).^{37}

Local and regional current conditions determine transport of tracers in the ocean.^{38,39} Examples are accidental release of radioactive, biological, or chemical substances from industrial complexes, e.g., organic waste from fish farms in Norwegian fjords,^{40} plastic,^{41} or other contaminants that might have adverse effects on marine ecosystems.^{42}

The ability to predict the environmental impact of a release, i.e., concentrations as a function of distance and direction from the source, requires reliable current conditions.^{43,44} Subsequently, these transport predictions support design of marine environmental monitoring programs.^{45–48} The aim here is to model current conditions in a probabilistic manner using SCVAEs. This allows for predicting footprints in a Monte Carlo framework, providing simulated data for training networks used for analyzing environmental time series.^{49}

We will compare results with the GPOD method.^{50} We are aware that there are recent methods (e.g., PINNS and traditional autoencoder) that may perform better on the specific datasets than the GPOD; however, the simplicity, versatility, and, not least, popularity of GPOD^{5,51,52} make it a great method for comparison.

The reminder of this manuscript is outlined as follows: Sec. II presents a motivating example for the SCVAE-method in comparison with the GPOD-method. In Sec. III, we review both the VAE and CVAE methods and present the SCVAE. Results from experiments on two different datasets are presented in Sec. IV. Section V summarizes and discusses the method, experiments, drawbacks, benefits, potential extensions, and further work.

## II. A MOTIVATING EXAMPLE

Here, we illustrate the performance of the proposed method vs the GPOD method in order to give a motivation for this study. We use data from simulations of a two dimensional viscous flow around a cylinder at the Reynolds number of 160 obtained from Ref. 53. The simulations were performed by Weinkauf and Theisel^{54} with the Gerris Flow Solver software.^{55} The dataset consists of a two velocity components, *u* and *$v$*, on an uniform 400 × 50 × 1001 grid of the [−0.5, 7.5] × [−0.5, 0.5] × [15, 23] spatial–temporal domain. The simulations are run from *t* = 0 to *t* = 23, but velocities are only extracted from *t* = 15 to *t* = 23. In particular, we have 400 points in the horizontal direction, 50 points in the vertical direction, and 1001 points in time. The cylinder has a diameter of 0.125 and is centered at the origin (see Fig. 2).

The left vertical boundary (inlet) has the Dirichlet boundary condition, *u* = 1 and *$v$* = 0. The homogeneous Neumann boundary condition is imposed at the right boundary (outlet) and with homogeneous Dirichlet conditions on the remaining boundaries. At the start of the simulation, *t* = 0, both velocities were equal to zero. We plot a snapshot of the velocities in Fig. 2.

In the experiment below, we extract the data downstream from the cylinder, that is, from grid points 40–200 in the horizontal direction, and keep all grid points in the vertical direction. Hence, $P$ contains *N* = 8000 points, 160 points in the horizontal direction, and 50 in the vertical direction. The temporal resolution is kept as before, that is, the number of time steps in $T$ is *K* = 1001.

For validation purposes, the dataset was split into a train, validation, and test dataset. For both the SCVAE and the GPOD, the goal was to minimize the *L*_{2} error between the true and the modeled flow state. The restriction of the GPOD is that the number of components *r* can be at most 2*M*. To deal with this problem and to account for the flow incompressibility, we added the regularization term *λ*$\u2225$*L*_{di$v$}*x*^{(i)}$\u2225$, *λ* > 0, to the objective function (see Appendix A). For the GPOD method, the parameters *r* and *λ* are optimized on the validation dataset in order to have the smallest mean error. We give more details about objective functions for the SCVAE in Sec. III B. For now, we mention that there are two versions, where one version uses an additional divergence regularization term similar to GPOD.

In Fig. 3, we plot the mean of the relative *L*_{2} error calculated on the test data for both methods with and without the div-regularization (in-compressible fluid). The results are presented for three, four, and five measurement locations, that is, *M* = 3, 4, 5. For each of these three cases, we selected 20 randomly different configurations of *M*. In particular, we created 20 subgrids $Q$, each containing five randomly sampled spatial grid points. Next, we removed one and then two points from each of the 20 subgrids $Q$ to create new subgrids of 4 and 3 measurements, respectively. The figure is a box-and-whisker plot of these 20 configurations, where outliers are defined as data points that are outside 1.5 times the inter quartile range from the upper and lower quantiles.

As shown in Fig. 3, both methods perform well with 5 measurements. The resulting relative errors have comparable mean and variance. When reducing the number of observations, the SCVAE method maintains low errors, while the GPOD error increases. The SCVAE seems to benefit from the additional regularization of minimizing the divergence in terms of lower error and less variation in the error estimates. The effect is more profound with fewer measurements.

The key benefit of the SCVAE is that its predictions are optimal for the given measurement locations. In contrast, the POD based approaches and particularly the GPOD create a set of basis functions (principal components) based on the training data independently of the measurements. While this has an obvious computational advantage, the number of principle components for complex flows can be high and, as a result, many more measurements are needed.^{6,50,56} There are a number of algorithms that aim to optimize the measurement locations to achieve the best performance of the POD based methods (see Refs. 17, 50, and 51). In practice, however, the locations are often fixed and other approaches are needed. The results in Fig. 3 suggest that the SCVAE could be one of these approaches.

## III. METHODS

Before we introduce the model used for reconstruction of flows, we give a brief introduction to VAEs and CVAEs. For a detailed introduction, see Ref. 57. VAEs are neural network models that has been used for learning structured representations in a wide variety of applications, e.g., image generation,^{58} interpolation between sentences,^{59} and compressed sensing.^{26}

### A. Variational auto-encoders (VAE)

Let us assume that the dataset ** X** is generated by a random process that involves an unobserved continuous random variable

**. The process consists of two steps: (i) a value**

*z*

*z*^{(i)}is sampled from a prior $p\theta *(z)$ and (ii)

*x*^{(i)}is generated from a conditional distribution, $p\theta *(x|z)$. In the case of flow reconstruction,

**could be unknown boundary or initial conditions, tidal and wind forcing, etc. However, generally,**

*z***is just a convenient construct to represent**

*z***, rather than a physically explained phenomena. Hence, it is, for convenience, assumed that $p\theta *(z)$ and $p\theta *(x|z)$ come from parametric families of distributions**

*X**p*

_{θ}(

**) and**

*z**p*

_{θ}(

**|**

*x***), and their density functions are differentiable almost everywhere with respect to both**

*z***and**

*z**θ*. A probabilistic autoencoder is a neural network that is trained to represent its input

**as**

*X**p*

_{θ}(

**) via**

*x**latent representation*

**∼**

*z**p*

_{θ}(

**), that is,**

*z*As *p*_{θ}(** z**) is unknown and observations

*z*^{(i)}are not accessible, we must use

**in order to generate**

*X***∼**

*z**p*

_{θ}(

**|**

*z***). That is, the network can be viewed as consisting of two parts: an**

*x**encoder p*

_{θ}(

**|**

*z***) and a**

*x**decoder p*

_{θ}(

**|**

*x***). Typically, the true posterior distribution**

*z**p*

_{θ}(

**|**

*z***) is intractable but could be approximated with variational inference.**

*x*^{34,35}That is, we define the so-called recognition model

*q*

_{ϕ}(

**|**

*z***) with variational parameters**

*x**ϕ*, which aims to approximate

*p*

_{θ}(

**|**

*z***). The recognition model is often parameterized as a Gaussian. Thus, the problem of estimating**

*x**p*

_{θ}(

**|**

*z***) is reduced to finding the best possible estimate for**

*x**ϕ*, effectively turning the problem into an optimization problem.

An autoencoder that uses a recognition model is called Variational Auto-Encoder (VAE). In order to get good prediction, we need to estimate the parameters *ϕ* and *θ*. The marginal likelihood is equal to the sum over the marginal likelihoods of the individual samples, that is, $\u2211i=1K\u2061logp\theta (x(i))$. Therefore, we further on present estimates for an individual sample. The Kullback–Leibler divergence between two probability distributions *q*_{ϕ}(** z**|

*x*^{(i)}) and

*p*

_{θ}(

**|**

*z*

*x*^{(i)}), defined as

can be interpreted as a measure of distinctiveness between these two distributions.^{60} It can be shown (see Ref. 57) that

where

Since KL-divergence is non-negative, we have $logp\theta (x(i))\u2265L(\theta ,\varphi ;x(i))$ and $L(\theta ,\varphi ;x(i))$ is called Evidence Lower Bound (ELBO) for the marginal likelihood log *p*_{θ}(*x*^{(i)}). Thus, instead of maximizing the marginal probability, one can instead maximize its variational lower bound to which we also refer to as an objective function. It can be further shown that the ELBO can be written as

Reformulating the traditional VAE framework as a constraint optimization problem, it is possible to obtain the *β*-VAE^{61} objective function if $p\theta (z)=N(0,I)$,

where *β* > 0. Here, *β* is a regularization coefficient that constrains the capacity of the latent representation ** z**. The $Eq\varphi (z|x(i))logp\theta (x(i)|z)$ can be interpreted as the reconstruction term, while the KL-term can be interpreted

*βD*

_{KL}[

*q*

_{ϕ}(

**|**

*z*

*x*^{(i)})$\u2225$

*p*

_{θ}(

**)] as a regularization term.**

*z*Conditional Variational Auto-Encoders^{32} (CVAE) are similar to VAEs but differ by conditioning on an additional property of the data (e.g., a label or class), here denoted ** c**. Conditioning both the recognition model and the true posteriori on both

*x*^{(i)}and

**results in the CVAE ELBO,**

*c*In the decoding phase, CVAE allows for conditional probabilistic reconstruction and permits sampling from the conditional distribution *p*_{θ}(** z**|

**), which has been useful for generative modeling of data with known labels (see Ref. 32). Here, we investigate a special case of the CVAE when**

*c***is a partial observation of**

*c***. We call this the Semi-Conditional Variational Auto-Encoder (SCVAE).**

*x*### B. Semi-conditional variational auto-encoder

The SCVAE takes the input data ** X**, conditioned on

**, and approximates the probability distribution**

*M**p*

_{θ}(

**|**

*x***,**

*z***). Then, we can generate**

*m*

*x*^{(i)}based on the observations

*m*^{(i)}and latent representation

**. As**

*z*

*m*^{(i)}=

*C*

*x*^{(i)}, where

*C*is a non-stochastic sampling matrix, we have

Therefore, from Eq. (7), the ELBO for SCVAE is

where $p\theta (z|m(i))=N(0,I)$. Similarly, as for the *β*-VAE,^{61} we can obtain a relaxed version of Eq. (8) by maximizing the parameters {*ϕ*, *θ*} of the expected log-likelihood $Eq\varphi (\u22c5)[logp\theta (x(i)|m(i),z)]$) and treat it as a constrained optimization problem,

where *ϵ* > 0 is small. The subscript *q*_{ϕ}(·) is short for *q*_{ϕ}(** z**|

*m*^{(i)},

*x*^{(i)}). Since

*m*^{(i)}is dependent on

*x*^{(i)}, we have

*q*

_{ϕ}(

**|**

*z*

*m*^{(i)},

*x*^{(i)}) =

*q*

_{ϕ}(

**|**

*z*

*x*^{(i)}). Equation (9) can be expressed as a Lagrangian under the Karush–Kuhn–Tucker (KKT) conditions.

^{62,63}Hence,

According to the complementary slackness KKT condition *β* ≥ 0, we can rewrite Eq. (10) as

The objective functions in Eqs. (8) and (11), later in Eq. (13), show that if conditioning on a feature that is a known function of the original data, such as measurements, we do not need to account for them in the encoding phase. The measurements are then coupled with the encoded data in the decoder. We sketch the main components of the SCVAE in Fig. 4.

In order to preserve some physical properties of the data ** X**, we can condition yet on another feature. Here, we utilize the incompressibility property of the fluid, i.e.,

*d*^{(i)}=

*L*

_{di$v$}

*x*^{(i)}≈ 0 [see Eq. (2)]. We intend to maximize a log-likelihood under an additional constrain

*d*^{(i)}compared to Eq. (9). That is,

subject to

and

where *ϵ*, *δ* > 0 are small. Equation (12) can expressed as a Lagrangian under the Karush–Kuhn–Tucker (KKT) conditions as before, and as a consequence of the complementary slackness condition *λ*, *β* ≥ 0, we can obtain the objective function

where $p(z|m(i),d(i))=N(0,I)$. For convenience of notation, we refer to the objective function [Eq. (11)] as the case with *λ* = 0 and the objective function [Eq. (13)] as the case with *λ* > 0. Observe that under the Gaussian assumptions on the priors, Eq. (13) is equivalent to Eq. (11) if *λ* = 0. Thus, from now on, we will refer to it as a special case of Eq. (13) and denote as $L0$. Optimizing Eq. (13) can be interpreted as a multi task learning,^{64} i.e., we optimize the network to solve more than one task at a time. Multitask learning reduces the risk of overfitting^{65} and produces more consistent results.^{64}

Similar to Ref. 31, we obtain $q\varphi (z|x(i))=N(\mu (i),diag(\sigma (i))2)$, that is, *ϕ* = {*μ*^{(i)}, *σ*^{(i)}}. This allows us to express the KL-divergence terms in a closed form and avoid issues related to differentiability of the ELBOs. Under these assumptions, the KL-divergence terms can be integrated analytically, while the terms $Eq\varphi (z|x(i))logp\theta (x(i)|z,m(i))$ and $Eq\varphi (z|x(i))logp\theta (d(i)|z,m(i))$ require estimation by sampling,

where

Here, *ϵ*^{l} is an auxiliary (noise) variable with independent marginal *p*(** ϵ**), and

*g*

_{ϕ}(·) is a differentiable transformation of

**, parametrized by**

*ϵ**ϕ*(see Ref. 31 for details). We denote $L\lambda $,

*λ*≥ 0 [Eq. (13)], with the approximation above as $L^\lambda $, that is,

The objective function $L^\lambda $ can be maximized by gradient descent. Since the gradient $\u2207\theta ,\varphi \u2009L^\lambda $ cannot be calculated for large datasets, Stochastic gradient descent methods (see Refs. 66 and 67) are typically used where

Here, $XR=x(ir)r=1R$, *R* < *K* is a mini-batch consisting of randomly sampled datapoints, $MR=m(ir)r=1R$, and $DR=d(ir)r=1R$. After the network is optimized, a posterior predictive distribution *p*_{θ}(** x**|

**,**

*m***) can be approximated with a Monte Carlo estimator.**

*d*#### 1. Uncertainty quantification

Let $\theta ^$ and $\varphi ^$ be an estimation of generative and variational parameters, as described in Sec. III B. Then, the decoder can be used to predict the posterior as

While sampling from the latent space has been viewed typically as an approach for generating new samples with similar properties, here, we use it to estimate the prediction uncertainty of the trained model. From Eq. (17), we are able to estimate the mean prediction $x^*$ and empirical covariance matrix $\Sigma ^$ using a Monte Carlo estimator. We get

and

where $x(j)\u223cp\theta ^(x|m*,d*)$. The empirical standard deviation is then $\sigma ^=diag(\Sigma ^)$. To estimate the confidence region, we assume that the predicted $p\theta ^(x|m*,d*)$ is well approximated by a normal distribution *N*(** μ**,

**Σ**). Given that $x^*$ and $\Sigma ^$ are approximations of

**and**

*μ***Σ**, obtained from

*N*

_{MC}samples as above, a confidence region estimate for a prediction

*x*^{*}can be given as

where $\chi k2(p)$ is the quantile function for probability *p* of the chi-squared distribution with *k* = min{*N*_{MC}, 2*N*} degrees of freedom and $\Sigma ^+$ is the pseudoinverse of $\Sigma ^$. Using the singular value decomposition, $\Sigma ^=USUT$, the corresponding interval for $(x*)n$, *n* = 1, …, 2*N*, is

where $unT$ is *n*th row of the matrix ** U**.

## IV. EXPERIMENTS

We will present the SCVAE method on two different datasets. The first one is the 2D flow around a cylinder described in Sec. II, and the second is ocean currents on the seafloor created by the Bergen Ocean Model (BOM).^{68} The data ** X** consist of the two dimensional velocities $w$ = (

*u*,

*$v$*). To illustrate the results, we will plot

*u*and

*$v$*components of

*x*^{(i)}∈

**[see Eq. (1)]. For validation of the models, the data**

*X***are split into train, test, and validation subsets, which are subscripted accordingly, if necessary. The datasets, spitting, and preprocessing for each case are described in Secs. IV A and IV B.**

*X*We use a schematically simple architecture to explore the SCVAE. The main ingredient of the encoder is the convolutional neural network (CNN),^{69,70} and for the decoder, we use transposed CNN-layers.^{71} The SCVAE has a slightly different architecture in each case, which we present in Appendix C.

The SCVAE is trained to maximize the objective function in Eq. (16) with the backpropagation algorithm^{72} and the Adam algorithm.^{73} We use an adaptive approach of weighing the reconstruction term with KL-divergence and/or divergence terms,^{74} that is, finding the regularization parameters *β* and *λ*. Specifically, we calculate the proportion of contribution of each term to the total value of the objective function and scale the terms accordingly. This approach prevents posterior collapse. Posteriori collapse occurs if the KL-divergence term becomes too close to zero, resulting in a non-probabilistic reconstruction. The approach of weighing the terms proportionally iteratively adjusts the weight of the KL-divergence term, *β*, such that posterior collapse is mitigated. For the result shown here, we trained the SCVAEs with early stopping criteria of 50 epochs, i.e., the optimization is stopped if we do not see any improvement after 50 epochs and returns the best model. We use a two-dimensional Gaussian distribution for *p*_{θ}(** z**|

*m*^{(i)},

*d*^{(i)}) in all the experiments.

Let the test data *X*_{test} consist of *n* instances *x*^{(i)}, *i* = 1, …, *n*, and $x^(i)$ denote a prediction of the true *x*^{(i)} given *m*^{(i)}. In the case of the SCVAE, $x^(i)$ is the mean prediction obtained as in Eq. (18). For the GPOD method, $x^(i)$ is a deterministic output of the optimization problem (see Appendix A). In order to compare the SCVAE results with the results of the GPOD method, we introduce the mean of the relative error for the prediction,

and the mean of the absolute error for the divergence,

### A. 2D flow around a cylinder

Here, we return to the example in Sec. II. In the following, we give some additional details of the data preprocessing and model implementation.

#### 1. Preprocessing

The data are reduced, as described in Sec. II. We assess the SCVAE with a sequential split for train, test, and validation: the last 15% of the data is used for test, the last 30% of the remaining data is used for validation, and the first 70% is used for training. To improve the conditioning of the optimization problem, we scale the data as described in Appendix B. The errors, $E$ [Eq. (21)] and $Ediv$ [Eq. (22)], are calculated after re-scaling the data back. The input to the SVAE *x*^{(i)} was reshaped as an array with dimension (160 × 50 × 2) in order to apply convolutional layers. Here, we use five, four, three, and two fixed spatial measurements, that is, four different subgrids $Q$,

The flow state at these specific locations constitutes ** M**.

#### 2. Model

A schematic description of the model is given in Appendix C. The first layer of the encoder is a zero-padding layer that expands the horizontal and vertical dimensions by adding zeros on the boundaries, a stencil of four in the horizontal direction and three in the vertical direction. The subsequent layers consist of two convolutional layers, where the first and second layers have 160 and 200 filters, respectively. We use a kernel size and strides of 2 in both convolutional layers and ReLu activation functions. This design compresses the data into a (42 × 14 × 200) shape. The compressed representation from the convolutional layers is flattened and is further compressed into a 64 dimensional vector through a traditional dense layer. Two outputs layers are defined to represent the mean and log-variance of the latent representation ** z**. The reparameterization trick is realized in a third layer, the so-called lambda layer, which takes the mean and log-variance as an input and generates

**. The output of the encoder are the samples**

*z*

*z*^{(i)}and the mean and the log-variance of

*z*^{(i)}.

The decoder takes the latent representation *z*^{(i)} and the measurements *m*^{(i)} as the input. The input *m*^{(i)} is flattened and then concatenated with *z*^{(i)}. The next layer is a dense layer with shape (42 × 14 × 200). Afterward, there are two transposed convolutional layers with filters of 200 and 160. The strides and the kernel size are the same as those for the encoder. The final layer is a transposed convolutional layer with the same dimension as the input to the encoder, the dimension of *x*^{(i)}. A linear activation function is used for this output layer. The last layer of the model is a lambda layer that removes the zero-padding. In Sec. IV A 3, we show statistics of the probabilistic reconstruction and compare with the GPOD method.

#### 3. Results

In Fig. 5, we have plotted the reconstructed velocity fields and associated statistics. The observations placements are shown as stars (black and white). The SCVAE with the objective function $L^0$ [see Eq. (7)] was used for this prediction. To generate the posterior predictive distributions [Eq. (17)], we sample 100 realizations from $z\u223cN(0,I)$, which allows for calculation mean prediction and uncertainty estimates [see Eq. (18)].

We emphasize again that the SCVAEs with $L^0$ and $L^\lambda $, *λ* > 0, are two different models. For the sake of notation, we here refer to *λ* = 0 when we mean the model with the objective function in Eq. (7) and to *λ* > 0 when in Eq. (13). The same holds for the GPOD method (see Appendix B). When *λ* = 0, the number of the principle components *r* is less, 2*M*. The number *r* is chosen such that the prediction on the validation data has the smallest possible error on average. If *λ* > 0, no restrictions on *r* are imposed. In this case, both *λ* and *r* are estimated from the validation data.

The results show that the SCVAE reconstructs the data well with the associated low uncertainty. This can be explained by the periodicity in the data. In particular, the training and validation datasets represent the test data well enough.

In Fig. 6, we have plotted four time series of the reconstructed test data at two specific grid points, together with the confidence regions constructed as in Eq. (20) with *p* = 0.95. Figures 6(a) and 6(c) represent the reconstruction at the grid point (6, 31), and Figs. 6(b) and 6(d) represent that at (101, 25) for *u* and *$v$*, respectively. The SCVAE reconstruction is significantly better than the GPOD and close to the true solution for all time steps.

Figure 7 shows the difference between the true values and the model prediction in time for the same two locations. This figure has to be seen in context with Fig. 5. The difference marginals are obtained based on the confidence region in Fig. 11. In Table I, we display the relative errors [Eq. (21)] for the SCVAE and the GPOD method, both with and without divergence regularization, for five, four, three, and two measurement locations given in Eq. (23).

. | . | Measurement locations . | |||
---|---|---|---|---|---|

Method . | Regularization . | 5 . | 4 . | 3 . | 2 . |

SCVAE | λ = 0 | 0.30 × 10^{−2} | 0.33 × 10^{−2} | 0.26 × 10^{−2} | 0.28 × 10^{−2} |

λ > 0 | 0.31 × 10^{−2} | 0.32 × 10^{−2} | 0.30 × 10^{−2} | 0.28 × 10^{−2} | |

GPOD | λ = 0 | 2.35 × 10^{−2} | 2.49 × 10^{−2} | 3.38 × 10^{−2} | 17.38 × 10^{−2} |

λ > 0 | 2.12 × 10^{−2} | 2.33 × 10^{−2} | 3.15 × 10^{−2} | 16.38 × 10^{−2} |

. | . | Measurement locations . | |||
---|---|---|---|---|---|

Method . | Regularization . | 5 . | 4 . | 3 . | 2 . |

SCVAE | λ = 0 | 0.30 × 10^{−2} | 0.33 × 10^{−2} | 0.26 × 10^{−2} | 0.28 × 10^{−2} |

λ > 0 | 0.31 × 10^{−2} | 0.32 × 10^{−2} | 0.30 × 10^{−2} | 0.28 × 10^{−2} | |

GPOD | λ = 0 | 2.35 × 10^{−2} | 2.49 × 10^{−2} | 3.38 × 10^{−2} | 17.38 × 10^{−2} |

λ > 0 | 2.12 × 10^{−2} | 2.33 × 10^{−2} | 3.15 × 10^{−2} | 16.38 × 10^{−2} |

The results of the SCVAE depend on two stochastic inputs, which are (i) randomness in the initialization of the prior weights and (ii) random mini-batch sampling. We have trained the model with each measurement configuration ten times and chose the model that performs the best on the validation dataset. Ideally, we would run test cases where we used all the values as measurements, i.e., ** M** =

**, and test how well the model would reconstruct in this case. This would then give us the lower bound of the best reconstruction that is possible for this specific architecture and hyper-parameter settings. However, this scenario was not possible to test due to limitations in the memory in the GPU. Therefore, we have used a large enough**

*X**M*, which still allowed us to run the model. In particular, we used every fifth and second pixel in the horizontal and vertical directions, which resulted in a total of (32 × 25) measurement locations, or

*M*= 800. We believe that training the model with these settings gave us a good indication of the lower bound of the reconstruction error. The error observed was of the magnitude of 10

^{−3}.

This lower bound has been reached for all measurement configurations [see Eq. (23)]. However, a larger computational cost was needed to reach the lower bound for fewer measurement locations. Figure 8 shows the number of epochs as a boxplot diagram. For each measurement configuration and regularization technique, the model is run ten times. The variation of the number of epochs for each measurement locations is due to different priors of the weights and random mini-batch sampling. In comparison with GPOD, the SCVAE error is ten times lower than the GPOD-error, and this difference becomes larger with fewer measurements. Note that adding regularization did not have much effect on the relative error. From the motivating example, we observed that regularizing with *λ* > 0 is better in terms of a more consistent and low variable error estimation. Here, we selected from the ten trained models the one that performed best on the validation dataset. This model selection approach shows that there are no significant differences between the two regularization techniques. The associated errors in the divergence of the velocity fields are reported in Table II.

. | . | Measurement locations . | |||
---|---|---|---|---|---|

Method . | Regularization . | 5 . | 4 . | 3 . | 2 . |

SCVAE | λ = 0 | 0.1439 | 0.1580 | 0.1383 | 0.143 2 |

λ > 0 | 0.1533 | 0.1408 | 0.1468 | 0.141 0 | |

GPOD | λ = 0 | 0.1052 | 0.1047 | 0.0943 | 0.088 66 |

λ > 0 | 0.1039 | 0.1051 | 0.0966 | 0.066 9 |

. | . | Measurement locations . | |||
---|---|---|---|---|---|

Method . | Regularization . | 5 . | 4 . | 3 . | 2 . |

SCVAE | λ = 0 | 0.1439 | 0.1580 | 0.1383 | 0.143 2 |

λ > 0 | 0.1533 | 0.1408 | 0.1468 | 0.141 0 | |

GPOD | λ = 0 | 0.1052 | 0.1047 | 0.0943 | 0.088 66 |

λ > 0 | 0.1039 | 0.1051 | 0.0966 | 0.066 9 |

### B. Current data from Bergen ocean model

We tested the SCVAE on simulations from the Bergen Ocean Model (BOM).^{68} The BOM is a three-dimensional terrain-following nonhydrostatic ocean circulation model with capabilities of resolving mesoscale to large-scale processes. Here, we use velocities simulated by Ali *et al.*^{43} The simulations were conducted on the entire North Sea with 800 m horizontal and vertical grid resolution and 41 layers for the period from 1 January 2012 to 15 January 2012. Forcing of the model consists of wind, atmospheric pressure, harmonic tides, rivers, and initial fields for salinity and temperature. For details of the setup of the model, forcing, and the simulations, we refer to Ref. 43.

Here, the horizontal velocities in the bottom layer of an excerpt of the model domain, with dimensions 25.6 × 25.6 km^{2} in the southern North Sea [center at (58.36°*N*, 1.91°*E*)] are used as the dataset for reconstruction. In Fig. 9, we have plotted time series of the mean and extreme values of the two velocity components, *u* and *$v$*, for each time *t* in $T$.

#### 1. Preprocessing

We extract the 32 × 32 central grid from the bottom layer velocity data. Hence, $P$ contains *N* = 1024 points. The temporal resolution is originally 105 000, and the time between each time step is 1 min. We downsample the temporal dimension of the original data uniformly such that the number of time steps in $T$ is *K* = 8500. We train and validate the SCVAE with two different data splits: randomized and sequential in time. For the sequential split, we have used the last 15% for the test, the last 30% of the remaining data is used for validation, and the fist 70% for training. In Fig. 9, the red and blue vertical lines indicate the data split for this case. For the random split, the instances *x*^{(i)} are drawn randomly from ** X** with the same percentage. The data were scaled as described in Appendix B. The input

*x*^{(i)}to the SCVAE was shaped as (32 × 32 × 2) in order to apply convolutional layers. We use nine, five, and three fixed spatial measurement locations. In particular, the subgrid $Q$ is given as

As before, the values of *u* and *$v$* at these specific locations constitute the measurements *m*^{(i)} ∈ ** M**.

#### 2. Model

A schematic description of the model is given in Appendixes C 3 and C 4. The first two layers of the encoder are convolutional layers with 64 and 128 filters with strides and a kernel size of 2 and ReLu activation functions. This compresses the data into a shape of (8 × 8 × 128). The next layers are flattening and dense layers, where the latter have 16 filters and ReLu activation. The subsequent layers define the mean and log-variance of the latent representation ** z**, which is input to a lambda layer for realization of the reparameterization trick. The encoder outputs the samples

*z*^{(i)}, the mean, and the log-variance of

*z*^{(i)}.

The input to the decoder is the output *z*^{(i)} of the encoder and the measurement *m*^{(i)}. To concatenate the inputs, *m*^{(i)} is flattened. After concatenation of *z*^{(i)} and *m*^{(i)}, the next layer is a dense layer with shape (8 × 8 × 128) and ReLu activation. This allows for the use of transposed convolutional layers to obtain the original shape of the data. Hence, the following layers are two transposed convolutional layers with 64 and 128 filters, strides and kernel size of 2, and ReLu activation. The final layer is a transposed convolutional with linear activation functions and a filter size of shape (32 × 32 × 2), i.e., the same shape as *x*^{(i)}.

#### 3. Results

We illustrate the obtained posterior predictive distribution in terms of the predictive mean and standard deviation for the prediction at a specific time. The SCVAE is compared with the GPOD method, both with *λ* > 0 and *λ* = 0, for measurement locations given in Eq. (24) for random and sequential split cases. To generate the posterior predictive distributions [Eq. (17)], we sample 200 realizations from $z\u223cN(0,I)$, which allows for calculation mean prediction and uncertainty estimates [see Eq. (18)]. Figure 10 shows the results of the prediction at time step 1185 for both the *u* and *$v$* components and associated uncertainty estimates for a trained model with *λ* = 0 and *Q*_{3} measurement locations [see Eq. (24)].

In Fig. 11, we plot the true solution and the predicted mean velocity [Eq. (18)] with the associated uncertainty [Eq. (20)] for two grid points. We plot only the first 600 time steps for readability. The first grid point is (26, 6) and (4, 1). One location is ∼5.1 km from the nearest observation, and another one is about 16.1 km away.

Figure 12 has to be viewed in context with Fig. 11 and shows the difference between the true and the predicted solutions with the associated difference marginal in time for the two locations as in Fig. 11.

Integrating over the latent space generates a posterior distribution of the reconstruction, as described in Sec. III B 1. It is also possible to use the latent space to generate new statistically sound versions of *u* and *$v$*. This is presented in Fig. 13 where it is sampled uniformly over the two dimensional latent space $z\u223cN(0,I)$, and the result shows how different variations can be created with the SCVAE model, given only the sparse measurements.

These sampled velocities could be used for ensemble simulations when estimating uncertainty in a passive tracer transport (see Ref. 48).

The SCVAE results are compared with results of the GPOD method (see Tables III and IV). The tables show the errors as calculated in Eqs. (21) and (22) of the test dataset for both sequential and random split. For the sequential splitting, the SCVAE is better for three measurement locations, while the GPOD method performs better for nine and five locations. From Fig. 9, we observe that test dataset seems to arise from a different process than the train and validation data (especially for *$v$*). Thus, the SCVAE generalize worse than a simpler model such as the GPOD.^{75} For the three location case, the number of components in the GPOD is not enough to compete with the SCVAE.

. | . | . | Measurement locations . | ||
---|---|---|---|---|---|

Split . | Regularization . | Method . | 9 . | 5 . | 3 . |

Random | λ = 0 | SCVAE | 0.1379 | 0.2097 | 0.2928 |

GPOD | 0.3300 | 0.3822 | 0.4349 | ||

λ > 0 | SCVAE | 0.1403 | 0.2025 | 0.3016 | |

GPOD | 0.2971 | 0.3579 | 0.4039 | ||

Time dependent | λ = 0 | SCVAE | 0.3493 | 0.3913 | 0.4155 |

GPOD | 0.3767 | 0.4031 | 0.4678 | ||

λ > 0 | SCVAE | 0.3527 | 0.3889 | 0.4141 | |

GPOD | 0.3362 | 0.3695 | 0.4462 |

. | . | . | Measurement locations . | ||
---|---|---|---|---|---|

Split . | Regularization . | Method . | 9 . | 5 . | 3 . |

Random | λ = 0 | SCVAE | 0.1379 | 0.2097 | 0.2928 |

GPOD | 0.3300 | 0.3822 | 0.4349 | ||

λ > 0 | SCVAE | 0.1403 | 0.2025 | 0.3016 | |

GPOD | 0.2971 | 0.3579 | 0.4039 | ||

Time dependent | λ = 0 | SCVAE | 0.3493 | 0.3913 | 0.4155 |

GPOD | 0.3767 | 0.4031 | 0.4678 | ||

λ > 0 | SCVAE | 0.3527 | 0.3889 | 0.4141 | |

GPOD | 0.3362 | 0.3695 | 0.4462 |

. | . | . | Measurement locations . | ||
---|---|---|---|---|---|

Split . | Regularization . | Method . | 9 . | 5 . | 3 . |

Random | λ = 0 | SCVAE | 3.75 × 10^{−5} | 3.62 × 10^{−5} | 3.42 × 10^{−5} |

GPOD | 6.51 × 10^{−5} | 5.88 × 10^{−5} | 5.02 × 10^{−5} | ||

λ > 0 | SCVAE | 3.60 × 10^{−5} | 3.60 × 10^{−5} | 3.13 × 10^{−5} | |

GPOD | 6.23 × 10^{−5} | 4.77 × 10^{−5} | 4.14 × 10^{−5} | ||

Time dependent | λ = 0 | SCVAE | 2.02 × 10^{−5} | 1.80 × 10^{−5} | 1.69 × 10^{−5} |

GPOD | 5.09 × 10^{−5} | 4.03 × 10^{−5} | 4.15 × 10^{−5} | ||

λ > 0 | SCVAE | 2.05 × 10^{−5} | 1.99 × 10^{−5} | 1.85 × 10^{−5} | |

GPOD | 4.39 × 10^{−5} | 3.65 × 10^{−5} | 2.92 × 10^{−5} |

. | . | . | Measurement locations . | ||
---|---|---|---|---|---|

Split . | Regularization . | Method . | 9 . | 5 . | 3 . |

Random | λ = 0 | SCVAE | 3.75 × 10^{−5} | 3.62 × 10^{−5} | 3.42 × 10^{−5} |

GPOD | 6.51 × 10^{−5} | 5.88 × 10^{−5} | 5.02 × 10^{−5} | ||

λ > 0 | SCVAE | 3.60 × 10^{−5} | 3.60 × 10^{−5} | 3.13 × 10^{−5} | |

GPOD | 6.23 × 10^{−5} | 4.77 × 10^{−5} | 4.14 × 10^{−5} | ||

Time dependent | λ = 0 | SCVAE | 2.02 × 10^{−5} | 1.80 × 10^{−5} | 1.69 × 10^{−5} |

GPOD | 5.09 × 10^{−5} | 4.03 × 10^{−5} | 4.15 × 10^{−5} | ||

λ > 0 | SCVAE | 2.05 × 10^{−5} | 1.99 × 10^{−5} | 1.85 × 10^{−5} | |

GPOD | 4.39 × 10^{−5} | 3.65 × 10^{−5} | 2.92 × 10^{−5} |

With random split on the train, test, and validation data, we see that the SCVAE is significantly better than the GPOD. The training data and measurements represent the test data and test measurements better with random splitting. This highlights the importance of large datasets that cover as many outcomes as possible. Demanding that *λ* > 0 in Eq. (16) does not improve the result. From the SCVAE models with *λ* = 0, we learn that the reconstructed representations should have low divergence without explicitly demanding it during optimization. However, as discussed in the 2D flow around cylinder experiment, demanding *λ* > 0 seems to improve the conditioning of the optimization problem and give more consistent results. In Fig. 14, we present a boxplot of the number of epochs against the number of measurements. For each measurement configuration and regularization technique, the model is optimized ten times. The variation in the number of epochs for each measurement and regularization technique is due to different priors of the weights and mini-batch sampling.

## V. DISCUSSION

We have presented the SCVAE method for efficient data reconstruction based on sparse observations. The derived objective functions for the network optimization show that the encoding is independent of measurements. This allows for a simpler model structure with fewer model parameters than a CVAE and results in an optimization procedure that requires less computational resources.

We have shown that the SCVAE is suitable for the reconstruction of fluid flow. The method is showcased on two different datasets: velocity data from simulations of 2D flow around a cylinder and bottom currents from the BOM. The fact that the fluids studied in the experiments are incompressible served as a motivation for adding an extra constraint to the objective function with *λ* > 0.

Our investigation of additional regularization showed that the mean reconstruction errors over all models were lower with *λ* > 0 compared to the model where *λ* = 0, but the best reconstruction errors were similar for *λ* = 0 and *λ* > 0.

The SCVAE is a probabilistic model, which allows us to make predictions, estimate their uncertainty, and draw multiple samples from the predictive distribution. The last two properties make the SCVAE a useful method especially when the predictions are used as a component in a larger application, i.e., ensemble simulations of tracer transport. Motivated by Ref. 17, we compared the SCVAE predictions with the predictions of a modified GPOD method.

Unlike the GPOD method, a benefit with the SCVAE method is that it scales well to larger datasets. Another aspect and as suggested by the experiments in Sec. IV, the GPOD seems more sensitive to the number of measurement locations than the SCVAE. On the other hand, the experiments suggested that GPOD is better than SCVAE with a larger number of measurement locations if the training data and the test data are too different. Essentially, the SCVAE is overfit to the training data and, as a result, performs poorly on the test dataset. This fact shows the importance of training the SCVAE on large datasets, which covers as many potential flow patterns as possible. Furthermore, the results show that the GPOD is more sensitive to the measurement location choice than the SCVAE, and the GPOD method is not expected to perform well on a complex flow with very few fixed measurement locations.

VAEs have been used for generating data in computer vision,^{31} and autoencoders are natural choices in reconstruction tasks.^{13} Many reconstruction approaches, including the GPOD approach, first create a basis and then use the basis and minimize the error of the observations.^{50,76} This makes the GPOD suitable for fast optimization of measurement locations that minimize the reconstruction error. On the other hand, the SCVAE optimizes the basis function given the measurements, i.e., they are known and fixed. This makes it challenging to use the framework for optimizing sensor layout. However, if the measurement locations are fixed and large amounts of training data are available, the SCVAE outperforms the GPOD for reconstruction. SCVAE optimizes the latent representation and the neural network model parameters and variational and generative parameters, given the measurements. This ensures that the reconstruction is adapted to the specific configuration of measurements.

A limitation of our experiments is that we used only 100 and 200 samples and constructed the confidence region under further simplifying assumptions. The uncertainty estimate could be improved by increasing the sample size and better model for the confidence region.

Natural applications for the SCVAE are related to environmental data, where we often have sparse measurements. It is possible to optimize the sensor layout to the best possible one for detecting unintentional discharges in the marine environment by using a simple transport model, forced by given flow fields, to predict the area of influence (see the work of Oleynik *et al*).^{48} The SCVAE can be used to improve such an approach by efficiently generating realistic flow fields in a Monte Carlo framework. The incompressibility constraint is important in this case, and it assures conservation of mass in the transport model. Such a framework may be important as the input to design, environmental risk assessments, and emergency preparedness plans.

We have highlighted the SCVAE through the reconstruction of currents and flow field reconstruction; however, the SCVAE method is not limited to fluid flow problems. For instance, the same principles could be used in computer vision to generate a new picture based on sparse pixel representations or in time series reconstruction.

A survey paper on Bayesian networks that accounts for time dependence in the model itself, the so-called dynamical BNNs, was recently published.^{77} A variety of recurrent neural networks are presented, applicable for spatial and temporal data without explicitly addressing the flow reconstruction problem. A natural extension of the SCVAE can be to implement it as a dynamical BNN, i.e., to predict the current state *p*_{θ}(*x*_{t}|*m*_{t}, *x*_{t−1}), given the measurements and the reconstruction from the previous time step. This could potentially improve the reconstruction further.

## DATA AVAILABILITY

The 2D flow around a cylinder dataset is simulated by Weinkauf^{54} using the Free Software *Gerris Flow Solver*.^{55} The data that support the findings of this study are openly available in http://tinoweinkauf.net/notes/cylinder2d.html.^{53} The BOM dataset is simulated by Alfatih Ali and contains time series of CO_{2} concentration, velocity components, time, and position in longitude and latitude. The data that support the findings of this study are openly available in Zenodo at http://doi.org/10.5281/zenodo.806088.^{78}

## ACKNOWLEDGMENTS

This work is part of the project ACTOM, funded through the ACT program (Accelerating CCS Technologies, Horizon2020 Project No. 294766). This work received financial contributions from the Research Council of Norway (RCN), Norway; the Netherlands Enterprise Agency (RVO), the Netherlands; the Department for Business, Energy and Industrial Strategy (BEIS) together with extra funding from NERC and EPSRC research councils, United Kingdom; and the U.S. Department of Energy (U.S. DOE), USA. Kristian Gundersen was supported by the Research Council of Norway through the CLIMIT program (Project No. 254711, BayMode) and the European Union Horizon 2020 research and innovation program under Grant Agreement No. 654462, STEMM-CCS. The authors acknowledge NVIDIA Corporation for providing their GPUs in the academic GPU Grant Program.

### APPENDIX A: GPOD METHOD WITH DIVERGENCE REGULARIZATION

Let the vectors *x*^{(i)} in ** X** be organized as a snapshots matrix $Xh\u2208R2N\xd7K$. Here, we consider that the latent space be given by the

*r*principal components of the matrix

*X*_{h}, assuming that

*r*≪

*N*. Thus,

*X*_{h}≈ Φ

*A*

_{h}, where Φ is the 2

*N*×

*r*matrix of principal components and

*A*

_{h}= Φ

^{+}

*X*

_{h}is the

*r*×

*K*representation of

*X*

_{h}.

Let $x*\u2208R2N$ be an unknown state that we need to reconstruct from *M* spatial measurements *m*^{*} = *C**x*^{*}. We assume that there is an ** a** such that

*x*^{*}=

**Φ**, and we search for a solution of

*a***=**

*C*Φ*a*

*m*^{*}. Even if the system

**=**

*C*Φ*a*

*m*^{*}is overdetermined, the matrix

*C*Φ could be ill-conditioned or rank-deficient. However, since the number of sensors is usually small, 2

*M*<

*r*, the system is underdetermined and regularization is required.

Assuming that the flow is incompressible, the natural regularization is to penalize the divergence error of a solution. That is, we solve

where $Ldiv:R2N\u2192RN$ is a linear operator approximating the divergence and *λ* > 0 is a regularization constant. Finally, we decode *x*^{*} from the measurements *m*^{*} as *x*^{*} = **Φ a**

^{*}.

### APPENDIX B: SCALING OF DATA

Let $Ttrain$ contain the times $tli$, *i* = 1, …, *n*, corresponding to the training data. We define

and

as the largest and smallest values of *u* and *$v$* on $P$ and $Ttrain$. Then, the middle points are given as

and the half lengths are given as

Then, the whole data are scaled as

and the divergence operator *L*_{di$v$} is scaled accordingly.

After the optimization is completed, the data are scaled back, i.e.,

### APPENDIX C: DETAILS ON THE EXPERIMENTS

We use Keras^{79} in the implementation of the SCVAE for all experiments. Here, we present details on the architecture of the decoders and encoders for the different experiments (Figs. 15,Figs. 16,Figs. 17,Figs. 18). We have optimized the SCVAE models with different number of measurements. That is, the shape of the input layer to the decoder will be dependent on the measurements (sensor-input layer). Here, we present details on the architecture of the encoders and decoders with the largest number of measurements for SCVAE models for both experiments. There is one extra dimension in the figures showing the encoders and decoders. Here, this dimension is one, but the framework is implemented to allow for more dimensions in time.

#### 1. Encoder for 2D flow around cylinder data experiment

#### 2. Decoder for 2D flow around cylinder experiment

#### 3. Encoder for BOM data experiment

#### 4. Decoder for BOM data experiment

## REFERENCES

_{2}: A case study

_{2}signals from sources at the seafloor to help designing risk-based monitoring programs

_{2}leak in a varying marine environment

_{2}seep in a varying marine environment

_{2}discharges from unknown locations on the seafloor

_{2}in the vicinity of a seep in the North Sea

The simulations are run from *t* = 0 to *t* = 23, but velocities are only extracted from *t* = 15 to *t* = 23.