Models of many engineering and natural systems are imperfect. The discrepancy between the mathematical representations of a true physical system and its imperfect model is called the model error. These model errors can lead to substantial differences between the numerical solutions of the model and the state of the system, particularly in those involving nonlinear, multi-scale phenomena. Thus, there is increasing interest in reducing model errors, particularly by leveraging the rapidly growing observational data to understand their physics and sources. Here, we introduce a framework named MEDIDA: Model Error Discovery with Interpretability and Data Assimilation. MEDIDA only requires a working numerical solver of the model and a small number of noise-free or noisy sporadic observations of the system. In MEDIDA, first, the model error is estimated from differences between the observed states and model-predicted states (the latter are obtained from a number of one-time-step numerical integrations from the previous observed states). If observations are noisy, a data assimilation technique, such as the ensemble Kalman filter, is employed to provide the analysis state of the system, which is then used to estimate the model error. Finally, an equation-discovery technique, here the relevance vector machine, a sparsity-promoting Bayesian method, is used to identify an interpretable, parsimonious, and closed-form representation of the model error. Using the chaotic Kuramoto–Sivashinsky system as the test case, we demonstrate the excellent performance of MEDIDA in discovering different types of structural/parametric model errors, representing different types of missing physics, using noise-free and noisy observations.

The discovery of the governing equations of physical systems is often based on the first principles, which has been the origin of most advances in science and engineering. However, in many important applications, some of the underlying physical processes are not well understood and, therefore, are missing or poorly represented in the mathematical models of such systems. Accordingly, these *imperfect models* (“models” hereafter) can merely closely track the dynamics of the *true physical system* (“system” hereafter) while failing to exactly represent it. Global climate models (GCMs) and numerical weather prediction (NWP) models are prominent examples of such models, often suffering from parametric and structural errors. The framework proposed here integrates machine learning (ML) and data assimilation (DA) techniques to discover close-form, interpretable representations of these model errors. This framework can leverage the ever-growing abundance of observational data to reduce the errors in models of nonlinear dynamical systems, such as the climate system.

## I. INTRODUCTION

The difference between the solution of the model and the system becomes prominent in many problems involving complex, nonlinear, multi-scale phenomena such as those in engineering,^{1–3} thermo-fluids,^{4,5} and climate/weather prediction;^{6,7} see Levine and Stuart^{8} for an insightful overview. The deviation of the model from the system is called, in different communities, model uncertainty (structural and/or parametric), model discrepancy, model inadequacy, missing dynamics, or “model error”; hereafter, the latter is used.

Recently, many studies have focused on leveraging the rapid advances in ML techniques and availability of data (e.g., high-quality observations) to develop more accurate models. Several main approaches include learning fully data-driven (equation-free) models^{9–15} or data-driven subgrid-scale closures.^{16–24} In a third approach, corrections to the state or its temporal derivative (tendency) are learned from the deviation of the model predictions from the observations.^{25–29} More specifically, the model is initialized with the observed state, integrated forward in time, and the difference between the predicted state and the observation at the later time is computed. Repeated many times, a correction scheme, e.g., a deep neural network (DNN), can be trained to nudge the model’s predicted trajectory (or tendency) to that of the system. To deal with observations with noise, a number of studies have integrated DA with DNNs.^{30–36}

These studies often used DNNs, showing promising results for a variety of test cases. However, while powerfully expressive, DNNs are currently hard to interpret and often fail to generalize (especially extrapolate) when the systems’ parameters change.^{17,24,37,38} The interpretability of models is crucial for robust, reliable, and generalizable decision-critical predictions or designs.^{3,7} Posing the task of system identification as a linear regression problem, based on a library of nonlinear terms, exchanges the expressivity of DNNs for the sake of interpretability.^{39,40} The closed-form representation of the identified models and their parsimony (i.e., sparsity in the space of the employed library) is the key advantage of these methods, leading to highly interpretable models. A number of studies have used such methods to discover closed-form full models or closures.^{40–46} While the results are promising, noisy data, especially in the chaotic regimes, significantly degrades the quality of the discovered models.^{40,43,45,46}

So far, there has not been any work on discovering closed-form representation of model error using the differences between model predictions and observations (approach 3) or on combining the sparsity-promoting regression methods with DA to alleviate observation noise. Here, we introduce MEDIDA (Model Error Discovery with Interpretability and Data Assimilation), a general-purpose, data-efficient, non-intrusive framework to discover the structural (and parametric) model errors in the form of missing/incorrect terms of partial differential equations (PDEs). MEDIDA uses differences between predictions from a working numerical solver of the model and noisy sporadic (sparse in time) observations. The framework is built on two main ideas:

discovering interpretable and closed-form model errors using relevance vector machine (RVM), a sparsity-promoting Bayesian regression method,

^{47}andreducing the effects of observation noise using DA methods such as ensemble Kalman filter (EnKF) to generate the “analysis state.”

Below, we present the problem statement and introduce MEDIDA. Subsequently, its performance is demonstrated using a chaotic Kuramoto–Sivashinsky (KS) test case.

## II. PROBLEM STATEMENT

Suppose the exact mathematical representation of a system is a nonlinear PDE,

in a continuous domain. Here, $u(t)$ is the state variable at time $t$. While (1) is unknown, we assume to have access to sporadic pairs of observations of the state variable $uo$. These observations might be contaminated by measurement noise. The set of observed states at $ti$ is denoted as ${uo(ti\u2212\Delta ti),uo(ti)}i=1n$. Note that $ti$ do not have to be equally spaced. Furthermore, $\Delta ti$ should be similar for all $i$ but do not have to be the same (hereafter, we use $\u2200i,\Delta ti=\Delta t$ for convenience).

Moreover, suppose that we have a *model* of the system,

Without loss of generality,^{8} we assume that the deviation of (2) from (1) is additive; we further assume that the deviation is only a function of the state.^{30,31} Therefore, the model error is

Our goal is to find a closed-form representation of $h$ given a working numerical solver of (2) and noisy or noise-free observations ${uo(ti\u2212\Delta t),uo(ti)}i=1n$.

## III. FRAMEWORK: MEDIDA

MEDIDA has three main steps (Fig. 1): Step (1) Collecting sporadic observations of the system and numerically integrating the model forward for short-time (Sec. III A); Step (2) Construction and solving a regression problem using RVM,^{47} which leads to the sparse identification of the model error (Sec. III B); and Step (3) If observations are noisy, following step 1, stochastic EnKF^{48,49} is used to estimate an analysis state, an estimate of the system’s state, for Step 2 (Sec. III C). We emphasize that at no point does MEDIDA need any knowledge of the system (1). Also, note that other equation-discovery techniques and ensemble-based DA methods can be used in steps 2 and 3.

### A. Interpretable model error

Consider the discretized (2),

For brevity, we use the notation of an explicit scheme but the scheme can also be implicit, as shown in the test case. The domain is discretized on the grid of $x\u2208RM$. The observation at $ti\u2212\Delta t$, i.e., $uo(ti\u2212\Delta t)$, is the initial condition and $um(ti)$ is the predicted state at time $ti$ [Fig. 1(a)].

At each time $ti$, subtracting the state predicted by the model ($um(ti)$) from the observed state ($uo(ti)$) results in an approximation of the model error at $ti$ [Fig. 1(b)],

The difference between the states of the system and the model is similar to the analysis increment in the DA literature, where the best estimate of the state replaces $uo(ti)$ in (5) (see Sec. III C). The idea of accounting for model error through the analysis increment was first introduced by Leith^{50} and is at the core of many recent applications of ML aiming to *nudge* the model to its correct trajectory^{31} or to account for the model error.^{51,52} However, in this manuscript, we aim to *discover* an interpretable representation of model error, $h(u(t))$.

Note that to accurately discover $h(u(t))$ as defined in (3), $\Delta ui$ should be *dominated by the model error*, and the contributions from numerical errors [in obtaining $um(ti)$] and measurement errors [in $uo(ti)$] should be minimized as much as possible. As discussed in Sec. III C, DA can be utilized to reduce the contributions from the observation noise. Computing $\Delta ui$ after only one $\Delta t$ prevents the accumulation of numerical error from integrating (4). Moreover, this avoids nonlinear growth of the model error and its further entanglement with truncation errors, which can complicate the discovery of just model error. Therefore, in MEDIDA, the model time step and observation intervals are set to be the same. Note that while the size of $\Delta t$ could be restricted by the availability of the observation pairs, increasing $M$ can be used to reduce truncation errors from spatial discretizations.

Integrating (4) from $uo(ti\u2212\Delta t)$ and computing $\Delta ui$ from (5) are repeated for $n$ samples (step 1), and the vectors of model error are stacked to form $\Delta u\u2208RnM$ (step 2) [see Figs. 1(a) and 1(b)]. Similar to past studies,^{40,41,43,53} we further assume that the model error spans over the space of the bases or training vectors, i.e.,

where $\varphi i$ is a linear or nonlinear function of the state variable, the building block of the library of training bases ${\varphi k(u(t))}k=1q$. The selection of the bases should be guided by the physical understanding of the system and the model.

Here, in the discretized form, we assume that $\Phi (u)$ is a linear combination of polynomial terms up to $pth$-order and spatial derivatives up to $dth$-order, i.e.,

where in $ur$, each element of $u$ is raised to the power of $r$, $Dxsu$ denotes the $sth$ spatial derivative of $u$, and $\u2299$ is the element-wise (Hadamard) product. Therefore, the model error lies on the space of library of the training bases evaluated using the observed state at each $ti$, $\Phi (uo,ti)\u2208RM\xd7q$. For all the $n$ samples, the library of the bases are stacked to form $\Phi (uo)\u2208RnM\xd7q$ [Fig. 1(b)].

At the end of step 2, the regression problem for discovery of model error is formulated as

where $c\u2208Rq$ is the vector of coefficients corresponding to the training bases, and $c\u2217=[c1\u2217,c2\u2217,\u2026,cq\u2217]\u22a4$ is a minimizer of the regression problem, i.e., the vector of coefficients of the model error. Finally, the discovered (corrected) model is identified as

### B. Solution of the regression problem

In this study, we use RVM^{47} to compute $c\u2217$ in (8) for inputs of $\Delta u$ and $\Phi (uo)$. RVM leads to a sparse identification of columns of $\Phi $ with posterior distribution of the corresponding weights, i.e., the *relevant vectors*. See Zhang and Lin^{41} for a detailed discussion in the context of PDEs discovery.

The breakthrough in equation-discovery originates in the introduction of parsimony.^{39} While the original least absolute shrinkage and selection operator (LASSO)-type regularization has yielded promising results in a broad range of applications, RVMs have been found to achieve the desired sparsity by a more straightforward hyper-parameter tuning and a relatively lower sensitivity to noise.^{41,44,54} The hyper-parameter in RVMs is a threshold to prune the bases with higher posterior variance, i.e., highly uncertain bases are removed. To avoid over-fitting, we choose the hyper-parameter as a compromise between the sparsity and accuracy of the corrected model at the elbow of the L-curve, following Mangan *et al.*^{55}

### C. Data assimilation

Steps 1 and 2 described above suffice to find an accurate corrected model (9) if the fully observed state is noise-free. However, noise can cause substantial challenges in the sparse identification of nonlinear systems, a topic of ongoing research.^{40,42,43,54,56} In MEDIDA, we propose to use DA, a powerful set of tools to deal with noisy and partial observation (spatiotemporal sparsity). Here, we use stochastic EnKF ^{57} [Fig. 1(c)]. In this study, we assume that observations of the full state are available; i.e., $H(u(t))=u(t)\u2208RM$, where $H(.)$ is the observation operator. Dealing with partial observations and more general forms of observation errors (e.g., as in Hamilton *et al.*^{58}) remains subject of future investigations.

The result of this step is the analysis state used to construct the vector of model error and the library of the bases [Fig. 1(b)].

At each sample time $ti\u2212\Delta t$, the observations are further perturbed with Gaussian white noise to obtain an ensemble of size $N$ of the initial conditions,

where $\sigma b$ is standard deviation of the observation noise ($\sigma obs$) times an inflation factor,^{48,59} and $k$ denotes the $kth$ ensemble member.

Subsequently, the model is evolved for each of these ensemble members, i.e., $ukm(ti)=f(uko(ti\u2212\Delta t))$. This ensemble of model prediction is used to construct the background covariance matrix as

where $um\xaf(ti)$ is the ensemble average of the model prediction and $E[.]$ denotes the expected value. Having the observation operator to be linear, and non-sparse, the Kalman gain, $K\u2208RM\xd7M$, is then calculated as

where $I\u2208RM\xd7M$ is the identity matrix.

For each sample at the prediction time $ti$, the $k$th member of the ensemble of the observations is generated as

Note that while the EnKF can be used for non-Gaussian observation noise, the method would be sub-optimal and may require further modification.^{60} Hereon, we limit our experiments to Gaussian noise.

Finally, the $k$th member of the ensemble of the analysis state is

The process of EnKF assimilates the noisy observations to the background forecast obtained from the model, $ukm(ti)$. Subsequently, the ensemble average of the analysis states, $ua\xaf(ti)$, is treated as a close estimate of the true state of the system, $uo(ti)$ in Sec. III A.

Similarly, the difference between the ensemble averages of the analysis state and the model prediction, i.e.,

There are a few points about this DA step that needs to be further clarified. First, as discussed in Sec. III A, the model is evolved for only one time step. Although this may hinder the traditional spin-up period used in EnKF, as a compromise, we resort to a large ensemble and inflation to ensure a successful estimation of the state.

### D. Performance metric

To quantify the accuracy of MEDIDA, we use the normalized distance between the vector of coefficients, defined as

where $cs$, $cm$, $c\u2217$ are coefficient vectors of system (1), the model (2), and the corrected model (9).^{40,43} The $ith$ element of the coefficient vector is a scalar corresponding to the $ith$ term in the equation. Quantitatively, the goal is to achieve $\epsilon \u2217\u226a\epsilon m$. Note that the implicit assumption in this metric is that the model and the model error can be expressed on the space spanned by the bases of the library.

## IV. TEST CASE: KS EQUATION

To evaluate the performance of MEDIDA, we use a chaotic KS system, a challenging test case for system identification, particularly from noisy observations.^{40,43} The PDE is

where $u\u2202xu$ is convection, $\u2202x2u$ is anti-diffusion (destabilizing), and $\u2202x4u$ is hyper-diffusion (stabilizing). The domain is periodic, $u(0,t)=u(L,t)$. We use $L=32\pi $, which leads to a highly chaotic system.^{9,62}

Here, (17) is the system. We have created nine (imperfect) models as shown in Table I. In cases $1$–$3$, one of the convection, anti-diffusion, or hyper-diffusion terms is entirely missing (i.e., structural uncertainty). In cases $4$–$6$, some or all of the coefficients of the system are incorrect (parametric uncertainty). Finally, in cases $7$-$9$, a mixture of parametric and structural uncertainties, with missing and extra terms, is present.

. | Model: $0=\u2202tu+$ . | $\epsilon m(%)$ . | Corrected model (noise-free): $0=\u2202tu+$ . | $\epsilon \u2217(%)$ . | Corrected model (noisy): $0=\u2202tu+$ . | $\epsilon \u2217(%)$ . |
---|---|---|---|---|---|---|

1 | $\u2202x2u+\u2202x4u$ | 57.74 | $0.97u\u2202xu+\u2202x2u+\u2202x4u$ | 1.50 | $0.98u\u2202xu+\u2202x2u+\u2202x4u$ | 1.43 |

2 | $u\u2202xu+\u2202x4u$ | 57.74 | $u\u2202xu+0.99\u2202x2u+\u2202x4u$ | 0.39 | $u\u2202xu+0.98\u2202x2u+\u2202x4u$ | 0.98 |

3 | $u\u2202xu+\u2202x2u$ | 57.74 | $u\u2202xu+\u2202x2u+1.02\u2202x4u$ | 1.08 | $u\u2202xu+\u2202x2u+1.02\u2202x4u$ | 0.98 |

4 | $0.5u\u2202xu+\u2202x2u+\u2202x4u$ | 28.87 | $0.99u\u2202xu+\u2202x2u+\u2202x4u$ | 0.75 | $0.99u\u2202xu+\u2202x2u+\u2202x4u$ | 0.63 |

5 | $u\u2202xu+0.5\u2202x2u+\u2202x4u$ | 28.87 | $u\u2202xu+1.00\u2202x2u+\u2202x4u$ | 0.14 | $u\u2202xu+0.98\u2202x2u+\u2202x4u$ | 0.99 |

6 | $u\u2202xu+\u2202x2u+0.5\u2202x4u$ | 28.87 | $u\u2202xu+\u2202x2u+1.00\u2202x4u$ | 0.14 | $u\u2202xu+\u2202x2u+1.00\u2202x4u$ | 0.15 |

7 | $0.5u\u2202xu+2\u2202x4u$ | 86.60 | $0.97u\u2202xu+1.00\u2202x2u+\u2202x4u$ | 1.55 | $0.98u\u2202xu+0.98\u2202x2u+\u2202x4u$ | 1.63 |

8 | $u\u2202xu+\u2202x2u+0.5\u2202x3u+\u2202x4u$ | 28.87 | $u\u2202xu+\u2202x2u+\u2202x4u$ | 0.28 | $u\u2202xu+\u2202x2u\u22120.01\u2202x3u+\u2202x4u$ | 0.83 |

9 | $u\u2202xu+\u2202x2u+0.5\u2202x3u$ | 64.55 | $u\u2202xu+\u2202x2u\u22120.01\u2202x3u+1.02\u2202x4u$ | 1.29 | $u\u2202xu+\u2202x2u\u22120.01\u2202x3u+1.00\u2202x4u$ | 0.55 |

. | Model: $0=\u2202tu+$ . | $\epsilon m(%)$ . | Corrected model (noise-free): $0=\u2202tu+$ . | $\epsilon \u2217(%)$ . | Corrected model (noisy): $0=\u2202tu+$ . | $\epsilon \u2217(%)$ . |
---|---|---|---|---|---|---|

1 | $\u2202x2u+\u2202x4u$ | 57.74 | $0.97u\u2202xu+\u2202x2u+\u2202x4u$ | 1.50 | $0.98u\u2202xu+\u2202x2u+\u2202x4u$ | 1.43 |

2 | $u\u2202xu+\u2202x4u$ | 57.74 | $u\u2202xu+0.99\u2202x2u+\u2202x4u$ | 0.39 | $u\u2202xu+0.98\u2202x2u+\u2202x4u$ | 0.98 |

3 | $u\u2202xu+\u2202x2u$ | 57.74 | $u\u2202xu+\u2202x2u+1.02\u2202x4u$ | 1.08 | $u\u2202xu+\u2202x2u+1.02\u2202x4u$ | 0.98 |

4 | $0.5u\u2202xu+\u2202x2u+\u2202x4u$ | 28.87 | $0.99u\u2202xu+\u2202x2u+\u2202x4u$ | 0.75 | $0.99u\u2202xu+\u2202x2u+\u2202x4u$ | 0.63 |

5 | $u\u2202xu+0.5\u2202x2u+\u2202x4u$ | 28.87 | $u\u2202xu+1.00\u2202x2u+\u2202x4u$ | 0.14 | $u\u2202xu+0.98\u2202x2u+\u2202x4u$ | 0.99 |

6 | $u\u2202xu+\u2202x2u+0.5\u2202x4u$ | 28.87 | $u\u2202xu+\u2202x2u+1.00\u2202x4u$ | 0.14 | $u\u2202xu+\u2202x2u+1.00\u2202x4u$ | 0.15 |

7 | $0.5u\u2202xu+2\u2202x4u$ | 86.60 | $0.97u\u2202xu+1.00\u2202x2u+\u2202x4u$ | 1.55 | $0.98u\u2202xu+0.98\u2202x2u+\u2202x4u$ | 1.63 |

8 | $u\u2202xu+\u2202x2u+0.5\u2202x3u+\u2202x4u$ | 28.87 | $u\u2202xu+\u2202x2u+\u2202x4u$ | 0.28 | $u\u2202xu+\u2202x2u\u22120.01\u2202x3u+\u2202x4u$ | 0.83 |

9 | $u\u2202xu+\u2202x2u+0.5\u2202x3u$ | 64.55 | $u\u2202xu+\u2202x2u\u22120.01\u2202x3u+1.02\u2202x4u$ | 1.29 | $u\u2202xu+\u2202x2u\u22120.01\u2202x3u+1.00\u2202x4u$ | 0.55 |

The system and the models are numerically solved using different time-integration schemes and time-step sizes to introduce additional challenges and a more realistic test case. To generate $uo$, (17) is integrated with the exponential time-differencing fourth-order Runge–Kutta^{63} with time-step $\Delta to$ [in Fig. 1(a), $1t=103\Delta to$]. The models are integrated with second-order Crank–Nicolson and Adams–Bashforth schemes with $\Delta tm=\Delta t=5\Delta to$. For both system and models, $M=1024$ Fourier modes are used (different discretizations or $M$ for system and models could have been used too). $uo$ is collected after a spin-up period of $\tau =200\Delta t$ and uniformly at $ti=i\tau $. $\Phi $, is constructed with $d=4$ and $p=4$ as defined in (7). Open-source codes are used to generate $\Phi $ ^{40} and solve (8) using RVM.^{44}

### A. Noise-free observations

First, we examine the performance of MEDIDA for noise-free observations (only steps 1 and 2). The motivation is twofold: (i) to test steps 1 and 2 before adding the complexity of noisy observations/DA, and (ii) in many applications, an estimate of $uo$ is already available (see Sec. V).

Here, $uo$ are from the numerical solutions of (17). The models, MEDIDA-corrected models, and the corresponding errors for $n=100$ samples show $\epsilon \u2217<2%$, between 40 and 200 times improvement compared to $\epsilon m$ (Table I).

To conclude, MEDIDA is a data-efficient framework. Its performance is insensitive to $n$ such that $\epsilon \u2217$ changes by $0.1%$ for $n\u2208{10,100,1000}$. There are two reasons for this: (i) RVM is known to be data-efficient (compared to DNNs)^{64} and (ii) each grid point at $ti$ provides a data-point, i.e., a row in $\Delta u$ and $\Phi $, although these are not all independent/uncorrelated data-points.

### B. Noisy observations

Next, we examine the performance of MEDIDA for noisy observations, obtained from adding noise to the numerical solution of KS: $uo(t)=u(t)+N(0,\sigma obs2).$ Here, $\sigma obs/\sigma u=0.01$, where $\sigma u$ is the standard deviation of the state (equivalent to $0.1$ following the methodology in Refs. 43 and 65). Without step 3 (DA), the model error discovery fails, leading to many spurious terms and $\epsilon \u2217$ of $O(10%)--O(100%)$, comparable or even worse than the model error, $\epsilon m$ (Fig. 2).

Table I shows the corrected models (with steps 1–3). With $N=10M$, in all cases except for 3, 6, and 9, $\epsilon \u2217<2%$, between 30 and 60 times lower than $\epsilon m$. For those three cases, increasing $N$ further reduces $\epsilon \u2217$ (Fig. 2), leading to the discovery of accurate models with $\epsilon \u2217<1%$ at large ensembles with $N=100M\u2212500M$ (Table I). Note that one common aspect of these cases is that the model is missing or mis-representing the hyper-diffusion term. The larger $N$ is needed to further reduce the noise in the analysis state to prevent amplification of the noise due to the higher-order derivative of this term. It should also be highlighted that while $N$ at the order of $M$ or larger might seem impractical, each ensemble member requires only *one* time-step integration (thus, a total of $nN$ time steps).

MEDIDA is found to be data-efficient in these experiments too. Like before, $\epsilon \u2217$ changes by $0.1%$ for $n\u2208{10,100,1000}$ for all cases except for 3, 6, and 9. For these three cases, $\epsilon \u2217$ improves by about $10%$ as $n$ is increased from $10$ to $100$ but then changes by only $0.1%$ when $n$ is further increased to $1000$.

## V. DISCUSSION AND SUMMARY

We introduced MEDIDA, a data-efficient and non-intrusive framework to discover interpretable, closed-form structural (or parametric) model error for chaotic systems. MEDIDA only needs a working numerical solver of the model and $n$ sporadic pairs of noise-free or noisy observations of the system. Model error is estimated from differences between the observed and predicted states (obtained from one-time-step integration of the model from earlier observations). A closed-form representation of the model error is estimated using an equation-discovery technique such as RVM. MEDIDA is expected to work accurately if (i) the aforementioned differences are dominated by model error and not by numerical (discretization) error or observation noise and (ii) the library of the bases is adequate (further discussed below). Regarding (i), the numerical error can be reduced by using higher numerical resolutions while the observation noise can be tackled using DA techniques such as EnKF.

The performance of MEDIDA is demonstrated using a chaotic KS system for both noise-free and noisy observations. In the absence of noise, even with a small $n$, MEDIDA shows excellent performance and accurately discovers linear and nonlinear model errors, leading to corrected models that are very close to the system. These results already show the potential of MEDIDA as in some important applications, noise-free estimates of the system (reanalysis states) are routinely generated and are available; this is often the case for atmosphere and some other components of the Earth system.^{66–68} For example, such reanalysis states have been recently used to correct the model errors of an NWP model,^{28} a quasi-geostrophic (QG) model,^{31} and a modular arbitrary-order-ocean-atmosphere model (MAOOAM).^{69}

In the presence of noise, the model error could not be accurately discovered without DA. Once EnKF is employed, MEDIDA accurately discovers linear and nonlinear model errors. A few cases with model errors involving linear but fourth-order derivatives require larger ensembles. This is because higher-quality analysis states are needed to avoid amplification of any remaining noise as a result of high-order derivatives.

Although the number of ensemble members used in MEDIDA is more than what it is commonly used in operational DA, the associated cost is tractable since each ensemble is evolved only for one time step. This is another advantage of evolving the model for only one time step, which as discussed before, and is also motivated by reducing the accumulation of truncation errors and nonlinear growth of the model errors. Still, if needed, inexpensive surrogates of the model that could provide accurate one-time-step forecasts can be used to efficiently produce the ensembles.^{35,70}

Also, note that in EnKF, ensemble members are evolved using the model. If the model error is large, the analysis state might be too far from the system’s state, degrading the performance of MEDIDA. In all cases examined here, even though the model errors are large, with large-enough ensembles, the approximated analysis states are accurate enough to enable MEDIDA to discover the missing or inaccurate terms. However, in more complex systems, this might become a challenge that requires devising new remedies. One potential resolution is an iterative procedure, in which the analysis state is generated and a corrected model is identified, which is then used to produce a new analysis state and a new corrected model, and this continues until convergence. Although such iterative approaches currently lack proof of convergence, they have shown promises in similar settings;^{39,58,71,72} for example, Hamilton *et al.*^{58} update the error in the observations iteratively until the convergence of the estimated state from the EnKF. The corresponding cost and possible convergence properties of such an iterative approach for MEDIDA remain to be investigated. Similarly, other methods for dealing with noise in equation discovery^{43,56,73} could also be integrated into MEDIDA.

The choice of an adequate library for RVM is crucial in MEDIDA. Although an exhaustive library of the training vectors with any arbitrary bases is straightforward, it quickly becomes computationally intractable. Any *a priori* knowledge of the system, such as locality, homogeneity,^{30} Galilean invariance, and conservation properties,^{41} can be considered to construct a concise library. Conversely, the library can be expanded to “explore the computational universe,” e.g., using gene expression programming.^{74} Even further, additional constraints, such as stability of the corrected model, can be imposed.^{75–77} Effective strategies for the selection of an adequate and concise library can be investigated in future work using more complex test cases.

Finally, beyond dealing with noisy observations, another challenge in the discovery of interpretable model errors is approximating the library given spatially sparse observations. Approaches that leverage auto-differentiation for building the library in equation discovery have recently shown promising results^{78} and can be integrated into MEDIDA.

MEDIDA is shown to work well for a widely used but simple chaotic prototype, the KS system. The next step will be investigating the performance of MEDIDA for more complex test cases, such as a two-layer quasi-geostrophic model. Scaling MEDIDA to more complex, higher dimensional systems will enable us to discover interpretable model errors for GCMs, NWP models, and other models of the climate system.

## ACKNOWLEDGMENTS

We are grateful to Yifei Guan for helpful comments on the manuscript. We also would like to thank two anonymous reviewers for their constructive and insightful comments. This work was supported by an award from the ONR Young Investigator Program (No. N00014-20-1-2722), a grant from the NSF CSSI Program (No. OAC-2005123), and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program. Computational resources were provided by NSF XSEDE (Allocation No. ATM170020) and NCAR’s CISL (Allocation No. URIC0004).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Rambod Mojgani:** Conceptualization (equal); Investigation (equal); Methodology (equal); Writing – original draft (equal); Writing – review & editing (equal). **Ashesh Chattopadhyay:** Conceptualization (equal); Methodology (equal); Writing – review and editing (equal). **Pedram Hassanzadeh:** Conceptualization (equal); Funding acquisition (lead); Resources (lead); Writing – review and editing (equal).

## DATA AVAILABILITY

The code and data that support the findings of this study are available at https://github.com/envfluids/MEDIDA.

## REFERENCES

*Data Assimilation: A Mathematical Introduction*, Texts in Applied Mathematics (Springer International Publishing, 2015).