We consider the problem of data-assisted forecasting of chaotic dynamical systems when the available data are in the form of noisy partial measurements of the past and present state of the dynamical system. Recently, there have been several promising data-driven approaches to forecasting of chaotic dynamical systems using machine learning. Particularly promising among these are hybrid approaches that combine machine learning with a knowledge-based model, where a machine-learning technique is used to correct the imperfections in the knowledge-based model. Such imperfections may be due to incomplete understanding and/or limited resolution of the physical processes in the underlying dynamical system, e.g., the atmosphere or the ocean. Previously proposed data-driven forecasting approaches tend to require, for training, measurements of all the variables that are intended to be forecast. We describe a way to relax this assumption by combining data assimilation with machine learning. We demonstrate this technique using the Ensemble Transform Kalman Filter to assimilate synthetic data for the three-variable Lorenz 1963 system and for the Kuramoto–Sivashinsky system, simulating a model error in each case by a misspecified parameter value. We show that by using partial measurements of the state of the dynamical system, we can train a machine-learning model to improve predictions made by an imperfect knowledge-based model.

A common desired function is that of cyclic forecasting of a chaotic system’s state, in which a set of forecasts to various time increments in the future is periodically produced once every cycle time $T$ (e.g., for weather forecasting, $T=6$ h is common). To do this, at the beginning of each cycle, available measurements taken during the last cycle are combined with (“assimilated into”) a forecast of the current state made at the beginning of the previous cycle to produce an estimate of the current system state. This estimate then serves as the initial condition for the prediction model, which produces the next set of forecasts and so on. The process of cyclic formation of the required initial conditions is called “Data Assimilation.” Recently, prediction models utilizing machine learning have been developed and shown to have great promise. Thus, the training and use of machine-learning-assisted prediction models in cyclic forecasting systems is of considerable interest. To enable this, it is necessary to train the machine-learning system, using measured data, to forecast variables that are not measured. Here, we show that this can be accomplished using data assimilation and demonstrate the ability of machine learning to improve the forecast of variables for which no directly measured training data are available.

## I. INTRODUCTION

The application of machine learning (ML) to forecasting chaotic dynamical systems (e.g., Ref. 24) has been a subject of much recent interest.^{2,5–8,10,17,26,29–31,40} Promising results of tests promote the possibility of using ML for large scale applications such as weather forecasting. Particularly, Ref. 29 indicates that machine-learning techniques can be scaled effectively to forecast arbitrarily high-dimensional chaotic attractors. In cases such as weather forecasting where a skillful knowledge-based model is available, hybrid approaches that use ML to correct model deficiencies are especially promising.^{31,33} By “knowledge-based,” we mean derived from fundamental principles such as the laws of physics.

So far, most data-driven forecasting approaches assume that training data are available for all the variables to be forecast. This assumption is relaxed in Refs. 5–8, which describe techniques to develop a data-driven forecast model on a specified, but potentially only partially measured, model state space. Using an approach different from those in Refs. 5–8, in this paper, we develop a technique for hybrid knowledge-based/ML systems in the partially measured scenario. Both our hybrid approach and the purely data-driven approaches in Refs. 5, 6, 8, and 37 are based on the paradigm of data assimilation (DA), which has long been used in weather forecasting and other applications.

When a forecast model is available, DA (e.g., Refs. 3, 12, and 21) determines initial conditions for an ongoing sequence of forecasts, based on an ongoing sequence of measurements. DA does not require that current measurements are sufficient to determine the current model state; alternatively, it typically uses the forecast model to propagate information from past measurements to augment current information. This technique helps both to constrain the model state and to filter noise in the measurements.

Ideas from ML and DA have been combined in various ways in the literature. DA has been used with data-driven forecast models based on delay-coordinate embedding^{14} and analog forecasting.^{22} On the other hand, with appropriate training data, machine-learning techniques can be taught to perform DA without explicitly formulating a forecast model.^{25} DA has also been employed as a paradigm and a tool for the training of machine-learning systems.^{1,5–8,13,32}

We give a brief introduction to DA in Sec. II A. In Sec. II B, we formulate a potentially effective scheme combining DA (Sec. II A) with machine-learning-assisted prediction (as described in Ref. 31) to create a cyclic forecast system. In Sec. II C, we propose an iterative procedure to improve the DA cycle and potentially produce improved ML-assisted forecast models. In Sec. III, we show the results of numerical experiments.

## II. METHODS

We consider the problem of forecasting an autonomous, chaotic, nonlinear dynamical system represented by the equation

In Eq. (1), $x~(t)$ is the true dynamical system state at time $t$ and $F$ is the vector field. We allow for the case where $x~$ is infinite-dimensional, as it is in a partial differential equation. We assume that we are only able to obtain partial measurements of the state $x~$ at regular time intervals. We denote the measurements made at time $k\Delta t$ by $yk$ so that

In Eq. (2), $H~$ is often called the observation operator; here, we call it the measurement operator. For our exposition, we assume $H~$ to be linear, as will be true in the numerical experiments performed in this paper. (Many DA methods allow a nonlinear measurement operator, and our ML method is compatible with this case.) The random variable $\eta k$ represents measurement noise, which is typically assumed to be Gaussian with mean $0$ and covariance matrix $R$.

In applications, we do not have full knowledge of the dynamics of the system, and, as such, Eq. (1) is unknown. What often is available is an imperfect model of the dynamics, with model error due to imperfect representations of some of the physical processes that govern the system. We assume that we have access to such an imperfect “knowledge-based” model denoted by $G$ and that the dynamical equation

can, via evolution of $x$, be used to forecast future $x~$-dependent properties of interest. In some applications, Eq. (1) represents a partial differential equation, and $x$ approximates $x~$ on a model grid. Thus, we assume that there is a projection operator (which could be the identity) from the space of $x~$ to the space of $x$. We write $xkt$ to represent the projection of the true state $x~(k\Delta t)$ onto the space of $x$ so that $x(k\Delta t)$ is meant to approximate $xkt$. We rewrite Eq. (2) as

where $H$ approximates $H~$ (for example, via interpolation from a model grid to an observation location).

### A. Data assimilation

In DA, the initial state $x(0)$ for the forecast model, Eq. (3), is determined from the current measurement $y0$ and past measurements ${yk}\u2212Tm\u2264k<0$, using the same model to propagate information (imperfectly) from the past to the present. Here, $Tm$ represents the number of past times $k\Delta t$ from which measurements are available. Many DA methods can be formulated as the approximate solution of a nonlinear least-squares problem,^{16,23,35} similar to the following. Given the measurements ${yk}\u2212Tm\u2264k\u22640$, find a trajectory $x(t)$ of Eq. (3) that minimizes the cost function given by

Usually, $\rho $ is taken to equal $1$, but here, we allow also $\rho >1$ in order to discount past measurements in part to compensate for model error in $G$; we will discuss the role of $\rho $ further after Eq. (8). In the situation where the measurement errors ($\eta k$) are Gaussian and the model $G$ is a perfect representation of the dynamical system, i.e., $G=F$, minimizing Eq. (5) with $\rho =1$ will give us the trajectory of the model $x(t)$ that is the most likely trajectory of the dynamical system in the Bayesian sense (here, we assume a “flat” prior likelihood; the choice of prior distribution becomes increasingly unimportant as $T$ increases.)

In forecast applications, the purpose of DA is to determine an estimate $x(0)$ of the current true system state $x0t$. This estimate is used as an initial condition for Eq. (3) to forecast $x(t)$ for $t>0$. As such, the trajectory $x(t)$ for $t<0$ that minimizes the cost function $J0$ is often not computed explicitly; only its final state $x(0)$ is needed as the algorithmic output. DA is typically done sequentially using the state estimate from an earlier time to compute the estimate for the current time. To be more precise, consider the cost function $Jj$ that generalizes Eq. (5) as follows:

Then,

Suppose for some $\u2212Tm<j<=0$ that $xj\u22121a$ represents an estimate of $xj\u22121t$ based on the cost function $Jj\u22121$, i.e., based on measurements $yk$ up to and including $k=j\u22121$. The next estimate $xja$ is computed in two steps as follows:

*Forecast*: The model, Eq. (3), is applied from time $(j\u22121)\Delta t$ to $j\Delta t$, using $xj\u22121a$ as the initial condition, to obtain an estimate $xjb$ to $xjt$, called the “background” state at time $j\Delta t$.*Analysis*: The background $xjb$ is used along with the current measurement vector $yj$ to compute the “analysis” state $xja$ that estimates the minimizer of $Jj$ at time $j\Delta t$.

These two steps are repeated to compute $xj+1b$, $xj+1a$, and so on.

Different DA methods perform the analysis step differently.^{3,12,21,35} In some methods, including those based on the Kalman filter, the cost functions are approximated by quadratic functions of $x(j\Delta t)$ so that the right side of Eq. (7) becomes

Here, $Pjb$ is a covariance matrix associated with the background state at time $j\Delta t$. Thus, the quantity to be minimized is a sum of (squared) weighted distances from $x(j\Delta t)$ to the background state (which is forecast from previous observations) and to the current observations. In an ensemble Kalman filter,^{9,11,15}$Pjb$ is computed as the sample covariance of an ensemble of forecasts. The parameter $\rho $ effectively inflates this covariance, compensating for (in addition to model error) sampling errors due to finite ensemble size and the effect of nonlinearity on the error dynamics; this technique is called multiplicative covariance inflation.^{16,35,39} We refer to $\rho $ as the “covariance inflation parameter.”

In the Appendix, we describe the particular DA method we use, the “Ensemble Transform Kalman Filter” (ETKF),^{4,38} a type of an ensemble Kalman filter. We emphasize that the approach we describe in Sec. II B can be used with any DA method. The forecast and analysis steps we have described are applied iteratively to produce a time series of analysis states $xja$. We will use this time series in Sec. II B to train the ML component of our method.

### B. ML-DA algorithm

In what follows, we use a reservoir computer implementation of our ML-DA algorithm based on Refs. 17 and 26, which is similar to that of Ref. 31. The main component of a reservoir computer is the reservoir network. The reservoir network is a sparse, randomly generated graph with weighted, directed edges and dynamical nodes whose states evolve in response to input signals transported along network edges. The network topology can be represented completely by a weighted adjacency matrix, denoted $A$. We choose $A$ to be a $Dr\xd7Dr$ sparse, randomly generated matrix. The network is constructed to have an average in-degree (number of incoming edges per node) denoted by $\u27e8d\u27e9$, and the nonzero elements of $A$, representing the edge weights in the network, are initially chosen independently from the uniform distribution over the interval $[0,1]$. All the edge weights in the network are then uniformly scaled via multiplication of the adjacency matrix by a constant factor to set the largest magnitude eigenvalue of the matrix to a desired value $\omega $, which is called the “spectral radius” of $A$. The state of the reservoir, given by the vector $r(t)$, consists of the components $rn$ for $1\u2264n\u2264Dr$, where $rn(t)$ denotes the scalar state of the $n$th node in the network. The reservoir is coupled to the $M$ dimensional input through a $Dr\xd7M$ dimensional matrix $Win$. Each row of the matrix $Win$ has exactly one nonzero element, which is independently and randomly chosen from the uniform distribution on the interval $[\u2212\zeta ,\zeta ]$, where $\zeta $ is a tunable parameter. We choose the locations of these nonzero elements such that each node receives input from only one input variable and each input variable is input to an equal number of nodes.

As outlined at the beginning of Sec. II, we note that the finite-dimensional representation of the true dynamical system states is given by $xjt$. We have $T+Ts$ measurements given by ${yj}$, in the interval $\u2212T\u2212Ts\u2264j\u22640$, which are related to the true state of the dynamical system by Eq. (2). We refer to $T$ and $Ts$ as the number of training and synchronization samples, where $T\u226bTs$. We further assume that the model $G$ [Eq. (3)] is imperfect. Using the model $G$ and the DA algorithm outlined in the Appendix, we obtain an analysis state $xja$ at each time step $j$, $\u2212T\u2212Ts\u2264j\u22640$. We are interested in forecasting the state of the dynamical system for $j>0$. In the standard forecast approach (without ML), one would predict the future state of the dynamical system [Eq. (1)] using $x0a$ as the initial condition and Eq. (3) as the dynamical model. We will call this forecast the baseline forecast, to which we will compare the forecast made by our ML-DA algorithm. The ML-DA algorithm is an extension, to the case where the model state is only partially observed, of the algorithm proposed in Ref. 31 for hybrid forecasting of chaotic dynamical systems using ML in conjunction with a knowledge-based model. The analysis states $xja$ lie in the model state space; once they are determined by the DA procedure, training proceeds as in the hybrid method of Ref. 31. Figure 1 shows a schematic illustration of the ML-DA scheme described below.

#### 1. Training

Use the model $G$ to create a set of forecasts from the latest $T$ analysis states $xja$, $\u2212T\u2264j\u22640$. We denote these time-$\Delta t$ forecasts by $xjM$, $\u2212T+1\u2264j\u22641$.

Initialize the reservoir state $r\u2212T\u2212Ts$ to a random value.

- Evolve the reservoir computer using the following equation:for $\u2212T\u2212Ts\u2264j\u22640$. Here, $tanh$ is applied element-by-element to its input vector.(9)$rj+1=tanh\u2061[Arj+Winxja]$
- Find a set of output weights $Wout$ so thatfor $\u2212T+1\u2264j\u22640$. The matrix $Wout$ is computed using regularized least-squares regression. Thus, we find that $Wout$ minimizes the following cost function:(10)$xjH=Wout[rjxjM]\u2243xja$Here, $\beta $ is a regularization parameter. Notice that we have not used the first $Ts$ reservoir states to find the optimal $Wout$. This is to ensure that the reservoir states used in the computation of $Wout$ do not depend strongly on the initial random value of the reservoir state, $r\u2212T\u2212Ts$, and thus are “synchronized” to the measurements that drive Eq. (9) through the analysis states $xja$.(11)$\u2113(Wout)=\u2211j=\u2212T+10\Vert Wout[rjxjM]\u2212xja\Vert 2+\beta \Vert Wout\Vert 2.$

#### 2. Prediction

We now describe the hybrid forecast model that uses the trained reservoir along with the knowledge-based model $G$ to forecast the state $x(j\Delta t)$ for $j>0$.

The hybrid reservoir runs in a closed-loop feedback mode according to the following steps:

- Obtain the hybrid ML prediction at time $j\Delta t$ (denoted $xjH$) according toThe reservoir state and knowledge-based forecasts at the first step, $r1$ and $x1M$, are computed from the final training analysis $x0a$ according to steps 1 and 3 of the training procedure.(12)$xjH=Wout[rjxjM].$
Compute the imperfect model forecast at time $(j+1)\Delta t$, $xj+1M$, by evolving $xjH$ using the imperfect model $G$.

- Evolve the reservoir according to(13)$rj+1=tanh\u2061[Arj+WinxjH].$
Increment $j$ and repeat steps 1–3.

If one wants to initialize a forecast at a time $J\Delta t$ for $J>0$, one should continue to run the DA procedure for $J$ more steps after the training time period, set $xJH=xJa$, then proceed as above starting with step 2 for $j=J$. The $J$ DA cycles after time 0 can be performed using the trained hybrid model in place of the knowledge-based model, similarly to the approach described in Sec. II C.

We note that the ensemble-based DA method we use could be used to initialize an ensemble forecast. However, for simplicity, our results use only forecasts from the ensemble mean.

### C. Iterated ML-DA algorithm

Once we have trained a hybrid model using the process described in Sec. II B 1, we can iterate the training procedure in the following sense. An iterative approach, initialized differently, is described in Ref. 8. As shown in Fig. 2, we replace the knowledge-based model used during DA with the previously trained hybrid model. At iteration $i=1$, the hybrid is trained using the analysis states produced using only the knowledge-based model. We denote the output matrix to be determined by the training, the state of the reservoir at time $j\Delta t$, the analysis state at time $j\Delta t$, and the hybrid model predictions for time $j\Delta t$ at iteration $i$ as $Wout,i$, $rj,i$, $xj,ia$, and $xj,iH$, respectively.

Training proceeds as in Sec. II B 1 with the following modifications.

#### 1. Training

- For $\u2212T\u2212Ts\u2264j<\u2212T$, we use the knowledge-based model to perform DA, producing analysis states $xj,ia$. We do not use the hybrid model during this synchronization time period because the reservoir states $rj,i$ might not yet be synchronized with the measurements. These reservoir states are generated according to step 3 of Sec. II B 1; i.e.,(14)$rj+1,i=tanh\u2061[Arj,i+Winxj,ia].$
For $T\u2264j\u22640$, we use the hybrid model to perform DA in place of the knowledge-based model, producing analysis states $xj,ia$, as depicted in Fig. 2. We continue to use Eq. (14) to evolve the reservoir states.

Using these new analysis states, we follow the steps contained in Sec. II B 1 to compute a new hybrid output matrix, $Wout,i+1$.

Once the new hybrid output matrix has been computed, predictions can be produced by following Sec. II B 2.

In the case of an ensemble-based assimilation method (such as the ETKF described in the Appendix), we augment each ensemble member to include, in addition to the system state $xja,k$, a corresponding reservoir state $rjk$. This reservoir state is synchronized to the corresponding ensemble system state by evolving it as in Eq. (14),

Then, $rj+1,ik$ is used in the hybrid model to produce the $k$th background ensemble member for the next analysis cycle. We note that the reservoir state(s) at $j=\u2212T+1$ needed to begin DA using the hybrid model must only be computed at the first iteration. Afterward, these initial synchronized states may be reused at the beginning of subsequent DA cycles.

### D. Experiment design

We evaluate the forecast performance of the ML-DA algorithm in comparison with the performance of baseline for forecasts prepared without using DA. We consider two dynamical systems as our examples: the Lorenz 1963 system^{24} and the Kuramoto–Sivashinsky (KS) system.^{20,34} For the latter system, we additionally evaluate the error in the analysis states from using the iterative method described in Sec. II C.

For the purpose of evaluating forecast performance, simulated true states $xjt$ are generated by the perfect model dynamics in the time interval $\u2212T\u2212TS\u2264j\u2264P$, where $T$ and $TS$ are described in Sec. II B. In all cases, transient dynamics are discarded so that the simulated true states lie in the model attractor. The data in the interval $1\u2264j\u2264P$ are set aside to verify the baseline and ML-DA forecasts but is not otherwise used since it is considered to be in the future. We assume that we only have access to the measurements ${yj}\u2212T\u2212TS\u2264j\u22640$ generated from the simulated true states via Eq. (2) and do not know the perfect model, Eq. (1). We further assume that we have access to an imperfect model $G$ [Eq. (3)]. We use this imperfect model to perform DA, to train the ML-DA, and to prepare both the baseline and ML-DA forecasts for $j>0$.

For DA, we use the ETKF, as described in the Appendix, with 15 ensemble members for the Lorenz 1963 system and 30 ensemble members for the KS system. To initialize the forecasts and the ML training, we perform DA using the measurements $yj$ ($\u2212T\u2212TS\u2264j\u22640$) and the model $G$ to obtain the analysis states $x\u2212T\u2212Tsa,\u2026,x0a$.

#### 1. Baseline forecasts

Using $x0a$ as the initial condition, we compute a forecast using the model $G$ from time $j=1$ to $j=P$. We call this forecast $xjf,base$, $1\u2264j\u2264P$.

#### 2. ML-DA forecasts

We follow the ML-DA scheme detailed in Sec. II B using a reservoir computer with the reservoir and training parameters listed in Table I for the Lorenz 1963 and the Kuramoto–Sivashinsky (KS) dynamical systems. We use the measurements $yj$ ($\u2212T\u2212TS\u2264j\u22640$) and the imperfect model $G$ corresponding to the Lorenz and KS systems, respectively, to train the reservoir computer. The exact form of the measurements and of the imperfect model is described in Secs. III A and III B. We then forecast using the ML-DA scheme from $j=1$ to $j=P$. We call this forecast $xjf,ml$, $1\u2264j\u2264P$.

#### 3. Normalized RMS error

The normalized RMS error in the baseline forecast ($ejbase$) and the RMS error in the ML-DA forecast ($ejml$) are calculated as follows:

where $||\cdots ||$ denotes the Euclidean norm, $\u27e8\cdots \u27e9$ denotes an average over the prediction interval ($1\u2264j\u2264P$), and $ej$, $xjf$ denote either $ejbase$, $xjf,base$, or $ejml$, $xjf,ml$. The superscript $f$ in $xjf$ stands for forecast.

#### 4. Forecast valid time

We define the forecast Valid Time ($t{VT}$) as the time $j\Delta t$ at which the RMS error ($ej$) first exceeds a threshold $\kappa $ chosen to be $0.9$. In order to report the forecast quality on a meaningful natural time scale, we normalize the forecast valid time by the “Lyapunov time” of the dynamical system. The Lyapunov time is defined as the inverse of the Largest Lyapunov Exponent (LLE), denoted $\Lambda max$, of the dynamical system being forecast. In a chaotic system, infinitesimal errors grow exponentially on average as $exp\u2061(\Lambda maxt)$. Thus, $\Lambda max\u22121$ is a natural time scale for evaluating forecast quality in a chaotic system.

Since forecast valid time can vary considerably depending on what part of the attractor forecasts are initialized from, we present valid time results as box plots reflecting 100 different runs for the Lorenz System and 50 runs for the KS system. Each run uses a different initial condition for the training data (and, therefore, for the following forecast) and a different random realization of the reservoir. The box plots depict the median, the 25th and 75th percentiles and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials.

We evaluate the statistical significance of the difference between each sample of ML-DA and baseline forecast valid times using Mood’s median test.^{27} This test considers the null hypothesis that the two samples come from distributions with the same median. We compute the median of the union of the two samples and then count the number of forecast valid times in each sample that exceeds this median. We then compute the p-value of these counts using Pearson’s chi-squared test; this represents the probability that a discrepancy between the two counts of at least the same size would have occurred if the null hypothesis were true. We choose to reject the null hypothesis and declare the difference in the samples to be statistically significant if the p-value is less than $10\u22123$.

#### 5. Analysis RMS error

For the iterative method in Sec. II C, we evaluate the error in the analysis state at a particular iteration $i$ by computing the normalized RMS error over the entire analysis period,

where $\u27e8\cdots \u27e9$ denotes an average over the DA interval ($\u2212T\u2212TS\u2264j\Delta t\u22640$). In addition, we evaluate the analysis error in the measured or unmeasured system variables by including only the particular set of variables in the norms in Eq. (17).

## III. RESULTS

We evaluate the performance of ML-DA using the procedures described in Sec. II D.

### A. Lorenz 1963

The Lorenz 1963 system is described by the equations

where $a=10$, $b=28$, and $c=8/3$. In this example, the true dynamical system, Eq. (1), is given by Eqs. (18)–(20). We obtain simulated data by integrating Eqs. (18)–(20) using a fourth order Runge–Kutta solver with time step $\tau =0.01$. We sample the time series at intervals $\Delta t=0.01$. These data are taken to represent the true system state $xjt$, $\u2212T\u2212TS\u2264j\u2264P$, where the subscript $j$ indicates that the data were sampled at time $j\Delta t$. We use the data in the interval $\u2212T\u2212TS\u2264j\u22641$ to create simulated measurements,

We will study two examples with two different forms of the measurement operator $H$, one in which only the $X1$ variable is measured,

and another in which the $X1$ and $X3$ variables are measured,

Measurement noise $\eta j$ with a Gaussian distribution that has mean zero and covariance matrix $R$ is added to the measurements. The covariance matrix is diagonal, with the diagonal elements equal to $\sigma noise2$, which corresponds to the situation in which the errors in the measurements of the different state vector components are statistically independent. In Secs. III A 1 and III A 2, we let $\sigma noise=0.1$. For comparison, the standard deviations $\sigma X1,\sigma X2,\sigma X3$ of the variables $X1,X2,X3$ are $7.9$, $8.9$, and $8.6$, respectively. In Sec. III A 3, we consider the effect of varying $\sigma noise$.

We let the imperfect model equations be given by

Thus, the imperfect model $G$ differs from the perfect model $F$ in the parameter $b$ by a multiplicative factor $(1+\u03f5)$. Notice also that the model error is in the evaluation of $X2$, which is unmeasured in our experiments.

#### 1. Optimizing the covariance inflation parameter

The quality of both the baseline and ML-DA forecasts depends on the covariance inflation parameter $\rho $ [see Eqs. (8) and (A4)]. This parameter is used in the DA cycle that initializes both forecasts, as well as forming the training data for the ML-DA. It is thus crucial that the covariance inflation parameter is optimized independently for both the forecast schemes. In Fig. 3, for $H=[100001]$, we demonstrate the dependence of $t{VT}$ on the covariance inflation parameter $\rho $ when the model error is $\u03f5=0.1$. Each forecast valid time is shown for a set of 100 forecasts (see Sec. II D 4) made with the baseline scheme (magenta box plots) and the ML-DA scheme (light blue box plots). Figure 3 shows that the ML-DA scheme dramatically improves $t{VT}$ when the model has substantial error. In particular, the best median $t{VT}$ for the ML-DA (which occurs at $\rho =1.2$) exceeds the best median $t{VT}$ baseline (which occurs at $\rho =1.05$) by about a factor of 3.

In the results that follow, for each choice of method, model error, and noise magnitude, we choose the covariance inflation parameter $\rho $ to be the value that yields the highest median $t{VT}$, among the values shown in Fig. 3. For our usual values $\u03f5=0.1$ and $\rho noise=0.1$, Fig. 4 shows a typical example comparing the true and predicted time series; notice that the ML-DA scheme improves the forecast for all three variables, including those that are unobserved.

#### 2. Dependence on the model error

Figure 5 demonstrates the effectiveness of the ML-DA algorithm at different values of the model error ($\u03f5$). As expected, we find that the ML-DA optimized over $\rho $ yields the most improvement relative to the baseline optimized over $\rho $ when the model error is large. In the case when the model error is zero, the ML-DA algorithm degrades the forecasts somewhat; since there is no model error to correct, the modification made to the model by the ML is spurious.

#### 3. Effects of measurement noise

Figure 6 shows the effects of varying amounts of measurement noise on the baseline (magenta box plots) and ML-DA (light blue box plots) forecast valid time for the Lorenz 1963 system. The measurement operator is $H=[100]$ and simulated uncorrelated Gaussian noise with standard deviation $\sigma noise$ is added to the measurements. We assume knowledge of the correct measurement noise so that the DA procedure uses $R=\sigma noise2I$. For comparison, the standard deviation of the $X1$ variable is $\sigma X1\u22437.9$. We see that the ML-DA outperforms the baseline forecast by a larger amount when the noise is smaller.

### B. Kuramoto–Sivashinsky (KS) system

In this section, we consider an example of a spatiotemporal chaotic dynamical system called the Kuramoto–Sivashinsky (KS) system defined by the Partial Differential Equation (PDE),

Equation (27) defines the evolution of a one-dimensional spatiotemporal scalar field $u(x,t)$ on the spatial domain $x\u2208[0,L)$. We assume periodic boundary conditions so that $u(x+L,t)=u(x,t)$. In this example, we let Eq. (27) represent the true, unknown space and time continuous dynamics. We obtain a time series of finite-dimensional representations of the state that we assume to satisfy Eq. (1) by discretizing the domain $[0,L)$ into $Q$ grid points and integrating Eq. (27) using a pseudo-spectral PDE solver^{19} with time step $\tau =0.25$. We sample the time series at intervals of $\Delta t=0.25$. Thus, our simulated true system states takes the form of a $Q$-dimensional vector $xjt$, $\u2212T\u2212TS\u2264j\u2264P$, where the subscript $j$ indicates that the data were sampled at time $j\Delta t$. As in the previous example of the Lorenz 1963 system (Sec. III A), we generate simulated measurements by adding random Gaussian noise to the simulated true states. We use the simulated measurements in the interval $\u2212T\u2212TS\u2264j\u22640$ as the training data for the ML-DA scheme and set aside the data in the interval $1\u2264j\u2264P$ for forecast verification.

We assume that we have access to an imperfect model, $G$, of the KS dynamical system given by the equations

Thus, if $\u03f5=0$, then the model represents the true dynamics perfectly, while a nonzero value of $\u03f5$ indicates an imperfect model. We also assume that our measurements are of the form given by Eq. (2). The measurement operator $H$ is a projection operator that outputs $\Theta $ uniformly spaced measurements (of the $Q$ total variables) in $xjt$. The measurement noise is Gaussian with zero mean and a diagonal covariance matrix with diagonal elements equal to $\sigma noise2$, where $\sigma noise=0.1$. For comparison, the standard deviation of the KS time series variable is $\sigma KS=1.3$.

#### 1. Forecast quality and dependence on model error

To demonstrate the effectiveness of the ML-DA technique, we consider a KS system with $L=35$, $Q=64$ and a measurement operator $H$ that corresponds to $\Theta =16$ uniformly spaced measurements. The upper three panels of Fig. 7 show (from top to bottom) the time dependence of the true discretized Kuramoto–Sivashinksy equation dynamics, the error in the baseline prediction, and the error in the ML-DA prediction when the model error is $\u03f5=0.1$. The bottom panel in Fig. 7 shows the normalized RMS error of both predictions. Notice that the ML-DA again significantly outperforms the baseline scheme.

Figure 8 shows $t{VT}$ for $50$ different forecasts per box plot made with the ML-DA and the baseline scheme at four different values of the model error $\u03f5$. As for the Lorenz 1963 system, as the size of the error in the imperfect model increases, the ML-DA scheme outperforms the baseline scheme by an increasing ratio, but as expected, it degrades performance somewhat when the model error is zero (and, thus, no ML correction is needed). In general, this degradation must persist for a sufficiently small model error; for the KS system, model error $\u03f5=0.01$ is not large enough for us to see a clear advantage to either scheme.

#### 2. Analysis error using iterative ML-ETKF

To demonstrate that the ML-DA hybrid model is also capable of improving the analysis states in a DA cycle, we re-do the DA run that produced the training data using the hybrid model in place of the knowledge-based model. We then re-train the hybrid model and iterate as described in Sec. II C. Figure 9 shows the error in analysis states computed using iteratively produced hybrid models. Here, iteration $0$ denotes the original set of analysis states computed using only the knowledge-based model. We see that using the first two generations of the trained ML-DA to perform DA results in a significant reduction in error in the analyses of both the measured and unmeasured variables. Later, iterations give smaller reductions in error. We also measured the forecast valid time for each hybrid model trained on the iteratively produced analysis time series. We found that there was only a small increase in the median valid time after one iteration ($\u223c10$%) and that successive iterations did not improve the valid time.

## IV. CONCLUSIONS AND DISCUSSION

We have demonstrated that DA can successfully extend the hybrid approach introduced in Ref. 31, which trains a machine-learning component based on reservoir computing to improve a knowledge-based forecast model, to the case where a multivariate time series of complete, directly-measured model states is not available for training. Though we use a specific ensemble DA scheme, ETKF, in our numerical experiments, any DA scheme could be used to process the available measurements into estimated model states for use as training data. Using a knowledge-based model rather than a purely data-driven approach allows us to perform DA before training the machine-learning component. A potential drawback to this approach is that biases present in the knowledge-based model will affect the training data, reducing the effectiveness of the training in correcting for these biases. Using the iterative approach described in Sec. II C, however, we are able to reduce the effect of these biases on the analysis errors by using the previously trained hybrid model as the forecast model on the DA. In applications such as weather forecasting, where multiple knowledge-based models are available, another approach would be to use a different model in the DA procedure than in the hybrid training and prediction procedure.

Another natural extension of the method described in this article would be to integrate the machine-learning training with the data-assimilation cycle in such a way that the training is updated each time new data become available. The use of reservoir computing as the machine-learning component of our hybrid model allows a large number of parameters (the elements of the output weight matrix) to be trained efficiently using linear regression, which could be done sequentially (like DA) through techniques such as recursive least squares. It could be fruitful to more closely integrate the state estimation and the estimation of the parameters of the hybrid model. The unified Bayesian framework described in Refs. 5 and 7 could be applied to a hybrid model, though a successful implementation may require limiting the number of machine-learning parameters.

## AUTHORS’ CONTRIBUTIONS

A.W. and J.P. contributed equally to this work.

## ACKNOWLEDGMENTS

This work was supported by the Defense Advanced Research Projects Agency (DARPA) (Grant No. HR00111890044). The work of Alexander Wikner was supported in part by the National Science Foundation (NSF) (Award No. DGE-1632976). The work of Istvan Szunyogh was also supported by the Office of Naval Research (ONR) (Award No. N00014-18-2509). We thank the reviewers for their helpful comments.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request. A software implementation in MATLAB of the hybrid reservoir computer described here (as well as in Refs. 31 and 40) may be found in GitHub at https://github.com/awikner/CHyPP, Ref. 41.

### APPENDIX: THE ENSEMBLE TRANSFORM KALMAN FILTER

The particular data-assimilation algorithm used in this paper is the Ensemble Transform Kalman Filter (ETKF),^{4,38} a variant of the Ensemble Kalman Filter (EnKF).^{9,11,15} Our formulation is based on the local analysis used in the LETKF^{16,28,36} for spatially extended systems larger than the KS system used in Sec. III B.

Recall from Sec. II A that the analysis state at time $j\Delta t$ is computed from the analysis state at time $(j\u22121)\Delta t$ by a forecast step followed by an analysis step. An EnKF uses an ensemble ${xja,k}1\u2264k\u2264E$ of analysis states whose mean $x\xafja$ represents the estimated state and whose covariance $Pja$ represents the uncertainty in the estimate. We now describe the forecast and analysis steps for the ensemble.

In an EnKF, the background ensemble members $xjb,k$ are forecast separately using the model, Eq. (3), so that

The sample covariance $Pjb$ of the background ensemble is

where $x\xafjb$ is the mean of the background ensemble members. The goal of the analysis step, in which the types of EnKF differ, is to compute an analysis ensemble whose mean $x\xafja$ and sample covariance $Pja$ are consistent with the Kalman filter.^{18}

We now present the computations that form the analysis step of the ETKF, as formulated in Ref. 16 (in which the consistency with the Kalman filter is explained). Recall that $yj$ is a vector of measurements made at time $j\Delta t$.

- Create the matrixwhose columns represent deviations of the ensemble members from the ensemble mean.(A3)$Xjb=[xjb,1\u2212x\xafjb|xjb,2\u2212x\xafjb|\cdots |xjb,E\u2212x\xafjb],$
Compute the matrix $Yjb=HXjb$ and the vector $y\xafjb$.

- Create a matrixwhere $C=(Yjb)TR\u22121$ and $\rho $ is the covariance inflation parameter described in Sec. II A.(A4)$P~ja=[(E\u22121)I/\rho +CYjb],$
Compute $Wja=[(E\u22121)P~ja]1/2$, where the $[\u22c5]1/2$ denotes the symmetric square root.

Compute $w\xafja=P~aC(yj\u2212y\xafjb)$ and add it to the $k$th column of $Wja$ to obtain $wja,k$.

Compute $xja,k=Xjbwja,k+x\xafjb$. The analysis ensemble is given by ${xja,k}1\u2264k\u2264E$.

For our results, the analysis ensemble is initialized arbitrarily in a neighborhood of the model attractor. Then, the forecast and analysis steps described above are applied iteratively to generate the analysis time series used to train the ML system.

## REFERENCES

*Data Assimilation: Methods, Algorithms, and Applications*, Fundamentals of Algorithms (SIAM, 2016).

*Neural Networks: Tricks of the Trade*(Springer, 2012), pp. 659–686.