We consider the problem of data-assisted forecasting of chaotic dynamical systems when the available data are in the form of noisy partial measurements of the past and present state of the dynamical system. Recently, there have been several promising data-driven approaches to forecasting of chaotic dynamical systems using machine learning. Particularly promising among these are hybrid approaches that combine machine learning with a knowledge-based model, where a machine-learning technique is used to correct the imperfections in the knowledge-based model. Such imperfections may be due to incomplete understanding and/or limited resolution of the physical processes in the underlying dynamical system, e.g., the atmosphere or the ocean. Previously proposed data-driven forecasting approaches tend to require, for training, measurements of all the variables that are intended to be forecast. We describe a way to relax this assumption by combining data assimilation with machine learning. We demonstrate this technique using the Ensemble Transform Kalman Filter to assimilate synthetic data for the three-variable Lorenz 1963 system and for the Kuramoto–Sivashinsky system, simulating a model error in each case by a misspecified parameter value. We show that by using partial measurements of the state of the dynamical system, we can train a machine-learning model to improve predictions made by an imperfect knowledge-based model.

A common desired function is that of cyclic forecasting of a chaotic system’s state, in which a set of forecasts to various time increments in the future is periodically produced once every cycle time T (e.g., for weather forecasting, T=6 h is common). To do this, at the beginning of each cycle, available measurements taken during the last cycle are combined with (“assimilated into”) a forecast of the current state made at the beginning of the previous cycle to produce an estimate of the current system state. This estimate then serves as the initial condition for the prediction model, which produces the next set of forecasts and so on. The process of cyclic formation of the required initial conditions is called “Data Assimilation.” Recently, prediction models utilizing machine learning have been developed and shown to have great promise. Thus, the training and use of machine-learning-assisted prediction models in cyclic forecasting systems is of considerable interest. To enable this, it is necessary to train the machine-learning system, using measured data, to forecast variables that are not measured. Here, we show that this can be accomplished using data assimilation and demonstrate the ability of machine learning to improve the forecast of variables for which no directly measured training data are available.

The application of machine learning (ML) to forecasting chaotic dynamical systems (e.g., Ref. 24) has been a subject of much recent interest.2,5–8,10,17,26,29–31,40 Promising results of tests promote the possibility of using ML for large scale applications such as weather forecasting. Particularly, Ref. 29 indicates that machine-learning techniques can be scaled effectively to forecast arbitrarily high-dimensional chaotic attractors. In cases such as weather forecasting where a skillful knowledge-based model is available, hybrid approaches that use ML to correct model deficiencies are especially promising.31,33 By “knowledge-based,” we mean derived from fundamental principles such as the laws of physics.

So far, most data-driven forecasting approaches assume that training data are available for all the variables to be forecast. This assumption is relaxed in Refs. 5–8, which describe techniques to develop a data-driven forecast model on a specified, but potentially only partially measured, model state space. Using an approach different from those in Refs. 5–8, in this paper, we develop a technique for hybrid knowledge-based/ML systems in the partially measured scenario. Both our hybrid approach and the purely data-driven approaches in Refs. 5, 6, 8, and 37 are based on the paradigm of data assimilation (DA), which has long been used in weather forecasting and other applications.

When a forecast model is available, DA (e.g., Refs. 3, 12, and 21) determines initial conditions for an ongoing sequence of forecasts, based on an ongoing sequence of measurements. DA does not require that current measurements are sufficient to determine the current model state; alternatively, it typically uses the forecast model to propagate information from past measurements to augment current information. This technique helps both to constrain the model state and to filter noise in the measurements.

Ideas from ML and DA have been combined in various ways in the literature. DA has been used with data-driven forecast models based on delay-coordinate embedding14 and analog forecasting.22 On the other hand, with appropriate training data, machine-learning techniques can be taught to perform DA without explicitly formulating a forecast model.25 DA has also been employed as a paradigm and a tool for the training of machine-learning systems.1,5–8,13,32

We give a brief introduction to DA in Sec. II A. In Sec. II B, we formulate a potentially effective scheme combining DA (Sec. II A) with machine-learning-assisted prediction (as described in Ref. 31) to create a cyclic forecast system. In Sec. II C, we propose an iterative procedure to improve the DA cycle and potentially produce improved ML-assisted forecast models. In Sec. III, we show the results of numerical experiments.

We consider the problem of forecasting an autonomous, chaotic, nonlinear dynamical system represented by the equation

(1)

In Eq. (1), x~(t) is the true dynamical system state at time t and F is the vector field. We allow for the case where x~ is infinite-dimensional, as it is in a partial differential equation. We assume that we are only able to obtain partial measurements of the state x~ at regular time intervals. We denote the measurements made at time kΔt by yk so that

(2)

In Eq. (2), H~ is often called the observation operator; here, we call it the measurement operator. For our exposition, we assume H~ to be linear, as will be true in the numerical experiments performed in this paper. (Many DA methods allow a nonlinear measurement operator, and our ML method is compatible with this case.) The random variable ηk represents measurement noise, which is typically assumed to be Gaussian with mean 0 and covariance matrix R.

In applications, we do not have full knowledge of the dynamics of the system, and, as such, Eq. (1) is unknown. What often is available is an imperfect model of the dynamics, with model error due to imperfect representations of some of the physical processes that govern the system. We assume that we have access to such an imperfect “knowledge-based” model denoted by G and that the dynamical equation

(3)

can, via evolution of x, be used to forecast future x~-dependent properties of interest. In some applications, Eq. (1) represents a partial differential equation, and x approximates x~ on a model grid. Thus, we assume that there is a projection operator (which could be the identity) from the space of x~ to the space of x. We write xkt to represent the projection of the true state x~(kΔt) onto the space of x so that x(kΔt) is meant to approximate xkt. We rewrite Eq. (2) as

(4)

where H approximates H~ (for example, via interpolation from a model grid to an observation location).

In DA, the initial state x(0) for the forecast model, Eq. (3), is determined from the current measurement y0 and past measurements {yk}Tmk<0, using the same model to propagate information (imperfectly) from the past to the present. Here, Tm represents the number of past times kΔt from which measurements are available. Many DA methods can be formulated as the approximate solution of a nonlinear least-squares problem,16,23,35 similar to the following. Given the measurements {yk}Tmk0, find a trajectory x(t) of Eq. (3) that minimizes the cost function given by

(5)

Usually, ρ is taken to equal 1, but here, we allow also ρ>1 in order to discount past measurements in part to compensate for model error in G; we will discuss the role of ρ further after Eq. (8). In the situation where the measurement errors (ηk) are Gaussian and the model G is a perfect representation of the dynamical system, i.e., G=F, minimizing Eq. (5) with ρ=1 will give us the trajectory of the model x(t) that is the most likely trajectory of the dynamical system in the Bayesian sense (here, we assume a “flat” prior likelihood; the choice of prior distribution becomes increasingly unimportant as T increases.)

In forecast applications, the purpose of DA is to determine an estimate x(0) of the current true system state x0t. This estimate is used as an initial condition for Eq. (3) to forecast x(t) for t>0. As such, the trajectory x(t) for t<0 that minimizes the cost function J0 is often not computed explicitly; only its final state x(0) is needed as the algorithmic output. DA is typically done sequentially using the state estimate from an earlier time to compute the estimate for the current time. To be more precise, consider the cost function Jj that generalizes Eq. (5) as follows:

(6)

Then,

(7)

Suppose for some Tm<j<=0 that xj1a represents an estimate of xj1t based on the cost function Jj1, i.e., based on measurements yk up to and including k=j1. The next estimate xja is computed in two steps as follows:

  1. Forecast: The model, Eq. (3), is applied from time (j1)Δt to jΔt, using xj1a as the initial condition, to obtain an estimate xjb to xjt, called the “background” state at time jΔt.

  2. Analysis: The background xjb is used along with the current measurement vector yj to compute the “analysis” state xja that estimates the minimizer of Jj at time jΔt.

These two steps are repeated to compute xj+1b, xj+1a, and so on.

Different DA methods perform the analysis step differently.3,12,21,35 In some methods, including those based on the Kalman filter, the cost functions are approximated by quadratic functions of x(jΔt) so that the right side of Eq. (7) becomes

(8)

Here, Pjb is a covariance matrix associated with the background state at time jΔt. Thus, the quantity to be minimized is a sum of (squared) weighted distances from x(jΔt) to the background state (which is forecast from previous observations) and to the current observations. In an ensemble Kalman filter,9,11,15Pjb is computed as the sample covariance of an ensemble of forecasts. The parameter ρ effectively inflates this covariance, compensating for (in addition to model error) sampling errors due to finite ensemble size and the effect of nonlinearity on the error dynamics; this technique is called multiplicative covariance inflation.16,35,39 We refer to ρ as the “covariance inflation parameter.”

In the  Appendix, we describe the particular DA method we use, the “Ensemble Transform Kalman Filter” (ETKF),4,38 a type of an ensemble Kalman filter. We emphasize that the approach we describe in Sec. II B can be used with any DA method. The forecast and analysis steps we have described are applied iteratively to produce a time series of analysis states xja. We will use this time series in Sec. II B to train the ML component of our method.

In what follows, we use a reservoir computer implementation of our ML-DA algorithm based on Refs. 17 and 26, which is similar to that of Ref. 31. The main component of a reservoir computer is the reservoir network. The reservoir network is a sparse, randomly generated graph with weighted, directed edges and dynamical nodes whose states evolve in response to input signals transported along network edges. The network topology can be represented completely by a weighted adjacency matrix, denoted A. We choose A to be a Dr×Dr sparse, randomly generated matrix. The network is constructed to have an average in-degree (number of incoming edges per node) denoted by d, and the nonzero elements of A, representing the edge weights in the network, are initially chosen independently from the uniform distribution over the interval [0,1]. All the edge weights in the network are then uniformly scaled via multiplication of the adjacency matrix by a constant factor to set the largest magnitude eigenvalue of the matrix to a desired value ω, which is called the “spectral radius” of A. The state of the reservoir, given by the vector r(t), consists of the components rn for 1nDr, where rn(t) denotes the scalar state of the nth node in the network. The reservoir is coupled to the M dimensional input through a Dr×M dimensional matrix Win. Each row of the matrix Win has exactly one nonzero element, which is independently and randomly chosen from the uniform distribution on the interval [ζ,ζ], where ζ is a tunable parameter. We choose the locations of these nonzero elements such that each node receives input from only one input variable and each input variable is input to an equal number of nodes.

As outlined at the beginning of Sec. II, we note that the finite-dimensional representation of the true dynamical system states is given by xjt. We have T+Ts measurements given by {yj}, in the interval TTsj0, which are related to the true state of the dynamical system by Eq. (2). We refer to T and Ts as the number of training and synchronization samples, where TTs. We further assume that the model G [Eq. (3)] is imperfect. Using the model G and the DA algorithm outlined in the  Appendix, we obtain an analysis state xja at each time step j, TTsj0. We are interested in forecasting the state of the dynamical system for j>0. In the standard forecast approach (without ML), one would predict the future state of the dynamical system [Eq. (1)] using x0a as the initial condition and Eq. (3) as the dynamical model. We will call this forecast the baseline forecast, to which we will compare the forecast made by our ML-DA algorithm. The ML-DA algorithm is an extension, to the case where the model state is only partially observed, of the algorithm proposed in Ref. 31 for hybrid forecasting of chaotic dynamical systems using ML in conjunction with a knowledge-based model. The analysis states xja lie in the model state space; once they are determined by the DA procedure, training proceeds as in the hybrid method of Ref. 31. Figure 1 shows a schematic illustration of the ML-DA scheme described below.

FIG. 1.

(a) Training the ML-DA prediction scheme and (b) hybrid forecasting as described in Sec. II B. The output matrix in (a), Wout, is unknown at the time of training. The goal of training is to determine Wout such that xjH best approximates xja in the regularized least-squares sense.

FIG. 1.

(a) Training the ML-DA prediction scheme and (b) hybrid forecasting as described in Sec. II B. The output matrix in (a), Wout, is unknown at the time of training. The goal of training is to determine Wout such that xjH best approximates xja in the regularized least-squares sense.

Close modal

1. Training

  1. Use the model G to create a set of forecasts from the latest T analysis states xja, Tj0. We denote these time-Δt forecasts by xjM, T+1j1.

  2. Initialize the reservoir state rTTs to a random value.

  3. Evolve the reservoir computer using the following equation:
    (9)
    for TTsj0. Here, tanh is applied element-by-element to its input vector.
  4. Find a set of output weights Wout so that
    (10)
    for T+1j0. The matrix Wout is computed using regularized least-squares regression. Thus, we find that Wout minimizes the following cost function:
    (11)
    Here, β is a regularization parameter. Notice that we have not used the first Ts reservoir states to find the optimal Wout. This is to ensure that the reservoir states used in the computation of Wout do not depend strongly on the initial random value of the reservoir state, rTTs, and thus are “synchronized” to the measurements that drive Eq. (9) through the analysis states xja.

2. Prediction

We now describe the hybrid forecast model that uses the trained reservoir along with the knowledge-based model G to forecast the state x(jΔt) for j>0.

The hybrid reservoir runs in a closed-loop feedback mode according to the following steps:

  1. Obtain the hybrid ML prediction at time jΔt (denoted xjH) according to
    (12)
    The reservoir state and knowledge-based forecasts at the first step, r1 and x1M, are computed from the final training analysis x0a according to steps 1 and 3 of the training procedure.
  2. Compute the imperfect model forecast at time (j+1)Δt, xj+1M, by evolving xjH using the imperfect model G.

  3. Evolve the reservoir according to
    (13)
  4. Increment j and repeat steps 1–3.

If one wants to initialize a forecast at a time JΔt for J>0, one should continue to run the DA procedure for J more steps after the training time period, set xJH=xJa, then proceed as above starting with step 2 for j=J. The J DA cycles after time 0 can be performed using the trained hybrid model in place of the knowledge-based model, similarly to the approach described in Sec. II C.

We note that the ensemble-based DA method we use could be used to initialize an ensemble forecast. However, for simplicity, our results use only forecasts from the ensemble mean.

Once we have trained a hybrid model using the process described in Sec. II B 1, we can iterate the training procedure in the following sense. An iterative approach, initialized differently, is described in Ref. 8. As shown in Fig. 2, we replace the knowledge-based model used during DA with the previously trained hybrid model. At iteration i=1, the hybrid is trained using the analysis states produced using only the knowledge-based model. We denote the output matrix to be determined by the training, the state of the reservoir at time jΔt, the analysis state at time jΔt, and the hybrid model predictions for time jΔt at iteration i as Wout,i, rj,i, xj,ia, and xj,iH, respectively.

FIG. 2.

Training subsequent iterations (i2) of the ML-DA algorithm using the synchronized hybrid model during DA as described in Sec. II C. The index i denotes the current iteration of the algorithm, and we consider the training procedure shown in Fig. 1(a) to occur at i=1. At iteration i, Wout,i1 is known from the previous training iteration, while we endeavor to determine Wout,i such that xj,iH best approximates xj,ia in the regularized least-squares sense.

FIG. 2.

Training subsequent iterations (i2) of the ML-DA algorithm using the synchronized hybrid model during DA as described in Sec. II C. The index i denotes the current iteration of the algorithm, and we consider the training procedure shown in Fig. 1(a) to occur at i=1. At iteration i, Wout,i1 is known from the previous training iteration, while we endeavor to determine Wout,i such that xj,iH best approximates xj,ia in the regularized least-squares sense.

Close modal

Training proceeds as in Sec. II B 1 with the following modifications.

1. Training

  1. For TTsj<T, we use the knowledge-based model to perform DA, producing analysis states xj,ia. We do not use the hybrid model during this synchronization time period because the reservoir states rj,i might not yet be synchronized with the measurements. These reservoir states are generated according to step 3 of Sec. II B 1; i.e.,
    (14)
  2. For Tj0, we use the hybrid model to perform DA in place of the knowledge-based model, producing analysis states xj,ia, as depicted in Fig. 2. We continue to use Eq. (14) to evolve the reservoir states.

  3. Using these new analysis states, we follow the steps contained in Sec. II B 1 to compute a new hybrid output matrix, Wout,i+1.

  4. Once the new hybrid output matrix has been computed, predictions can be produced by following Sec. II B 2.

In the case of an ensemble-based assimilation method (such as the ETKF described in the  Appendix), we augment each ensemble member to include, in addition to the system state xja,k, a corresponding reservoir state rjk. This reservoir state is synchronized to the corresponding ensemble system state by evolving it as in Eq. (14),

(15)

Then, rj+1,ik is used in the hybrid model to produce the kth background ensemble member for the next analysis cycle. We note that the reservoir state(s) at j=T+1 needed to begin DA using the hybrid model must only be computed at the first iteration. Afterward, these initial synchronized states may be reused at the beginning of subsequent DA cycles.

We evaluate the forecast performance of the ML-DA algorithm in comparison with the performance of baseline for forecasts prepared without using DA. We consider two dynamical systems as our examples: the Lorenz 1963 system24 and the Kuramoto–Sivashinsky (KS) system.20,34 For the latter system, we additionally evaluate the error in the analysis states from using the iterative method described in Sec. II C.

For the purpose of evaluating forecast performance, simulated true states xjt are generated by the perfect model dynamics in the time interval TTSjP, where T and TS are described in Sec. II B. In all cases, transient dynamics are discarded so that the simulated true states lie in the model attractor. The data in the interval 1jP are set aside to verify the baseline and ML-DA forecasts but is not otherwise used since it is considered to be in the future. We assume that we only have access to the measurements {yj}TTSj0 generated from the simulated true states via Eq. (2) and do not know the perfect model, Eq. (1). We further assume that we have access to an imperfect model G [Eq. (3)]. We use this imperfect model to perform DA, to train the ML-DA, and to prepare both the baseline and ML-DA forecasts for j>0.

For DA, we use the ETKF, as described in the  Appendix, with 15 ensemble members for the Lorenz 1963 system and 30 ensemble members for the KS system. To initialize the forecasts and the ML training, we perform DA using the measurements yj (TTSj0) and the model G to obtain the analysis states xTTsa,,x0a.

1. Baseline forecasts

Using x0a as the initial condition, we compute a forecast using the model G from time j=1 to j=P. We call this forecast xjf,base, 1jP.

2. ML-DA forecasts

We follow the ML-DA scheme detailed in Sec. II B using a reservoir computer with the reservoir and training parameters listed in Table I for the Lorenz 1963 and the Kuramoto–Sivashinsky (KS) dynamical systems. We use the measurements yj (TTSj0) and the imperfect model G corresponding to the Lorenz and KS systems, respectively, to train the reservoir computer. The exact form of the measurements and of the imperfect model is described in Secs. III A and III B. We then forecast using the ML-DA scheme from j=1 to j=P. We call this forecast xjf,ml, 1jP.

TABLE I.

Reservoir and training parameters.

ParameterLorenz 1963KS
Dr 1000 2000 
d⟩ 
ω 0.9 0.6 
ζ 0.1 1.0 
β 10−4 10−4 
T 20 000 100 000 
ParameterLorenz 1963KS
Dr 1000 2000 
d⟩ 
ω 0.9 0.6 
ζ 0.1 1.0 
β 10−4 10−4 
T 20 000 100 000 

3. Normalized RMS error

The normalized RMS error in the baseline forecast (ejbase) and the RMS error in the ML-DA forecast (ejml) are calculated as follows:

(16)

where |||| denotes the Euclidean norm, denotes an average over the prediction interval (1jP), and ej, xjf denote either ejbase, xjf,base, or ejml, xjf,ml. The superscript f in xjf stands for forecast.

4. Forecast valid time

We define the forecast Valid Time (t{VT}) as the time jΔt at which the RMS error (ej) first exceeds a threshold κ chosen to be 0.9. In order to report the forecast quality on a meaningful natural time scale, we normalize the forecast valid time by the “Lyapunov time” of the dynamical system. The Lyapunov time is defined as the inverse of the Largest Lyapunov Exponent (LLE), denoted Λmax, of the dynamical system being forecast. In a chaotic system, infinitesimal errors grow exponentially on average as exp(Λmaxt). Thus, Λmax1 is a natural time scale for evaluating forecast quality in a chaotic system.

Since forecast valid time can vary considerably depending on what part of the attractor forecasts are initialized from, we present valid time results as box plots reflecting 100 different runs for the Lorenz System and 50 runs for the KS system. Each run uses a different initial condition for the training data (and, therefore, for the following forecast) and a different random realization of the reservoir. The box plots depict the median, the 25th and 75th percentiles and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials.

We evaluate the statistical significance of the difference between each sample of ML-DA and baseline forecast valid times using Mood’s median test.27 This test considers the null hypothesis that the two samples come from distributions with the same median. We compute the median of the union of the two samples and then count the number of forecast valid times in each sample that exceeds this median. We then compute the p-value of these counts using Pearson’s chi-squared test; this represents the probability that a discrepancy between the two counts of at least the same size would have occurred if the null hypothesis were true. We choose to reject the null hypothesis and declare the difference in the samples to be statistically significant if the p-value is less than 103.

5. Analysis RMS error

For the iterative method in Sec. II C, we evaluate the error in the analysis state at a particular iteration i by computing the normalized RMS error over the entire analysis period,

(17)

where denotes an average over the DA interval (TTSjΔt0). In addition, we evaluate the analysis error in the measured or unmeasured system variables by including only the particular set of variables in the norms in Eq. (17).

We evaluate the performance of ML-DA using the procedures described in Sec. II D.

The Lorenz 1963 system is described by the equations

(18)
(19)
(20)

where a=10, b=28, and c=8/3. In this example, the true dynamical system, Eq. (1), is given by Eqs. (18)–(20). We obtain simulated data by integrating Eqs. (18)–(20) using a fourth order Runge–Kutta solver with time step τ=0.01. We sample the time series at intervals Δt=0.01. These data are taken to represent the true system state xjt, TTSjP, where the subscript j indicates that the data were sampled at time jΔt. We use the data in the interval TTSj1 to create simulated measurements,

(21)

We will study two examples with two different forms of the measurement operator H, one in which only the X1 variable is measured,

(22)

and another in which the X1 and X3 variables are measured,

(23)

Measurement noise ηj with a Gaussian distribution that has mean zero and covariance matrix R is added to the measurements. The covariance matrix is diagonal, with the diagonal elements equal to σnoise2, which corresponds to the situation in which the errors in the measurements of the different state vector components are statistically independent. In Secs. III A 1 and III A 2, we let σnoise=0.1. For comparison, the standard deviations σX1,σX2,σX3 of the variables X1,X2,X3 are 7.9, 8.9, and 8.6, respectively. In Sec. III A 3, we consider the effect of varying σnoise.

We let the imperfect model equations be given by

(24)
(25)
(26)

Thus, the imperfect model G differs from the perfect model F in the parameter b by a multiplicative factor (1+ϵ). Notice also that the model error is in the evaluation of X2, which is unmeasured in our experiments.

1. Optimizing the covariance inflation parameter

The quality of both the baseline and ML-DA forecasts depends on the covariance inflation parameter ρ [see Eqs. (8) and (A4)]. This parameter is used in the DA cycle that initializes both forecasts, as well as forming the training data for the ML-DA. It is thus crucial that the covariance inflation parameter is optimized independently for both the forecast schemes. In Fig. 3, for H=[100001], we demonstrate the dependence of t{VT} on the covariance inflation parameter ρ when the model error is ϵ=0.1. Each forecast valid time is shown for a set of 100 forecasts (see Sec. II D 4) made with the baseline scheme (magenta box plots) and the ML-DA scheme (light blue box plots). Figure 3 shows that the ML-DA scheme dramatically improves t{VT} when the model has substantial error. In particular, the best median t{VT} for the ML-DA (which occurs at ρ=1.2) exceeds the best median t{VT} baseline (which occurs at ρ=1.05) by about a factor of 3.

FIG. 3.

Dependence of the forecast valid time (t{VT}) for the Lorenz 1963 model on the covariance inflation parameter (ρ) when H=[100001], measurement noise has magnitude σnoise=0.1, and the model error is ϵ=0.1. Each box plot represents the result of 100 independent trials comparing the baseline and the ML-DA forecasts, each using a different Lorenz 1963 time series starting from a random initial condition in the attractor. The box plots depict the median, the 25th and 75th percentiles, and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials. The “*” above a pair of boxes indicates where Mood’s median test determined that the ML-DA median forecast valid time was significantly greater than that of the baseline for that value of ρ, with a p-value less than 103.

FIG. 3.

Dependence of the forecast valid time (t{VT}) for the Lorenz 1963 model on the covariance inflation parameter (ρ) when H=[100001], measurement noise has magnitude σnoise=0.1, and the model error is ϵ=0.1. Each box plot represents the result of 100 independent trials comparing the baseline and the ML-DA forecasts, each using a different Lorenz 1963 time series starting from a random initial condition in the attractor. The box plots depict the median, the 25th and 75th percentiles, and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials. The “*” above a pair of boxes indicates where Mood’s median test determined that the ML-DA median forecast valid time was significantly greater than that of the baseline for that value of ρ, with a p-value less than 103.

Close modal

In the results that follow, for each choice of method, model error, and noise magnitude, we choose the covariance inflation parameter ρ to be the value that yields the highest median t{VT}, among the values shown in Fig. 3. For our usual values ϵ=0.1 and ρnoise=0.1, Fig. 4 shows a typical example comparing the true and predicted time series; notice that the ML-DA scheme improves the forecast for all three variables, including those that are unobserved.

FIG. 4.

Forecast time series plots for the Lorenz 1963 system using the ML-DA and the baseline scheme for H=[100] and model error ϵ=0.1.

FIG. 4.

Forecast time series plots for the Lorenz 1963 system using the ML-DA and the baseline scheme for H=[100] and model error ϵ=0.1.

Close modal

2. Dependence on the model error

Figure 5 demonstrates the effectiveness of the ML-DA algorithm at different values of the model error (ϵ). As expected, we find that the ML-DA optimized over ρ yields the most improvement relative to the baseline optimized over ρ when the model error is large. In the case when the model error is zero, the ML-DA algorithm degrades the forecasts somewhat; since there is no model error to correct, the modification made to the model by the ML is spurious.

FIG. 5.

Dependence of t{VT} on the model error (ϵ) for the Lorenz 1963 system; (a) and (b) differ in the measurement operator used, which is indicated in the figures. Each box plot represents the result of 100 independent trials comparing the baseline and the ML-DA forecasts, each using a different Lorenz 1963 time series starting from a random initial condition in the attractor. The box plots depict the median, the 25th and 75th percentiles and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials. The covariance inflation parameter is independently optimized for both sets of forecasts at each value of the model error (ϵ), and the optimal value is displayed next to each box. The “*” above a pair of boxes indicates where Mood’s median test determined that the ML-DA median forecast valid time was significantly greater than that of the baseline for that value of ϵ and optimal ρ, with a p-value less than 103.

FIG. 5.

Dependence of t{VT} on the model error (ϵ) for the Lorenz 1963 system; (a) and (b) differ in the measurement operator used, which is indicated in the figures. Each box plot represents the result of 100 independent trials comparing the baseline and the ML-DA forecasts, each using a different Lorenz 1963 time series starting from a random initial condition in the attractor. The box plots depict the median, the 25th and 75th percentiles and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials. The covariance inflation parameter is independently optimized for both sets of forecasts at each value of the model error (ϵ), and the optimal value is displayed next to each box. The “*” above a pair of boxes indicates where Mood’s median test determined that the ML-DA median forecast valid time was significantly greater than that of the baseline for that value of ϵ and optimal ρ, with a p-value less than 103.

Close modal

3. Effects of measurement noise

Figure 6 shows the effects of varying amounts of measurement noise on the baseline (magenta box plots) and ML-DA (light blue box plots) forecast valid time for the Lorenz 1963 system. The measurement operator is H=[100] and simulated uncorrelated Gaussian noise with standard deviation σnoise is added to the measurements. We assume knowledge of the correct measurement noise so that the DA procedure uses R=σnoise2I. For comparison, the standard deviation of the X1 variable is σX17.9. We see that the ML-DA outperforms the baseline forecast by a larger amount when the noise is smaller.

FIG. 6.

Effect of varying amounts of measurement noise on the baseline (magenta box plots) and ML-DA (light blue box plots) forecast valid time for the Lorenz 1963 system, with model error ϵ=0.2. Each box plot represents the result of 100 independent trials comparing the baseline and the ML-DA forecasts, each using a different Lorenz 1963 time series starting from a random initial condition in the attractor. The box plots depict the median, the 25th and 75th percentiles and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials. As in Fig. 5, the covariance inflation parameter is independently optimized for both sets of forecasts at each value of the measurement noise (σnoise), and the optimal value is displayed next to each box. The “*” above a pair of boxes indicates where Mood’s median test determined that the ML-DA median forecast valid time was significantly greater than that of the baseline for that value of σnoise and optimal ρ, with a p-value less than 103. The measurement operator is H=[100], and simulated uncorrelated Gaussian noise with standard deviation σnoise is added to the measurements. For comparison, the standard deviation of the X1 variable is σX17.9.

FIG. 6.

Effect of varying amounts of measurement noise on the baseline (magenta box plots) and ML-DA (light blue box plots) forecast valid time for the Lorenz 1963 system, with model error ϵ=0.2. Each box plot represents the result of 100 independent trials comparing the baseline and the ML-DA forecasts, each using a different Lorenz 1963 time series starting from a random initial condition in the attractor. The box plots depict the median, the 25th and 75th percentiles and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials. As in Fig. 5, the covariance inflation parameter is independently optimized for both sets of forecasts at each value of the measurement noise (σnoise), and the optimal value is displayed next to each box. The “*” above a pair of boxes indicates where Mood’s median test determined that the ML-DA median forecast valid time was significantly greater than that of the baseline for that value of σnoise and optimal ρ, with a p-value less than 103. The measurement operator is H=[100], and simulated uncorrelated Gaussian noise with standard deviation σnoise is added to the measurements. For comparison, the standard deviation of the X1 variable is σX17.9.

Close modal

In this section, we consider an example of a spatiotemporal chaotic dynamical system called the Kuramoto–Sivashinsky (KS) system defined by the Partial Differential Equation (PDE),

(27)

Equation (27) defines the evolution of a one-dimensional spatiotemporal scalar field u(x,t) on the spatial domain x[0,L). We assume periodic boundary conditions so that u(x+L,t)=u(x,t). In this example, we let Eq. (27) represent the true, unknown space and time continuous dynamics. We obtain a time series of finite-dimensional representations of the state that we assume to satisfy Eq. (1) by discretizing the domain [0,L) into Q grid points and integrating Eq. (27) using a pseudo-spectral PDE solver19 with time step τ=0.25. We sample the time series at intervals of Δt=0.25. Thus, our simulated true system states takes the form of a Q-dimensional vector xjt, TTSjP, where the subscript j indicates that the data were sampled at time jΔt. As in the previous example of the Lorenz 1963 system (Sec. III A), we generate simulated measurements by adding random Gaussian noise to the simulated true states. We use the simulated measurements in the interval TTSj0 as the training data for the ML-DA scheme and set aside the data in the interval 1jP for forecast verification.

We assume that we have access to an imperfect model, G, of the KS dynamical system given by the equations

(28)

Thus, if ϵ=0, then the model represents the true dynamics perfectly, while a nonzero value of ϵ indicates an imperfect model. We also assume that our measurements are of the form given by Eq. (2). The measurement operator H is a projection operator that outputs Θ uniformly spaced measurements (of the Q total variables) in xjt. The measurement noise is Gaussian with zero mean and a diagonal covariance matrix with diagonal elements equal to σnoise2, where σnoise=0.1. For comparison, the standard deviation of the KS time series variable is σKS=1.3.

1. Forecast quality and dependence on model error

To demonstrate the effectiveness of the ML-DA technique, we consider a KS system with L=35, Q=64 and a measurement operator H that corresponds to Θ=16 uniformly spaced measurements. The upper three panels of Fig. 7 show (from top to bottom) the time dependence of the true discretized Kuramoto–Sivashinksy equation dynamics, the error in the baseline prediction, and the error in the ML-DA prediction when the model error is ϵ=0.1. The bottom panel in Fig. 7 shows the normalized RMS error of both predictions. Notice that the ML-DA again significantly outperforms the baseline scheme.

FIG. 7.

The top 3 panels in this figure are spatiotemporal plots with the 64 spatial grid point locations plotted vertically and the Lyapunov time plotted horizontally. From top to bottom in these panels, the color axis shows the true value of the ith element of xjt for i=1,2,,Q from the KS system, the difference between the true i{th} element of xjt for i=1,2,,Q from the KS system and the prediction from the baseline scheme, and the difference between the true ith element of xjt for i=1,2,,Q from the KS system and the prediction from the ML-DA scheme. The bottom plot shows the normalized RMS error as a function of time for the baseline (magenta) and ML-DA (light blue) schemes, while the dotted vertical lines mark the forecast valid time using each scheme. For both schemes, the model error was ϵ=0.1, and the state was measured at Θ=16 equally spaced locations of the full Q=64 grid points with measurement noise magnitude σnoise=0.1.

FIG. 7.

The top 3 panels in this figure are spatiotemporal plots with the 64 spatial grid point locations plotted vertically and the Lyapunov time plotted horizontally. From top to bottom in these panels, the color axis shows the true value of the ith element of xjt for i=1,2,,Q from the KS system, the difference between the true i{th} element of xjt for i=1,2,,Q from the KS system and the prediction from the baseline scheme, and the difference between the true ith element of xjt for i=1,2,,Q from the KS system and the prediction from the ML-DA scheme. The bottom plot shows the normalized RMS error as a function of time for the baseline (magenta) and ML-DA (light blue) schemes, while the dotted vertical lines mark the forecast valid time using each scheme. For both schemes, the model error was ϵ=0.1, and the state was measured at Θ=16 equally spaced locations of the full Q=64 grid points with measurement noise magnitude σnoise=0.1.

Close modal

Figure 8 shows t{VT} for 50 different forecasts per box plot made with the ML-DA and the baseline scheme at four different values of the model error ϵ. As for the Lorenz 1963 system, as the size of the error in the imperfect model increases, the ML-DA scheme outperforms the baseline scheme by an increasing ratio, but as expected, it degrades performance somewhat when the model error is zero (and, thus, no ML correction is needed). In general, this degradation must persist for a sufficiently small model error; for the KS system, model error ϵ=0.01 is not large enough for us to see a clear advantage to either scheme.

FIG. 8.

Forecast valid time of the ML-DA (light blue box plots) and baseline (magenta box plots) schemes for the KS system at different values of the model error (ϵ). Each box plot represents the result of 50 independent trials comparing the baseline and the ML-DA forecasts, each using a different KS time series starting from a random initial condition in the attractor. The box plots depict the median, the 25th and 75th percentiles and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials. As in Fig. 5, the covariance inflation parameter is independently optimized for both sets of forecasts at each value of the model error (ϵ), and the optimal value is displayed next to each box. The “*” above a pair of boxes indicates where Mood’s median test determined that the ML-DA median forecast valid time was significantly greater than that of the baseline for that value of ϵ and optimal ρ, with a p-value less than 103.

FIG. 8.

Forecast valid time of the ML-DA (light blue box plots) and baseline (magenta box plots) schemes for the KS system at different values of the model error (ϵ). Each box plot represents the result of 50 independent trials comparing the baseline and the ML-DA forecasts, each using a different KS time series starting from a random initial condition in the attractor. The box plots depict the median, the 25th and 75th percentiles and the 5th and 95th percentiles (depicted by dashed lines) computed over these trials. As in Fig. 5, the covariance inflation parameter is independently optimized for both sets of forecasts at each value of the model error (ϵ), and the optimal value is displayed next to each box. The “*” above a pair of boxes indicates where Mood’s median test determined that the ML-DA median forecast valid time was significantly greater than that of the baseline for that value of ϵ and optimal ρ, with a p-value less than 103.

Close modal

2. Analysis error using iterative ML-ETKF

To demonstrate that the ML-DA hybrid model is also capable of improving the analysis states in a DA cycle, we re-do the DA run that produced the training data using the hybrid model in place of the knowledge-based model. We then re-train the hybrid model and iterate as described in Sec. II C. Figure 9 shows the error in analysis states computed using iteratively produced hybrid models. Here, iteration 0 denotes the original set of analysis states computed using only the knowledge-based model. We see that using the first two generations of the trained ML-DA to perform DA results in a significant reduction in error in the analyses of both the measured and unmeasured variables. Later, iterations give smaller reductions in error. We also measured the forecast valid time for each hybrid model trained on the iteratively produced analysis time series. We found that there was only a small increase in the median valid time after one iteration (10%) and that successive iterations did not improve the valid time.

FIG. 9.

Normalized RMS error during DA as we iteratively use the ML-DA to produce analysis time series for the KS system and use those time series to train the hybrid model. A covariance inflation parameter of ρ=2, which we found to produce the lowest analysis error, was used to produce analysis time series at each step of the iteration. In addition, all knowledge-based models used in this plot had an error of ϵ=0.2. The plotted values are the median normalized RMS error values computed over 50 different sets of measurement data. The variance in the error over these different data sets is very small; therefore, we have chosen not to plot the standard deviation. At iteration 0, the error values are those from using only the knowledge-based model as our forecast model during DA. We also plot the error values computed over only the measured or the unmeasured variables.

FIG. 9.

Normalized RMS error during DA as we iteratively use the ML-DA to produce analysis time series for the KS system and use those time series to train the hybrid model. A covariance inflation parameter of ρ=2, which we found to produce the lowest analysis error, was used to produce analysis time series at each step of the iteration. In addition, all knowledge-based models used in this plot had an error of ϵ=0.2. The plotted values are the median normalized RMS error values computed over 50 different sets of measurement data. The variance in the error over these different data sets is very small; therefore, we have chosen not to plot the standard deviation. At iteration 0, the error values are those from using only the knowledge-based model as our forecast model during DA. We also plot the error values computed over only the measured or the unmeasured variables.

Close modal

We have demonstrated that DA can successfully extend the hybrid approach introduced in Ref. 31, which trains a machine-learning component based on reservoir computing to improve a knowledge-based forecast model, to the case where a multivariate time series of complete, directly-measured model states is not available for training. Though we use a specific ensemble DA scheme, ETKF, in our numerical experiments, any DA scheme could be used to process the available measurements into estimated model states for use as training data. Using a knowledge-based model rather than a purely data-driven approach allows us to perform DA before training the machine-learning component. A potential drawback to this approach is that biases present in the knowledge-based model will affect the training data, reducing the effectiveness of the training in correcting for these biases. Using the iterative approach described in Sec. II C, however, we are able to reduce the effect of these biases on the analysis errors by using the previously trained hybrid model as the forecast model on the DA. In applications such as weather forecasting, where multiple knowledge-based models are available, another approach would be to use a different model in the DA procedure than in the hybrid training and prediction procedure.

Another natural extension of the method described in this article would be to integrate the machine-learning training with the data-assimilation cycle in such a way that the training is updated each time new data become available. The use of reservoir computing as the machine-learning component of our hybrid model allows a large number of parameters (the elements of the output weight matrix) to be trained efficiently using linear regression, which could be done sequentially (like DA) through techniques such as recursive least squares. It could be fruitful to more closely integrate the state estimation and the estimation of the parameters of the hybrid model. The unified Bayesian framework described in Refs. 5 and 7 could be applied to a hybrid model, though a successful implementation may require limiting the number of machine-learning parameters.

A.W. and J.P. contributed equally to this work.

This work was supported by the Defense Advanced Research Projects Agency (DARPA) (Grant No. HR00111890044). The work of Alexander Wikner was supported in part by the National Science Foundation (NSF) (Award No. DGE-1632976). The work of Istvan Szunyogh was also supported by the Office of Naval Research (ONR) (Award No. N00014-18-2509). We thank the reviewers for their helpful comments.

The data that support the findings of this study are available from the corresponding author upon reasonable request. A software implementation in MATLAB of the hybrid reservoir computer described here (as well as in Refs. 31 and 40) may be found in GitHub at https://github.com/awikner/CHyPP, Ref. 41.

The particular data-assimilation algorithm used in this paper is the Ensemble Transform Kalman Filter (ETKF),4,38 a variant of the Ensemble Kalman Filter (EnKF).9,11,15 Our formulation is based on the local analysis used in the LETKF16,28,36 for spatially extended systems larger than the KS system used in Sec. III B.

Recall from Sec. II A that the analysis state at time jΔt is computed from the analysis state at time (j1)Δt by a forecast step followed by an analysis step. An EnKF uses an ensemble {xja,k}1kE of analysis states whose mean x¯ja represents the estimated state and whose covariance Pja represents the uncertainty in the estimate. We now describe the forecast and analysis steps for the ensemble.

In an EnKF, the background ensemble members xjb,k are forecast separately using the model, Eq. (3), so that

(A1)

The sample covariance Pjb of the background ensemble is

(A2)

where x¯jb is the mean of the background ensemble members. The goal of the analysis step, in which the types of EnKF differ, is to compute an analysis ensemble whose mean x¯ja and sample covariance Pja are consistent with the Kalman filter.18 

We now present the computations that form the analysis step of the ETKF, as formulated in Ref. 16 (in which the consistency with the Kalman filter is explained). Recall that yj is a vector of measurements made at time jΔt.

  1. Create the matrix
    (A3)
    whose columns represent deviations of the ensemble members from the ensemble mean.
  2. Compute the matrix Yjb=HXjb and the vector y¯jb.

  3. Create a matrix
    (A4)
    where C=(Yjb)TR1 and ρ is the covariance inflation parameter described in Sec. II A.
  4. Compute Wja=[(E1)P~ja]1/2, where the []1/2 denotes the symmetric square root.

  5. Compute w¯ja=P~aC(yjy¯jb) and add it to the kth column of Wja to obtain wja,k.

  6. Compute xja,k=Xjbwja,k+x¯jb. The analysis ensemble is given by {xja,k}1kE.

For our results, the analysis ensemble is initialized arbitrarily in a neighborhood of the model attractor. Then, the forecast and analysis steps described above are applied iteratively to generate the analysis time series used to train the ML system.

1.
H. D. I.
Abarbanel
,
P. J.
Rozdeba
, and
S.
Shirman
,
Neural Comput.
30
,
2025
(
2018
).
2.
T.
Arcomano
,
I.
Szunyogh
,
J.
Pathak
,
A.
Wikner
,
B. R.
Hunt
, and
E.
Ott
,
Geophys. Res. Lett.
47
,
e2020GL087776
, https://doi.org/10.1029/2020GL087776 (
2020
).
3.
M.
Asch
,
M.
Bocquet
, and
M.
Nodet
, Data Assimilation: Methods, Algorithms, and Applications, Fundamentals of Algorithms (SIAM, 2016).
4.
C. H.
Bishop
,
B. J.
Etherton
, and
S. J.
Majumdar
,
Mon. Weather Rev.
129
,
420
(
2001
).
5.
M.
Bocquet
,
J.
Brajard
,
A.
Carassi
, and
L.
Bertino
,
Found. Data Sci.
2
,
55
(
2020
).
6.
M.
Bocquet
,
J.
Brajard
,
A.
Carrassi
, and
L.
Bertino
,
Nonlinear Process. Geophys.
26
,
143
(
2019
).
7.
M. B.
Bocquet
,
A.
Farchi
, and
Q.
Malartic
,
Found. Data Sci.
(published online, 2020).
8.
J.
Brajard
,
A.
Carrassi
,
M.
Bocquet
, and
L.
Bertino
,
J. Comput. Sci.
44
,
101171
(
2020
).
9.
G.
Burgers
,
P.
Jan van Leeuwen
, and
G.
Evensen
,
Mon. Weather Rev.
126
,
1719
(
1998
).
10.
A.
Chattopadhyay
,
P.
Hassanzadeh
, and
D.
Subramanian
,
Nonlinear Process. Geophys.
27
,
373
(
2020
).
12.
S.
Fletcher
,
Data Assimilation for the Geosciences: From Theory to Applications
(
Elsevier
,
2017
).
13.
G. A.
Gottwald
and
S.
Reich
, “Supervised learning from noisy observations: Combining machine-learning techniques with data assimilation,” arXiv:2007.07383 (2020).
14.
F.
Hamilton
,
T.
Berry
, and
T.
Sauer
,
Phys. Rev. X
6
,
011021
(
2016
).
15.
P. L.
Houtekamer
and
H. L.
Mitchell
,
Mon. Weather Rev.
126
,
796
(
1998
).
16.
B. R.
Hunt
,
E. J.
Kostelich
, and
I.
Szunyogh
,
Physica D
230
,
112
(
2007
).
17.
H.
Jaeger
and
H.
Haas
,
Science
304
,
78
(
2004
).
18.
R. E.
Kalman
,
J. Basic Eng.
82
,
35
(
1960
).
19.
A.-K.
Kassam
and
L. N.
Trefethen
,
SIAM J. Sci. Comput.
26
,
1214
(
2005
).
20.
Y.
Kuramoto
and
T.
Tsuzuki
,
Prog. Theor. Phys.
55
,
356
(
1976
).
21.
K.
Law
,
A.
Stuart
, and
K.
Zygalakis
,
Data Assimilation: A Mathematical Introduction
(
Springer International
,
2015
), Vol. 62.
22.
R.
Lguensat
,
P.
Tandeo
,
P.
Ailliot
,
M.
Pulido
, and
R.
Fablet
,
Mon. Weather Rev.
145
,
4093
(
2017
).
23.
Z.
Li
and
I. M.
Navon
,
Q. J. R. Meteorol. Soc.
127
,
661
(
2001
).
25.
Z.
Lu
,
J.
Pathak
,
B.
Hunt
,
M.
Girvan
,
R.
Brockett
, and
E.
Ott
,
Chaos
27
,
041102
(
2017
).
26.
M.
Lukoševičius
, Neural Networks: Tricks of the Trade (Springer, 2012), pp. 659–686.
27.
A. M.
Mood
,
Introduction to the Theory of Statistics
, 1st ed. (
McGraw-Hill
,
1950
), pp.
394
399
.
28.
E.
Ott
,
B. R.
Hunt
,
I.
Szunyogh
,
A. V.
Zimin
,
E. J.
Kostelich
,
M.
Corazza
,
E.
Kalnay
,
D.
Patil
, and
J. A.
Yorke
,
Tellus A
56
,
415
(
2004
).
29.
J.
Pathak
,
B.
Hunt
,
M.
Girvan
,
Z.
Lu
, and
E.
Ott
,
Phys. Rev. Lett.
120
,
024102
(
2018
).
30.
J.
Pathak
,
Z.
Lu
,
B. R.
Hunt
,
M.
Girvan
, and
E.
Ott
,
Chaos
27
,
121102
(
2017
).
31.
J.
Pathak
,
A.
Wikner
,
R.
Fussell
,
S.
Chandra
,
B. R.
Hunt
,
M.
Girvan
, and
E.
Ott
,
Chaos
28
,
041101
(
2018
).
32.
G. V.
Puskorius
and
L. A.
Feldkamp
,
IEEE Trans. Neural Networks
5
,
279
(
1994
).
33.
M.
Reichstein
,
G.
Camps-Valls
,
B.
Stevens
,
M.
Jung
,
J.
Denzler
,
N.
Carvalhais
, and
Prabhat
,
Nature
566
,
195
(
2019
).
34.
G.
Sivashinsky
,
Acta Astronaut.
4
,
1177
(
1977
).
35.
I.
Szunyogh
,
Applicable Atmospheric Dynamics: Techniques for the Exploration of Atmospheric Dynamics
, 1st ed. (
World Scientific
,
2014
), Chap. 4, pp. 405–536.
36.
I.
Szunyogh
,
E. J.
Kostelich
,
G.
Gyarmati
,
E.
Kalnay
,
B. R.
Hunt
,
E.
Ott
,
E.
Satterfield
, and
J. A.
Yorke
,
Tellus A
60
,
113
(
2008
).
37.
F.
Tomizawa
and
Y.
Sawada
, “
Geoscientific model development
,”
Geosci. Model Dev. Discuss.
2020
, 1–33.
38.
X.
Wang
,
C. H.
Bishop
, and
S. J.
Julier
,
Mon. Weather Rev.
132
,
1590
(
2004
).
39.
J. S.
Whitaker
and
T. M.
Hamill
,
Mon. Weather Rev.
130
,
1913
(
2002
).
40.
A.
Wikner
,
J.
Pathak
,
B.
Hunt
,
M.
Girvan
,
T.
Arcomano
,
I.
Szunyogh
,
A.
Pomerance
, and
E.
Ott
,
Chaos
30
,
053111
(
2020
).
41.
A.
Wikner
, (2020). “MATLAB code for Combined Hybrid-Parallel Prediction,” GitHub. https://github.com/awikner/CHyPP.