A machine-learning approach called “reservoir computing” has been used successfully for short-term prediction and attractor reconstruction of chaotic dynamical systems from time series data. We present a theoretical framework that describes conditions under which reservoir computing can create an empirical model capable of skillful short-term forecasts and accurate long-term ergodic behavior. We illustrate this theory through numerical experiments. We also argue that the theory applies to certain other machine learning methods for time series prediction.

A long-standing problem is prediction and analysis of data generated by a chaotic dynamical system whose equations of motion are unknown. Techniques based on delay-coordinate embedding have been successful for sufficiently low-dimensional systems. Recently, machine-learning approaches such as reservoir computing have shown promise in treating a larger class of systems. We develop a theory of how prediction with reservoir computing or related machine-learning methods can “learn” a chaotic system well enough to reconstruct the long-term dynamics of its attractor.

## I. INTRODUCTION

Reservoir computing^{1–4} is a machine-learning approach that has demonstrated success at a variety of tasks, including time series prediction^{5–8} and inferring unmeasured variables of a dynamical system from measured variables.^{9,10} In this approach, a “reservoir” is a high-dimensional, non-autonomous (driven) dynamical system, chosen independently of the task. A particular task provides an input time series, and the reservoir state as a function of time is regarded as a “raw” output time series, which is post-processed to fit the task. The post-processing function is determined, typically by linear regression, from a limited-time “training” data set consisting of the desired output time series for a given input time series.

Reservoir computing can be performed entirely in software, typically with an artificial neural network model, or with a physical reservoir; examples of the latter include a bucket of water,^{11} an electronic circuit with a time delay,^{12} a field-programmable gate array (FPGA),^{13} an optical network of semiconductor lasers,^{14} and an optic-electronic phase-delay system.^{15} Other machine-learning techniques, including deep learning,^{16,17} attempt to optimize internal system parameters to fit the training data; doing so requires a mathematical model of the machine-learning system. By contrast, reservoir computing does not require a model for the reservoir, nor the ability to alter the reservoir dynamics, because it seeks only to optimize the parameters of the post-processing function. The ability to use a physical reservoir as a “black box” allows for various potential advantages over other machine-learning techniques, including greatly enhanced speed.

In this article, we consider the task of predicting future measurements from a deterministic dynamical system, whose equations of motion are unknown, from limited time series data. We describe a general framework that includes the reservoir computing prediction method proposed by Jaeger and Haas.^{5} With appropriate modifications, the same framework applies to other machine-learning methods for time series prediction [including a long short-term memory (LSTM) approach^{18}] as we discuss further in Sec. IV. We assume the vector **u**(*t*) of measurements to be a function **h** of the finite-dimensional system state **s**(*t*)

For simplicity, we assume that there is no measurement noise although our discussion below could be modified for the case that Eq. (1) is an approximation. We do not assume that **h** is invertible nor that **h** or **s** is known in practice. Training data consists of a finite time series {**u**(*t*)} of measurements. We predict future values of **u**(*t*) by a sequence of three steps, which we call listening, training, and predicting.

Listening consists of using the training time series as input to the reservoir, which we model as a discrete time deterministic process

Here, **r**(*t*) is the reservoir state, *τ* is a time increment, and we assume **f** to be a differentiable function. We emphasize that in practice, a formula for **f** need not be known; only its outputs are used for training and prediction. For convenience, we assume that the full reservoir state **r**(*t*) can be measured or computed although our arguments can be modified easily for the more general case that the reservoir output is a function of its internal state. We call Eq. (2) the “listening reservoir.”

Training consists of determining a post-processing function $\psi \u0302$ that, when applied to the reservoir output **r**(*t* + *τ*), estimates the next input **u**(*t* + *τ*). (We view $\psi \u0302$ as an approximation to an “ideal” post-processing function $\psi $, to be introduced in Sec. II B.) Thus, the goal of training is to find $\psi \u0302$ such that $\psi \u0302(r(t+\tau ))\u2248u(t+\tau )$, or equivalently

for *t* large enough that the listening reservoir (2) has evolved beyond transient dynamics. We compute $\psi \u0302$ by a fitting procedure, such as linear regression, on the training time series {**u**(*t*)} and the corresponding time series {**r**(*t*)} determined from the listening reservoir (2).

Predicting then proceeds by modifying the reservoir to run autonomously with a feedback loop, replacing its input [**u**(*t*) in Eq. (2)] with its post-processed output from the previous time increment

We call Eq. (4) the “predicting reservoir.” When initialized (from the listening reservoir state) with $r\u0302(t0)=r(t0)$, iterating the predicting reservoir yields a time series ${\psi \u0302(r\u0302(t0+\tau )),\psi \u0302(r\u0302(t0+2\tau )),\u2026}$ of predictions for future measurements ${u(t0+\tau ),u(t0+2\tau ),\u2026}$. (Our notation reflects the fact that for *t* > *t*_{0}, the predicting reservoir state $r\u0302(t)$ estimates the state **r**(*t*) that would result from evolving the listening reservoir (2) with the future measurements.)

The reservoir prediction method we have described has been shown to produce successful short-term forecasts for a variety of dynamical systems.^{5,7,8} If the system has a chaotic attractor, then, as for any imperfect model, the prediction error $\Vert \psi \u0302(r\u0302(t))\u2212u(t)\Vert $ cannot remain small for *t* ≫ *t*_{0}. However, in some cases, the long-term time series ${\psi \u0302(r\u0302(t))}$ continues to behave like the measurements from a typical trajectory on the attractor, and in this sense the predicting reservoir (4) approximately reproduces the ergodic properties of the dynamical system that generated the measurements.^{8} We refer to this ability, often called attractor reconstruction, as replication of the “climate.”

In this article, we develop and illustrate a theory of how reservoir prediction is able to “learn” the dynamics of a system well enough to produce both accurate short-term forecasts and accurate long-term climate. We make use of the notion of *generalized synchronization*,^{19–22} which in our context means that the reservoir state **r**(*t*) becomes asymptotically a continuous function $\varphi $ of **s**(*t*), in the limit that the listening reservoir (2) is run infinitely long. In Sec. II, we argue that the following four conditions are sufficient for both short-term prediction and attractor/climate replication.

The listening reservoir (2) achieves generalized synchronization with the process {

**s**(*t*)}, so that $r(t)\u2248\varphi (s(t))$ for a continuous function $\varphi $, within the time interval covered by the training time series.The synchronization function $\varphi $ is one-to-one, or at least carries enough information about its input to recover

**u**(*t*) =**h**(**s**(*t*)) from $\varphi (s(t))$.Training is successful in finding a function $\psi \u0302$ such that Eq. (3) holds, or equivalently in view of generalized synchronization, that $\psi \u0302(\varphi (s(t)))\u2248h(s(t))$.

The attractor approached by the listening reservoir is also stable for the predicting reservoir (4).

Conditions 1–3 enable short-term prediction. Condition 4 ensures that the climate established by generalized synchronization of the listening reservoir is preserved when its input is replaced by a feedback term to form the predicting reservoir. One of the main points of Sec. II is to precisely formulate the stability condition described in condition 4.

We remark that generalized synchronization of the listening reservoir^{6,10} is related to the “echo state property,”^{1,23} which states that an infinite history of inputs ${u(t\u2212\tau ),u(t\u22122\tau ),\u2026}$ uniquely determines **r**(*t*), subject to the condition that the trajectory {**r**(*t*)} is bounded. Indeed, if {**s**(*t*)} is a trajectory of an invertible dynamical system, then the past inputs are functions of **s**(*t*), so the echo state property implies that if the listening reservoir (2) has run for an infinite period of time in a bounded domain, then **r**(*t*) is a function of **s**(*t*) [though it does not imply that this function is continuous]. We believe that for the reservoir prediction method we described, it is desirable (though not strictly necessary) to have the echo state property and generalized synchronization. In Sec. II A, we show why both properties hold if the listening reservoir is uniformly contracting as a function of **r**, and that we can quantify the amount of transient time it takes for the reservoir to achieve the approximation $r(t)\u2248\varphi (s(t))$ of condition 1.

Conditions 2 and 3 are significantly more difficult to ensure *a priori*. In Sec. II B, we argue why it is plausible that these conditions can be achieved. In Secs. II C and II D, we describe the consequences of conditions 1–3 for short-term prediction and formulate more precisely the stability criterion of condition 4 that determines whether the correct attractor and climate are approximately reproduced by the long-term dynamics of the predicting reservoir (4). In Sec. II E, we describe how a model for the reservoir dynamics can be used to compute Lyapunov exponents that reflect climate stability.

In Sec. III, we give examples of short-term state and long-term climate predictions using the Lorenz equations as our input system. In addition to a case where the climate is approximated well, we show a case where the predicted climate is inaccurate, although the short-term forecast is still reasonably accurate. We compute the Lyapunov exponents of the predicting reservoir (4) and show that the transition from accurate climate to inaccurate climate corresponds to a Lyapunov exponent crossing zero. When this Lyapunov exponent is positive but close to zero, the reservoir prediction remains close to the correct climate for a transient period, and we relate the average duration of this transient to the value of the Lyapunov exponent.

## II. THEORY

We consider the application of the reservoir prediction method described in the introduction to a time series {**u**(*t*)} that is a function **h** of a trajectory {**s**(*t*)} of the dynamical system

where **g** is differentiable and invertible, and we assume that **s**(*t*) evolves on a bounded attractor *A*. In preparation for training and prior to prediction, the reservoir state **r**(*t*) evolves according to the listening reservoir (2). The system described by Eqs. (5) and (2), coupled by Eq. (1), is often called a drive-response, skew-product, or one-way coupled system. The coupled system dynamics are illustrated by Fig. 1. We next consider the evolution of the coupled system as *t* → *∞*.

### A. Listening and generalized synchronization

The goal of training can be regarded as finding a post-processing function $\psi \u0302$ such that $\psi \u0302(r(t))$ is in approximate identical synchronization^{20} with $u(t)=h(s(t))$, when **r**(*t*) is evolved with the listening reservoir (2). The desired relationship $u(t)\u2248\psi \u0302(r(t))$ can also be thought of as approximate generalized synchronization between **u**(*t*) [or the underlying state **s**(*t*)] and **r**(*t*). The existence of such a relationship would be implied by stochastic synchronization,^{19} which in our context means a one-to-one correspondence between **r**(*t*) and **s**(*t*) in the limit *t* → *∞*. However, in drive-response systems, the definition of *generalized synchronization*^{21,22} requires only that the response state be asymptotically a function of the drive state: in our case, that there is a continuous function $\varphi $ such that $r(t)\u2212\varphi (s(t))\u21920$ as *t* → *∞*. The existence of such a $\varphi $ is typically easier to establish than its invertibility. Next, we describe conditions on the reservoir system **f** that guarantee generalized synchronization.

Although weaker conditions are possible, we assume uniform contraction for **f**, as is often the case in practice. By *uniform contraction*, we mean that there is some *ρ* < 1 such that for all **r**_{1}, **r**_{2}, and **u** we have that $|f(r1,u)\u2212f(r2,u)|<\rho |r1\u2212r2|$. It then follows that two trajectories {**r**_{1}(*t*), **u**(*t*)} and {**r**_{2}(*t*), **u**(*t*)} of (2) with the same input time series approach each other exponentially: $|r1(t)\u2212r2(t)|\u2264|r1(0)\u2212r2(0)|\rho t/\tau $. Thus, for a given input time series {**u**(*t*)}, the reservoir state **r**(*t*) is asymptotically independent of its initial state; this is essentially what Jaeger^{1} called the “echo state property.” Furthermore, because **g** is invertible and *A* is bounded, and due to the results of Hirsch, Pugh, and Shub^{24,25} (a direct proof is given by Stark^{26}), uniform contraction implies generalized synchronization, as defined above. (In general, the synchronization function $\varphi $ cannot be determined analytically from **f**, **g**, and **h**.) A weaker form of generalized synchronization can also be guaranteed^{26} from the non-uniform contraction implied by negative conditional Lyapunov exponents.

We remark that if the listening reservoir (2) is uniformly contracting, then $r(t)\u2212\varphi (s(t))$ converges to zero exponentially. If the designer of the reservoir can guarantee a specific contraction rate *ρ*, this determines the convergence rate, so that the amount of transient time needed to make the approximation $r(t)\u2248\varphi (s(t))$ accurate can be known in practice.

Generalized synchronization implies that the set of (**s**, **r**) such that **s** is on its attractor *A* and $r=\varphi (s)$ is an attractor for the drive-response system given by Eqs. (5), (1), and (2). Below we will use the fact that this set is invariant: $r(t)=\varphi (s(t))$ implies $r(t+\tau )=\varphi (s(t+\tau ))$.

### B. Training

Recall that training seeks a function $\psi \u0302$ that predicts the current measurement vector **u**(*t*) from the current listening reservoir state **r**(*t*) [which is computed from past measurements] and that when generalized synchronization is achieved, accuracy of this prediction is equivalent to $\psi \u0302(\varphi (s(t)))\u2248h(s(t))$. For the rest of Sec. II, we assume that there is a function $\psi $ defined on $\varphi (A)$ such that $\psi (\varphi (s))=h(s)$ for all **s** in *A*. This assumption means that in the asymptotic limit of generalized synchronization, the listening reservoir state $r(t)=\varphi (s(t))$ uniquely determines **u**(*t*) = **h**(**s**(*t*)). The goal of training can then be described as finding a function $\psi \u0302$ defined on the state space of the reservoir that approximates $\psi $ on the set $\varphi (A)$. We summarize our notation in Table I.

Dynamical system to be predicted | |

s(t) | System state |

$g:s(t)\u2192s(t+\tau )$ | System evolution |

A | Attractor for s(t) |

Measurements | |

u(t) | Measurement vector |

$h:s(t)\u2192u(t)$ | Measurement function |

Reservoir | |

r(t) | Listening reservoir state |

$f:[r(t),u(t)]\u2192r(t+\tau )$ | Listening reservoir evolution |

$r\u0302(t)$ | Predicting reservoir state |

$u\u0302(t)=\psi \u0302(r\u0302(t))$ | Predicted measurements |

$f:[r\u0302(t),u\u0302(t)]\u2192r\u0302(t+\tau )$ | Predicting reservoir evolution |

Generalized synchronization | |

$\varphi :s\u2192r$ for s in A | Synchronization function |

$\psi :r\u2192u$ for r in $\varphi (A)$ | Ideal post-processing function |

$\psi \u0302:r\u0302(t)\u2192u\u0302(t)$ | Actual post-processing function |

Dynamical system to be predicted | |

s(t) | System state |

$g:s(t)\u2192s(t+\tau )$ | System evolution |

A | Attractor for s(t) |

Measurements | |

u(t) | Measurement vector |

$h:s(t)\u2192u(t)$ | Measurement function |

Reservoir | |

r(t) | Listening reservoir state |

$f:[r(t),u(t)]\u2192r(t+\tau )$ | Listening reservoir evolution |

$r\u0302(t)$ | Predicting reservoir state |

$u\u0302(t)=\psi \u0302(r\u0302(t))$ | Predicted measurements |

$f:[r\u0302(t),u\u0302(t)]\u2192r\u0302(t+\tau )$ | Predicting reservoir evolution |

Generalized synchronization | |

$\varphi :s\u2192r$ for s in A | Synchronization function |

$\psi :r\u2192u$ for r in $\varphi (A)$ | Ideal post-processing function |

$\psi \u0302:r\u0302(t)\u2192u\u0302(t)$ | Actual post-processing function |

Although the existence of $\psi $ is not strictly necessary for the reservoir to make useful predictions, if no such $\psi $ exists, then it seems unlikely that training can successfully achieve the desired approximation $\psi (\varphi (s(t)))\u2248h(s(t))$, and thus unlikely that **u**(*t*) can be approximated as a function of the reservoir state during either listening or predicting. The existence of $\psi $ is guaranteed if $\varphi $ is one-to-one on *A*; then $\psi =h\xb0\varphi \u22121$. Furthermore, if **h** is one-to-one on *A* (in other words, the measurements at a given time determine the system state), then $\varphi $ must be one-to-one on *A* in order for $\psi $ to exist. Thus, we propose that a goal of reservoir design should be to yield a one-to-one synchronization function $\varphi $ for a variety of input systems. In practice, having a sufficiently high-dimensional reservoir may suffice; embedding results^{27,28} imply that if the dimension of the reservoir state **r** is more than twice the dimension of *A*, functions from *A* to the reservoir state space are typically one-to-one. We note that in practice, the dimension of **r** must be much larger than twice the dimension of *A* in order to provide a suitable basis for approximating $\psi $, in the sense described below.

Careful consideration of conditions under which training is successful in determining an accurate approximation $\psi \u0302$ to $\psi $ is beyond the scope of our theory. However, we argue that success is plausible if the training time series is sufficiently long that the trajectory {**s**(*t*)} well samples its attractor *A*, if the dimension of the reservoir state **r**(*t*) is sufficiently high, and if the dynamics of the coordinates of **r**(*t*) are sufficiently heterogeneous. If, for example, training uses linear regression of ${u(t)}={h(s(t))}$ versus {**r**(*t*)}, then since $r(t)\u2248\varphi (s(t))$, the coordinates of the vector-valued function $\varphi (s)$ can be thought of “basis functions”;^{6} training seeks a linear combination $\psi \u0302$ of these basis functions that approximates **h**(**s**) on *A*. A suitable basis for training (using a linear or nonlinear combination) is plausible if the listening reservoir yields a sufficiently large variety of responses to its input.

### C. Prediction and attractor reconstruction

After training determines the post-processing function $\psi \u0302$, prediction proceeds by initializing $r\u0302(t0)=r(t0)$ and evolving $r\u0302(t)$ for *t* ≥ *t*_{0} according to the predicting reservoir (4). The reservoir state **r**(*t*_{0}) is determined by evolving the listening reservoir (2) for an interval of time preceding *t*_{0}; this could be the time interval used for training, or it could be a later time interval that uses inputs {**u**(*t*)} measured after training (we call this feature “training reusability”^{29}). We assume that the listening time preceding *t*_{0} is sufficiently long to achieve generalized synchronization, so that $r\u0302(t0)=r(t0)\u2248\varphi (s(t0))$ is near $\varphi (A)$. For *t* ≥ *t*_{0}, the predicted value of **u**(*t*) is

Consider now the idealized scenario that our approximations are instead exact relations $\psi \u0302=\psi $ on $\varphi (A)$, and $r\u0302(t0)=r(t0)=\varphi (s(t0))$. Suppose hypothetically that the measurements {**u**(*t*)} for *t* ≥ *t*_{0} (these are the values we want to predict in practice) are available, so that we can evolve both the listening reservoir (2) depicted in Fig. 1 and the predicting reservoir (4) depicted in Fig. 2, and compare their outputs. Then we claim that the two reservoirs agree exactly: $r\u0302(t)=r(t)$ and $u\u0302(t)=u(t)$ for all *t* ≥ *t*_{0}. First notice that $u\u0302(t0)=\psi \u0302(r\u0302(t0))=\psi (\varphi (s(t0))=h(s(t0))=u(t0)$. Then $r\u0302(t0+\tau )=f[r\u0302(t0),u\u0302(t0)]=f[r(t0),u(t0)]=r(t0+\tau )$, and $r(t0+\tau )=\varphi (s(t0+\tau ))$ due to generalized synchronization. Similarly, $u\u0302(t0+\tau )$ then equals $u(t0+\tau )$, so $r\u0302(t0+2\tau )=r(t0+2\tau )=\varphi (s(t0+2\tau ))$, etc. This agreement between the trajectories also shows that $\varphi (A)$ is an invariant set for the idealized predicting reservoir

and that its dynamics, observed through $\psi $, are equivalent to the dynamics of *A* observed through **h**.

Thus, if the time series {**u**(*t*)} of measurements has enough information to reconstruct the attractor *A*, then we can regard $\varphi (A)$ and the idealized predicting reservoir (7) as an exact reconstruction of *A* and its dynamics. When the approximation $\psi \u0302\u2248\psi $ is not exact on $\varphi (A)$, the actual predicting reservoir (4) is still initialized near $\varphi (A)$, but $\varphi (A)$ is only approximately invariant. The better the approximation, the more accurate the predictions $u\u0302(t)\u2248u(t)$ will be, at least in the short term. However, if the system (5) that generates the measurements {**u**(*t*)} is chaotic, the prediction error $\Vert u\u0302(t)\u2212u(t)\Vert $ will typically grow exponentially as *t* increases.

Nonetheless, it remains possible that $u\u0302(t)$ will maintain a climate similar to **u**(*t*) in the long term. This will happen if (and practically speaking, only if) the predicting reservoir trajectory ${r\u0302(t)}$ remains close to $\varphi (A)$ for all time, and its attractor has a similar climate to that of the idealized predicting reservoir on $\varphi (A)$. In this sense, climate replication (attractor reconstruction) relies on both state-space stability and structural stability of the predicting reservoir near the idealized reconstructed attractor $\varphi (A)$.

Structural stability is difficult to ensure rigorously, but in practice small perturbations of the dynamics near an attractor tend to yield small perturbations to the climate. Thus, we argue that climate replication is likely if $\varphi (A)$, which according to our assumptions is invariant for the idealized predicting reservoir, is also attracting, in the sense described below.

### D. Stability and Lyapunov exponents

Recall that generalized synchronization implies that the set $\varphi (A)$ is attracting for the listening reservoir (2), when driven by $u(t)=h(s(t))$ where **s**(*t*) evolves on *A*. Whether $\varphi (A)$ is attracting for the predicting reservoir is complicated by the fact that it is invariant only in the idealized case $\psi \u0302=\psi $ and that $\psi $ is defined only on $\varphi (A)$, so that the idealized predicting reservoir (7) is also defined only on $\varphi (A)$. For its stability to be well-defined, the domain of $\psi $ must be extended to a neighborhood of $\varphi (A)$, and whether $\varphi (A)$ is attracting depends on how the extension is chosen.

Thus, the suitability of the empirically determined function $\psi \u0302$ for climate prediction depends not only on how well it approximates $\psi $ on $\varphi (A)$ but also on how it behaves near $\varphi (A)$. For a particular $\psi \u0302$, we consider hypothetically a particular extension of $\psi $ such that $\psi \u0302\u2248\psi $ near $\varphi (A)$. This extension gives the idealized predicting reservoir a full set of Lyapunov exponents on $\varphi (A)$, some of which correspond to infinitesimal perturbations tangent to $\varphi (A)$ and some of which correspond to infinitesimal perturbations transverse to $\varphi (A)$. Then $\varphi (A)$ is attracting if the transverse Lyapunov exponents are all negative, and is unstable if there is a positive transverse Lyapunov exponent.

If the generalized synchronization function $\varphi $ is one-to-one and differentiable, then the tangential Lyapunov exponents of the system (5) on *A* are reproduced as the tangential Lyapunov exponents of the idealized predicting reservoir on $\varphi (A)$. Generalized synchronization does not always yield a differentiable $\varphi $,^{26,30} but even when differentiability cannot be guaranteed, it is possible in practice to reproduce much of the Lyapunov spectrum of *A*, including negative Lyapunov exponents in some cases, with a predicting reservoir.^{8}

We remark that unlike the conditional Lyapunov exponents for a drive-response system (such as the listening reservoir), which correspond to perturbations of the response system state, for the predicting reservoir it is not clear in advance which perturbations correspond to transverse Lyapunov exponents. However, in a numerical experiment where the equations for the driving system (5) and the reservoir are known, the existence or absence of a positive transverse Lyapunov exponent can be inferred by computing all of the positive Lyapunov exponents of the predicting reservoir and eliminating those that are Lyapunov exponents of *A*.

### E. Computation of Lyapunov exponents

We now describe how to estimate the Lyapunov exponents of the idealized predicting reservoir (7) on $\varphi (A)$, for a particular extension of $\psi $ to a neighborhood of $\varphi (A)$, from its empirical approximation $\psi \u0302$. To do so, we assume that we have a formula for **f**, so that we can compute its Jacobian matrix. (We emphasize that we estimate the Lyapunov exponents in order to corroborate the theory we have presented; their computation, and a formula for **f**, are not needed for the reservoir prediction method we have described.) If climate replication is successful, we can simply generate a long trajectory of the predicting reservoir (4) and use it to compute the Lyapunov exponents of the trajectory.^{8} However, this trajectory cannot be expected to remain close to $\varphi (A)$ if the set is unstable. Nonetheless, if we have a sufficiently long time time series {**u**(*t*)} of measurements, we can estimate the Lyapunov exponents of $\varphi (A)$, whether or not it is stable, as follows.

First, we use the time series {**u**(*t*)} to generate a trajectory {**r**(*t*)} of the listening reservoir (2); as we have argued, **r**(*t*) will approach $\varphi (A)$ under the conditions for generalized synchronization. Then along this trajectory, which is an approximate trajectory for the predicting reservoir, we compute Lyapunov exponents using the Jacobian matrix of the predicting reservoir (4).

## III. NUMERICAL EXPERIMENTS

In this section, we give examples of short-term state and long-term climate predictions for the Lorenz system,^{31} with standard parameter values that yield chaotic trajectories

We consider the case where the measurement function **h** is the identity, so that $u(t)=s(t)=[x(t),y(t),z(t)]T$. For the reservoir, we use an artificial neural network similar to the one used by Jaeger and Haas;^{5} our listening reservoir [a continuous-time version of Eq. (2)] evolves according to

where **r** is an *N*-dimensional vector, *γ* is a scalar, and **M** is an adjacency matrix representing internal network connections. The matrix $\sigma Win$ consists of “input weights”; in our numerical results, we will fix **W**_{in} and vary the scalar input strength *σ*. The vector function $tanh$ is computed by applying the scalar hyperbolic tangent to each coordinate of its input vector. We compute trajectories of both the Lorenz and reservoir systems using the 4th order Runge-Kutta method with time step *τ* = 0.001. We will show cases where climate replication (attractor reconstruction) succeeds and where it fails, and compare the results with Lyapunov exponents we compute for the predicting reservoir.

We consider post-processing functions of the form $\psi \u0302(r)=Woutq(r)$, where **q**(**r**) is the 2*N*-dimensional vector consisting of the *N* coordinates of **r** followed by their squares, and the “output weight” matrix **W**_{out} is determined by a linear regression procedure described below. The listening reservoir (9) and the post-processing function are illustrated as an input-output system in Fig. 3. The goal of training is that the post-processed output $Woutq(r(t+\tau ))$ based on input up to time *t* estimates the subsequent input **u**(*t* + *τ*). Once **W**_{out} is determined, the external input can be replaced in a feedback loop by the post-processed output to form the predicting reservoir, as depicted in Fig. 4. The predicting reservoir evolves according to

and the predicted value of **u**(*t*) is $u\u0302(t)=\psi \u0302(r\u0302(t))=Woutq(r\u0302(t))$.

Details of our reservoir implementation are as follows. The reservoir dimension is *N *=* *2000, and we use *γ* = 10. The *N*-by-*N* adjacency matrix **M** is chosen randomly with sparse Erdös-Renyi connectivity and spectral radius 0.9; specifically, each element is chosen independently to be nonzero with a probability of 0.02, nonzero elements are chosen uniformly between –1 and 1, and the resulting matrix is rescaled so that the magnitude of its largest eigenvalue is 0.9. The *N*-by-3 matrix **W**_{in} is chosen randomly so that each row has one non-zero element, chosen uniformly between –1 and 1. We evolve the Lorenz system and the listening reservoir (9) from time *t* = –100 to *t *=* *60, and we discard 100 time units of transient evolution, so that training is based on **u**(*t*) and **r**(*t*) for 0 ≤ *t *≤* *60. For training, we constrain the 3-by-2*N* matrix **W**_{out} to have only 3 *N* nonzero elements, namely, the first *N* elements of its first two rows, and the first *N*/2 and last *N*/2 elements of its third row. (Thus, we fit the *x* and *y* coordinates of the Lorenz state with linear functions of **r**, and the *z* coordinate with a linear combination of the first *N*/2 coordinates of **r** and the squares of the second *N*/2 coordinates; for the Lorenz system, this is advantageous over using a purely linear function of **r**.^{8}) Subject to this constraint, we select **W**_{out} so as to minimize the error function

here we have coarsely sampled the training data every 0.02 time units in order to reduce the amount of computation required by the regression. The second term in the error function modifies ordinary linear least-squares regression in order to discourage overfitting; this modification is often called ridge regression or Tikhonov regularization. Below, we will show results with regularization parameter *β* = 10^{−6} and with *β* = 0 (no regularization). We begin prediction by initializing $r\u0302(T)=r(T)$ and evolving the predicting reservoir (10), where *T *=* *60 is the end of the listening and training periods.

In Fig. 5, we show the actual *z*(*t*) from a trajectory of the Lorenz system, and predictions $z\u0302(t)$ from two reservoirs that are identical except for their input strength parameter values [*σ* = 0.012 for Fig. 5(a) and *σ* = 0.014 for Fig. 5(b)]. Each reservoir is trained with the same Lorenz trajectory and with regularization parameter *β* = 10^{−6}. Both reservoirs predict the short-term future similarly well, but for larger values of the prediction time *t* – *T*, only the second prediction continues with a Lorenz-like climate. We compare the two climate predictions over a longer time period in Fig. 6, which shows Poincaré return maps of successive *z*(*t*) maxima. In Fig. 6(a), the red dots (showing the reservoir prediction) initially are near the blue dots (representing the Lorenz attractor), but eventually the red dots approach a period two orbit, indicated by the arrows. The large distance of the upper left arrow from the blue dots indicates that this period two orbit for the reservoir is not on the Lorenz attractor. In contrast, the red dots in Fig. 6(b) remain near the blue dots at all times, indicating that the reservoir replicates the climate in the long term.

Based on the arguments in Sec. II A, we hypothesize that for both *σ* = 0.012 and *σ* = 0.014, the listening reservoir (9) evolves toward a set $\varphi \sigma (A)$, where *A* is the Lorenz attractor and $\varphi \sigma $ is a generalized synchronization function. Our choice of spectral radius 0.9 for the adjacency matrix **M** is consistent with common practice in reservoir computing,^{32} although it does not guarantee uniform contraction for the listening reservoir.^{23} However, it does guarantee that the eigenvalues of the Jacobian matrix of the right side of (9), evaluated at **r **=** u **=** **0, have real parts at most *γ*(–1 + 0.9) = 10(–0.1) = –1. This suggests an asymptotic contraction rate of –1 or faster for the listening reservoir, and that after discarding 100 transient time units, **r**(*t*) is extremely close to $\varphi \sigma (A)$ for *t *≥* *0.

Based on the arguments in Sec. II C, we hypothesize that the set $\varphi \sigma (A)$ is approximately invariant for the predicting reservoir (10). Based on the results in Figs. 5 and 6, we hypothesize further that for *σ* = 0.014, there is an attracting invariant set for the predicting reservoir near $\varphi \sigma (A)$, but that between *σ* = 0.014 and *σ* = 0.012, there is a bifurcation that causes this invariant set either to become unstable or to be destroyed entirely. To corroborate this hypothesis, we compute the Lyapunov exponents of the predicting reservoir for an approximate trajectory on $\varphi \sigma (A)$, as described in Sec. II E.

Figure 7 shows the three largest Lyapunov exponents of the predicting reservoir (10) as the input strength *σ* varies from 0.004 to 0.02. We do not change the matrices **M** and **W**_{in}, but for each value of *σ*, we perform a separate training (with *β* = 10^{−6} as before), resulting in a different output weight matrix **W**_{out}. The exponents colored red and blue approximate the positive and zero Lyapunov exponents of the Lorenz attractor *A* (the approximation is closest for *σ* ≥ 0.01). Reproduction of the positive exponent of *A* in the reservoir dynamics on $\varphi \sigma (A)$ is a necessary consequence of successful attractor reconstruction and does not indicate instability of $\varphi \sigma (A)$ to transverse perturbations. The exponent colored green estimates the largest of the transverse Lyapunov exponents described in Sec. II D. This exponent passes through zero, indicating a bifurcation, at *σ* ≈ 0.013.

Next, we compare the change in stability indicated by the computed transient Lyapunov exponent to a more direct computation indicating success or failure of climate replication. To detect when the prediction $u\u0302(t)=\psi \u0302(r\u0302(t))$ of the Lorenz state diverges from the true Lorenz attractor, we let Δ(*t*) be the Euclidean distance between the vector field $du\u0302/dt$ implied by the predicting reservoir and the vector field (right-hand side) of the Lorenz system (8), evaluated at $[x,y,z]T=u\u0302(t)$. [We calculate the reservoir-implied vector field by the chain rule $du\u0302/dt=D\psi \u0302(r\u0302(t))dr\u0302/dt$, where $D\psi \u0302$ is the Jacobian matrix of $\psi \u0302=Woutq$, and $dr\u0302/dt$ is given by Eq. (10).] For each value of *σ* depicted in Fig. 7, we calculate the vector field discrepancy Δ(*t*) for the prediction time period *t *≥* T*. If Δ(*t*) does not exceed a threshold value 20 for a duration of 800 time units, we consider the climate to be approximately reproduced. (Our threshold value 20 is small compared to the typical magnitude of the Lorenz vector field.) Otherwise, we say that the prediction has “escaped” from the Lorenz attractor. In Fig. 7, we show a black dot at each value of *σ* for which we detect escape; these values are the same as those for which the computed transverse Lyapunov exponent is positive. The height of each black dot represents an observed divergence rate *λ**, computed as follows.

When we detect escape for a particular value of *σ*, we reinitialize the predicting reservoir (10) using $r\u0302(t0)=r(t0)$ for 1000 different values of *t*_{0} ≥ *T*, where the values of **r**(*t*_{0}) are determined by continuing to run the listening reservoir (9) for *t *≥* T*. For each *t*_{0}, we evolve the predicting reservoir until the first time *t*_{1} for which Δ(*t*_{1}) ≥ 20, or until *t*_{1} − *t*_{0} = 800, whichever comes first. If divergence from the attractor is governed by Lyapunov exponent *λ*, we should have $\Delta (t1)\u2248\Delta (t0)\u2009exp\u2009(\lambda (t1\u2212t0))$ in a certain average sense. We compute the observed exponential divergence rate $\lambda *=\u27e8\u2009ln\u2009[\Delta (t1)/\Delta (t0)]\u27e9/\u27e8t1\u2212t0\u27e9$, where the angle brackets represent an average over the 1000 values of *t*_{0}. The computed values of *λ** are shown as black dots in Fig. 7. The approximate agreement of *λ** with the green curve (especially for 0.01 ≤ *σ* ≤ 0.013) demonstrates that the computed transverse Lyapunov exponent reflects divergence of predictions from the Lorenz attractor.

To illustrate the correspondence between the computed transverse Lyapunov exponent and the observed divergence rates in a case where their dependence on *σ* is more complicated, we show in Fig. 8 the analogue of Fig. 7 in a case where no regularization (*β* = 0) is used in the training. Again, we see precise correspondence between detected failure of climate replication (presence of a black dot) and positive values of the transverse Lyapunov exponent (green curve), and good agreement with the observed divergence rates for these values of *σ*. In this case, there are two bifurcations, one near *σ* = 0.12 and one near *σ* = 0.16.

We remark that when we use regularization (*β* = 10^{−6}) in the training, we do not observe as complicated a dependence of the computed transverse Lyapunov exponent on the input strength *σ* as in Fig. 8. Instead, the computed transverse Lyapunov exponent is typically negative and slowly varying across a wide range of *σ* values, for which climate replication is successful. In Fig. 9, we use the transverse Lyapunov exponent computation, averaged over 10 different randomly generated reservoirs, to give a quantitative illustration of the advantage of regularization. When regularization is used, the negative means and small standard deviations of the computed transverse Lyapunov exponent indicate robust climate stability over the entire range 0.05 ≤ *σ* ≤ 0.5. (By contrast, Figs. 5–7 depicted values of *σ* ≤ 0.02.) With no regularization, the means are larger and more variable, indicating less stability and greater sensitivity to the value of *σ*, and the standard deviations are significantly larger, indicating lack of robustness from one random reservoir realization to another.

## IV. CONCLUSIONS AND DISCUSSION

We presented in Sec. II a partial explanation for how reservoir computing prediction is able to reconstruct the attractor (replicate the climate) for a chaotic process from limited time series data. We argued that the reservoir dynamics (2) can be designed so that during the listening period on which training is based, the reservoir state **r**(*t*) is approximately a continuous function $\varphi $ of the state **s**(*t*) of the chaotic process. This property, called generalized synchronization, is closely related to the echo state property for reservoir computing. We showed that both properties hold if the listening reservoir (2) is uniformly contracting as a function of the reservoir state; other criteria for these properties have also been identified.^{23,26,32}

Ideally, the synchronization function $\varphi $ should be one-to-one in order to recover the process dynamics from the reservoir dynamics. Investigation of conditions that can guarantee $\varphi $ to be one-to-one could help guide reservoir design. However, even in the absence of a guarantee, we noted that embedding results suggest that $\varphi $ is likely to be one-to-one if the reservoir state space is sufficiently high-dimensional compared with dimensionality of the chaotic process.

Practically speaking, a necessary condition for climate replication is that training should be successful in approximately recovering the measured state $u(t)=h(s(t))$ from the reservoir state **r**(*t*); this depends on the amount of training data available and the method of regression used, among other things. We did not address the theoretical aspects of training, but we argued that success is plausible if the reservoir is sufficiently high-dimensional and heterogeneous to yield a large variety of basis functions for the regression.

We showed that in the limit that the approximations we described are exact, the predicting reservoir (4) exactly predicts future values of **u**(*t*). Thus, accurate approximations yield commensurately accurate short-term forecasts. Long-term climate replication depends on stability of the predicting reservoir dynamics with respect to perturbations produced by the approximations. We discussed how to estimate Lyapunov exponents for the predicting reservoir in numerical experiments, whether or not the desired climate is stable. We emphasize that our computation of Lyapunov exponents was intended to illustrate our theory and that the method we described requires measurements {**u**(*t*)} over a long time period to maintain the desired climate. If one's goal is to estimate the Lyapunov exponents of the process that produced {**u**(*t*)} from a limited amount of data, one should seek parameters of the predicting reservoir that replicate the climate and simply compute the Lyapunov exponents of the resulting trajectory.^{8}

In Sec. III, we gave examples of climate replication successes and failures, and showed how they correspond to the Lyapunov exponents we computed. We emphasize that the results and the ranges of *σ* we displayed were selected to illustrate and analyze failures that can occur with inadequate input strength (Figs. 5–7) or without regularization (Fig. 8) in the training. With regularization, we are able to obtain robust climate replication [indicated by Fig. 9(a)] over a wide range of input strengths.

We remark that for simplicity, our theory considered discrete-time reservoir dynamics. Discrete time is the appropriate way to model software reservoirs, but physical reservoirs typically are better modeled by continuous time. With appropriate modifications, our theory applies to the continuous-time case. The prediction time increment *τ* used in the training should be the amount of time information takes to traverse the feedback loop depicted in Fig. 4. However, with a physical reservoir, careful calibration of the sampled training data may be necessary to meet the goal of predicting **u**(*t* + *τ*) based on the listening reservoir's response to input up to time *t*, in part because *τ* is a property of the predicting reservoir and not of the listening reservoir.

Finally, we argue that in addition to reservoir computing, the theory we presented in Sec. II applies to some other machine learning methods for time series prediction. The essential features a prediction method needs for our theory to apply are: (1) that the method maintains an internal state, or “memory,” that depends on the sequence of inputs it receives during training; (2) that it is trained to predict a short time increment ahead, after receiving the input time series for a relatively long time interval; and (3) that it is used to predict farther into the future by iterating its incremental forecasts through a feedback loop. These features are present, for example, in prediction using the FORCE method for training reservoirs^{33} and in recent work using long short-term memory (LSTM) networks for prediction.^{18} For methods that (unlike reservoir computing) train parameters that affect the internal state in the absence of feedback, our theory applies if we take the function **f** in Eq. (2) to represent the update rule for the internal state **r** after training has selected parameter values. Although our description of how training arrives at the pair of functions $(f,\u2009\psi \u0302)$ was specific to reservoir computing, our discussion of how these functions can be used with Eqs. (4) and (6) for prediction and attractor reconstruction are independent of which machine-learning method is used to determine the functions.

## ACKNOWLEDGMENTS

We gratefully acknowledge the support of grants from ARO (W911NF-12-1-0101) and DARPA. We thank Michelle Girvan, Jaideep Pathak, Sarthak Chandra, Daniel Gauthier, and Daniel Canaday for their input.