A machine-learning approach called “reservoir computing” has been used successfully for short-term prediction and attractor reconstruction of chaotic dynamical systems from time series data. We present a theoretical framework that describes conditions under which reservoir computing can create an empirical model capable of skillful short-term forecasts and accurate long-term ergodic behavior. We illustrate this theory through numerical experiments. We also argue that the theory applies to certain other machine learning methods for time series prediction.

A long-standing problem is prediction and analysis of data generated by a chaotic dynamical system whose equations of motion are unknown. Techniques based on delay-coordinate embedding have been successful for sufficiently low-dimensional systems. Recently, machine-learning approaches such as reservoir computing have shown promise in treating a larger class of systems. We develop a theory of how prediction with reservoir computing or related machine-learning methods can “learn” a chaotic system well enough to reconstruct the long-term dynamics of its attractor.

Reservoir computing1–4 is a machine-learning approach that has demonstrated success at a variety of tasks, including time series prediction5–8 and inferring unmeasured variables of a dynamical system from measured variables.9,10 In this approach, a “reservoir” is a high-dimensional, non-autonomous (driven) dynamical system, chosen independently of the task. A particular task provides an input time series, and the reservoir state as a function of time is regarded as a “raw” output time series, which is post-processed to fit the task. The post-processing function is determined, typically by linear regression, from a limited-time “training” data set consisting of the desired output time series for a given input time series.

Reservoir computing can be performed entirely in software, typically with an artificial neural network model, or with a physical reservoir; examples of the latter include a bucket of water,11 an electronic circuit with a time delay,12 a field-programmable gate array (FPGA),13 an optical network of semiconductor lasers,14 and an optic-electronic phase-delay system.15 Other machine-learning techniques, including deep learning,16,17 attempt to optimize internal system parameters to fit the training data; doing so requires a mathematical model of the machine-learning system. By contrast, reservoir computing does not require a model for the reservoir, nor the ability to alter the reservoir dynamics, because it seeks only to optimize the parameters of the post-processing function. The ability to use a physical reservoir as a “black box” allows for various potential advantages over other machine-learning techniques, including greatly enhanced speed.

In this article, we consider the task of predicting future measurements from a deterministic dynamical system, whose equations of motion are unknown, from limited time series data. We describe a general framework that includes the reservoir computing prediction method proposed by Jaeger and Haas.5 With appropriate modifications, the same framework applies to other machine-learning methods for time series prediction [including a long short-term memory (LSTM) approach18] as we discuss further in Sec. IV. We assume the vector u(t) of measurements to be a function h of the finite-dimensional system state s(t)

u(t)=h(s(t)).
(1)

For simplicity, we assume that there is no measurement noise although our discussion below could be modified for the case that Eq. (1) is an approximation. We do not assume that h is invertible nor that h or s is known in practice. Training data consists of a finite time series {u(t)} of measurements. We predict future values of u(t) by a sequence of three steps, which we call listening, training, and predicting.

Listening consists of using the training time series as input to the reservoir, which we model as a discrete time deterministic process

r(t+τ)=f[r(t),u(t)].
(2)

Here, r(t) is the reservoir state, τ is a time increment, and we assume f to be a differentiable function. We emphasize that in practice, a formula for f need not be known; only its outputs are used for training and prediction. For convenience, we assume that the full reservoir state r(t) can be measured or computed although our arguments can be modified easily for the more general case that the reservoir output is a function of its internal state. We call Eq. (2) the “listening reservoir.”

Training consists of determining a post-processing function ψ̂ that, when applied to the reservoir output r(t + τ), estimates the next input u(t + τ). (We view ψ̂ as an approximation to an “ideal” post-processing function ψ, to be introduced in Sec. II B.) Thus, the goal of training is to find ψ̂ such that ψ̂(r(t+τ))u(t+τ), or equivalently

ψ̂(r(t))u(t),
(3)

for t large enough that the listening reservoir (2) has evolved beyond transient dynamics. We compute ψ̂ by a fitting procedure, such as linear regression, on the training time series {u(t)} and the corresponding time series {r(t)} determined from the listening reservoir (2).

Predicting then proceeds by modifying the reservoir to run autonomously with a feedback loop, replacing its input [u(t) in Eq. (2)] with its post-processed output from the previous time increment

r̂(t+τ)=f[r̂(t),ψ̂(r̂(t))].
(4)

We call Eq. (4) the “predicting reservoir.” When initialized (from the listening reservoir state) with r̂(t0)=r(t0), iterating the predicting reservoir yields a time series {ψ̂(r̂(t0+τ)),ψ̂(r̂(t0+2τ)),} of predictions for future measurements {u(t0+τ),u(t0+2τ),}. (Our notation reflects the fact that for t > t0, the predicting reservoir state r̂(t) estimates the state r(t) that would result from evolving the listening reservoir (2) with the future measurements.)

The reservoir prediction method we have described has been shown to produce successful short-term forecasts for a variety of dynamical systems.5,7,8 If the system has a chaotic attractor, then, as for any imperfect model, the prediction error ψ̂(r̂(t))u(t) cannot remain small for tt0. However, in some cases, the long-term time series {ψ̂(r̂(t))} continues to behave like the measurements from a typical trajectory on the attractor, and in this sense the predicting reservoir (4) approximately reproduces the ergodic properties of the dynamical system that generated the measurements.8 We refer to this ability, often called attractor reconstruction, as replication of the “climate.”

In this article, we develop and illustrate a theory of how reservoir prediction is able to “learn” the dynamics of a system well enough to produce both accurate short-term forecasts and accurate long-term climate. We make use of the notion of generalized synchronization,19–22 which in our context means that the reservoir state r(t) becomes asymptotically a continuous function ϕ of s(t), in the limit that the listening reservoir (2) is run infinitely long. In Sec. II, we argue that the following four conditions are sufficient for both short-term prediction and attractor/climate replication.

  1. The listening reservoir (2) achieves generalized synchronization with the process {s(t)}, so that r(t)ϕ(s(t)) for a continuous function ϕ, within the time interval covered by the training time series.

  2. The synchronization function ϕ is one-to-one, or at least carries enough information about its input to recover u(t) = h(s(t)) from ϕ(s(t)).

  3. Training is successful in finding a function ψ̂ such that Eq. (3) holds, or equivalently in view of generalized synchronization, that ψ̂(ϕ(s(t)))h(s(t)).

  4. The attractor approached by the listening reservoir is also stable for the predicting reservoir (4).

Conditions 1–3 enable short-term prediction. Condition 4 ensures that the climate established by generalized synchronization of the listening reservoir is preserved when its input is replaced by a feedback term to form the predicting reservoir. One of the main points of Sec. II is to precisely formulate the stability condition described in condition 4.

We remark that generalized synchronization of the listening reservoir6,10 is related to the “echo state property,”1,23 which states that an infinite history of inputs {u(tτ),u(t2τ),} uniquely determines r(t), subject to the condition that the trajectory {r(t)} is bounded. Indeed, if {s(t)} is a trajectory of an invertible dynamical system, then the past inputs are functions of s(t), so the echo state property implies that if the listening reservoir (2) has run for an infinite period of time in a bounded domain, then r(t) is a function of s(t) [though it does not imply that this function is continuous]. We believe that for the reservoir prediction method we described, it is desirable (though not strictly necessary) to have the echo state property and generalized synchronization. In Sec. II A, we show why both properties hold if the listening reservoir is uniformly contracting as a function of r, and that we can quantify the amount of transient time it takes for the reservoir to achieve the approximation r(t)ϕ(s(t)) of condition 1.

Conditions 2 and 3 are significantly more difficult to ensure a priori. In Sec. II B, we argue why it is plausible that these conditions can be achieved. In Secs. II C and II D, we describe the consequences of conditions 1–3 for short-term prediction and formulate more precisely the stability criterion of condition 4 that determines whether the correct attractor and climate are approximately reproduced by the long-term dynamics of the predicting reservoir (4). In Sec. II E, we describe how a model for the reservoir dynamics can be used to compute Lyapunov exponents that reflect climate stability.

In Sec. III, we give examples of short-term state and long-term climate predictions using the Lorenz equations as our input system. In addition to a case where the climate is approximated well, we show a case where the predicted climate is inaccurate, although the short-term forecast is still reasonably accurate. We compute the Lyapunov exponents of the predicting reservoir (4) and show that the transition from accurate climate to inaccurate climate corresponds to a Lyapunov exponent crossing zero. When this Lyapunov exponent is positive but close to zero, the reservoir prediction remains close to the correct climate for a transient period, and we relate the average duration of this transient to the value of the Lyapunov exponent.

We consider the application of the reservoir prediction method described in the introduction to a time series {u(t)} that is a function h of a trajectory {s(t)} of the dynamical system

s(t+τ)=g(s(t)),
(5)

where g is differentiable and invertible, and we assume that s(t) evolves on a bounded attractor A. In preparation for training and prior to prediction, the reservoir state r(t) evolves according to the listening reservoir (2). The system described by Eqs. (5) and (2), coupled by Eq. (1), is often called a drive-response, skew-product, or one-way coupled system. The coupled system dynamics are illustrated by Fig. 1. We next consider the evolution of the coupled system as t.

FIG. 1.

Drive-response system dynamics, with the drive state s(t) coupled to the listening reservoir state r(t) through the measurement vector u(t).

FIG. 1.

Drive-response system dynamics, with the drive state s(t) coupled to the listening reservoir state r(t) through the measurement vector u(t).

Close modal

The goal of training can be regarded as finding a post-processing function ψ̂ such that ψ̂(r(t)) is in approximate identical synchronization20 with u(t)=h(s(t)), when r(t) is evolved with the listening reservoir (2). The desired relationship u(t)ψ̂(r(t)) can also be thought of as approximate generalized synchronization between u(t) [or the underlying state s(t)] and r(t). The existence of such a relationship would be implied by stochastic synchronization,19 which in our context means a one-to-one correspondence between r(t) and s(t) in the limit t. However, in drive-response systems, the definition of generalized synchronization21,22 requires only that the response state be asymptotically a function of the drive state: in our case, that there is a continuous function ϕ such that r(t)ϕ(s(t))0 as t. The existence of such a ϕ is typically easier to establish than its invertibility. Next, we describe conditions on the reservoir system f that guarantee generalized synchronization.

Although weaker conditions are possible, we assume uniform contraction for f, as is often the case in practice. By uniform contraction, we mean that there is some ρ < 1 such that for all r1, r2, and u we have that |f(r1,u)f(r2,u)|<ρ|r1r2|. It then follows that two trajectories {r1(t), u(t)} and {r2(t), u(t)} of (2) with the same input time series approach each other exponentially: |r1(t)r2(t)||r1(0)r2(0)|ρt/τ. Thus, for a given input time series {u(t)}, the reservoir state r(t) is asymptotically independent of its initial state; this is essentially what Jaeger1 called the “echo state property.” Furthermore, because g is invertible and A is bounded, and due to the results of Hirsch, Pugh, and Shub24,25 (a direct proof is given by Stark26), uniform contraction implies generalized synchronization, as defined above. (In general, the synchronization function ϕ cannot be determined analytically from f, g, and h.) A weaker form of generalized synchronization can also be guaranteed26 from the non-uniform contraction implied by negative conditional Lyapunov exponents.

We remark that if the listening reservoir (2) is uniformly contracting, then r(t)ϕ(s(t)) converges to zero exponentially. If the designer of the reservoir can guarantee a specific contraction rate ρ, this determines the convergence rate, so that the amount of transient time needed to make the approximation r(t)ϕ(s(t)) accurate can be known in practice.

Generalized synchronization implies that the set of (s, r) such that s is on its attractor A and r=ϕ(s) is an attractor for the drive-response system given by Eqs. (5), (1), and (2). Below we will use the fact that this set is invariant: r(t)=ϕ(s(t)) implies r(t+τ)=ϕ(s(t+τ)).

Recall that training seeks a function ψ̂ that predicts the current measurement vector u(t) from the current listening reservoir state r(t) [which is computed from past measurements] and that when generalized synchronization is achieved, accuracy of this prediction is equivalent to ψ̂(ϕ(s(t)))h(s(t)). For the rest of Sec. II, we assume that there is a function ψ defined on ϕ(A) such that ψ(ϕ(s))=h(s) for all s in A. This assumption means that in the asymptotic limit of generalized synchronization, the listening reservoir state r(t)=ϕ(s(t)) uniquely determines u(t) = h(s(t)). The goal of training can then be described as finding a function ψ̂ defined on the state space of the reservoir that approximates ψ on the set ϕ(A). We summarize our notation in Table I.

TABLE I.

Summary of notation.

Dynamical system to be predicted 
s(tSystem state 
g:s(t)s(t+τ) System evolution 
A Attractor for s(t
Measurements 
u(tMeasurement vector 
h:s(t)u(t) Measurement function 
Reservoir 
r(tListening reservoir state 
f:[r(t),u(t)]r(t+τ) Listening reservoir evolution 
r̂(t) Predicting reservoir state 
û(t)=ψ̂(r̂(t)) Predicted measurements 
f:[r̂(t),û(t)]r̂(t+τ) Predicting reservoir evolution 
Generalized synchronization 
ϕ:sr for s in A Synchronization function 
ψ:ru for r in ϕ(A) Ideal post-processing function 
ψ̂:r̂(t)û(t) Actual post-processing function 
Dynamical system to be predicted 
s(tSystem state 
g:s(t)s(t+τ) System evolution 
A Attractor for s(t
Measurements 
u(tMeasurement vector 
h:s(t)u(t) Measurement function 
Reservoir 
r(tListening reservoir state 
f:[r(t),u(t)]r(t+τ) Listening reservoir evolution 
r̂(t) Predicting reservoir state 
û(t)=ψ̂(r̂(t)) Predicted measurements 
f:[r̂(t),û(t)]r̂(t+τ) Predicting reservoir evolution 
Generalized synchronization 
ϕ:sr for s in A Synchronization function 
ψ:ru for r in ϕ(A) Ideal post-processing function 
ψ̂:r̂(t)û(t) Actual post-processing function 

Although the existence of ψ is not strictly necessary for the reservoir to make useful predictions, if no such ψ exists, then it seems unlikely that training can successfully achieve the desired approximation ψ(ϕ(s(t)))h(s(t)), and thus unlikely that u(t) can be approximated as a function of the reservoir state during either listening or predicting. The existence of ψ is guaranteed if ϕ is one-to-one on A; then ψ=h°ϕ1. Furthermore, if h is one-to-one on A (in other words, the measurements at a given time determine the system state), then ϕ must be one-to-one on A in order for ψ to exist. Thus, we propose that a goal of reservoir design should be to yield a one-to-one synchronization function ϕ for a variety of input systems. In practice, having a sufficiently high-dimensional reservoir may suffice; embedding results27,28 imply that if the dimension of the reservoir state r is more than twice the dimension of A, functions from A to the reservoir state space are typically one-to-one. We note that in practice, the dimension of r must be much larger than twice the dimension of A in order to provide a suitable basis for approximating ψ, in the sense described below.

Careful consideration of conditions under which training is successful in determining an accurate approximation ψ̂ to ψ is beyond the scope of our theory. However, we argue that success is plausible if the training time series is sufficiently long that the trajectory {s(t)} well samples its attractor A, if the dimension of the reservoir state r(t) is sufficiently high, and if the dynamics of the coordinates of r(t) are sufficiently heterogeneous. If, for example, training uses linear regression of {u(t)}={h(s(t))} versus {r(t)}, then since r(t)ϕ(s(t)), the coordinates of the vector-valued function ϕ(s) can be thought of “basis functions”;6 training seeks a linear combination ψ̂ of these basis functions that approximates h(s) on A. A suitable basis for training (using a linear or nonlinear combination) is plausible if the listening reservoir yields a sufficiently large variety of responses to its input.

After training determines the post-processing function ψ̂, prediction proceeds by initializing r̂(t0)=r(t0) and evolving r̂(t) for tt0 according to the predicting reservoir (4). The reservoir state r(t0) is determined by evolving the listening reservoir (2) for an interval of time preceding t0; this could be the time interval used for training, or it could be a later time interval that uses inputs {u(t)} measured after training (we call this feature “training reusability”29). We assume that the listening time preceding t0 is sufficiently long to achieve generalized synchronization, so that r̂(t0)=r(t0)ϕ(s(t0)) is near ϕ(A). For tt0, the predicted value of u(t) is

û(t)=ψ̂(r̂(t)).
(6)

Figure 2 depicts the dynamics of the predicting reservoir (4).

FIG. 2.

Predicting reservoir dynamics, with the listening reservoir input u(t) replaced by the estimate û(t) determined from the predicting reservoir state r̂(t).

FIG. 2.

Predicting reservoir dynamics, with the listening reservoir input u(t) replaced by the estimate û(t) determined from the predicting reservoir state r̂(t).

Close modal

Consider now the idealized scenario that our approximations are instead exact relations ψ̂=ψ on ϕ(A), and r̂(t0)=r(t0)=ϕ(s(t0)). Suppose hypothetically that the measurements {u(t)} for tt0 (these are the values we want to predict in practice) are available, so that we can evolve both the listening reservoir (2) depicted in Fig. 1 and the predicting reservoir (4) depicted in Fig. 2, and compare their outputs. Then we claim that the two reservoirs agree exactly: r̂(t)=r(t) and û(t)=u(t) for all t ≥ t0. First notice that û(t0)=ψ̂(r̂(t0))=ψ(ϕ(s(t0))=h(s(t0))=u(t0). Then r̂(t0+τ)=f[r̂(t0),û(t0)]=f[r(t0),u(t0)]=r(t0+τ), and r(t0+τ)=ϕ(s(t0+τ)) due to generalized synchronization. Similarly, û(t0+τ) then equals u(t0+τ), so r̂(t0+2τ)=r(t0+2τ)=ϕ(s(t0+2τ)), etc. This agreement between the trajectories also shows that ϕ(A) is an invariant set for the idealized predicting reservoir

r(t+τ)=f[r(t),ψ(r(t))],
(7)

and that its dynamics, observed through ψ, are equivalent to the dynamics of A observed through h.

Thus, if the time series {u(t)} of measurements has enough information to reconstruct the attractor A, then we can regard ϕ(A) and the idealized predicting reservoir (7) as an exact reconstruction of A and its dynamics. When the approximation ψ̂ψ is not exact on ϕ(A), the actual predicting reservoir (4) is still initialized near ϕ(A), but ϕ(A) is only approximately invariant. The better the approximation, the more accurate the predictions û(t)u(t) will be, at least in the short term. However, if the system (5) that generates the measurements {u(t)} is chaotic, the prediction error û(t)u(t) will typically grow exponentially as t increases.

Nonetheless, it remains possible that û(t) will maintain a climate similar to u(t) in the long term. This will happen if (and practically speaking, only if) the predicting reservoir trajectory {r̂(t)} remains close to ϕ(A) for all time, and its attractor has a similar climate to that of the idealized predicting reservoir on ϕ(A). In this sense, climate replication (attractor reconstruction) relies on both state-space stability and structural stability of the predicting reservoir near the idealized reconstructed attractor ϕ(A).

Structural stability is difficult to ensure rigorously, but in practice small perturbations of the dynamics near an attractor tend to yield small perturbations to the climate. Thus, we argue that climate replication is likely if ϕ(A), which according to our assumptions is invariant for the idealized predicting reservoir, is also attracting, in the sense described below.

Recall that generalized synchronization implies that the set ϕ(A) is attracting for the listening reservoir (2), when driven by u(t)=h(s(t)) where s(t) evolves on A. Whether ϕ(A) is attracting for the predicting reservoir is complicated by the fact that it is invariant only in the idealized case ψ̂=ψ and that ψ is defined only on ϕ(A), so that the idealized predicting reservoir (7) is also defined only on ϕ(A). For its stability to be well-defined, the domain of ψ must be extended to a neighborhood of ϕ(A), and whether ϕ(A) is attracting depends on how the extension is chosen.

Thus, the suitability of the empirically determined function ψ̂ for climate prediction depends not only on how well it approximates ψ on ϕ(A) but also on how it behaves near ϕ(A). For a particular ψ̂, we consider hypothetically a particular extension of ψ such that ψ̂ψ near ϕ(A). This extension gives the idealized predicting reservoir a full set of Lyapunov exponents on ϕ(A), some of which correspond to infinitesimal perturbations tangent to ϕ(A) and some of which correspond to infinitesimal perturbations transverse to ϕ(A). Then ϕ(A) is attracting if the transverse Lyapunov exponents are all negative, and is unstable if there is a positive transverse Lyapunov exponent.

If the generalized synchronization function ϕ is one-to-one and differentiable, then the tangential Lyapunov exponents of the system (5) on A are reproduced as the tangential Lyapunov exponents of the idealized predicting reservoir on ϕ(A). Generalized synchronization does not always yield a differentiable ϕ,26,30 but even when differentiability cannot be guaranteed, it is possible in practice to reproduce much of the Lyapunov spectrum of A, including negative Lyapunov exponents in some cases, with a predicting reservoir.8 

We remark that unlike the conditional Lyapunov exponents for a drive-response system (such as the listening reservoir), which correspond to perturbations of the response system state, for the predicting reservoir it is not clear in advance which perturbations correspond to transverse Lyapunov exponents. However, in a numerical experiment where the equations for the driving system (5) and the reservoir are known, the existence or absence of a positive transverse Lyapunov exponent can be inferred by computing all of the positive Lyapunov exponents of the predicting reservoir and eliminating those that are Lyapunov exponents of A.

We now describe how to estimate the Lyapunov exponents of the idealized predicting reservoir (7) on ϕ(A), for a particular extension of ψ to a neighborhood of ϕ(A), from its empirical approximation ψ̂. To do so, we assume that we have a formula for f, so that we can compute its Jacobian matrix. (We emphasize that we estimate the Lyapunov exponents in order to corroborate the theory we have presented; their computation, and a formula for f, are not needed for the reservoir prediction method we have described.) If climate replication is successful, we can simply generate a long trajectory of the predicting reservoir (4) and use it to compute the Lyapunov exponents of the trajectory.8 However, this trajectory cannot be expected to remain close to ϕ(A) if the set is unstable. Nonetheless, if we have a sufficiently long time time series {u(t)} of measurements, we can estimate the Lyapunov exponents of ϕ(A), whether or not it is stable, as follows.

First, we use the time series {u(t)} to generate a trajectory {r(t)} of the listening reservoir (2); as we have argued, r(t) will approach ϕ(A) under the conditions for generalized synchronization. Then along this trajectory, which is an approximate trajectory for the predicting reservoir, we compute Lyapunov exponents using the Jacobian matrix of the predicting reservoir (4).

In this section, we give examples of short-term state and long-term climate predictions for the Lorenz system,31 with standard parameter values that yield chaotic trajectories

dx/dt=10(yx),dy/dt=x(28z)y,dz/dt=xy8z/3.
(8)

We consider the case where the measurement function h is the identity, so that u(t)=s(t)=[x(t),y(t),z(t)]T. For the reservoir, we use an artificial neural network similar to the one used by Jaeger and Haas;5 our listening reservoir [a continuous-time version of Eq. (2)] evolves according to

ddtr(t)=γ[r(t)+tanh(Mr(t)+σWinu(t))],
(9)

where r is an N-dimensional vector, γ is a scalar, and M is an adjacency matrix representing internal network connections. The matrix σWin consists of “input weights”; in our numerical results, we will fix Win and vary the scalar input strength σ. The vector function tanh is computed by applying the scalar hyperbolic tangent to each coordinate of its input vector. We compute trajectories of both the Lorenz and reservoir systems using the 4th order Runge-Kutta method with time step τ = 0.001. We will show cases where climate replication (attractor reconstruction) succeeds and where it fails, and compare the results with Lyapunov exponents we compute for the predicting reservoir.

We consider post-processing functions of the form ψ̂(r)=Woutq(r), where q(r) is the 2N-dimensional vector consisting of the N coordinates of r followed by their squares, and the “output weight” matrix Wout is determined by a linear regression procedure described below. The listening reservoir (9) and the post-processing function are illustrated as an input-output system in Fig. 3. The goal of training is that the post-processed output Woutq(r(t+τ)) based on input up to time t estimates the subsequent input u(t + τ). Once Wout is determined, the external input can be replaced in a feedback loop by the post-processed output to form the predicting reservoir, as depicted in Fig. 4. The predicting reservoir evolves according to

ddtr̂(t)=γ[r̂(t)+tanh(Mr̂(t)+σWinWoutq(r̂(t))],
(10)

and the predicted value of u(t) is û(t)=ψ̂(r̂(t))=Woutq(r̂(t)).

FIG. 3.

Listening reservoir based on an artificial neural network with N neurons. The input vector u(t)3 is mapped to the reservoir state space N by the input weight matrix σWin, and the resulting reservoir state is mapped to 3 by the post-processing function ψ̂=Woutq.

FIG. 3.

Listening reservoir based on an artificial neural network with N neurons. The input vector u(t)3 is mapped to the reservoir state space N by the input weight matrix σWin, and the resulting reservoir state is mapped to 3 by the post-processing function ψ̂=Woutq.

Close modal
FIG. 4.

The predicting reservoir replaces the external input of the listening reservoir with the post-processed reservoir output. The time increment τ in our discussion represents the amount of time for information to travel once around the feedback loop.

FIG. 4.

The predicting reservoir replaces the external input of the listening reservoir with the post-processed reservoir output. The time increment τ in our discussion represents the amount of time for information to travel once around the feedback loop.

Close modal

Details of our reservoir implementation are as follows. The reservoir dimension is N =2000, and we use γ = 10. The N-by-N adjacency matrix M is chosen randomly with sparse Erdös-Renyi connectivity and spectral radius 0.9; specifically, each element is chosen independently to be nonzero with a probability of 0.02, nonzero elements are chosen uniformly between –1 and 1, and the resulting matrix is rescaled so that the magnitude of its largest eigenvalue is 0.9. The N-by-3 matrix Win is chosen randomly so that each row has one non-zero element, chosen uniformly between –1 and 1. We evolve the Lorenz system and the listening reservoir (9) from time t = –100 to t =60, and we discard 100 time units of transient evolution, so that training is based on u(t) and r(t) for 0 ≤ t 60. For training, we constrain the 3-by-2N matrix Wout to have only 3 N nonzero elements, namely, the first N elements of its first two rows, and the first N/2 and last N/2 elements of its third row. (Thus, we fit the x and y coordinates of the Lorenz state with linear functions of r, and the z coordinate with a linear combination of the first N/2 coordinates of r and the squares of the second N/2 coordinates; for the Lorenz system, this is advantageous over using a purely linear function of r.8) Subject to this constraint, we select Wout so as to minimize the error function

k=13000||Woutq(r(0.02k))u(0.02k)||2+β||Wout||2;
(11)

here we have coarsely sampled the training data every 0.02 time units in order to reduce the amount of computation required by the regression. The second term in the error function modifies ordinary linear least-squares regression in order to discourage overfitting; this modification is often called ridge regression or Tikhonov regularization. Below, we will show results with regularization parameter β = 10−6 and with β = 0 (no regularization). We begin prediction by initializing r̂(T)=r(T) and evolving the predicting reservoir (10), where T =60 is the end of the listening and training periods.

In Fig. 5, we show the actual z(t) from a trajectory of the Lorenz system, and predictions ẑ(t) from two reservoirs that are identical except for their input strength parameter values [σ = 0.012 for Fig. 5(a) and σ = 0.014 for Fig. 5(b)]. Each reservoir is trained with the same Lorenz trajectory and with regularization parameter β = 10−6. Both reservoirs predict the short-term future similarly well, but for larger values of the prediction time tT, only the second prediction continues with a Lorenz-like climate. We compare the two climate predictions over a longer time period in Fig. 6, which shows Poincaré return maps of successive z(t) maxima. In Fig. 6(a), the red dots (showing the reservoir prediction) initially are near the blue dots (representing the Lorenz attractor), but eventually the red dots approach a period two orbit, indicated by the arrows. The large distance of the upper left arrow from the blue dots indicates that this period two orbit for the reservoir is not on the Lorenz attractor. In contrast, the red dots in Fig. 6(b) remain near the blue dots at all times, indicating that the reservoir replicates the climate in the long term.

FIG. 5.

Predicted (red) and actual (blue) z(t) for a chaotic Lorenz system trajectory, using the same randomly generated reservoir with different input strengths σ = 0.012 [panel (a)] and σ = 0.014 [panel (b)]. Both predictions remain well correlated with the actual trajectory for roughly 10 time units. After decorrelation, the first prediction approaches a periodic orbit, whereas the second prediction appears to continue with a climate similar to that of the actual trajectory.

FIG. 5.

Predicted (red) and actual (blue) z(t) for a chaotic Lorenz system trajectory, using the same randomly generated reservoir with different input strengths σ = 0.012 [panel (a)] and σ = 0.014 [panel (b)]. Both predictions remain well correlated with the actual trajectory for roughly 10 time units. After decorrelation, the first prediction approaches a periodic orbit, whereas the second prediction appears to continue with a climate similar to that of the actual trajectory.

Close modal
FIG. 6.

Poincaré return map of successive local maxima of z(t) for the actual (blue) and predicted (red) trajectories for tT from 0 to 300, using the same Lorenz trajectory and reservoir as Fig. 5, again with σ = 0.012 [panel (a)] and σ = 0.014 [panel (b)]. Here zmaxn represents the nth local maximum of z(t). The first prediction approaches a period two orbit (indicated by the arrows) that is not on the Lorenz attractor, whereas the second prediction remains close to the Lorenz attractor.

FIG. 6.

Poincaré return map of successive local maxima of z(t) for the actual (blue) and predicted (red) trajectories for tT from 0 to 300, using the same Lorenz trajectory and reservoir as Fig. 5, again with σ = 0.012 [panel (a)] and σ = 0.014 [panel (b)]. Here zmaxn represents the nth local maximum of z(t). The first prediction approaches a period two orbit (indicated by the arrows) that is not on the Lorenz attractor, whereas the second prediction remains close to the Lorenz attractor.

Close modal

Based on the arguments in Sec. II A, we hypothesize that for both σ = 0.012 and σ = 0.014, the listening reservoir (9) evolves toward a set ϕσ(A), where A is the Lorenz attractor and ϕσ is a generalized synchronization function. Our choice of spectral radius 0.9 for the adjacency matrix M is consistent with common practice in reservoir computing,32 although it does not guarantee uniform contraction for the listening reservoir.23 However, it does guarantee that the eigenvalues of the Jacobian matrix of the right side of (9), evaluated at r = u =0, have real parts at most γ(–1 + 0.9) = 10(–0.1) = –1. This suggests an asymptotic contraction rate of –1 or faster for the listening reservoir, and that after discarding 100 transient time units, r(t) is extremely close to ϕσ(A) for t 0.

Based on the arguments in Sec. II C, we hypothesize that the set ϕσ(A) is approximately invariant for the predicting reservoir (10). Based on the results in Figs. 5 and 6, we hypothesize further that for σ = 0.014, there is an attracting invariant set for the predicting reservoir near ϕσ(A), but that between σ = 0.014 and σ = 0.012, there is a bifurcation that causes this invariant set either to become unstable or to be destroyed entirely. To corroborate this hypothesis, we compute the Lyapunov exponents of the predicting reservoir for an approximate trajectory on ϕσ(A), as described in Sec. II E.

Figure 7 shows the three largest Lyapunov exponents of the predicting reservoir (10) as the input strength σ varies from 0.004 to 0.02. We do not change the matrices M and Win, but for each value of σ, we perform a separate training (with β = 10−6 as before), resulting in a different output weight matrix Wout. The exponents colored red and blue approximate the positive and zero Lyapunov exponents of the Lorenz attractor A (the approximation is closest for σ ≥ 0.01). Reproduction of the positive exponent of A in the reservoir dynamics on ϕσ(A) is a necessary consequence of successful attractor reconstruction and does not indicate instability of ϕσ(A) to transverse perturbations. The exponent colored green estimates the largest of the transverse Lyapunov exponents described in Sec. II D. This exponent passes through zero, indicating a bifurcation, at σ ≈ 0.013.

FIG. 7.

The three largest Lyapunov exponents of the predicting reservoir (10) on the invariant set ϕσ(A) for the listening reservoir (9), as a function of the input strength σ, for the same reservoir as Figs. 5 and 6. Two exponents that are approximately constant as a function of σ, which approximate the two largest Lyapunov exponents of the Lorenz attractor, are colored red and blue; the more variable exponent, which we call the transverse Lyapunov exponent and which determines climate stability, is colored green. For values of σ for which we detect divergence from the Lorenz climate, we graph with a black dot the observed divergence rate λ*, computed as described in the text.

FIG. 7.

The three largest Lyapunov exponents of the predicting reservoir (10) on the invariant set ϕσ(A) for the listening reservoir (9), as a function of the input strength σ, for the same reservoir as Figs. 5 and 6. Two exponents that are approximately constant as a function of σ, which approximate the two largest Lyapunov exponents of the Lorenz attractor, are colored red and blue; the more variable exponent, which we call the transverse Lyapunov exponent and which determines climate stability, is colored green. For values of σ for which we detect divergence from the Lorenz climate, we graph with a black dot the observed divergence rate λ*, computed as described in the text.

Close modal

Next, we compare the change in stability indicated by the computed transient Lyapunov exponent to a more direct computation indicating success or failure of climate replication. To detect when the prediction û(t)=ψ̂(r̂(t)) of the Lorenz state diverges from the true Lorenz attractor, we let Δ(t) be the Euclidean distance between the vector field dû/dt implied by the predicting reservoir and the vector field (right-hand side) of the Lorenz system (8), evaluated at [x,y,z]T=û(t). [We calculate the reservoir-implied vector field by the chain rule dû/dt=Dψ̂(r̂(t))dr̂/dt, where Dψ̂ is the Jacobian matrix of ψ̂=Woutq, and dr̂/dt is given by Eq. (10).] For each value of σ depicted in Fig. 7, we calculate the vector field discrepancy Δ(t) for the prediction time period t  T. If Δ(t) does not exceed a threshold value 20 for a duration of 800 time units, we consider the climate to be approximately reproduced. (Our threshold value 20 is small compared to the typical magnitude of the Lorenz vector field.) Otherwise, we say that the prediction has “escaped” from the Lorenz attractor. In Fig. 7, we show a black dot at each value of σ for which we detect escape; these values are the same as those for which the computed transverse Lyapunov exponent is positive. The height of each black dot represents an observed divergence rate λ*, computed as follows.

When we detect escape for a particular value of σ, we reinitialize the predicting reservoir (10) using r̂(t0)=r(t0) for 1000 different values of t0T, where the values of r(t0) are determined by continuing to run the listening reservoir (9) for t  T. For each t0, we evolve the predicting reservoir until the first time t1 for which Δ(t1) ≥ 20, or until t1 − t0 = 800, whichever comes first. If divergence from the attractor is governed by Lyapunov exponent λ, we should have Δ(t1)Δ(t0)exp(λ(t1t0)) in a certain average sense. We compute the observed exponential divergence rate λ*=ln[Δ(t1)/Δ(t0)]/t1t0, where the angle brackets represent an average over the 1000 values of t0. The computed values of λ* are shown as black dots in Fig. 7. The approximate agreement of λ* with the green curve (especially for 0.01 ≤ σ ≤ 0.013) demonstrates that the computed transverse Lyapunov exponent reflects divergence of predictions from the Lorenz attractor.

To illustrate the correspondence between the computed transverse Lyapunov exponent and the observed divergence rates in a case where their dependence on σ is more complicated, we show in Fig. 8 the analogue of Fig. 7 in a case where no regularization (β = 0) is used in the training. Again, we see precise correspondence between detected failure of climate replication (presence of a black dot) and positive values of the transverse Lyapunov exponent (green curve), and good agreement with the observed divergence rates for these values of σ. In this case, there are two bifurcations, one near σ = 0.12 and one near σ = 0.16.

FIG. 8.

The three largest Lyapunov exponents of the predicting reservoir (10), and the estimated divergence rate λ*, as a function of σ, using the same color scheme as Fig. 7. Here we use a different randomly generated reservoir than in Fig. 7, and no regularization (β = 0) in the training.

FIG. 8.

The three largest Lyapunov exponents of the predicting reservoir (10), and the estimated divergence rate λ*, as a function of σ, using the same color scheme as Fig. 7. Here we use a different randomly generated reservoir than in Fig. 7, and no regularization (β = 0) in the training.

Close modal

We remark that when we use regularization (β = 10−6) in the training, we do not observe as complicated a dependence of the computed transverse Lyapunov exponent on the input strength σ as in Fig. 8. Instead, the computed transverse Lyapunov exponent is typically negative and slowly varying across a wide range of σ values, for which climate replication is successful. In Fig. 9, we use the transverse Lyapunov exponent computation, averaged over 10 different randomly generated reservoirs, to give a quantitative illustration of the advantage of regularization. When regularization is used, the negative means and small standard deviations of the computed transverse Lyapunov exponent indicate robust climate stability over the entire range 0.05 ≤ σ ≤ 0.5. (By contrast, Figs. 5–7 depicted values of σ ≤ 0.02.) With no regularization, the means are larger and more variable, indicating less stability and greater sensitivity to the value of σ, and the standard deviations are significantly larger, indicating lack of robustness from one random reservoir realization to another.

FIG. 9.

The means and standard deviations of three largest Lyapunov exponents for the same 10 randomly generated reservoirs trained with regularization parameter β = 10−6 [panel (a)] and with β = 0 [panel (b)]. Again, the red and blue curves approximate the two largest exponents of the Lorenz attractor, and the green curve is the computed transverse Lyapunov exponent.

FIG. 9.

The means and standard deviations of three largest Lyapunov exponents for the same 10 randomly generated reservoirs trained with regularization parameter β = 10−6 [panel (a)] and with β = 0 [panel (b)]. Again, the red and blue curves approximate the two largest exponents of the Lorenz attractor, and the green curve is the computed transverse Lyapunov exponent.

Close modal

We presented in Sec. II a partial explanation for how reservoir computing prediction is able to reconstruct the attractor (replicate the climate) for a chaotic process from limited time series data. We argued that the reservoir dynamics (2) can be designed so that during the listening period on which training is based, the reservoir state r(t) is approximately a continuous function ϕ of the state s(t) of the chaotic process. This property, called generalized synchronization, is closely related to the echo state property for reservoir computing. We showed that both properties hold if the listening reservoir (2) is uniformly contracting as a function of the reservoir state; other criteria for these properties have also been identified.23,26,32

Ideally, the synchronization function ϕ should be one-to-one in order to recover the process dynamics from the reservoir dynamics. Investigation of conditions that can guarantee ϕ to be one-to-one could help guide reservoir design. However, even in the absence of a guarantee, we noted that embedding results suggest that ϕ is likely to be one-to-one if the reservoir state space is sufficiently high-dimensional compared with dimensionality of the chaotic process.

Practically speaking, a necessary condition for climate replication is that training should be successful in approximately recovering the measured state u(t)=h(s(t)) from the reservoir state r(t); this depends on the amount of training data available and the method of regression used, among other things. We did not address the theoretical aspects of training, but we argued that success is plausible if the reservoir is sufficiently high-dimensional and heterogeneous to yield a large variety of basis functions for the regression.

We showed that in the limit that the approximations we described are exact, the predicting reservoir (4) exactly predicts future values of u(t). Thus, accurate approximations yield commensurately accurate short-term forecasts. Long-term climate replication depends on stability of the predicting reservoir dynamics with respect to perturbations produced by the approximations. We discussed how to estimate Lyapunov exponents for the predicting reservoir in numerical experiments, whether or not the desired climate is stable. We emphasize that our computation of Lyapunov exponents was intended to illustrate our theory and that the method we described requires measurements {u(t)} over a long time period to maintain the desired climate. If one's goal is to estimate the Lyapunov exponents of the process that produced {u(t)} from a limited amount of data, one should seek parameters of the predicting reservoir that replicate the climate and simply compute the Lyapunov exponents of the resulting trajectory.8 

In Sec. III, we gave examples of climate replication successes and failures, and showed how they correspond to the Lyapunov exponents we computed. We emphasize that the results and the ranges of σ we displayed were selected to illustrate and analyze failures that can occur with inadequate input strength (Figs. 5–7) or without regularization (Fig. 8) in the training. With regularization, we are able to obtain robust climate replication [indicated by Fig. 9(a)] over a wide range of input strengths.

We remark that for simplicity, our theory considered discrete-time reservoir dynamics. Discrete time is the appropriate way to model software reservoirs, but physical reservoirs typically are better modeled by continuous time. With appropriate modifications, our theory applies to the continuous-time case. The prediction time increment τ used in the training should be the amount of time information takes to traverse the feedback loop depicted in Fig. 4. However, with a physical reservoir, careful calibration of the sampled training data may be necessary to meet the goal of predicting u(t + τ) based on the listening reservoir's response to input up to time t, in part because τ is a property of the predicting reservoir and not of the listening reservoir.

Finally, we argue that in addition to reservoir computing, the theory we presented in Sec. II applies to some other machine learning methods for time series prediction. The essential features a prediction method needs for our theory to apply are: (1) that the method maintains an internal state, or “memory,” that depends on the sequence of inputs it receives during training; (2) that it is trained to predict a short time increment ahead, after receiving the input time series for a relatively long time interval; and (3) that it is used to predict farther into the future by iterating its incremental forecasts through a feedback loop. These features are present, for example, in prediction using the FORCE method for training reservoirs33 and in recent work using long short-term memory (LSTM) networks for prediction.18 For methods that (unlike reservoir computing) train parameters that affect the internal state in the absence of feedback, our theory applies if we take the function f in Eq. (2) to represent the update rule for the internal state r after training has selected parameter values. Although our description of how training arrives at the pair of functions (f,ψ̂) was specific to reservoir computing, our discussion of how these functions can be used with Eqs. (4) and (6) for prediction and attractor reconstruction are independent of which machine-learning method is used to determine the functions.

We gratefully acknowledge the support of grants from ARO (W911NF-12-1-0101) and DARPA. We thank Michelle Girvan, Jaideep Pathak, Sarthak Chandra, Daniel Gauthier, and Daniel Canaday for their input.

1.
H.
Jaeger
,
GMD Technical Report No. 148
(German National Research Center for Information Technology,
2001
).
2.
W.
Maass
,
T.
Natschläger
, and
H.
Markram
,
Neural Comput.
14
,
2531
(
2002
).
3.
H.
Jaeger
,
Scholarpedia
2
,
2330
(
2007
).
4.
M.
Lukoševičius
and
H.
Jaeger
,
Comput. Sci. Rev.
3
,
127
(
2009
).
5.
H.
Jaeger
and
H.
Haas
,
Science
304
,
78
(
2004
).
6.
U.
Parlitz
and
A.
Hornstein
,
Chaos Complexity Lett.
1
,
135
(
2005
).
7.
F.
Wyffels
and
B.
Schrauwen
,
Neurocomputing
73
,
1958
(
2010
).
8.
J.
Pathak
,
Z.
Lu
,
B. R.
Hunt
,
M.
Girvan
, and
E.
Ott
,
Chaos
27
,
121102
(
2017
).
9.
Z.
Lu
,
J.
Pathak
,
B.
Hunt
,
M.
Girvan
,
R.
Brockett
, and
E.
Ott
,
Chaos
27
,
041102
(
2017
).
10.
R. S.
Zimmermann
and
U.
Parlitz
,
Chaos
28
,
043118
(
2018
).
11.
C.
Fernando
and
S.
Sojakka
, in
European Conference on Artificial Life
(
Springer
,
2003
), pp.
588
597
.
12.
L.
Appeltant
,
M. C.
Soriano
,
G.
Van der Sande
,
J.
Danckaert
,
S.
Massar
,
J.
Dambre
,
B.
Schrauwen
,
C. R.
Mirasso
, and
I.
Fischer
,
Nat. Commun.
2
,
468
(
2011
).
13.
N. D.
Haynes
,
M. C.
Soriano
,
D. P.
Rosin
,
I.
Fischer
, and
D. J.
Gauthier
,
Phys. Rev. E
91
,
020801
(
2015
).
14.
D.
Brunner
,
S.
Reitzenstein
, and
I.
Fischer
, in
IEEE International Conference on Rebooting Computing (ICRC)
(
IEEE
,
2016
), pp.
1
2
.
15.
L.
Larger
,
A.
Baylón-Fuentes
,
R.
Martinenghi
,
V. S.
Udaltsov
,
Y. K.
Chembo
, and
M.
Jacquot
,
Phys. Rev. X
7
,
011015
(
2017
).
16.
Y.
LeCun
,
Y.
Bengio
, and
G.
Hinton
,
Nature
521
,
436
(
2015
).
17.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
,
Deep Learning
(
MIT Press
,
2016
).
18.
P. R.
Vlachas
,
W.
Byeon
,
Z. Y.
Wan
,
T. P.
Sapsis
, and
P.
Koumoutsakos
,
Proc. R. Soc. A
474
,
20170844
(
2018
).
19.
V.
Afraimovich
,
N.
Verichev
, and
M. I.
Rabinovich
,
Radiophys. Quantum Electron.
29
,
795
(
1986
).
20.
L. M.
Pecora
and
T. L.
Carroll
,
Phys. Rev. Lett.
64
,
821
(
1990
).
21.
N. F.
Rulkov
,
M. M.
Sushchik
,
L. S.
Tsimring
, and
H. D.
Abarbanel
,
Phys. Rev. E
51
,
980
(
1995
).
22.
L.
Kocarev
and
U.
Parlitz
,
Phys. Rev. Lett.
76
,
1816
(
1996
).
23.
I. B.
Yildiz
,
H.
Jaeger
, and
S. J.
Kiebel
,
Neural Networks
35
,
1
(
2012
).
24.
M.
Hirsch
and
C.
Pugh
, “
Global analysis
,” in
Proceedings of Symposia in Pure Mathematics
(
American Mathematical Society
,
1970
), Vol.
14
, pp.
133
164
.
25.
M.
Hirsch
and
C.
Pugh
,
Invariant Manifolds
, Lecture Notes in Mathematics No. 583 (
Springer-Verlag
,
1977
).
26.
J.
Stark
,
Physica D
109
,
163
(
1997
).
27.
H.
Whitney
,
Ann. Math.
37
,
645
(
1936
).
28.
T.
Sauer
,
J. A.
Yorke
, and
M.
Casdagli
,
J. Stat. Phys.
65
,
579
(
1991
).
29.
J.
Pathak
,
A.
Wikner
,
R.
Fussell
,
S.
Chandra
,
B. R.
Hunt
,
M.
Girvan
, and
E.
Ott
,
Chaos
28
,
041101
(
2018
).
30.
B. R.
Hunt
,
E.
Ott
, and
J. A.
Yorke
,
Phys. Rev. E
55
,
4029
(
1997
).
31.
32.
K.
Caluwaerts
,
F.
Wyffels
,
S.
Dieleman
, and
B.
Schrauewn
, in
2013 International Joint Conference on Neural Networks
(
IEEE
,
2014
), pp.
1
6
.
33.
D.
Sussillo
and
L. F.
Abbott
,
Neuron
63
,
544
(
2009
).