We develop and test machine learning techniques for successfully using past state time series data and knowledge of a time-dependent system parameter to predict the evolution of the “climate” associated with the long-term behavior of a *non-stationary* dynamical system, where the non-stationary dynamical system is itself unknown. By the term *climate*, we mean the statistical properties of orbits rather than their precise trajectories in time. By the term *non-stationary*, we refer to systems that are, themselves, varying with time. We show that our methods perform well on test systems predicting both continuous gradual climate evolution as well as relatively sudden climate changes (which we refer to as “regime transitions”). We consider not only noiseless (i.e., deterministic) non-stationary dynamical systems, but also climate prediction for non-stationary dynamical systems subject to stochastic forcing (i.e., dynamical noise), and we develop a method for handling this latter case. The main conclusion of this paper is that machine learning has great promise as a new and highly effective approach to accomplishing data driven prediction of non-stationary systems.

Prediction of the behavior of dynamical systems that are themselves changing with time or are influenced by independently time evolving external environmental conditions is an important task that appears in many scientific, technological, medical, sociological, and economic settings. Often, knowledge of the scientific principles, underlying equations, and conditions enabling a basis for the construction of models capable of predicting the behavior of such non-stationary systems is lacking or very incomplete. However, one often has available past measurements of the system state as a function of time. Can these kind of data be used to enable prediction of future dynamical system behavior? In this paper, we address this question via an approach utilizing machine learning and find that it can be extremely effective for this purpose.

## I. INTRODUCTION/BACKGROUND

Predicting the time evolution of a dynamical system is a central requirement for enabling many applications. Moreover, it is often the case that first principles knowledge of the dynamical system to be predicted is unavailable. Alternatively, one may have access to measured time series data of the previous states of the system. For situations where the system to be predicted is *stationary* in time, machine learning has proven to be effective in using available time series data to accomplish prediction tasks.^{1–25} In this stationary context, particularly for systems with chaotic behavior, it is useful to partition the problem of prediction into (i) short-term prediction of the state of the system and (ii) long-term prediction/replication of the statistical/ergodic dynamical behavior of the system. That is, even after short-term prediction fails (e.g., due to the sensitive dependence of chaos), we may still desire to replicate the statistical properties of the orbits of the real system. We have in mind the situation in which, following an initial transient, the orbit from an initial condition goes to an “attractor,” i.e., a subset of the state space on which the orbit continues to reside, while, perhaps, evolving chaotically. In such a stationary situation, a measurement of the state at a random instant of time can be thought of as corresponding to a sample taken from an underlying probability distribution, which is invariant under the flow of time. For such systems, the ergodic theorem allows us to characterize the long-term statistical properties of the orbits by equating the time average of a state function over a single typical orbit to the state space average of that function taken with respect to an invariant measure.^{26} If we succeed in predicting trajectories that have the same long-term statistical properties as those produced by trajectories of the actual system, then we say we have succeeded in replicating the “climate” of that system.

In contrast to the situation described in the above paragraph, we are interested in prediction of *non-stationary* dynamical systems, i.e., systems that themselves vary with time. However, in the case of a non-stationary system, ergodicity no longer holds. Thus, for non-stationary systems, we cannot meaningfully define a representative statistical quantity in the above way,^{27} and we wish to extend our concept of “climate” to the non-stationary case. To do this, we consider an ensemble of trajectories originating from many randomly chosen initial conditions in the past. At any subsequent given time, after a sufficiently large convergence time has passed since initialization, the ensemble trace out a so-called “snapshot attractor.”^{27–29} This attractor possesses a probability distribution that (under appropriate general conditions) is independent of the set of initial conditions chosen for the ensemble and thus provides an appropriate measure representing the probability distribution of states on the snapshot attractor.^{27} [The collection of all snapshot attractors (at every time instance) is referred to, in mathematical and terrestrial climate-related literature, as the pullback attractor of the system.^{30}] This time-dependent snapshot probability distribution can then be used to obtain expectation values of state-dependent functions, thereby defining the time-dependent climate of a non-stationary system.

The goal of this paper will be to devise and test machine learning approaches for the prediction of the future evolution of the climate of non-stationary dynamical systems.

We note that, although machine learning has previously been extensively investigated and applied for the task of short-term state prediction of stationary dynamical systems (especially chaotic systems) (see Refs. ^{1–3} and ^{13–22}) and has also been shown to be effective for replicating the climate of stationary systems,^{4,5,23,24} machine learning has previously received little attention as a general approach to the important task of predicting climate evolution in non-stationary dynamical systems. For recent papers on using machine learning for short-term state prediction of non-stationary systems, see Refs. 19, 25, and 31. In the context of our stated goal, an important issue is that of “regime transitions.” Specifically, for different system parameters, stationary dynamical systems can generally exhibit a variety of different types of motions with different climates, and, correspondingly, slowly time varying non-stationary systems can experience *regime transitions* where, once the system crosses some critical condition, the system’s behavior rapidly becomes qualitatively and quantitatively different. There are a variety of mechanisms for regime transitions ranging from basic bifurcations of periodic orbits (e.g., Hopf, saddle-node, and period doubling bifurcations) to transitions of chaotic attractors (e.g., period doubling cascades,^{32,33} intermittency,^{34} and crises^{35,36}). For a very recent work on using machine learning to predict system collapse and chaotic transients associated with crisis transitions, see Ref. 37.

Thus, in addition to slow, temporally continuous climate change, regime transitions are important in many fields. For example, in the context of terrestrial climate, reference is typically made to “tipping points” representing abrupt transitions in the Earth’s climate.^{38,39} Geological records indicate that the Earth has experienced such transitions in the past; for example, Greenland ice-core records indicate a series of abrupt climatic events in the Last Glacial period,^{40} and there is growing concern that anthropogenic activities such as greenhouse gas emissions could drive the climate system to a regime transition,^{41,42} such as the possible shutdown of the thermohaline circulation in the Atlantic Ocean^{43,44} and the potential transition from a wet to a dry Indian summer monsoon.^{45} Such regional transitions could also affect the climate of geographically distant locations via teleconnection.^{46–48} Naturally, both temporally continuous behavior evolution as well as regime change events are of immense consequence not only for atmospheric climate, but also in many other dynamical contexts, e.g., economics,^{49,50} ecology,^{51–62} medicine,^{63,64} technology,^{65,66} and sociology.^{67}

The goal of predicting the climate of non-stationary systems from time series state data, in principle, requires some knowledge of the future evolution of the underlying dynamical system that produces the data. Such knowledge might be directly available or derivable from the available past state time series data. The need for this type of knowledge can be seen by considering the example where the underlying system time dependence experiences an abrupt change such that the system after the change is unrelated to the system before the change, thus making predictions of the state changes past the time of the abrupt system change untenable. With this in mind, we restrict our considerations to the case where the underlying system evolution is smooth and slow compared to the time scale ($\tau s$) of the state evolution ($\tau ns\u226b\tau s$, where $\tau ns$ is the “non-stationarity time scale” over which the underlying system evolves). We emphasize that we are interested in climate prediction over the long time scale $\tau ns$. In this paper, we consider the scenario where the non-stationarity of the system to be predicted is due to time-dependent system parameters where the time dependence is known, although the system itself is otherwise unknown. Discussion of the importance of this scenario in application is given in Sec. IV A. (For simplicity, in what follows, we will only consider the case where there is one relevant time-dependent system parameter.)

The main conclusion of this paper is that machine learning offers a surprisingly promising avenue for using measurements of past system state time series to accomplish the important general goal of predicting the long-term dynamical behavior (climate) of non-stationary dynamical systems, including prediction of apparently continuous change and of more sudden regime transitions.

In what follows, we use reservoir computing^{68–70} as our machine learning platform of choice for practical reasons. Reservoir computing is a computationally inexpensive framework since it requires training only its output layer via a simple linear regression. The resulting reduced training time allows for relatively rapid testing of various methodologies and parameter regimes. However, we expect other types of machine learning, e.g., deep learning,^{71} to also work well for this task via the methods we develop and test in this paper.

Another issue we address is that of predicting the time evolution of a non-stationary dynamical system evolving under the influence of dynamical noise, which is often unavoidable in real situations. We present a simple method that uses the residual error distribution from the training procedure to inject appropriately determined noise into the reservoir prediction scheme to perturb the reservoir output in a way that mimics the effects of the dynamical noise present in the real system, thus enabling prediction of the noise-influenced climate.

The rest of the paper is organized as follows. In Sec. II, we give a brief overview of reservoir computing and of a widely employed basic scheme for using it for prediction of stationary dynamical systems. In Sec. III, we discuss noiseless and noisy non-stationary versions of the logistic map and their behaviors. In Sec. IV A, using the noiseless non-stationary logistic map, we introduce and test our method. In Sec. IV B, we generalize the method of Sec. IV A to include dynamical noise. Sections V and VI test our methods on non-stationary versions of the Lorenz 63 system^{72} and the Kuramoto–Sivashinsky system,^{73,74} respectively. Section VII presents conclusions. Referring to Secs. III and VI, we note that the example systems tested include a simple discrete-time map (Sec. III), a low-dimensional continuous-time system (three ordinary differential equations; Sec. V), and a partial differential equation potentially exhibiting spatiotemporal chaos (Sec. VI).

## II. REVIEW OF RESERVOIR COMPUTING (RC) AND ITS USE FOR PREDICTION OF STATIONARY SYSTEMS

Reservoir computing (e.g., see the review paper^{68}) is a machine learning approach that has proven to be effective in tasks involving dynamical time series data. Although there are various ways of implementing RC,^{75–80} in this paper, we consider the widely used implementation that utilizes a network of neuron-like nodes with recurrent connections, the “reservoir,” to process temporal/sequential data.^{69,70} The main idea is to drive the reservoir with an input signal and form a trainable linear combination of the nonlinear response of the individual network nodes to obtain an output sequence that closely approximates some desired sequence.

For illustrative and background purposes, in this section, we describe the application of reservoir computing to predicting the time evolution of an unknown *stationary* system. Given a set of measurements of the state of a system from some time in the past $t=\u2212T$ up to some time labeled $t=0$, we desire to predict the state of the system for $t>0$. Even after prediction of the state of the system fails, we may still desire that the predicted trajectory replicates the climate of the actual system in the long-time limit (Refs. 4 and 5).

We denote the state of the dynamical system (from which the measurements are obtained) at time $t$ by a $K$ dimensional vector $u(t)=[u1(t),u2(t),\u2026,uK(t)]T$. If our reservoir consists of $N$ nodes, then the $K$ dimensional input vector at each iteration is coupled into the reservoir via a $(N\xd7K)$ dimensional input-to-reservoir coupling matrix, $Win$ (see Fig. 1). We chose to construct this matrix by randomly selecting one element of each row to be a nonzero number chosen randomly from a uniform distribution on the interval $[\u2212\chi ,\chi ]$. Each reservoir node $i$ has a scalar state $ri(t)$. The state of the reservoir at time $t$ is denoted by the $N$ dimensional vector $r(t)=[r1(t),r2(t),\u2026,rN(t)]T$. The reservoir evolves dynamically given the linearly coupled input and a network adjacency matrix $A$, according to

where the hyperbolic tangent function is applied to each element of its vector argument. The directed adjacency matrix is a randomly generated sparse matrix such that it has a predetermined average number of connections per node, $\u27e8d\u27e9$. For our task, the desired output time sequence is of the same dimension as the input sequence; therefore, a $(K\xd7N)$ dimensional reservoir-to-output coupling matrix $Wout$ is used to linearly map the reservoir state vector $r(t)$ to the output denoted $u~$ at the next time step, $t+\Delta t$. That is, the output is the $K$-vector, $Woutr(t)=u~(t+\Delta t)$. These steps are diagrammatically represented in Fig. 1(a). The matrix elements of $Wout$ are determined by the training so that the output of the reservoir computer closely approximates the desired output for $\u2212T\u2264t\u22640$. To do this, we minimize a “cost function,”

where we include a Tikhonov regularization parameter $\alpha $ to prevent overfitting by penalizing large output weights. This minimization is a simple linear regression. The set of coefficients of $Wout$, which minimizes this $L2$ norm, is taken to be the “trained” set of output weights for the network. The choice of $\alpha $ and other “hyperparameters” [such as the average degree $\u27e8d\u27e9$ and the spectral radius $\rho s$ of $A$ (i.e., the largest eigenvalue magnitude of $A$)] as well as the maximal strength of the input-to-reservoir coupling $\chi $ are empirically chosen to minimize the discrepancy between a target prediction and the reservoir prediction over a validation dataset (e.g., by performing a grid search over the space of these hyperparameters). Furthermore, we constrain our search by restricting our attention to a set of hyperparameters that satisfy the “echo state condition” (see, e.g., Ref. 69), which requires that, starting from any two different reservoir states $r1$ and $r2$, iterations of Eq. (1), after a transient phase, yield the same time series for the reservoir state $r(t)$; i.e., the reservoir eventually “forgets” its initial condition, meaning that its evolution depends only on its input $u(t)$ and is independent of its initial state. If $N$ is sufficiently large, it can be expected that the output $u~(t+\Delta t$) will be extremely close to the desired output $u(t+\Delta t)$; i.e., the reservoir will give an accurate one time step prediction.

Once we have trained the reservoir on previous measurements of the system states, using the “open-loop” configuration for $\u2212T\u2264t\u22640$ shown in Fig. 1(a), we predict the system evolution for $t>0$ by using the “closed-loop” configuration shown in Fig. 1(b). In this configuration, the output from the reservoir computer at one step is fed back as input at the next iteration. Thus, on the first iteration, the input is $u(t)$ and [neglecting the small difference between $u(t)$ and $u~(t)$] the output is ideally $u(t+\Delta t)$, which, when fed back, yields an output $u(t+2\Delta t)$, which, when fed back, yields an output $u(t+3\Delta t)$, and so on. However, due to the small deviation between $u(t)$ and $u~(t)$, errors build up with multiple circuits of the feedback loop, and eventually, the prediction fails. Nevertheless, it is observed that this scheme often produces predictions that are useful for a limited time. One notable point is that, because the reservoir network is “recurrent” (i.e., has directed links forming closed paths cycling between nodes), the reservoir system retains memory of its past states and inputs.

As mentioned in Sec. I, we are interested in both discrete-time dynamical systems (maps; Secs. III and IV), as well as continuous-time dynamical systems (flows; Secs. V and VI). In the case of maps, $t$ is an integer, where $t=n$ $(n=0,1,2,\u2026)$, and the reservoir time step is $\Delta t=1$. In the case of flows, we will choose $\Delta t$ to be small compared to the characteristic time for variation of $u(t)$ so that the predicted discrete outputs, separated by $\Delta t$, give a good representation of the continuous-time evolution.

It is important to note (e.g., Sec VI) that the closed-loop configuration of Fig. 1(a) is itself a stationary dynamical system whose dynamics, through the training of $Wout$, has essentially been devised so as to mimic the dynamics of the unknown system producing the measurement data (see Ref. 4).

In Sec. IV, we modify this setup (applicable for the prediction of *stationary* dynamical evolution processes) to the application of non-stationary, possibly stochastic, processes.

## III. THE NON-STATIONARY LOGISTIC MAP

In this section, we consider the one-dimensional logistic map that provides a simple platform for formulating, illustrating, and testing the application of RC to the task of predicting the time evolution of a non-stationary discrete-time dynamical system. Because of its simplicity, the logistic map allows us to rapidly test various methodologies and techniques. Furthermore, we will subsequently show (Secs. V and VI) that our results and methods, demonstrated for the simple example of the logistic map, carry over to other, less simple systems. Before we demonstrate the reservoir computer’s ability to predict the non-stationary climate of the logistic map, we briefly review three variations of this system: the stationary logistic map, a non-stationary logistic map, and a stochastic non-stationary logistic map. We demonstrate that the dynamics and regime transitions have different characteristics for each of these three variations of this system.

We will first briefly review the behavior of the stationary logistic map system,

for different but fixed values of $\lambda $. We randomly initiate a trajectory in the interval $(0,1)$, allow it to evolve under the action of the map for fixed $\lambda $, and consider only its long-term behavior (after any transients associated with the choice of initial condition have died away). Plotting the iterations of $x$ in this long-term limit for each value of $\lambda $, we generate a bifurcation diagram, as shown in Fig. 2(a). For $\lambda \u2208(1,3)$, a typical initial condition $x0\u2208(0,1)$ will evolve to a fixed point attractor, $x\u2217=1\u22121/\lambda $. As $\lambda $ is increased through $\lambda =3$, the stationary point $x\u2217$ bifurcates to a period $2$ orbit. As $\lambda $ is increased further, the attracting orbit undergoes a period doubling cascade^{32,33} terminating at $\lambda =\lambda \u221e\u22483.569\u2026.$ In the range $\lambda \u221e<\lambda <4$, chaos occurs, accompanied by an infinite number of windows of different periodicity; e.g., see Sec. 2.2 in Ref. 81. For this section, in order to conveniently address a variety of regime transition types, we focus on the period $3$ window $\lambda \u2208[3.82,3.87]$, as shown in Fig. 2(b). As $\lambda $ increases through a critical value $\lambda c1\u22483.828\u2026$, the period $3$ window is initiated from a chaotic situation below $\lambda c1$ through an intermittency transition^{34} associated with a period $3$ tangent bifurcation. As $\lambda $ increases beyond $\lambda c2$, this period $3$ orbit undergoes a period doubling cascade, leading to chaos cycling through three bands (with interspersed windows). This three-piece chaotic attractor widens for larger $\lambda $ until it collides with the unstable period $3$ orbit generated at the start of the window. This collision results in a crisis transition^{35,82} causing an abrupt broadening of the three-piece chaotic attractor into a single-band chaotic attractor, thereby terminating the period $3$ window (at $\lambda c3$).

In the tests and examples in the rest of this paper, we will often consider situations with windows, like the period three window of Fig. 2(b). We emphasize, however, that we do this not necessarily because we want to treat windows, but rather because, as noted above for the case of Fig. 2(b), windows contain within them a variety of general types of common regime transitions (intermittency transitions,^{34} period doubling cascades, crises^{35,82}). Thus, windows provide a convenient platform for studying and testing the effectiveness of our methods when regime transitions are present.

Now, we modify the stationary system to create a non-stationary situation where the parameter $\lambda $ drifts linearly with time,

[Later in this paper, we will consider the case where the linear drift with time, Eq. (3b), is replaced by a nonlinear drift with time.] To discuss the dynamics of this system, we consider the situation where $\lambda n$ varies very slowly ($|\gamma |\u226a\lambda n$). For $\gamma =10\u22125$, Fig. 3(a) shows a typical orbit evolved under the action of Eq. (3) over the interval of the period $3$ window. Notice that the period doubling bifurcation, from period $3$ to period $6$, looks different, more “abrupt” [see blow-up in Fig. 3(b)], in this non-stationary case than in the stationary case shown in Fig. 2(b). This can be understood as follows. Just before the bifurcation, the orbit is approximately moving on the period $3$ attractor. When $\lambda n$ increases past $\lambda c2$, the period $3$ orbit points become unstable, but the trajectory remains, for a few iterations, very close to now an unstable period $3$ orbit, and then, after a few more time steps, the trajectory becomes sufficiently displaced from the unstable period $3$ orbit (since $\lambda n$ is changing) and is repelled from the period $3$ orbit, quickly moving to the now stable period $6$ orbit. Thus, we obtain the observed delayed, but rapid, transition from a period $3$ orbit to a period $6$ orbit. As the drift rate of $\lambda $ increases, the transition from a period $3$ orbit to a period $6$ orbit is further delayed, and the details of the period doubling cascade become less discernable and are eventually washed out, as seen from Figs. 3(c) and 3(d), which apply for $\gamma =5\xd710\u22125$. The main point illustrated in Figs. 3(a)–3(d) is that non-stationarity can qualitatively alter the dynamical behavior of a system, especially during a regime transition.

Similarly, dynamical noise can also significantly alter the climate and features of regime transitions. To illustrate this, we consider the non-stationary stochastic logistic map,^{83} given by Eq. (4),

where $\epsilon n$ (the dynamical noise) is a number randomly and independently chosen from a distribution $\zeta (\epsilon )$. For simplicity, we will consider the case in which the noise distribution is uniform on the interval $[\u2212\epsilon ,\epsilon ]$. As shown in Ref. 84 for the stationary system bifurcation diagram, the dynamical noise acts to blur the fine structure of the period doubling cascade (near $\lambda \u221e$) and appears to cause the period doubling cascade to terminate after a few period doublings. Turning to the non-stationary case, in Fig. 3(e), we have plotted a typical trajectory in the period $3$ window evolving under the action of Eq. (4) with a noise of strength $\epsilon =0.002$ and $\lambda $ drifting linearly at a rate $\gamma =10\u22125$. We see that the dynamical noise “blurs” the period doubling bifurcation from a period $3$ orbit to a period $6$ orbit. We also see, before the occurrence of the internal crisis, at the upper $\lambda $ range of the window, that the noisy trajectory intermittently leaves (and returns) to the region of the state space, which corresponds to the three-piece chaotic attractor of the stationary system. Thus, dynamical noise can cause a significant change in features of regime transitions, including blurring of the noiseless sharp location of the point at which a transition occurs.

We desire to accurately predict and replicate, from past time series data, the effects of both noiseless and noisy non-stationarity on the climate associated with time evolution of the system.

## IV. APPLICATION OF RC TO CLIMATE PREDICTION OF THE NON-STATIONARY LOGISTIC MAP

### A. The noiseless non-stationary logistic map

Now, we turn to our stated goal: given a finite set of measurements of the state of an unknown non-stationary dynamical system from some time in the past $t=\u2212T$ up to some time $t=0$, we predict the climate and regime transitions for $t>0$. More precisely, when the prediction is performed over an ensemble of trajectories, each initiated from a randomly chosen initial condition, we want our predicted ensemble of trajectories to replicate the non-stationary climate of the actual system. For work previously applying initial condition ensembles for studying and understanding non-stationary dynamics, see Refs. ^{85–87}. First, we demonstrate how naïvely applying RC using the measurements of the previous states of the system as the only input to the reservoir (i.e., following the procedure of Sec. II) fails to accomplish our goal. We will call this setup “Setup 1.” Considering the noiseless system given by Eq. (3) for $\gamma =10\u22125$, we train the reservoir in the “open-loop” configuration [Fig. 1(a)] using a set of training data, ${xn}$ for $n\u22640$ (1000 data points) and then allow the reservoir to autonomously predict a trajectory, $xn$ for $n>0$, forward in time using the “closed-loop” configuration [Fig. 1(b)]. Table I displays the set of hyperparameters used in all tasks in this paper involving the prediction of the logistic map. Figure 4(a) compares the actual trajectory of the system (black dots) and the reservoir-predicted trajectory (red dots). We see that the reservoir-predicted trajectory does not undergo any of the regime transitions that the actual trajectory undergoes. Alternatively, the reservoir-predicted trajectory appears to evolve according to a fixed stationary map. Going back to the description of the prediction in the stationary case, Sec. II, this is to be expected because the memory of the reservoir is short compared to the timescale for change of $\lambda n$ (i.e., $\gamma \u22121$), and the training produces a single fixed $Wout$, which, when used in the closed-loop prediction configuration, produces a stationary reservoir dynamical system. As such, we can think of the training procedure of Fig. 1(a) acting on non-stationary training data as choosing a $Wout$ that in some sense yields the stationary system that is most consistent with the non-stationary training data $\u2212T<t<0$. Figure 4(b) shows the map traced out by the reservoir-predicted trajectory (red dots) and by the actual trajectory (black dots) from Fig. 4(a).

Reservoir size | N | 500 |

Average number of connections per node | ⟨d⟩ | 1 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.55 |

Maximal coupling strength of the input to the reservoir | χ | 0.23 |

Tikhonov regularization parameter | α | 2.15 × 10^{−11} |

Reservoir size | N | 500 |

Average number of connections per node | ⟨d⟩ | 1 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.55 |

Maximal coupling strength of the input to the reservoir | χ | 0.23 |

Tikhonov regularization parameter | α | 2.15 × 10^{−11} |

In order to take into account non-stationarity, we must break the stationarity of the closed-loop dynamical system of Fig. 1. Instead of providing only the state of the dynamical system as input to the reservoir, we also provide the reservoir with the value of the driving parameter $\lambda n$ of the system during both the training and the prediction. We call this setup, illustrated in Fig. 5, “Setup 2.” (A similar scheme was very recently independently formulated in the pre-print Ref. 37 for studying crises.) We have coupled this parameter to the reservoir in the same way we couple the measured states of the dynamical system; i.e., each reservoir node is coupled to a single randomly selected element of the state vector $u(t)$ or to the driving parameter. Using Setup 2, we will now demonstrate the successful application of RC to the prediction of the noiseless non-stationary logistic map. With an additional modification to this method, we will later extend its capabilities to successfully predict the stochastic logistic map as well.

We assume that we have a set of measurements of the state of the system, as well as knowledge of the time variation of the parameter, ${xn,\lambda n}n=\u2212Tn=0$, from some time in the past $n=\u2212T$ to the present $n=0$, and we desire to predict the future climate of this non-stationary system. We note that the task of prediction where non-stationarity is due to time variation of a *known* parameter is often of great practical interest. For instance, suppose we are trying to predict how the Earth’s climate will change over time, and we have a set of measurements that describe the climate from the past as well as of relevant quantities that we think may directly influence the dynamical behavior of the terrestrial climate system, such as atmospheric CO$2$ concentration.^{41,42} We may be interested in forecasting how the climate would change in the future for different rates of increase of CO$2$ concentration in the atmosphere. In that case, we may treat CO$2$ concentration levels as a known time-dependent driving parameter of the system. With this in mind, we train the reservoir, as shown in Fig. 5(a), and then we employ the “closed-loop” configuration of Fig. 5(b) to predict the future states of the system. For the case with $\gamma =10\u22125$, $\epsilon =0$, the results are shown in Fig. 4(c). The black dots represent the actual trajectory evolving under Eqs. (3a) and (3b), while the red dots represent the trajectory predicted by the reservoir computer. The reservoir computer was trained on a portion of the actual trajectory corresponding to the interval $\lambda \u2248[3.81,3.82]$ ($T=1000$ data points). Although the reservoir quickly loses the ability to predict the precise time evolution of the true system orbit, as expected for a chaotic orbit, we see that it was able to replicate and predict the climate, including the various regime transitions: the intermittency transition near $\lambda =3.828$ from a chaotic orbit to a stable period three orbit, the cascade of period doubling bifurcations starting at $\lambda \u22483.845$ leading to chaos near $\lambda =3.857$, and the crisis transition at the upper boundary of the period $3$ window ($\lambda \u22483.856$). Notice that the reservoir-predicted trajectory was able to accurately capture the effects of non-stationarity, for example, in the sudden transition from a period $3$ orbit to a period $6$ orbit discussed previously. We have also considered the reverse bifurcations (obtained as $\lambda n$ decreases with time), and they are well-predicted by the machine learning method; however, we omit these figures here but will show the corresponding figures for the next example (the Lorenz system, Sec. V). Figure 4(d) shows a similar result, but here, the system is drifting at a faster rate ($\gamma =5\xd710\u22125$). We see that this faster drift alters the details of the transition quite a bit (e.g., we cannot clearly see a period doubling cascade), while, the reservoir, this time trained on $T=250$ data points (over a similar range of $\lambda $ as for the previous case), again accurately mimics the dynamics.

We can, more quantifiably, verify that the long-term dynamical behavior predicted by the reservoir computer is indeed characteristic of that of the true system as follows. We initiate a trajectory $xn$ from an initial condition randomly selected from the interval $(0,1)$. We train the reservoir over a portion of the actual trajectory and then predict forward in time. We repeat this many times, each time using a trajectory initiated from a different randomly chosen initial condition. [The number of iterations of this step is made large enough that $\Gamma $ (defined below) appears to be well converged; i.e., $\Gamma $ does not change appreciably when there is an increase in the number of iterates.] After prediction has started, for each of the actual and predicted trajectories, we collect those points in the trajectory, which are contained in the segment of the trajectory where $\lambda \u2208[\lambda \u2217,\lambda \u2217+\Delta \lambda ]$ for a predetermined small $\Delta \lambda $ and a set of $\lambda \u2217$. For each such interval, we then record the orbit points for the actual and predicted trajectories from which we approximate the probability distribution functions of $x$ on the intervals $[\lambda \u2217,\lambda \u2217+\Delta \lambda ]$. This allows us to compare the $x$-distributions from the true system and from the RC prediction for different $\lambda \u2217$ values. To assist in doing this, we define the distance between the two probability distributions as

where $fa(x;\lambda )$ is the approximation to the cumulative probability distribution of $x$ over a small interval $[\lambda \u2217,\lambda \u2217+d\lambda ]$ obtained from the ensemble of actual trajectories, $fp(x;\lambda )$ is that which is obtained from the ensemble of reservoir-predicted trajectories, and $\Delta x=xmax\u2212xmin$ is the range in $x$ of the attracting set. For each portion of the trajectory of the system for which we calculate $\Gamma $, the corresponding $\Delta x$ is determined by taking the difference between the minimum and maximum values of the map points, $xmax\u2212xmin$, over the entire ensemble of trajectories, which provides a good estimate of the size of the attractor over that portion of the trajectory. Without the normalization factor of $2/\Delta x$, this metric is called the Wasserstein metric,^{88} a notion of distance between the two probability distributions. We choose to use this metric because it accounts for the underlying metric space from which the distributions are drawn and also because, since it uses cumulative distributions rather than histograms, it does not depend on an (arbitrary) choice of histogram binning width. This latter property is of particular importance in the analysis we perform. As an example, consider the situation of predicting an orbit in the periodic regime. If the reservoir-predicted orbit accurately captures the periodicity of the true orbit, but the predicted orbit values are slightly displaced from the true orbit values, then depending on the choice of the histogram binning width, a statistical distance that directly compares the predicted and true probability distributions (e.g., the Kullback–Leibler divergence^{89}) may suggest that the prediction was either very good or very bad. This arbitrariness is absent when comparing the underlying cumulative probability distributions using the Wasserstein metric. An intuitive interpretation of the Wasserstein metric is as the minimum “work” needed to transform one probability distribution into the other, where we imagine that the distributions are approximated by histograms with very narrow bins, each bin containing very many samples, and that the transformation from one distribution to the other is accomplished by shifting samples to different bins, in terms of which, the work is then defined as the average over all samples of their $x$-length shifts. We added the normalization factor $\Delta x/2$ in our definition of $\Gamma $ because in general, the integral has units that are the same as those of $x$ and because we wish a small value of the metric $\Gamma $ to have meaning independent of $\Delta x$ [e.g., if we are considering the logistic map vs the Lorenz system (Sec. V)]. Figure 6(a) shows $\Gamma (\lambda )$, calculated using an ensemble of $2000$ randomly initiated trajectories, at locations throughout the period $3$ window for the case with $\gamma =10\u22125$. We see that there are two instances where $\Gamma (\lambda )$ peaks sharply. The first spike, near $\lambda \u22483.829$, occurs when the actual system undergoes a transition from chaos to period $3$ and the reservoir fails to predict the location of this transition. Note, however, that the narrow width of this spike indicates that this error of the transition point is fairly small. The second spike occurs when the reservoir fails to predict exactly when the internal crisis leads to the abrupt broadening of the three-piece chaotic attracting set into a single-band chaotic set near $\lambda \u22483.860$. Figure 6(b) shows plots of cumulative probability distributions of the true system (solid curves) and for the RC prediction (dashed curves) at several values of $\lambda $ indicated in Fig. 6(a) (corresponding to $\lambda 1\u2217=3.825$, $\lambda 2\u2217=3.845$, $\lambda 3\u2217=3.865$, and $\Delta \lambda =0.001$). For those cases, we calculated $\Gamma (\lambda 1\u2217)=0.0211$, $\Gamma (\lambda 2\u2217)=0.0064$, and $\Gamma (\lambda 3\u2217)=0.0245$. Excellent climate predictions are indicated by the close agreement of the solid and dashed curves in Fig. 6(b) and the small values of $\Gamma $ in Fig. 6(a).

We now consider the case where the time evolution of $\lambda n$ is not linear,

Figure 7(a) shows how the slowly changing parameter $\lambda n$ varies in time for $\theta 0=\u2212\pi /8$, $\gamma =0.001$, $\lambda 0=3.79$, $\lambda 1=0.03$, and $m=10\u22125$ [note that for this case, $0<\lambda 1<(m/\gamma )$, and the variation of $\lambda n$ with time $n$ is non-monotonic]. Figure 7(b) compares a trajectory evolving under the action of this map to one predicted by the reservoir computer, again using the method illustrated in Fig. 5. Here, we have trained the reservoir computer on a set of data ($T=1000$) where $\lambda $ is changing almost linearly [see Figs. 7(a) and 7(b)]. After this training segment, we see that $\lambda n$ deviates more strongly from linear behavior, and remarkably, the reservoir computer can predict the corresponding changes to the nature of the dynamics of the trajectory. We see that the reservoir computer is able to predict regime transitions and dynamics which it was not exposed to during its training. After training, initially, $\lambda $ continues to increase through $\lambda c1$ and the dynamics experience a transition from chaotic to periodic via a tangent bifurcation, but shortly after, $\lambda $ begins to decrease and the system undergoes an intermittency transition in the reverse direction (i.e., from periodic to chaotic) as $\lambda $ decreases below $\lambda c1$. After this reverse transition, $\lambda $ rapidly increases and traverses the range of the period $3$ window, which results in an intermittency transition from chaotic below $\lambda c1$ to periodic above it, and then an abbreviated period doubling cascade starting near $\lambda c2$ and a crisis transition from a three-piece chaotic attractor to a single-band chaotic attractor near $\lambda c3$. The reservoir computer successfully predicts the multiple forward and backward regime transitions experienced by the actual system in this window, as shown in Figs. 7(b) and 7(c). Figure 7(d) shows $\Gamma (\lambda )$, obtained by using an ensemble of $2000$ randomly initiated trajectories and the corresponding reservoir predictions, over the entire prediction interval. Once again, we see local maxima in $\Gamma (\lambda )$ at points near a regime transition, and, as before, this can be attributed to the reservoir predicting a transition slightly earlier or later than it actually occurs. Figure 7(c) explicitly compares $fa(x;\lambda )$ and $fp(x;\lambda )$ for $\theta 1\u2217=\pi $, $\theta 2\u2217=1.5\pi $, $\theta 3\u2217=2.25\pi $, and $\Delta \theta =0.05$ ($\lambda 1\u2217=3.8253$, $\lambda 2\u2217=3.8111$, and $\lambda 3\u2217=3.8858$, respectively). The calculated errors between the distributions are small $\Gamma (\theta 1\u2217)=0.0195$, $\Gamma (\theta 2\u2217)=0.0342$, and $\Gamma (\theta 3\u2217)=0.0390$, indicating good prediction of the climate at these points.

### B. The noisy non-stationary logistic map

Next, we consider the noisy (or stochastic) logistic map, given by Eq. (4). Once again, we assume that we have a set of measurements of $xn$ and knowledge of the time variation of the parameter, ${xn,\lambda n}n=\u2212Tn=0$ from some point in the past $(n=\u2212T)$ up to the present ($n=0$), and we wish to predict the future trajectory of the system ($n>0$). Applying the method of Sec. IV A (Setup 2, Fig. 5) leads us to predicted trajectories, which do not accurately capture the noisy dynamics of the true system. Figure 8 shows this result for two different noise levels ($\lambda $ is taken to drift linearly with $\gamma =10\u22125$).

Note, in particular, the lack of spread in the red dots (as opposed to the black dots) in the period 3 range of $\lambda $, as well as the actual system’s (black dots) intermittent bursting away from the three chaotic attractor pieces preceding the interior crisis at the upper-$\lambda $ boundary of the window (which, as seen from the red dots, for the prediction, occurs abruptly, without being preceded by bursting).

We see that despite training the reservoir computer using a dataset obtained from a noisy trajectory, in agreement with Ref. 90, we obtain a prediction that appears to approximate a *noiseless* trajectory. This is because the linear regression used in the training (see Sec. II) fits the $Wout$ elements to many data pairs $(xn,xn+1)$, thus averaging out the dynamical noise component (which has zero mean). That is, at time step $n$, if the input to the trained reservoir is $xn$, then its output is an approximation to $\lambda nxn(1\u2212xn)$, the result of applying the *noiseless* map to $xn$. To restore the effect of dynamical noise on the climate, we begin by representing the true noisy orbit by introducing an intermediate variable $vn$,

in terms of which, by Eq. (4a), the true noisy value of $xn$ is $xn+1=vn+1+\epsilon n$, where $\epsilon n$ (the noise) is a random number with probability distribution $\zeta (\epsilon n)$. This can be diagrammatically represented as shown in Fig. 9(a), where an input $xn$ is mapped according to the noiseless logistic map to $vn+1$. Based on this, we modify our previous non-stationary reservoir prediction scheme [Fig. 5(b)] by adding a stochastic noise component, as illustrated in Fig. 9(b), where $v~n+1$ is the reservoir approximation to $vn+1$ and $\epsilon ~n$ is a random independent variable with probability distribution $\zeta ~(\epsilon ~n)$, where $\zeta ~$ is our approximation to the actual noise probability distribution $\zeta $. Given the training data time series ${xn}$ used to train the RC system, we obtain $\zeta ~$, our approximation to $\zeta $, as follows. Inputting $\lambda n$ and $xn$ to the trained RC system, we obtain the corresponding $v~n+1$ for each $n$ in the training time interval. Next, we determine $\epsilon ~n$ via $\epsilon ~n=xn+1\u2212v~n+1$, where $xn$ comes from the training data. We then make a histogram of this collection ${\epsilon ~n}$ of values and take this histogram to be our approximate noise probability distribution $\zeta ~$.

Once we have determined a suitable $\zeta ~(\epsilon ~)$, we use it to generate samples $\epsilon ~n$ for $n$ in the prediction time range for use in the prediction phase, Fig. 9(b). Figure 10 shows results for two different noise levels with a linearly drifting $\lambda n$ ($\gamma =10\u22125$), again demonstrating good climate prediction.

Since $\epsilon n=xn+1\u2212vn+1$ and $\epsilon ~n=xn+1\u2212v~n+1$, the accuracy of our numerical approximation of the distribution $\zeta $ by $\zeta ~$ depends on the accuracy of the reservoir’s approximation $v~n+1$ to the actual, noiseless system response $vn+1$. In the Appendix, we discuss this issue further.

## V. APPLICATION OF RC TO PREDICTING THE LORENZ 63 SYSTEM

Having shown the reservoir computer’s ability to predict climate evolution and regime transitions in a simple discrete-time one-dimensional map, we now implement this approach for the continuous-time three-dimensional Lorenz system,^{72}

For certain choices of the parameters $\sigma $, $\rho $, and $\beta $, typical initial conditions on the attractor of this system give rise to chaotic trajectories. Previous work shows the ability of reservoir computing to predict the short-term trajectories, as well as the long-term climate, of this system when the parameters are fixed.^{1,3–5,18} Here, we expand on those findings to show that we can predict the long-term dynamical behavior of this system when its parameters are drifting, making the trajectories statistically nonstationary.

We consider time varying $\rho $ in Eq. (8b) while fixing $\sigma =10$ and $\beta =8/3$. To display how the system climate evolves as $\rho $ varies, we take a Poincaré surface of section specified by $dz/dt=xy\u2212\beta z=0$ and $d2z/dt2<0$ (i.e., we consider maxima of $z$). We first construct a stationary system bifurcation diagram as follows. For each $\rho n=n\Delta \rho $ and $\Delta \rho =0.01$ in the interval $[25,200]$, we allow a trajectory to evolve under this system and observe its stationary behavior (i.e., with $\rho n$ constant in time). Figure 11(a) shows the resulting plot of the z-coordinate maxima, $zm$, vs $\rho $, where $zm\u2245z(tm)$ and $tm$ denotes the time of the $m$th piercing of the surface of section $[x(tm)y(tm)=\beta z(tm)$ for $d2z/dt2<0$ at $t=tm$]. In applying RC to predict the time evolution of this system, we restrict our attention to $\rho $ in the range $[140,170]$. In this range [see Fig. 11(b)], there are eight types of regime transitions, four as $\rho $ increases and four corresponding reverse transitions as $\rho $ decreases. The transitions as $\rho $ increases are $(1)$ the crisis near $\rho \u2248142$ where the one-piece chaotic attractor splits into a two-piece chaotic attractor, $(2)$ a cascade of pairwise chaotic band splittings beginning near $\rho \u2248145$, $(3)$ a period halving cascade beginning near $\rho \u2248146.5$, and $(4)$ an intermittency transition from a period $2$ orbit to chaos near $\rho \u2248166$.

We first demonstrate the reservoir’s ability to predict the four transitions as $\rho $ increases linearly according to $\rho (t)=\rho 0+\gamma t$, where $\gamma >0$ and $\rho 0$ is an initial value of $\rho $. We chose $\gamma =0.1$. As before with the logistic map (Fig. 5), we assume that we have available to us a set of measurements of the system’s state as well as knowledge of the time variation of the parameter $\rho $, ${u(t),\rho (t)}t=\u2212Tt=0$ from some point in the past $t=\u2212T$ (taken as $T\u2245170$ for this example) up to the “present” $t=0$ in increments of $\Delta t$ dictated by the sampling rate of the measurement process. We will use $\Delta t=0.01$ in the examples that follow. The state of the system is represented by $u(t)=[x(t),y(t),z(t)]T$. We train the reservoir using our measurement dataset and then predict forward for the duration of time it takes the system to traverse the interval of $\rho $ that is of interest to us. We normalize each input to the reservoir, including the parameter $\rho $, by dividing it by its root-mean-square value over the training interval so that we do not introduce a large bias into the activation function of the reservoir nodes. Tables II and III list the choice of RC hyperparameters that we use for all prediction tasks in this paper involving the noiseless (noisy) Lorenz system. We then use the sequence ${zm}$ of resulting $z$ maxima to display the results. Specifically, we plot the difference between consecutive $z$ maxima, $zm+1\u2212zm$, vs $\rho $ in Fig. 12(a) and 12(b) for the trained reservoir (red dots) and the true orbit (black dots). We see that the reservoir was able to accurately capture all the transitions (for increasing $\rho $) discussed above, including the period halving cascade and the abrupt transition to chaos near $\rho \u2248169$. To more quantitatively verify that the reservoir predicts trajectories that mimic the long-term dynamical properties of the true system, we compare the cumulative distributions of $z$ maxima for chosen intervals $[\rho \u2217,\rho \u2217+\Delta \rho ]$, where $\Delta \rho =0.5$, obtained from an ensemble of $2000$ randomly initiated trajectories for both the true system, Eqs. (8), and the corresponding reservoir-predictions [see Figs. 12(c) and 12(d)]. Figure 12(d) shows a plot of $\Gamma (\rho )$ for different values of $\rho $ in the prediction window showing that this quantity is small throughout most of the interval. There is, however, a large spike near $\rho =167$, which corresponds to the transition from a period two orbit to a chaotic orbit. This spike occurs because, as was the case for the logistic map, the reservoir does not always get the exact location of the transition correct. However, the predicted transition point is still fairly close to the actual transition point. This can be seen by the immediate drop in $\Gamma (\rho )$ after the spike [and from Figs. 12(a) and 12(b)]. In Fig. 12(c), we show the cumulative probability distributions for $\rho 1\u2217=139$, $\rho 2\u2217=149$, $\rho 3\u2217=159$, and $\rho 4\u2217=169.4$. The corresponding errors in the distributions were $\Gamma (\rho 1\u2217)=0.001$, $\Gamma (\rho 2\u2217)=0.0021$, $\Gamma (\rho 3\u2217)=0.0050$, and $\Gamma (\rho 4\u2217)=0.0058$. These small values confirm that the reservoir has accurately replicated the non-stationary climate of the system at these $\rho $ values.

Reservoir size | N | 1000 |

Average number of connections per node | ⟨d⟩ | 3 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.85 |

Maximal coupling strength of the input to the reservoir | χ | 0.6775 |

Tikhonov regularization parameter | α | 2.15 × 10^{−11} |

Numerical integration time step = RC time step | Δt | 0.01 |

Reservoir size | N | 1000 |

Average number of connections per node | ⟨d⟩ | 3 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.85 |

Maximal coupling strength of the input to the reservoir | χ | 0.6775 |

Tikhonov regularization parameter | α | 2.15 × 10^{−11} |

Numerical integration time step = RC time step | Δt | 0.01 |

Reservoir size | N | 1000 |

Average number of connections per node | ⟨d⟩ | 4 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.85 |

Maximal coupling strength of the input to the reservoir | χ | 0.4 |

Tikhonov regularization parameter | α | 1.29 × 10^{−10} |

Numerical integration time step = RC time step | Δt | 0.01 |

Reservoir size | N | 1000 |

Average number of connections per node | ⟨d⟩ | 4 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.85 |

Maximal coupling strength of the input to the reservoir | χ | 0.4 |

Tikhonov regularization parameter | α | 1.29 × 10^{−10} |

Numerical integration time step = RC time step | Δt | 0.01 |

Next, we demonstrate the reservoir’s ability to predict the transitions in the opposite direction (i.e., for $\rho $ decreasing in time); specifically, we take $\rho (t)=\rho 0+\gamma t$ with $\gamma =\u22120.1$ and $\rho 0$ above the window. In addition, we also add white noise $\epsilon z(t)$ to the right-hand side of the Lorenz equation for $z(t)$, Eq. (8c), implemented by adding a random number sampled from a uniform distribution over the interval $[\u22120.05,0.05]$ at each $\Delta t=0.01$ time step. We use the method developed in Sec. III to approximate a distribution $\zeta ~(\epsilon ~z)$ from which we can sample $\epsilon ~z$ and re-inject approximate noise [Fig. 9(b)] into our prediction scheme to replicate the effects of the dynamical noise present in the actual system. In Fig. 13, we show the results of the reservoir prediction of this noisy non-stationary process. Figure 13(a) and its zoomed-in view, Fig. 13(b), compare a plot of $zm+1\u2212zm$ vs $\rho $ for a true trajectory and for that predicted by the reservoir. We see that the reservoir is able to anticipate the intermittency-type regime transition from chaos to a period two orbit at $\rho \u2248165$, the noisy period doubling starting near $\rho \u2248154$, and the noisy pairwise band-merging near $\rho \u2248145$ (as well as the noise-induced intermittency-like behavior just prior to the band-merging). Figure 13(d) shows $\Gamma (\rho )$ for various values of $\rho $ in the prediction window. Again, we see a spike near $\rho \u2248165$, which is due to the reservoir’s slightly inaccurate prediction of when the intermittency transition initiating the start of the window occurs. There is also a growing discrepancy between the reservoir predicted cumulative probability distribution and that obtained from actual trajectories toward the end of the prediction window due to error in the reservoir prediction of when the specific period doubling bifurcations and transition to chaos occur. Even so, although the predictions of these transitions are shifted from their true locations, their existence and occurrence are still well-anticipated. Figure 13(c) compares the approximate cumulative distributions at locations $\rho 1\u2217=139$, $\rho 2\u2217=149$, $\rho 3\u2217=159$, and $\rho 4\u2217=168.5$ as obtained from the true trajectories and as predicted by the reservoir for $2000$ randomly initialized trajectories. The calculated errors at these points are small, $\Gamma (\rho 1\u2217)=0.0123$, $\Gamma (\rho 2\u2217)=0.0077$, $\Gamma (\rho 3\u2217)=0.0025$, and $\Gamma (\rho 4\u2217)=0.0008$, once again indicating that the reservoir has satisfactorily replicated the dynamics of the different regimes.

For comparison, Fig. 13(e) shows $\Gamma (\rho )$ for the same noisy Lorenz system but with $\rho 0$ under the window and $\rho (t)$ increasing with time; i.e., $\gamma =+0.1$. We note that this plot does not look similar to the previous case [Fig. 13(d)] with $\rho (t)$ decreasing in that there is no spike in $\Gamma $ near the start of the prediction window. We do, however, see that there is a relatively larger error in the reservoir’s prediction of the transition from a period $2$ orbit to chaos.

We now revisit the case [Eq. (6) and Fig. 7 from Sec. III] for the noiseless Lorenz system, but for a case in which the parameter variation is non-monotonic,

where we set $\rho 0=120$, $\rho 1=10$, $\gamma =0.02$, and $\mu =0.1$. Figure 14(a) shows the time variation of $\rho (t)$. Note that the training is over an interval of time during which $\rho (t)$ is approximately linear and increasing. The prediction is then carried out as $\rho (t)$ changes highly nonlinearly and, furthermore, contains an interval where $\rho (t)$ decreases with time, $d\rho /dt<0$. Figure 14(b), and the blow-up in Fig. 14(c), compares $zm+1\u2212zm$ vs time for a true trajectory evolving according to the Lorenz system with that from the reservoir prediction. Figure 14(d) shows the values of $\Gamma (\theta )$ over the prediction interval. As with the previous cases, we see a spike near the intermittency transition from the period $2$ orbit to chaos near $\theta \u22482.67\pi $ (which corresponds to $\rho \u2248165$), which is due to the reservoir’s slightly inaccurate prediction of when the transition occurs. The approximate cumulative distributions computed over small intervals $\Delta \theta =0.05$ are obtained at three times $\theta 1\u2217=\pi $, $\theta 2\u2217=1.8\pi $, and $\theta 3\u2217=2.9\pi $ [shown in Fig. 14(e)]. At these $\theta $ values, the calculated errors between the distributions obtained from the actual orbits and those obtained from the reservoir predictions are small, $\Gamma (\theta 1\u2217)=0.0006$, $\Gamma (\theta 2\u2217)=0.0014$, and $\Gamma (\theta 3\u2217)=0.0084$. We note that, although the reservoir was trained over an interval during which $\rho $ increased approximately linearly, it was still able to predict the transitions and climate as $\rho $ changed nonlinearly. In particular, the reservoir was able to accurately capture the transition from the two-piece chaotic set to a single-band chaotic set near $\theta \u22481.75\pi $. After this point, $\rho $ increases rapidly, leading to a transition from a period $2$ orbit to chaos, which the reservoir is able to accurately predict. To summarize, our results for the parameter time variation, Eq. (9), show that the reservoir was able to replicate the long-term dynamical properties of the non-stationary Lorenz system when it was trained on a set of measurements where the driving parameter $\rho $ varied differently than it did during prediction.

## VI. APPLICATION OF RC TO PREDICTING THE KURAMOTO–SIVASHINSKY SYSTEM

We consider a scalar field $w(x,t)$ in one spatial variable $x$ and time $t$, with periodic boundary conditions $w(x,t)=w(x+2\pi ,t)$ and with $w(x,t)$ satisfying the noisy, possibly non-stationary, Kuramoto-Sivashinsky (KS) equation,

where $\xi (x,t)$ represents spatially dependent noise that is white in time and the parameter $\kappa $ can be either constant in time, in which case, we write $\kappa =\kappa 0$, or varying with time [$\kappa =\kappa (t)$]. Furthermore, for simplicity, following Christiansen *et al.*,^{91} we consider solutions $w(x,t)$ having odd symmetry about $x=0$, $w(x,t)=\u2212w(\u2212x,t)$, in which case we can expand $w(x,t)$ in a Fourier series of the form

where $ap(t)=\u2212a\u2212p(t)$, $ap(t)$ is real, and $a0(t)=0$. (In the general, non-symmetric case, the Fourier coefficients, rather than being purely imaginary, $iap$, would have real and imaginary parts.) In order that the noise not break this convenient symmetry, we take the noise to also have the same symmetry, in which case its Fourier representation has the same form as Eq. (11),

Hence, $\xi p(t)$ are real white noise terms that are uncorrelated for $p\u2260p\u2032$ and delta-correlated in time, $\u27e8\xi p(t)\xi p(t\u2032)\u27e9=$ (constant)$\xd7\delta (t\u2212t\u2032)$, where the angle brackets denote an ensemble average.

Substituting (11) and (12) into (10), we obtain an infinite, coupled set of first order ordinary differential equations,^{91}

We solve Eq. (13) numerically using the Runge–Kutta fourth order method with a time step of 0.0001 and using a truncation approximation at $pmax=16$ (i.e., we set $ap=0$ for $p>pmax$). Figure 15 shows an example of a stationary system solution for $w(x,t)$ and $\kappa 0=0.02975$, where time is plotted horizontally, $x$ is plotted vertically, and the value of $w$ is color-coded.

Considering the noiseless [$\xi (x,t)=0$], stationary ($\kappa =\kappa 0$) version of Eq. (10), again following Ref. 91, we use a Poincaré map with surface of section, $a1=0$, to obtain a stationary system bifurcation diagram showing the various regime transitions and bifurcations. This is shown in Fig. 16(a), which plots the values of $a6(tn)$ at many surface of section crossings ($t=tn$) vs $\kappa 0$ for $\kappa 0$ in the range $0.0296<\kappa 0<0.02989$. In this range of $\kappa $, we see chaotic dynamics, interspersed with periodic windows of different periodicities (e.g., the period $5$ window near $\kappa 0\u22480.02985$). We also see that the size of the attractor is larger for smaller values of $\kappa 0$.

Applying our RC prediction method for a non-stationary, noiseless case,

with parameters given in Table IV, we obtain the results in Fig. 16(b), comparing the RC prediction (red dots) with the numerical solution of the KS equation (black dots). Here, we set $\gamma =5.5\xd710\u22128$ and $\kappa (0)=0.029825$. In this case, the (relatively) fast drift of the parameter $\kappa $ results in the loss of fine detail in the regime transitions in this period $5$ window (e.g., we cannot discern a period doubling cascade in each of the five branches near the lower $\kappa $ range of the window). Alternatively, we see a sudden transition from chaos below $\kappa \u22480.029844$ to a period $5$ orbit above it and similarly an abrupt transition from the period $5$ orbit below $\kappa \u22480.029854$ to chaos above it. We see from Fig. 16(b) that the reservoir system is able to capture the transitions from chaos to periodic and then back to chaos. The red dots represent the reservoir-predicted surface of section crossings of the $a6$-coordinate, $a6(tn)$, and the black dots represent $a6(tn)$ obtained from a numerical solution of Eq. (13). Figure 16(c) shows a plot of $\Gamma (\kappa )$ over this period $5$ prediction window, calculated using an ensemble of $1000$ randomly selected initial conditions and for $\Delta \kappa =5\xd710\u22127$. As before, we see that the value of $\Gamma $ is small throughout the window, indicating that the reservoir-predicted trajectories adequately replicate the non-stationary climate of this system. The spikes in $\Gamma $ near $\kappa \u22480.029844$ and $\kappa \u22480.029853$ are due to the reservoir’s slightly inaccurate prediction of the exact location of the transition from chaotic to periodic motion and from periodic back to chaotic, respectively. The narrow width of these spikes indicates that this error is small. Figure 16(d) shows the cumulative probability distributions obtained from the numerically integrated solutions of Eq. (13) (solid curves) and from the reservoir-predicted trajectories (dashed curves) at $\kappa 1\u2217=0.0298400,\kappa 2\u2217=0.0298500,$ and $\kappa 3\u2217=0.0298565$. The corresponding values of $\Gamma $ are $\Gamma (\kappa 1\u2217)=0.0026$, $\Gamma (\kappa 2\u2217)=0.0038$, and $\Gamma (\kappa 3\u2217)=0.0038$. The agreement between the curves and the small values of $\Gamma $ shows that the reservoir-predicted trajectories accurately predict the dynamics of the different regimes.

Reservoir size | N | 2000 |

Average number of connections per node | ⟨d⟩ | 3 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.45 |

Maximal coupling strength of the input to the reservoir | χ | 0.6775 |

Tikhonov regularization parameter | α | 1.67 × 10^{−7} |

Numerical integration time step for Eq. (13) | Δt | 0.0001 |

RC time step | Δt_{RC} | 0.005 |

Reservoir size | N | 2000 |

Average number of connections per node | ⟨d⟩ | 3 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.45 |

Maximal coupling strength of the input to the reservoir | χ | 0.6775 |

Tikhonov regularization parameter | α | 1.67 × 10^{−7} |

Numerical integration time step for Eq. (13) | Δt | 0.0001 |

RC time step | Δt_{RC} | 0.005 |

To ensure reservoir stability during the closed-loop prediction phase for the non-stationary, noiseless case, we found it necessary to insert a small noise $\delta u(t)$, sampled from a uniform distribution over the interval $[\u2212\delta u,\delta u]$, into the reservoir update equation [Eq. (1)] at each time step,

For the above example, we used $\delta u=9\xd710\u22125$. In addition, we used a “multi-pass” training scheme to obtain the above results as follows. The reservoir is run in the open-loop training configuration using Eq. (15), the training data ${u(t),\kappa (t)}t=\u2212Tt0$, and a realization of the input noise process ${\delta u1(t)}t=\u2212Tt0$. The reservoir states ${r(t)}$ are collected. The reservoir, after re-initializing it to some starting value $r(0)$, is again run in the open-loop configuration using Eq. (15), the same training data ${u(t),\kappa (t)}t=\u2212Tt0$, and a *different* realization of the input noise process ${\delta u2(t)}t=\u2212Tt0$. The reservoir states are collected again. This procedure is repeated a predetermined number of times (ten in this case), and a linear regression is performed on the collection of all resulting reservoir states to determine the output weight matrix $Wout$, which maps the collection of reservoir states ${r(t)}$ to ${u(t)}t=\u2212Tt0$. (The closed-loop prediction phase is performed in the usual way, i.e., without adding any noise to the reservoir update equation.) This method uses a single set of training data and multiple realizations of an input noise process to allow for determination of a more robust $Wout$. (A detailed exposition and analysis of noise stabilization of machine learning prediction will be the subject of a future publication.)

Next, we apply our RC prediction method to the noisy, non-stationary case with $\gamma =\u22125\xd710\u22127$, and, as in Sec. V, the noise ($\xi p$) is numerically implemented by adding a random number at each iteration of our numerical solution of Eq. (13) for $ap$, $0<p\u2264pmax$, where the random number, independent for each value of $p$, is uniformly chosen in the range $[\u22125\xd710\u22125,5\xd710\u22125]$. Table V lists the hyperparameters used for this task. As with the previous case, we use the multi-pass training scheme with ten passes (i.e., we run the reservoir in the open-loop configuration ten times using the same set of training data and a different noise realization each time) and the noise sampled from the same distribution as before. We consider the range $\kappa \u2208[0.02960,0.02988]$, in which, due to the rate of non-stationarity and added dynamical noise, we cannot discern the periodic windows [see the black dots in Fig. 16(e) representing the $a6(tn)$ obtained from a numerical solution of Eq. (13)], which are visible in the stationary bifurcation diagram of Fig. 16(a). Here, we see that the size of the attractor gradually increases with decreasing $\kappa $. The reservoir-predicted trajectory [represented by the red dots in Fig. 16(e)] accurately captures this gradual increase in the size of the attractor. Figure 16(f) shows $\Gamma (\kappa )$, obtained from $5000$ randomly initialized trajectories, over the prediction window. From the small value of $\Gamma $ over the prediction interval, we see that the reservoir-predicted trajectory also accurately captures the $\kappa $-dependent density variation of $a6(tn)$. This indicates that the non-stationary climate of this system is well-replicated by the reservoir-predicted trajectories.

Reservoir size | N | 2000 |

Average number of connections per node | ⟨d⟩ | 2 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.65 |

Maximal coupling strength of the input to the reservoir | χ | 0.9 |

Tikhonov regularization parameter | α | 7.74 × 10^{−10} |

Numerical integration time step for Eq. (13) | Δt | 0.0001 |

RC time step | Δt_{RC} | 0.005 |

Reservoir size | N | 2000 |

Average number of connections per node | ⟨d⟩ | 2 |

Spectral radius of the reservoir’s adjacency matrix | ρ_{s} | 0.65 |

Maximal coupling strength of the input to the reservoir | χ | 0.9 |

Tikhonov regularization parameter | α | 7.74 × 10^{−10} |

Numerical integration time step for Eq. (13) | Δt | 0.0001 |

RC time step | Δt_{RC} | 0.005 |

## VII. CONCLUSION

In this paper, we have formulated and tested a machine learning method for prediction of the behavior or “climate” (defined in Sec. I), of non-stationary dynamical systems (both noiseless and noisy) from time series data of the state of the system and knowledge of the time dependence of a system parameter responsible for the non-stationary but with no further knowledge of the nature of the system to be predicted. The methods introduced in Sec. IV are based on a previously formulated machine learning feedback prediction scheme for the dynamics of stationary systems (Sec. II), which we modify in order to adapt it for the purpose of climate prediction of noiseless non-stationary dynamical systems (Sec. IV A) and of noisy non-stationary dynamical systems (Sec. IV B). In the case of prediction for noiseless non-stationary systems, the modified prediction method (Fig. 5) is the stationary system prediction method (Fig. 1) altered to include input of the known time-dependent parameter {and this input applies for both the training phase [Fig. 5(a)] and the prediction phase [Fig. 5(b)]}. In the case of noisy non-stationary systems, the prediction method (Fig. 9) contains a further alteration in which dynamical noise approximating that of the unknown system to be predicted is injected at the output side of the machine learning system; we describe how to obtain from the available time series data the required approximate probability distribution of the actual noise.

These methods are tested on non-stationary versions of three systems: a simple discrete-time map (the logistic map, Sec. IV), a continuous time set of three autonomous ordinary differential equations (the Lorenz 63 equations, Sec. V), and a spatiotemporal partial differential equation (the Kuramoto–Sivashinsky equation, Sec. VI). Evaluations for the resulting tests are made through comparison of orbit characteristics of the actual systems with those predicted using plots of orbits, cumulative probability distributions, and the Wasserstein metric for the distance between probability distributions.

Some dynamical orbit features whose predictions are tested include the following:

Apparently continuous evolution of the distribution of chaotic orbit densities in chaotic regions without apparent regime transitions.

Non-stationarity induced phenomenon, such as the apparent delay and increased abruptness of period doublings [Fig. 3(b)].

The effects of dynamical noise, e.g., the apparent blurring of features present in stationary system bifurcation diagrams and intermittent bursting out of the small multiple piece chaotic attractors preceding interior crisis transitions [Fig. 8 and 10(b)].

Various types of regime transitions, including the forward and backward versions of period doubling cascades of periodic orbits followed by pairwise splitting cascades of chaotic attractors, intermittency transitions between periodicity and chaos, and crisis transitions.

Results of our tests show that the overall dependence of the future pattern of climate variation with time is extremely well-predicted and that the quantitative agreement between the predicted and actual behavior is also remarkably good, although there are some notable small to moderate shifts in the occurrences of regime transitions. Based on these results, we expect that machine learning will be of great future utility as a means for predicting the statistical behavior (climate) of unknown, non-stationary dynamical systems from time series data. Resolution of the question of how, and whether, this expectation is realized for many potential practical applications (see Sec. I) awaits further study.

## ACKNOWLEDGMENTS

We thank Brian Hunt for insightful comments, suggestions, and advice. The authors acknowledge the University of Maryland supercomputing resources (http://hpcc.umd.edu) made available for conducting the research reported in this paper. This work was supported, in part, by the National Science Foundation (NSF) under Grant Nos. DMS-1813027 and DGE-1632976 and by the Defense Advanced Research Projects Agency (DARPA) grant (No. W31P4Q-20-C-0077).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX: ANALYSIS OF THE NOISE-INJECTION SCHEME

As shown in Fig. 9, to approximately replicate the effects of uncorrelated stochastic forcing, we inject an uncorrelated noise term $\epsilon ~n$ into the reservoir output at each time step. We desire that the distribution $\zeta ~$ of the injected noise $\epsilon ~n$ is a reasonable approximation to the distribution $\zeta $ of the actual noise $\epsilon n$. To obtain this estimate, during the training, we take the difference between the reservoir output $v~n+1$ and the actual noisy system trajectory $xn+1$ as our noise estimate, $\epsilon ~n=xn+1\u2212v~n+1$. If the reservoir output $v~n+1$ were the actual noiseless output $vn+1$, then $\epsilon ~n=\epsilon n$, but the reservoir makes some error $\Delta vn+1=v~n+1\u2212vn+1$ and $\epsilon ~n$ is only an approximation of the true noise; i.e., $\epsilon ~n=\epsilon n+\Delta vn+1$. Thus, to the extent that the reservoir error $\Delta vn+1$ is small compared to the actual noise $\epsilon n$, our approximation will be good. Noting that $\epsilon n$ is an uncorrelated noise, while both $vn+1$ and $v~n+1$ come from deterministic mappings of a state $xn$ that is independent of $\epsilon n$ (where these mappings are the mapping of the system to be predicted and its reservoir approximation), we see that the reservoir error $\Delta vn+1$ is uncorrelated with $\epsilon n$. Letting $\zeta \Delta (\Delta v)$ denote the distribution function of $\Delta v$ computed from the time series $\Delta vn$, by virtue of $\epsilon ~n=\epsilon n+1+\Delta vn+1$, $\zeta ~$ is given by the convolution

Roughly speaking, the approximation $\zeta ~(\epsilon ~)$ is obtained by broadening $\zeta (\epsilon )$ by the width of $\zeta \Delta $, which, if small compared to the width of $\zeta $, may be expected to yield a good approximation of $\zeta $. Using a flat-top distribution for $\zeta (\epsilon )$ (as in our previous numerical experiments), we demonstrate this for the non-stationary logistic map for two noise levels [widths of $\zeta (\epsilon )$], $0.0005,0.002$, and $\gamma =10\u22125$, as follows. An ensemble of randomly chosen initial conditions is evolved by the actual orbit $xn$, obtained using the logistic map, with noise $\epsilon n$, and the trained reservoir is used to obtain the time series, $\epsilon ~n=xn+1\u2212v~n+1$, where $v~n+1$ is the reservoir output when its input is $xn$. In addition, we also use our determination of the actual orbit $xn$ to obtain the time series $vn$ via Eq. (7). We then use these time series to form histogram approximations to $\zeta ,\zeta ~,$ and $\zeta \Delta $. Figures 17(a) and 17(b) show numerical results for the distributions $\zeta \Delta $ (green lines), $\zeta $ (blue lines), and $\zeta ~$ (black lines). Figure 17(a) shows a logistic case where the dynamical noise level is substantially larger than the reservoir error (width of $\zeta \Delta $), and $\zeta ~$ provides a good approximation to $\zeta $. Figure 17(b) provides an example where the reservoir error is of the same order as the noise, leading $\zeta ~$ to noticeably deviate from $\zeta $, although we note that the reservoir training error could be decreased by using a larger reservoir with well-optimized hyperparameters so that $\zeta ~$ better approximates $\zeta $.

## REFERENCES

*Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop*(IEEE, 1997) pp. 511–520.

*2007 International Conference on Machine Learning and Cybernetics*(IEEE, 2007), Vol. 6, pp. 3496–3500.

*Proceedings of the 2nd International Conference on Software Engineering and Information Management (ICSIM 2019)*(Association for Computing Machinery, New York, 2019), pp. 7–11.

*2017 International Joint Conference on Neural Networks (IJCNN)*(IEEE Press, 2017) pp. 3176–3183.

*et al.*, “

*Advances in Artificial Life*, edited by W. Banzhaf, J. Ziegler, T. Christaller, P. Dittrich, and J. T. Kim (Springer, Berlin, 2003), pp. 588–597.

*et al*., “