Drawing on ergodic theory, we introduce a novel training method for machine learning based forecasting methods for chaotic dynamical systems. The training enforces dynamical invariants—such as the Lyapunov exponent spectrum and the fractal dimension—in the systems of interest, enabling longer and more stable forecasts when operating with limited data. The technique is demonstrated in detail using reservoir computing, a specific kind of recurrent neural network. Results are given for the Lorenz 1996 chaotic dynamical system and a spectral quasi-geostrophic model of the atmosphere, both typical test cases for numerical weather prediction.

Numerical weather prediction (NWP) requires vast amounts of computational power to solve global forecasting models. Machine learning (ML) systems—acting as surrogate models for the differential equations—have the promise of lessening the computational burden by using a larger time step than a classical numerical solver or through dedicated hardware acceleration. In order to be integrated into the data assimilation optimization routines of numerical weather prediction, ML models must reproduce the correct intrinsic error growth rates of the chaotic system and other dynamical invariants. We introduce here a training method for recurrent neural networks, a type of ML model that has shown great success in predicting chaotic dynamical systems, that enforces these invariants. By providing information on the physics of the system, we bias the network toward reproducing the correct physical characteristics and enhance its predictive capabilities.

Predicting the future trajectory of a dynamical system—a time series whose evolution is governed by a set of differential equations—is crucial in fields, such as weather prediction, economics, chemistry, physics, and many others.1,2 A prediction can be generated by deriving the governing equations of motion (EOM) for the system and integrating forward in time, perhaps with data being used to determine the value of particular constants or the initial conditions. Machine learning (ML), on the other hand, allows the construction of a forecast purely from observational data in lieu of a physical model. When the EOM are expensive to evaluate numerically, ML can be used to construct a surrogate model; such models can be integrated into data assimilation3 algorithms—such as the Kalman filter4,5—a typical use case when data are noisy and the model imperfect, such as in numerical weather prediction (NWP).6 

The inclusion of physical knowledge—EOM, conservation laws, and dynamical invariants—into ML algorithms has been a topic of ongoing interest.7–15 Enforcing these laws effectively reduces the searchable parameter space for a workable model, decreasing the training time and increasing the accuracy of the resulting models. An ML model trained without knowledge of the underlying physics may fail to generalize and can produce solutions that violate fundamental constraints on the physical system.16 Many of the examples cited above involve conservation of quantities based on the symmetry of the equations of motion, such as conservation of energy and momentum,13 or the inclusion of previously derived differential equations11 as components of the ML training. “Physics informed” neural networks14,15,17–19 add the known or partially known differential equations as a soft constraint in the loss function of the neural network, but conservation laws are not necessarily enforced and the equations must be known.

Many physical dynamical systems of interest are dissipative—e.g., any dynamical system containing friction—meaning that they exchange energy and mass with the surrounding environment.20 High dimensional dissipative systems are very likely to exhibit chaos21—making them extremely sensitive to initial conditions. Enforcing conservation of quantities, such as momentum, mass, or energy,11,13,22,23 for dissipative systems in isolation may not be sufficient for generalization due to the exchange of energy/momentum with the environment. Problems exhibiting chaotic dynamics, such as weather forecasting, have fractal phase space trajectories that make it difficult to specify analytic constraints.11 

With the goal of enforcing dynamical invariants, we suggest an alternative cost function for dissipative systems based on ergodicity, rather than symmetry. This has broad implications for time series prediction of dynamical systems. After formulating the invariants, we give a recurrent neural network (RNN)24 example applied to the Lorenz 1996 system25 and quasi-geostrophic dynamics26 where we add soft constraints into the training of the network in order to ensure that these dynamical invariant quantities are reproduced.

Ergodicity is a property of the evolution of a dynamical system. A system exhibiting ergodicity, called ergodic, is one in which the trajectories of that system will eventually visit the entire available phase space,27 with time spent in each part proportional to its volume. In general, the available phase space is a subset of the entire phase space volume; for instance, a Hamiltonian system will only visit the hypersurface with constant energy.20 Ergodicity implies that time averages over the system trajectories can be replaced with spatial averages,
(1)
for an arbitrary function g, where F t is the application of the flow of the dynamical system over time t, and ρ B ( u ) defines an invariant density over the finite set B.27 The invariant density gives an intuitive measure of how often a trajectory visits each part of B. The invariant density defines the invariant measure27 
(2)

B will often consist of exotic geometries, such as quasiperiodic orbits and strange attractors.20 The strange attractor is a hypersurface that contains the available subspace for a chaotic dynamical system—see Fig. 1. Deterministic chaotic systems are of importance to a vast array of applications, such as in NWP,3 chemical mixing,31,32 optics,33 robotics,34 and many other fields.

FIG. 1.

(left) 2D slice through the strange attractor of the spectral QG model reproduced from Fig. 3 in Reinhold and Pierrehumbert26 using the implementation in Demaeyer et al. 2020.28 An attractor is a hypersurface that draws in nearby trajectories of the system such that the system will eventually be constrained to stay on the manifold. The motion of the dynamical system on a strange attractor is chaotic with extreme sensitivity to initial conditions. Strange attractors can be analyzed through the invariant measure that describes how often the system visits each part of the attractor.29,30 Related quantities, such as the fractal dimension and the Lyapunov exponents,30 are globally invariant through any smooth change of coordinates and, thus, are natural invariant quantities for chaotic systems. (right) The Lyapunov exponents of the spectral QG model. There are two positive exponents making the system chaotic.

FIG. 1.

(left) 2D slice through the strange attractor of the spectral QG model reproduced from Fig. 3 in Reinhold and Pierrehumbert26 using the implementation in Demaeyer et al. 2020.28 An attractor is a hypersurface that draws in nearby trajectories of the system such that the system will eventually be constrained to stay on the manifold. The motion of the dynamical system on a strange attractor is chaotic with extreme sensitivity to initial conditions. Strange attractors can be analyzed through the invariant measure that describes how often the system visits each part of the attractor.29,30 Related quantities, such as the fractal dimension and the Lyapunov exponents,30 are globally invariant through any smooth change of coordinates and, thus, are natural invariant quantities for chaotic systems. (right) The Lyapunov exponents of the spectral QG model. There are two positive exponents making the system chaotic.

Close modal

Despite being deterministic, the precise long-term prediction of chaotic systems is impossible due to the exponential growth of errors, as quantified by the system’s Lyapunov spectrum. The Lyapunov spectrum, composed of a system’s Lyapunov exponents (LEs), characterizes a dynamical system29,35 by giving a quantitative measure of how a volume of phase space stretches or shrinks over time.

For the prediction of chaotic systems, we suggest that although short term predictions will inevitably diverge, long-term predictions of any system must preserve the invariants of motion characterized by the invariant measure μ B [Eq. (2)]. Furthermore, enforcing such invariants could help to generalize the training of neural networks designed to emulate dissipative chaotic systems, in much the same way that conservation of energy and momentum has for conservative systems. While any function g ( u ) integrated with the invariant density is a constant—as seen in Eq. (1)—by the multiplicative ergodic theorem,36 the LEs and the fractal dimension are both invariant under smooth coordinate transformations and have algorithms that make them feasible to compute from observed data.29 

In Secs. IIIV, we provide a concrete example using the fractal dimension and LEs as invariants that must be enforced when training a neural network. We use an RNN based on the reservoir computer (RC) architecture.37–39 We impose a loss function that takes into account the preservation of the LEs and fractal dimension and detail the benefits of doing so. We stress that the concept is not limited to RC models and can, in fact, apply to any neural network architecture.

An RNN is a network composed of nonlinear elements that are connected in such a way as to enable self-excitation.40 Therefore, given a state of the network r ( t 1 ), the next state
(3)
is a function of the input u, the RNN equations F r, and the internal weights θ. The label over the input data t Z—conveniently called time—gives the natural order and allows the analysis of the RNN as a dynamical map. r ( t ) can then be decoded by a function W out ( r ( t ) ) = u ^. W out is typically trained so that u ^ is as close to the target output as possible.24 In time series prediction tasks, u ^ u ( t ) so that the driven system [Eq. (3)] can become autonomous (with no external input),
(4)
and predict the future of the dynamical system.
Reservoir computing37,38,41–44 is a simplified form of RNN for which only the large-scale parameters of the network are varied with the detailed weights selected from probability distributions. For an RC with tanh ( ) nonlinear activation functions at the nodes, the RNN equations become45,
(5)
The elements of the N × N adjacency matrix A are fixed, i.e., not trained—in contrast to other RNN architectures—with only its overall properties chosen, such as the size N, density ρ A, and spectral radius ρ S R. W in R N × D maps the input into the high dimensional reservoir space u R D W in u R N; the elements of W in are chosen between [ σ , σ ]. α is the leak rate and is related to the time constant of the RC.39  σ b is an input bias governing the fixed point of the RC and the strength of the tanh ( ) nonlinearity. See Lukosevicius39 for detailed explanations of the architecture and parameter choices.

Training the RC includes training the function W out—often taken to be a matrix W out and trained through linear regression—as well as finding the correct parameters N , ρ A , ρ S R , σ , σ b. When predicting, the RC must first achieve synchronization with the input data46 during a “spinup” phase before the prediction can begin. In Refs. 6, 45, and 47, it is shown how to train the RC network through a two step training procedure that takes into account both the one step prediction accuracy as well as the long-term forecast skill.

RC has been shown to be extremely successful in time series prediction tasks. Its simple form allows the easy computation of the Jacobian with respect to the internal (hidden) states and other quantities that can help in dynamical systems analysis of the RNN. Platt et al.46 showed that the RC, when well trained, can reproduce dynamical invariants, such as the LEs and fractal dimension. It was further shown that the reproduction of these quantities maximized the prediction time and ensured the stability of the predictions. The training procedure used in previous works did not enforce the dynamical invariants directly, rather the results relied on using long-term forecasts as a proxy. Here, we reformulate the RC training to explicitly account for the dynamical invariant quantities.

The standard training of an RC is determined by the training data and the selection of the parameters governing the global properties of the RC: N , ρ A , ρ S R , σ , σ b and a regularization coefficient β. Once these quantities are chosen and the weights instantiated, then W out is given as
where u is the D × T matrix of input data, T is the number of time steps, and r is the N × T matrix of reservoir states.45 We can add into the selection of these parameters prior knowledge of the global invariants of the system u and a penalty term for longer-range forecasts. Therefore, we construct a loss function
(6)
that can be minimized over the RC parameters and with ϵ x hyperparameters. The selection of the parameters leads to the matrix W out based on training data u train. Platt et al.45 generated a number of long-term forecasts u f ( t ) and compared them to the data u; with enough data, this procedure often leads to a model that reproduces the correct dynamical invariants. Without the explicit enforcement of these invariants, however, the model can fail to capture the dynamics—particularly for high dimensional systems and in cases where the number of trajectories M is constrained. Here, we add the dynamical invariants ( C x) as a constraint in order to directly train for generalizability, similar to Beucler et al.49 This scheme is illustrated in Fig. 2. The global optimization routine used to minimize the cost function was the covariance matrix adaption evolution strategy (CMA-ES).50,51 CMA-ES is a derivative free global optimization routine involving the iterative update of sampled points through an evolutionary strategy. At each iteration, only the sampled points with the lowest loss are kept and the distribution of these points are used to update the mean and covariance matrix of the routine from which new samples are drawn. As we are optimizing a non-convex function, the technique is not guaranteed to converge to the global optimum in the given time specified for optimization—indeed, we often see differences in the parameters found by the routine depending on the initial sampling.
FIG. 2.

Parameter optimization of a reservoir computer showing the introduction of dynamical invariants. The observed data are split into training, validation, and testing sets with the invariants calculated from the data.29 These quantities can then be incorporated into the loss function to improve the overall training of the RC. A general discussion of the training strategy is found in Platt et al.45 

FIG. 2.

Parameter optimization of a reservoir computer showing the introduction of dynamical invariants. The observed data are split into training, validation, and testing sets with the invariants calculated from the data.29 These quantities can then be incorporated into the loss function to improve the overall training of the RC. A general discussion of the training strategy is found in Platt et al.45 

Close modal

We use the Lyapunov exponents and the fractal dimension as examples of dynamical invariants in order to demonstrate the technique. With the equations of motion, such as Eq. (4) for the RC, it is quite simple to calculate these quantities using well known and efficient algorithms.30 When training directly from data—without knowledge of the underlying system—we may not know the equations of motion; therefore, these quantities must be estimated. The largest LE can often be approximated from time series data,29,52,53 and the fractal dimension can be calculated using various techniques.29,54 A calculation of the full LE spectrum is more difficult. The use of other dynamical invariants derived from the invariant measure [Eq. (2)] is also possible, for instance, the energy density spectrum of a fluid dynamical system as a function of wavenumber.

Our first test case for the RC is the Lorenz 1996 system (L96),25 a standard testbed for data assimilation applications in numerical weather prediction. L96 describes the evolution of a scalar quantity over a number of sites positioned uniformly over a periodic lattice of constant latitude. The evolution of this scalar quantity is governed by terms representing advection and diffusion,
(7)
In this case, we take the number of sites to be D = 10 and forcing F = 8.0 with the purpose of making the system hyperchaotic, with three positive Lyapunov exponents (Fig. 3).
FIG. 3.

Lyapunov exponents of the 10D Lorenz 1996 system25 with F = 8.0. There are three positive LEs, a single zero exponent, and the remaining LEs are negative.

FIG. 3.

Lyapunov exponents of the 10D Lorenz 1996 system25 with F = 8.0. There are three positive LEs, a single zero exponent, and the remaining LEs are negative.

Close modal

The results for C R C = L E s [Eq. (6)] are shown in Fig. 4. When no global information is given to the RC, then it can fail to generalize when presented with an unseen input. Simply providing the largest LE to the RC during training enables the neural networks to

  1. generalize to unseen data so that there are good predictions over the entire range of possible initial conditions and

  2. reconstruct the attractor as in Refs. 46 and 55 with the prediction giving the correct ergodic properties of the data even after the prediction necessarily diverges from the ground truth.

In this case, providing the largest LE was sufficient to improve the predictions, with no further gains coming from providing additional LEs. Indeed, a slight decrease in the mean prediction time is observed when providing more LEs. A calculation of the LEs of the reservoir after optimization shows that providing the single largest exponent and the sample trajectories is sufficient for the entire LE spectrum to be reproduced. Providing additional LEs during training, therefore, adds more constraints, but no new information, resulting in a slightly less optimal result.

FIG. 4.

The RC is initialized ten times, providing k LEs and M = 7 long-term forecasts [Eq. (6)]; we report on the average of the distribution of predictions for the ten dimensional Lorenz 1996 system25 [Eq. (7)]. Valid prediction time (VPT), defined in Fig. (6), is the amount of time for which the forecast is “close” to the model truth. There are ten total LEs—three positive, shown in Fig. 3. When zero LEs are provided, the RC has no global information and the prediction time is poor. Providing the largest LE is sufficient for generalizable predictions, and this quantity is quite easily calculated from numerical data.29 The last column shows the alternative result when providing only the fractal dimension (no LEs provided) as another example of an invariant quantity. The size of the RC is N = 400, and the number of test initial conditions is 1000. The forecast time is given as an average over the 1000 trajectories and then scaled by the largest LE to give the prediction in terms of the number of Lyapunov timescales. The LEs were calculated using the QR decomposition method56 from the equations of motion. Results of how the forecast and hyperparameter distributions change as the number of exponents is varied are found in the supplementary material.

FIG. 4.

The RC is initialized ten times, providing k LEs and M = 7 long-term forecasts [Eq. (6)]; we report on the average of the distribution of predictions for the ten dimensional Lorenz 1996 system25 [Eq. (7)]. Valid prediction time (VPT), defined in Fig. (6), is the amount of time for which the forecast is “close” to the model truth. There are ten total LEs—three positive, shown in Fig. 3. When zero LEs are provided, the RC has no global information and the prediction time is poor. Providing the largest LE is sufficient for generalizable predictions, and this quantity is quite easily calculated from numerical data.29 The last column shows the alternative result when providing only the fractal dimension (no LEs provided) as another example of an invariant quantity. The size of the RC is N = 400, and the number of test initial conditions is 1000. The forecast time is given as an average over the 1000 trajectories and then scaled by the largest LE to give the prediction in terms of the number of Lyapunov timescales. The LEs were calculated using the QR decomposition method56 from the equations of motion. Results of how the forecast and hyperparameter distributions change as the number of exponents is varied are found in the supplementary material.

Close modal
The second invariant given is the fractal dimension of the data calculated through the Kaplan–Yorke formulation,57,58
(8)
with D being the system dimension, λ the ordered LEs, and α the smallest index where the sum of the LEs does not cross zero.29 There are alternate definitions and methods for calculating the fractal dimension.54 

Providing the fractal dimension to the loss function as an invariant has a similar effect to providing the LEs by raising the mean valid prediction time (VPT) from 2 4. The fractal dimension may not, however, be unique to a particular set of data, while the LE spectrum has a much greater chance of constraining the shape of the resulting strange attractor. Therefore, we see that there is not as much improvement in the forecast when providing the fractal dimension compared to the LE spectrum.

It is typically possible to calculate an approximation of the largest LE from numerical data29—for a large and complex system, however, the calculation is often difficult and the result may be inaccurate. The robustness of the proposed training technique to errors in provided LEs is tested in Fig. 5, where multiples of the correct exponent are provided to the RC during training. The addition of the LE acts as a guide to the training, not as a hard requirement—see Eq. (6). Errors in the provided exponent up to a factor of 2–3 cause only a slight degradation of performance. This experiment shows that the training method is robust to the numerical errors in the calculation of the largest LE from measured data.

FIG. 5.

The effect of including an incorrect LE in the training for the 10D Lorenz 96 system. The RC dimension is N = 400 with 1 LE and M = 7 trajectories with t f = 3 provided during training Eq. (6). The peak prediction time occurs when the exact exponent is given, but even large errors of up to a factor of 2–3 cause only minimal degradation to the prediction skill.

FIG. 5.

The effect of including an incorrect LE in the training for the 10D Lorenz 96 system. The RC dimension is N = 400 with 1 LE and M = 7 trajectories with t f = 3 provided during training Eq. (6). The peak prediction time occurs when the exact exponent is given, but even large errors of up to a factor of 2–3 cause only minimal degradation to the prediction skill.

Close modal

For more complex higher-dimensional dynamical systems, the Lyapunov spectrum or the Kaplan–Yorke dimension is quite difficult, if not impossible, to calculate. However, our previous results showed that capturing the leading Lyapunov exponent (LLE) enhanced prediction skill greatly, and even with complex models, this quantity can be estimated more readily either from data52,53 or from a model.56,59 We, therefore, explore the value of representing the LLE in the case of quasi-geostrophic (QG) dynamics,60 which assume that large-scale atmospheric disturbances are governed by the conservation of potential temperature and absolute potential vorticity, while the horizontal velocity is quasi-geostrophic and the pressure quasi-hydrostatic. Numerical models based on the QG approximation were a precursor to larger scale primitive equation models used for global numerical weather prediction3 and are frequently used in data assimilation studies targeting the atmosphere and ocean.61,62 Here, we consider the two-layer baroclinic model of Charney and Strauss63 used to study the planetary-scale motions of a thermally driven atmosphere in the presence of topography. We further incorporate the adaption of Reinhold and Pierrehumbert26 to include an additional wave in the zonal direction making it highly baroclinically unstable. We use the implementation of Demaeyer et al.,28 which provides a truncated two-layer QG atmospheric model on a mid-latitude β-plane frictionally coupled to a mountain and a valley with a dimension of 20 in the spectral space of the model.

For the atmospheric streamfunctions ψ a 1 / ψ a 3 at heights 250/750 hPa and the vertical velocity ω = d p d t, the equations of motion are derived to be
(9)
(10)
with = x x ^ + y y ^, k d the friction between the layers, k d the friction between the atmosphere and the ground, h / H a the ratio of ground height to the characteristic depth of the atmospheric layer, Δ p = 500 hPa the pressure differential between the layers, and J ( g 1 , g 2 ) = g 1 x g 2 y g 1 y g 2 x the Jacobian. More details are given in Refs. 26 and 28.

After integrating the model forward in time, we ask if an RC model is capable of predicting the dynamics given a significant amount of training data. An example forecast and distribution of predictions is shown in Fig. 6 where the conventionally trained6,45 RC model successfully predicts the synoptic scale atmospheric dynamics for a number of months. Such significant predictive power on a low resolution QG model is an interesting result in and of itself, showcasing the ability of RNNs to resolve more realistic atmospheric dynamics.

FIG. 6.

Example prediction and the probability distribution function of the valid prediction time (VPT) for 200 initial conditions over the 20 dimensional QG system described in Sec. V B. VPT6,48 is calculated as the time when the root mean square error RMSE ( t ) = 1 D i = 1 D [ u i f ( t ) u i ( t ) σ i ] 2 exceeds a certain value ϵ, in this case ϵ = 0.3 approximately in line with Refs. 6, 45, and 48. D is the system dimension, σ is the long-term standard deviation of the time series, and u f is the RC forecast. The largest LE 1 / λ 1 12  days gives the natural timescale for error growth of the system and, thus, can be used as a measurement for the predictive skill. The RC returns fantastic prediction times for this low resolution model. The size of the RC is N = 1500, and 100 000 training steps were provided with Δ t = 80 min giving about 30 years of data; we consider this the “data rich” case because extra training data do not have a significant impact on the predictive ability of the network.

FIG. 6.

Example prediction and the probability distribution function of the valid prediction time (VPT) for 200 initial conditions over the 20 dimensional QG system described in Sec. V B. VPT6,48 is calculated as the time when the root mean square error RMSE ( t ) = 1 D i = 1 D [ u i f ( t ) u i ( t ) σ i ] 2 exceeds a certain value ϵ, in this case ϵ = 0.3 approximately in line with Refs. 6, 45, and 48. D is the system dimension, σ is the long-term standard deviation of the time series, and u f is the RC forecast. The largest LE 1 / λ 1 12  days gives the natural timescale for error growth of the system and, thus, can be used as a measurement for the predictive skill. The RC returns fantastic prediction times for this low resolution model. The size of the RC is N = 1500, and 100 000 training steps were provided with Δ t = 80 min giving about 30 years of data; we consider this the “data rich” case because extra training data do not have a significant impact on the predictive ability of the network.

Close modal

When we reduce the amount of data and set M = 1—the limited data case in Eq. (6) with a single forecast as part of the training loss—the RC model loses its predictive power. Adding in the information contained in the LLE, which is 0.01, enables the model to recover a large amount of predictive capability. The calculated LLE of the RC with no provided LEs is 0.14, compared to 0.01 when it is provided. The mismatch between the LEs of the two systems is a clear indication that synchronization between the two is not achieved.46 The impact of the added information provided via the LLE is clear in Fig. 7, where the average VPT has extended from only a few days to multiple weeks.

FIG. 7.

Prediction time in days across 100 predictions for the QG model with an RC trained with and without input of LEs and with t f = 1 in the optimization routine [Eq. (6)]. The size of the reservoir is N = 500, and 50 000 training steps are provided with Δ t = 80 min. Without the provided exponents in this low information regime, unlike in the data rich example in Fig. 6, the RC model is unable to correctly infer the correct dynamical invariants and fails to generalize.

FIG. 7.

Prediction time in days across 100 predictions for the QG model with an RC trained with and without input of LEs and with t f = 1 in the optimization routine [Eq. (6)]. The size of the reservoir is N = 500, and 50 000 training steps are provided with Δ t = 80 min. Without the provided exponents in this low information regime, unlike in the data rich example in Fig. 6, the RC model is unable to correctly infer the correct dynamical invariants and fails to generalize.

Close modal

Chaotic dynamical systems are difficult to predict due to their sensitivity to initial conditions.64 Better understanding and accounting for dynamical uncertainties has, however, allowed fields, such as numerical weather prediction, to provide useful and continually improving forecasts.65 Previous works (e.g., Refs. 13 and 49) proposed that including conserved quantities, such as energy/momentum, may help to improve the application of neural networks to physical systems. However, the introduction of the proposed conserved quantities is not generally applicable to dissipative chaotic dynamical systems. Thus, we instead considered dynamical invariants based on the invariant measure.

We provided a concrete example using quantities derived from the invariant measure, such as the Lyapunov exponents and the fractal dimension, to train a particular RNN architecture called reservoir computing (RC). Previous RC training algorithms used long-term forecasts initialized from many different initial conditions in order to improve generalizability,6,47 essentially imposing these invariant measures by proxy. Here, we imposed the invariant measures as constraints directly in the training algorithm, allowing the RC to generalize with fewer data. Fortunately, we have found that much of the value of this additional constraint is achieved through the use of the leading Lyapunov exponent. While the entire Lyapunov spectrum can be quite difficult to calculate, particularly for large systems, the leading Lyapunov exponent can be estimated by using numerical techniques, such as the breeding method66,67 or other methods described in Refs. 21, 56, and 59. This provides an opportunity for extension of this technique to higher-dimensional systems.

Recent works from Refs. 68–70 have shown promise in producing data-driven surrogate weather models that are competitive by some metrics with conventional operational forecast models. A key property that has not yet been demonstrated with such surrogate models is their ability to reproduce dynamical quantities, such as the LEs, which indicate an average measure of the response to small errors in the initial conditions. For weather models, in particular, the enforcement of LEs is crucial for the correct operation of data assimilation algorithms.3 Platt et al.46 demonstrated the importance of reconstructing the LE spectrum for producing a skillful deterministic forecast model. Similarly, Penny et al.6 indicated the ability of the RC to reproduce accurate finite-time LEs as a requirement for RC-based ensemble forecasts to produce good estimates of the forecast error covariance, which is the primary tool used in conventional data assimilation methods to project observational data to unobserved components of the system. This information can then be used to make the RC robust to sparse and noisy measurements. While this is the more realistic scenario used in online weather prediction systems, it is a fact that is rarely taken into account in neural network applications. The introduction of explicit constraints in the training cost function both improves predictions and trains the RNN to reconstruct the correctly shaped attractor.45,55

We provide in this supplementary section a few detailed graphs that give added insight into the working of the optimization routine. Figure 8 more clearly shows how the additional information in the dynamical invariants enables the routine to find the correct set of parameters. Even with many provided trajectories, with no added dynamical information, it is difficult to locate the correct minimum. Figures 9 and 10 show how the found parameters change with the added information. There is a clear optimal spectral radius around ≈ 0.2 that the RC without LEs fails to find. Additionally, we show that the no LE case also fails to reproduce the correct largest LE, a clear sign of suboptimality.46

We express our profound sadness in the passing of Henry D. I. Abarbanel.

J. A. Platt, S. G. Penny, and H. D. I. Abarbanel acknowledge support from the Office of Naval Research (ONR) (Grant Nos. N00014-19-1-2522 and N00014-20-1-2580). S. G. Penny and T. A. Smith acknowledge support from the NOAA (Grant No. NA20OAR4600277). T.-C. Chen acknowledges support from the NOAA Cooperative Agreement with the Cooperative Institute for Research in Environmental Sciences at the University of Colorado Boulder (No. NA17OAR4320101).

The authors have no conflicts to disclose.

Jason A. Platt: Conceptualization (lead); Investigation (lead); Methodology (lead); Writing – original draft (lead). Stephen G. Penny: Conceptualization (equal); Investigation (equal); Project administration (equal); Resources (equal); Software (equal); Writing – review & editing (equal). Timothy A. Smith: Investigation (equal); Software (equal); Writing – review & editing (equal). Tse-Chun Chen: Formal analysis (equal); Investigation (equal); Methodology (equal); Writing – review & editing (equal). Henry D. I. Abarbanel: Investigation (equal); Methodology (equal); Project administration (equal); Writing – review & editing (equal).

The basic RC implementation used in this study is available: https://github.com/japlatt/BasicReservoirComputing.

1.
H.
Abarbanel
,
Predicting the Future: Completing Models of Complex Systems
(
Springer
,
New York
,
2013
).
2.
S. H.
Strogatz
,
Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry and Engineering
(
Westview Press
,
2000
).
3.
E.
Kalnay
,
Atmospheric Modeling, Data Assimilation and Predictability
(
Cambridge University Press
,
2002
).
4.
R. E.
Kalman
, “
A new approach to linear filtering and prediction problems
,”
J. Basic Eng.
82
(
1
),
35
45
(
1960
).
5.
J.
Mandel
, “A brief tutorial on the ensemble Kalman filter,” arXiv:0901.3725 (2009).
6.
S. G.
Penny
,
T. A.
Smith
,
T.-C.
Chen
,
J. A.
Platt
,
H.-Y.
Lin
,
M.
Goodliff
, and
H. D. I.
Abarbanel
, “
Integrating recurrent neural networks with data assimilation for scalable data-driven state estimation
,”
J. Adv. Modell. Earth Syst.
14
,
e2021MS002843
(
2022
).
7.
G.
Karniadakis
,
Y.
Kevrekidis
,
L.
Lu
,
P.
Perdikaris
,
S.
Wang
, and
L.
Yang
, “
Physics-informed machine learning
,”
Nat. Rev. Phys.
3
,
422
440
(
2021
).
8.
S.
Ha
and
H.
Jeong
, “Discovering conservation laws from trajectories via machine learning,” arXiv:2102.04008 (2021).
9.
F.
Alet
,
D. D.
Doblar
,
A.
Zhou
,
J. B.
Tenenbaum
,
K.
Kawaguchi
, and
C.
Finn
, “Noether networks: Meta-learning useful conserved quantities,” arXiv:2112.03321 (2021).
10.
Z.
Liu
and
M.
Tegmark
, “
Machine learning conservation laws from trajectories
,”
Phys. Rev. Lett.
126
,
180604
(
2021
).
11.
T.
Beucler
,
M.
Pritchard
,
S.
Rasp
,
J.
Ott
,
P.
Baldi
, and
P.
Gentine
, “
Enforcing analytic constraints in neural networks emulating physical systems
,”
Phys. Rev. Lett.
126
,
098302
(
2021
).
12.
Z.
Chen
,
J.
Zhang
,
M.
Arjovsky
, and
L.
Bottou
, “Symplectic recurrent neural networks,” in International Conference on Learning Representations (International Conference on Learning Representations, 2020). https://openreview.net/forum?id=BkgYPREtPr.
13.
S.
Greydanus
,
M.
Dzamba
, and
J.
Yosinski
, “Hamiltonian neural networks,” in Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Curran Associates, Inc., 2019), Vol. 32.
14.
M.
Raissi
,
P.
Perdikaris
, and
G.
Karniadakis
, “
Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations
,”
J. Comput. Phys.
378
,
686
707
(
2019
).
15.
L.
Yang
,
X.
Meng
, and
G. E.
Karniadakis
, “
B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data
,”
J. Comput. Phys.
425
,
109913
(
2021
).
16.
A.
Azizi
and
M.
Pleimling
, “
A cautionary tale for machine learning generated configurations in presence of a conserved quantity
,”
Sci. Rep.
11
(
1
),
6395
(
2021
).
17.
N.
Doan
,
W.
Polifke
, and
L.
Magri
, “
Physics-informed echo state networks
,”
J. Comput. Sci.
47
,
101237
(
2020
).
18.
N. A. K.
Doan
,
W.
Polifke
, and
L.
Magri
, “
Short- and long-term predictions of chaotic flows and extreme events: A physics-constrained reservoir computing approach
,”
Proc. R. Soc. A
477
(
2253
),
20210135
(
2021
).
19.
A.
Racca
and
L.
Magri
, “Automatic-differentiated physics-informed echo state network (API-ESN),” in Computational Science—ICCS 2021, edited by M. Paszynski, D. Kranzlmüller, V. V. Krzhizhanovskaya, J. J. Dongarra, and P. M. Sloot (Springer International Publishing, Cham, 2021), pp. 323–329.
20.
H.
Goldstein
,
C.
Poole
, and
J.
Safko
,
Classical Mechanics
,
3rd
ed. (
Addison-Wesley
,
2001
).
21.
I.
Ispolatov
,
V.
Madhok
,
S.
Allende
, and
M.
Doebeli
, “
Chaos in high-dimensional dissipative dynamical systems
,”
Sci. Rep.
5
,
12506
(
2015
).
22.
L.
Zanna
and
T.
Bolton
, “
Data-driven equation discovery of ocean mesoscale closures
,”
Geophys. Res. Lett.
47
(
17
),
e2020GL088376
, https://doi.org/10.1029/2020GL088376 (
2020
).
23.
T.
Beucler
,
M.
Pritchard
,
P.
Gentine
, and
S.
Rasp
, “Towards physically-consistent, data-driven models of convection,” in IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium (IEEE, 2020), pp. 3987–3990, ISSN: 2153-7003.
24.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
,
Deep Learning
(
MIT Press
,
2016
).
25.
E. N.
Lorenz
, “Predictability: A problem partly solved,” in Predictability of Weather and Climate, edited by T. Palmer and R. Hagedorn (Cambridge University Press, Cambridge, 2006).
26.
B. B.
Reinhold
and
R. T.
Pierrehumbert
, “
Dynamics of weather regimes: Quasi-stationary waves and blocking
,”
Mon. Weather Rev.
110
(
9
),
1105
1145
(
1982
).
27.
G.
Datseris
and
U.
Parlitz
,
Nonlinear Dynamics: A Concise Introduction Interlaced with Code
,
1st
ed. (
Springer
,
2022
).
28.
J.
Demaeyer
,
L.
De Cruz
, and
S.
Vannitsem
, “
qgs: A flexible Python framework of reduced-order multiscale climate models
,”
J. Open Source Softw.
5
,
2597
(
2020
).
29.
H. D. I.
Abarbanel
,
The Analysis of Observed Chaotic Data
(
Springer-Verlag
,
New York
,
1996
).
30.
J. P.
Eckmann
and
D.
Ruelle
, “
Ergodic theory of chaos and strange attractors
,”
Rev. Mod. Phys.
57
,
617
656
(
1985
).
31.
J. M.
Ottino
,
S. R.
Wiggins
,
M. R.
Bringer
,
C. J.
Gerdts
,
H.
Song
,
J. D.
Tice
, and
R. F.
Ismagilov
, “
Microfluidic systems for chemical kinetics that rely on chaotic mixing in droplets
,”
Philos. Trans. R. Soc. London, Ser. A
362
(
1818
),
923
935
(
2004
).
32.
R. O.
Grigoriev
,
M. F.
Schatz
, and
V.
Sharma
, “
Chaotic mixing in microdroplets
,”
Lab Chip
6
,
1369
1372
(
2006
).
33.
F. T.
Arecchi
,
G.
Giacomelli
,
P. L.
Ramazza
, and
S.
Residori
, “
Experimental evidence of chaotic itinerancy and spatiotemporal chaos in optics
,”
Phys. Rev. Lett.
65
,
2531
2534
(
1990
).
34.
X.
Zang
,
S.
Iqbal
,
Y.
Zhu
,
X.
Liu
, and
J.
Zhao
, “
Applications of chaotic dynamics in robotics
,”
Int. J. Adv. Rob. Syst.
13
(
2
),
60
(
2016
).
35.
A. M.
Lyapunov
, “
The general problem of the stability of motion
,”
Int. J. Control
55
(
3
),
531
534
(
1992
).
36.
V. I.
Oseledec
, “
A multiplicative ergodic theorem. Lyapunov characteristic numbers for dynamical systems
,”
Trudy Mosk. Mat. Obsc.
19
,
197
(
1968
).
37.
W.
Maass
,
T.
Natschläger
, and
H.
Markram
, “
Real-time computing without stable states: A new framework for neural computation based on perturbations
,”
Neural Comput.
14
,
2531
2560
(
2002
).
38.
H.
Jaeger
, “The “echo state” approach to analysing and training recurrent neural networks—With an erratum note,” GMD Technical Report No. 148, German National Research Center for Information Technology, Bonn, Germany, 2010, pp. 1–47.
39.
M.
Lukosevicius
, A Practical Guide to Applying Echo State Networks in
Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science
, edited by G. Montavon, G. B. Orr, and K. R. Müller (Springer, Berlin, Heidelberg, 2012), Vol. 7700.
40.
J. L.
Elman
, “
Finding structure in time
,”
Cogn. Sci.
14
(
2
),
179
211
(
1990
).
41.
H.
Jaeger
, “Short term memory in echo state networks,” GMD Report No. 152, German National Research Institute for Computer Science (GMD), 2002.
42.
H.
Jaeger
and
H.
Haas
, “
Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication
,”
Science
304
,
78
80
(
2004
).
43.
M.
Lukosevicius
and
H.
Jaeger
, “
Reservoir computing approaches to recurrent neural network training
,”
Comput. Sci. Rev.
3
,
127
149
(
2009
).
44.
H.
Jaeger
, “Long short-term memory in echo state networks: Details of a simulation study,” Constructor University Technical Reports 27 (2012).
45.
J. A.
Platt
,
S. G.
Penny
,
T. A.
Smith
,
T.-C.
Chen
, and
H. D. I.
Abarbanel
, “A systematic exploration of reservoir computing for forecasting complex spatiotemporal dynamics,” arXiv:2201.08910 (2022).
46.
J. A.
Platt
,
A. S.
Wong
,
R.
Clark
,
S. G.
Penny
, and
H. D. I.
Abarbanel
, “
Robust forecasting using predictive generalized synchronization in reservoir computing
,”
Chaos
31
,
123118
(
2021
).
47.
A.
Griffith
,
A.
Pomerance
, and
D. J.
Gauthier
, “
Forecasting chaotic systems with very low connectivity reservoir computers
,”
Chaos
29
(
12
),
123108
(
2019
).
48.
P.
Vlachas
,
J.
Pathak
,
B.
Hunt
,
T.
Sapsis
,
M.
Girvan
,
E.
Ott
, and
P.
Koumoutsakos
, “
Backpropagation algorithms and reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics
,”
Neural Netw.
126
,
191
217
(
2020
).
49.
T.
Beucler
,
M.
Pritchard
,
S.
Rasp
,
J.
Ott
,
P.
Baldi
, and
P.
Gentine
, “
Enforcing analytic constraints in neural networks emulating physical systems
,”
Phys. Rev. Lett.
126
(
9
),
098302
(
2021
).
50.
N.
Hansen
,
S. D.
Müller
, and
P.
Koumoutsakos
, “
Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES)
,”
Evol. Comput.
11
(
1
),
1
18
(
2003
).
51.
N.
Hansen
, “The CMA evolution strategy: A tutorial,” arXiv:1604.00772 (2023).
52.
M. T.
Rosenstein
,
J. J.
Collins
, and
C. J.
De Luca
, “
A practical method for calculating largest Lyapunov exponents from small data sets
,”
Physica D
65
(
1
),
117
134
(
1993
).
53.
H.
Kantz
, “
A robust method to estimate the maximal Lyapunov exponent of a time series
,”
Phys. Lett. A
185
(
1
),
77
87
(
1994
).
54.
J.
Theiler
, “
Estimating fractal dimension
,”
J. Opt. Soc. Am. A
7
,
1055
1073
(
1990
).
55.
Z.
Lu
,
B. R.
Hunt
, and
E.
Ott
, “
Attractor reconstruction by machine learning
,”
Chaos
28
(
6
),
061104
(
2018
).
56.
K.
Geist
,
U.
Parlitz
, and
W.
Lauterborn
, “
Comparison of different methods for computing Lyapunov exponents
,”
Prog. Theor. Phys.
83
(
5
),
875
893
(
1990
).
57.
J. L.
Kaplan
and
J. A.
Yorke
, “Chaotic behavior of multidimensional difference equations,” in Functional Differential Equations and Approximation of Fixed Points, edited by H.-O. Peitgen and H.-O. Walther (Springer, Berlin, 1979), pp. 204–227.
58.
P.
Frederickson
,
J. L.
Kaplan
,
E. D.
Yorke
, and
J. A.
Yorke
, “
The Liapunov dimension of strange attractors
,”
J. Differ. Equ.
49
(
2
),
185
207
(
1983
).
59.
G.
Benettin
,
L.
Galgani
, and
J.-M.
Strelcyn
, “
Kolmogorov entropy and numerical experiments
,”
Phys. Rev. A
14
,
2338
2345
(
1976
).
60.
J. G.
Charney
, “On the scale of atmospheric motions,” Geof. Publ. 17, 1–17 (1948).
61.
S.-C.
Yang
,
M.
Corazza
,
A.
Carrassi
,
E.
Kalnay
, and
T.
Miyoshi
, “
Comparison of local ensemble transform Kalman filter, 3DVAR, and 4DVAR in a quasigeostrophic model
,”
Mon. Weather Rev.
137
(
2
),
693
709
(
2009
).
62.
K.
Swanson
,
R.
Vautard
, and
C.
Pires
, “
Four-dimensional variational assimilation and predictability in a quasi-geostrophic model
,”
Tellus A: Dyn. Meteorol. Oceanogr.
50
(
4
),
369
390
(
2022
).
63.
J. G.
Charney
and
D. M.
Straus
, “
Form-drag instability, multiple equilibria and propagating planetary waves in baroclinic, orographically forced, planetary wave systems
,”
J. Atmos. Sci.
37
(
6
),
1157
1176
(
1980
).
64.
E. N.
Lorenz
, “
Deterministic nonperiodic flow
,”
J. Atmos. Sci.
20
(
2
),
130
141
(
1963
).
65.
E.
Kalnay
,
Atmospheric Modeling, Data Assimilation, and Predictability
(
Cambridge University Press
,
2003
).
66.
Z.
Toth
and
E.
Kalnay
, “
Ensemble forecasting at NMC: The generation of perturbations
,”
Bull. Am. Meteorol. Soc.
74
(
12
),
2317
2330
(
1993
).
67.
Z.
Toth
and
E.
Kalnay
, “
Ensemble forecasting at NCEP and the breeding method
,”
Mon. Weather Rev.
125
(
12
),
3297
3319
(
1997
).
68.
J.
Pathak
,
S.
Subramanian
,
P.
Harrington
,
S.
Raja
,
A.
Chattopadhyay
,
M.
Mardani
,
T.
Kurth
,
D.
Hall
,
Z.
Li
,
K.
Azizzadenesheli
,
P.
Hassanzadeh
,
K.
Kashinath
, and
A.
Anandkumar
, “FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators,” arXiv:2202.11214 (2022).
69.
R.
Lam
,
A.
Sanchez-Gonzalez
,
M.
Willson
,
P.
Wirnsberger
,
M.
Fortunato
,
A.
Pritzel
,
S.
Ravuri
,
T.
Ewalds
,
F.
Alet
,
Z.
Eaton-Rosen
,
W.
Hu
,
A.
Merose
,
S.
Hoyer
,
G.
Holland
,
J.
Stott
,
O.
Vinyals
,
S.
Mohamed
, and
P.
Battaglia
, “GraphCast: Learning skillful medium-range global weather forecasting,” arXiv:2212.12794 (2022).
70.
K.
Bi
,
L.
Xie
,
H.
Zhang
,
X.
Chen
,
X.
Gu
, and
Q.
Tian
, “Pangu-Weather: A 3D high-resolution model for fast and accurate global weather forecast,” arXiv:2211.02556 (2022).