Drawing on ergodic theory, we introduce a novel training method for machine learning based forecasting methods for chaotic dynamical systems. The training enforces dynamical invariants—such as the Lyapunov exponent spectrum and the fractal dimension—in the systems of interest, enabling longer and more stable forecasts when operating with limited data. The technique is demonstrated in detail using reservoir computing, a specific kind of recurrent neural network. Results are given for the Lorenz 1996 chaotic dynamical system and a spectral quasi-geostrophic model of the atmosphere, both typical test cases for numerical weather prediction.
Numerical weather prediction (NWP) requires vast amounts of computational power to solve global forecasting models. Machine learning (ML) systems—acting as surrogate models for the differential equations—have the promise of lessening the computational burden by using a larger time step than a classical numerical solver or through dedicated hardware acceleration. In order to be integrated into the data assimilation optimization routines of numerical weather prediction, ML models must reproduce the correct intrinsic error growth rates of the chaotic system and other dynamical invariants. We introduce here a training method for recurrent neural networks, a type of ML model that has shown great success in predicting chaotic dynamical systems, that enforces these invariants. By providing information on the physics of the system, we bias the network toward reproducing the correct physical characteristics and enhance its predictive capabilities.
I. INTRODUCTION
Predicting the future trajectory of a dynamical system—a time series whose evolution is governed by a set of differential equations—is crucial in fields, such as weather prediction, economics, chemistry, physics, and many others.1,2 A prediction can be generated by deriving the governing equations of motion (EOM) for the system and integrating forward in time, perhaps with data being used to determine the value of particular constants or the initial conditions. Machine learning (ML), on the other hand, allows the construction of a forecast purely from observational data in lieu of a physical model. When the EOM are expensive to evaluate numerically, ML can be used to construct a surrogate model; such models can be integrated into data assimilation3 algorithms—such as the Kalman filter4,5—a typical use case when data are noisy and the model imperfect, such as in numerical weather prediction (NWP).6
The inclusion of physical knowledge—EOM, conservation laws, and dynamical invariants—into ML algorithms has been a topic of ongoing interest.7–15 Enforcing these laws effectively reduces the searchable parameter space for a workable model, decreasing the training time and increasing the accuracy of the resulting models. An ML model trained without knowledge of the underlying physics may fail to generalize and can produce solutions that violate fundamental constraints on the physical system.16 Many of the examples cited above involve conservation of quantities based on the symmetry of the equations of motion, such as conservation of energy and momentum,13 or the inclusion of previously derived differential equations11 as components of the ML training. “Physics informed” neural networks14,15,17–19 add the known or partially known differential equations as a soft constraint in the loss function of the neural network, but conservation laws are not necessarily enforced and the equations must be known.
Many physical dynamical systems of interest are dissipative—e.g., any dynamical system containing friction—meaning that they exchange energy and mass with the surrounding environment.20 High dimensional dissipative systems are very likely to exhibit chaos21—making them extremely sensitive to initial conditions. Enforcing conservation of quantities, such as momentum, mass, or energy,11,13,22,23 for dissipative systems in isolation may not be sufficient for generalization due to the exchange of energy/momentum with the environment. Problems exhibiting chaotic dynamics, such as weather forecasting, have fractal phase space trajectories that make it difficult to specify analytic constraints.11
With the goal of enforcing dynamical invariants, we suggest an alternative cost function for dissipative systems based on ergodicity, rather than symmetry. This has broad implications for time series prediction of dynamical systems. After formulating the invariants, we give a recurrent neural network (RNN)24 example applied to the Lorenz 1996 system25 and quasi-geostrophic dynamics26 where we add soft constraints into the training of the network in order to ensure that these dynamical invariant quantities are reproduced.
II. DERIVING DYNAMICAL INVARIANTS
will often consist of exotic geometries, such as quasiperiodic orbits and strange attractors.20 The strange attractor is a hypersurface that contains the available subspace for a chaotic dynamical system—see Fig. 1. Deterministic chaotic systems are of importance to a vast array of applications, such as in NWP,3 chemical mixing,31,32 optics,33 robotics,34 and many other fields.
Despite being deterministic, the precise long-term prediction of chaotic systems is impossible due to the exponential growth of errors, as quantified by the system’s Lyapunov spectrum. The Lyapunov spectrum, composed of a system’s Lyapunov exponents (LEs), characterizes a dynamical system29,35 by giving a quantitative measure of how a volume of phase space stretches or shrinks over time.
For the prediction of chaotic systems, we suggest that although short term predictions will inevitably diverge, long-term predictions of any system must preserve the invariants of motion characterized by the invariant measure [Eq. (2)]. Furthermore, enforcing such invariants could help to generalize the training of neural networks designed to emulate dissipative chaotic systems, in much the same way that conservation of energy and momentum has for conservative systems. While any function integrated with the invariant density is a constant—as seen in Eq. (1)—by the multiplicative ergodic theorem,36 the LEs and the fractal dimension are both invariant under smooth coordinate transformations and have algorithms that make them feasible to compute from observed data.29
In Secs. III–V, we provide a concrete example using the fractal dimension and LEs as invariants that must be enforced when training a neural network. We use an RNN based on the reservoir computer (RC) architecture.37–39 We impose a loss function that takes into account the preservation of the LEs and fractal dimension and detail the benefits of doing so. We stress that the concept is not limited to RC models and can, in fact, apply to any neural network architecture.
III. RECURRENT NEURAL NETWORKS AND RESERVOIR COMPUTING
Training the RC includes training the function —often taken to be a matrix and trained through linear regression—as well as finding the correct parameters . When predicting, the RC must first achieve synchronization with the input data46 during a “spinup” phase before the prediction can begin. In Refs. 6, 45, and 47, it is shown how to train the RC network through a two step training procedure that takes into account both the one step prediction accuracy as well as the long-term forecast skill.
RC has been shown to be extremely successful in time series prediction tasks. Its simple form allows the easy computation of the Jacobian with respect to the internal (hidden) states and other quantities that can help in dynamical systems analysis of the RNN. Platt et al.46 showed that the RC, when well trained, can reproduce dynamical invariants, such as the LEs and fractal dimension. It was further shown that the reproduction of these quantities maximized the prediction time and ensured the stability of the predictions. The training procedure used in previous works did not enforce the dynamical invariants directly, rather the results relied on using long-term forecasts as a proxy. Here, we reformulate the RC training to explicitly account for the dynamical invariant quantities.
IV. ENFORCING INVARIANTS
We use the Lyapunov exponents and the fractal dimension as examples of dynamical invariants in order to demonstrate the technique. With the equations of motion, such as Eq. (4) for the RC, it is quite simple to calculate these quantities using well known and efficient algorithms.30 When training directly from data—without knowledge of the underlying system—we may not know the equations of motion; therefore, these quantities must be estimated. The largest LE can often be approximated from time series data,29,52,53 and the fractal dimension can be calculated using various techniques.29,54 A calculation of the full LE spectrum is more difficult. The use of other dynamical invariants derived from the invariant measure [Eq. (2)] is also possible, for instance, the energy density spectrum of a fluid dynamical system as a function of wavenumber.
V. RESULTS
A. Lorenz 1996
The results for [Eq. (6)] are shown in Fig. 4. When no global information is given to the RC, then it can fail to generalize when presented with an unseen input. Simply providing the largest LE to the RC during training enables the neural networks to
generalize to unseen data so that there are good predictions over the entire range of possible initial conditions and
reconstruct the attractor as in Refs. 46 and 55 with the prediction giving the correct ergodic properties of the data even after the prediction necessarily diverges from the ground truth.
Providing the fractal dimension to the loss function as an invariant has a similar effect to providing the LEs by raising the mean valid prediction time (VPT) from . The fractal dimension may not, however, be unique to a particular set of data, while the LE spectrum has a much greater chance of constraining the shape of the resulting strange attractor. Therefore, we see that there is not as much improvement in the forecast when providing the fractal dimension compared to the LE spectrum.
It is typically possible to calculate an approximation of the largest LE from numerical data29—for a large and complex system, however, the calculation is often difficult and the result may be inaccurate. The robustness of the proposed training technique to errors in provided LEs is tested in Fig. 5, where multiples of the correct exponent are provided to the RC during training. The addition of the LE acts as a guide to the training, not as a hard requirement—see Eq. (6). Errors in the provided exponent up to a factor of 2–3 cause only a slight degradation of performance. This experiment shows that the training method is robust to the numerical errors in the calculation of the largest LE from measured data.
B. Synoptic scale atmospheric model
For more complex higher-dimensional dynamical systems, the Lyapunov spectrum or the Kaplan–Yorke dimension is quite difficult, if not impossible, to calculate. However, our previous results showed that capturing the leading Lyapunov exponent (LLE) enhanced prediction skill greatly, and even with complex models, this quantity can be estimated more readily either from data52,53 or from a model.56,59 We, therefore, explore the value of representing the LLE in the case of quasi-geostrophic (QG) dynamics,60 which assume that large-scale atmospheric disturbances are governed by the conservation of potential temperature and absolute potential vorticity, while the horizontal velocity is quasi-geostrophic and the pressure quasi-hydrostatic. Numerical models based on the QG approximation were a precursor to larger scale primitive equation models used for global numerical weather prediction3 and are frequently used in data assimilation studies targeting the atmosphere and ocean.61,62 Here, we consider the two-layer baroclinic model of Charney and Strauss63 used to study the planetary-scale motions of a thermally driven atmosphere in the presence of topography. We further incorporate the adaption of Reinhold and Pierrehumbert26 to include an additional wave in the zonal direction making it highly baroclinically unstable. We use the implementation of Demaeyer et al.,28 which provides a truncated two-layer QG atmospheric model on a mid-latitude -plane frictionally coupled to a mountain and a valley with a dimension of 20 in the spectral space of the model.
After integrating the model forward in time, we ask if an RC model is capable of predicting the dynamics given a significant amount of training data. An example forecast and distribution of predictions is shown in Fig. 6 where the conventionally trained6,45 RC model successfully predicts the synoptic scale atmospheric dynamics for a number of months. Such significant predictive power on a low resolution QG model is an interesting result in and of itself, showcasing the ability of RNNs to resolve more realistic atmospheric dynamics.
When we reduce the amount of data and set —the limited data case in Eq. (6) with a single forecast as part of the training loss—the RC model loses its predictive power. Adding in the information contained in the LLE, which is , enables the model to recover a large amount of predictive capability. The calculated LLE of the RC with no provided LEs is 0.14, compared to when it is provided. The mismatch between the LEs of the two systems is a clear indication that synchronization between the two is not achieved.46 The impact of the added information provided via the LLE is clear in Fig. 7, where the average VPT has extended from only a few days to multiple weeks.
VI. DISCUSSION AND CONCLUSION
Chaotic dynamical systems are difficult to predict due to their sensitivity to initial conditions.64 Better understanding and accounting for dynamical uncertainties has, however, allowed fields, such as numerical weather prediction, to provide useful and continually improving forecasts.65 Previous works (e.g., Refs. 13 and 49) proposed that including conserved quantities, such as energy/momentum, may help to improve the application of neural networks to physical systems. However, the introduction of the proposed conserved quantities is not generally applicable to dissipative chaotic dynamical systems. Thus, we instead considered dynamical invariants based on the invariant measure.
We provided a concrete example using quantities derived from the invariant measure, such as the Lyapunov exponents and the fractal dimension, to train a particular RNN architecture called reservoir computing (RC). Previous RC training algorithms used long-term forecasts initialized from many different initial conditions in order to improve generalizability,6,47 essentially imposing these invariant measures by proxy. Here, we imposed the invariant measures as constraints directly in the training algorithm, allowing the RC to generalize with fewer data. Fortunately, we have found that much of the value of this additional constraint is achieved through the use of the leading Lyapunov exponent. While the entire Lyapunov spectrum can be quite difficult to calculate, particularly for large systems, the leading Lyapunov exponent can be estimated by using numerical techniques, such as the breeding method66,67 or other methods described in Refs. 21, 56, and 59. This provides an opportunity for extension of this technique to higher-dimensional systems.
Recent works from Refs. 68–70 have shown promise in producing data-driven surrogate weather models that are competitive by some metrics with conventional operational forecast models. A key property that has not yet been demonstrated with such surrogate models is their ability to reproduce dynamical quantities, such as the LEs, which indicate an average measure of the response to small errors in the initial conditions. For weather models, in particular, the enforcement of LEs is crucial for the correct operation of data assimilation algorithms.3 Platt et al.46 demonstrated the importance of reconstructing the LE spectrum for producing a skillful deterministic forecast model. Similarly, Penny et al.6 indicated the ability of the RC to reproduce accurate finite-time LEs as a requirement for RC-based ensemble forecasts to produce good estimates of the forecast error covariance, which is the primary tool used in conventional data assimilation methods to project observational data to unobserved components of the system. This information can then be used to make the RC robust to sparse and noisy measurements. While this is the more realistic scenario used in online weather prediction systems, it is a fact that is rarely taken into account in neural network applications. The introduction of explicit constraints in the training cost function both improves predictions and trains the RNN to reconstruct the correctly shaped attractor.45,55
SUPPLEMENTARY MATERIAL
We provide in this supplementary section a few detailed graphs that give added insight into the working of the optimization routine. Figure 8 more clearly shows how the additional information in the dynamical invariants enables the routine to find the correct set of parameters. Even with many provided trajectories, with no added dynamical information, it is difficult to locate the correct minimum. Figures 9 and 10 show how the found parameters change with the added information. There is a clear optimal spectral radius around ≈ 0.2 that the RC without LEs fails to find. Additionally, we show that the no LE case also fails to reproduce the correct largest LE, a clear sign of suboptimality.46
DEDICATION
We express our profound sadness in the passing of Henry D. I. Abarbanel.
ACKNOWLEDGMENTS
J. A. Platt, S. G. Penny, and H. D. I. Abarbanel acknowledge support from the Office of Naval Research (ONR) (Grant Nos. N00014-19-1-2522 and N00014-20-1-2580). S. G. Penny and T. A. Smith acknowledge support from the NOAA (Grant No. NA20OAR4600277). T.-C. Chen acknowledges support from the NOAA Cooperative Agreement with the Cooperative Institute for Research in Environmental Sciences at the University of Colorado Boulder (No. NA17OAR4320101).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Jason A. Platt: Conceptualization (lead); Investigation (lead); Methodology (lead); Writing – original draft (lead). Stephen G. Penny: Conceptualization (equal); Investigation (equal); Project administration (equal); Resources (equal); Software (equal); Writing – review & editing (equal). Timothy A. Smith: Investigation (equal); Software (equal); Writing – review & editing (equal). Tse-Chun Chen: Formal analysis (equal); Investigation (equal); Methodology (equal); Writing – review & editing (equal). Henry D. I. Abarbanel: Investigation (equal); Methodology (equal); Project administration (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The basic RC implementation used in this study is available: https://github.com/japlatt/BasicReservoirComputing.