Forecasting the behavior of high-dimensional dynamical systems using machine learning requires efficient methods to learn the underlying physical model. We demonstrate spatiotemporal chaos prediction using a machine learning architecture that, when combined with a next-generation reservoir computer, displays state-of-the-art performance with a computational time $103$–$104$ times faster for training process and training data set $\u223c102$ times smaller than other machine learning algorithms. We also take advantage of the translational symmetry of the model to further reduce the computational cost and training data, each by a factor of $\u223c$10.

Modeling and predicting high-dimensional dynamical systems, such as spatiotemporal chaotic systems, continues to be a physics grand challenge and require efficient methods to leverage computational resources and efficiently process large amounts of data. In this work, we implement a highly efficient machine learning (ML) parallel scheme for spatiotemporal forecasting where each model unit predicts a single spatial location. This reduces the number of trainable parameters to the minimum number possible, thus speeding up the algorithm and reducing the data set size needed for training. Moreover, when combined with next-generation reservoir computers (NG-RCs), our approach presents state-of-the-art performance with a computational cost and training data dramatically reduced in comparison with other machine learning approaches. We also show that the computational cost and training data set size can be further reduced when the system displays translational symmetry, which is commonly present in spatiotemporal systems with cyclic boundary conditions. Although many real systems do not have such symmetry, our results highlight the importance of symmetry addressing when it is present in the system.

## I. INTRODUCTION

Many nonlinear systems display temporal dynamics that depend on spatial location, such as the heart,^{1} optical devices,^{2} and fluid flow.^{3} These systems may display spatiotemporal chaos, which has finite spatiotemporal correlations, a loss of long-term predictability,^{4} and the appearance of coherent structures.^{3} Also, these systems often display multi-scale behavior, where information and energy flow across scales. Modeling spatiotemporal chaos is difficult for these reasons and continues to be a physics grand challenge.

One approach to this problem is to use machine learning, which may speed up prediction by learning only variables of interest, such as the macroscale behavior^{5–7} or improve the prediction accuracy by fusing model predictions and experimental observations.^{8–13} Some researchers use ML algorithms, such as deep learning,^{6,14} time embedding techniques,^{15–17} or sparse system identifiers,^{18} to learn the underlying ordinary or partial differential equations, but this subsequently requires precise numerical methods for model integration. Another approach is to learn the system *flow*, which allows for one-step-ahead prediction using a coarser spatiotemporal grid, likely leading to faster prediction. The next-generation reservoir computer (NG-RC), for example, excels at this task^{17} and is mathematically equivalent to a traditional reservoir computer (RC) but has an optimal form.^{19}

In typical ML approaches, the spatial variable is discretized at $L$ points with step size $\delta L$, assumed one-dimensional for exposition simplicity, and time is discretized with step size $\delta t$. During supervised training, blocks of data with $Nin$ spatial points and $k$ time steps are fed into the artificial neural network (ANN) used to predict the behavior at $Nout$ locations at one or more temporal steps. Often, $Nin=Nout=L$ and $k\delta t$ is longer than the correlation time,^{6,20} which is problematic because it causes the model to focus on unrelated observations. Also, the ANN is large, which increases the number of trainable parameters and, hence, increases the required computer resources and the training data set size. Recently, a parallel reservoir computing scheme^{21} was introduced with $Nout<Nin<L$ to reduce the computational cost.

Here, our primary contribution is to demonstrate a new algorithm for learning spatiotemporal systems. It makes predictions of the temporal dynamics of the system at a single spatial location based on data drawn from a small spatiotemporal neighborhood of the point. Prediction over all spatial points is realized using parallel machines. Quantitatively, we take $Nin<L$ and $Nout=1$, where $Nin\delta L$ is less than or comparable to the spatial correlation length, and take $k$ so that $k\delta t$ is less than the correlation time. Hence, we use $L$ parallel ANNs for one-step-ahead prediction. We apply our method to a heuristic atmospheric weather model using the NG-RC^{17} as the core learning machine, which reduces the computational complexity and the data set size required for training while displaying state-of-the-art accuracy. Accounting for the translational symmetry of this model further reduces the training computational time and data. Figure 1 illustrates our scheme.

We highlight that addressing symmetries has proven to be important for improving other ML approaches.^{22–26} Furthermore, ML algorithms can reveal hidden symmetries, such as translational invariance present in a simple 1D uniform motion or black hole dynamics,^{27} even if their presence is not obvious.

The rest of this paper is organized as follows. In Sec. II, we formally describe the high-dimensional dynamical system used as the learning system to train our ML approach for spatiotemporal chaos prediction. In Sec. III, we describe our parallel scheme of ML models where each model unit predicts a single discretized spatial location represented by a system variable. In Sec. IV, we introduce the theoretical background of the NG-RC, followed by brief descriptions of the training procedure and the prediction mode. Section V is dedicated to the results. Here, we use our parallel scheme of NG-RCs to forecast high-dimensional chaos from the model equations introduced in Sec. II. We compare the cases where the parallel NG-RCs are trained independently to the case where a translational symmetry is taken into account to improve the performance. Finally, we apply our approach to a lower-dimensional case of a Lorenz96 model system and to an even lower-dimensional case where fine-scale variables are not present. Last, we present a discussion comparing our results to other works and our conclusions in Sec. VI. We also include details on ridge regression parameter optimization and computational complexity in Appendixes A and B.

## II. EXTENDED LORENZ96 MODEL

We demonstrate our approach using numerically generated data from a heuristic atmospheric weather model introduced by Lorenz^{4,28} and extended by Thornes *et al*.^{29} It has an unspecified macroscopic scalar variable $xl$ on a discrete grid (position $l$) representing the observations around a latitude circle. To represent some convective-scale quantity across spatiotemporal scales, this variable is driven by a finer-scale variable $yj,l$, which is coupled to the macroscopic variable as well as the finest-scale variable $zi,j,l$ representing, for example, individual clouds in the atmosphere.

The model is described by a set of coupled differential equations given by

where the indices $l=1,\u2026,L$, $j=1,\u2026,J$, and $i=1,\u2026,I$ are $mod(l,L)$, $mod(j,J)$, and $mod(i,I)$, respectively, to represent cyclic boundary conditions. The terms $Syl=\u2211j=1Jyj,l$ and $Szj,l=\u2211i=1Izi,j,l$ represent the couplings between the different spatiotemporal scales.

Here, $F=20$ is a spatially homogeneous, large-scale forcing term, $h=1$ is the coupling strength between the different spatial scales, and the parameters $b=c=d=e=g=10$ set the magnitude and time scale of the fast variables. With these parameters, there is a factor of 100 difference in spatiotemporal scales from the finest ($z$) to the coarsest ($x$) scale. For future reference, we specify time in model time units (MTU), where 1 MTU corresponds approximately to 5 atmospheric days.^{4} We take $L=36$ to set the number of coherent structures appropriate for the Earth’s weather and $J=10$.^{4} For the fastest variable, we take $I=J$ following previous studies.^{6,7} There are $L\u2217J=360$ fine-scale and $L\u2217J\u2217I=3600$ finest-scale variables, and hence, there are $L[1+J(1+I)]=3996$ total variables.

We focus on learning and predicting only the slow macroscopic variables $xl$ without observing the finer-scale dynamics.^{6,7,30} Because of the fast time scale of $yl,j$ and $zi,j,l$ in comparison with $xl$, $Syl$ acts as a noise-like term in driving $xl$. It is known that many ML algorithms, including an NG-RC,^{17} can learn in the presence of large noise, and hence, we expect that we can make accurate predictions as demonstrated below.

## III. THE PARALLEL ML SCHEME

Our goal is to learn the one-step-ahead dynamics at a single location $xl$ based on using $Nin=(2Nnn+1)$ spatial points and $k$ temporal points, illustrated by the dashed boundary shown in the middle panel of Fig. 1(a), shown for $Nnn=2$, $Nin=5$, and $k=3$, values we use in Sec. V. We seek an ML model that predicts $xl$ at time step $tm+1$ based on these input data. Thus, there are $L$ independent ML models to predict the dynamics at all spatiotemporal points. Our scheme can work with a variety of ML algorithms, but we use an NG-RC because of its proven ability to make accurate predictions with limited data and low computational resources.^{17}

## IV. THE NG-RC

We operate the NG-RC in two modes shown in Fig. 1: training and forecasting. In both cases, we create a feature vector

composed of linear and nonlinear parts, respectively, where $c$ is a constant and $\u2a01$ is the concatenation operator. The linear part is formed by the current and previous $k\u22121$ values of the variable $xl$ and its $Nnn$ nearest neighbors of each side of this location. Hence, the dimension of $Ol,lin$ is $dlin=kNin=15$ for our choice of parameters.

We take the nonlinear part to be the unique second-order monomials of $Ol,lin$, which is appropriate for this problem because Eq. (1) contains only quadratic nonlinear terms. For the unique quadratic monomials, the dimension of $Ol,nonlin$ is $dnonlin=dlin(dlin+1)/2=120$ so that the total feature vector $Ol,total$ has $dtotal=1+dlin+dnonlin=136$ components, which is also equal to the number of trainable parameters for each NG-RC. Minimizing $dtotal$ is one important metric for reducing the computational resources during training. As an aside, we mention that other nonlinear functions can be used in the NG-RC, such as higher-order polynomials or radial basis functions, but are not needed here.

During training, data from the solution to Eq. (1) with $Nin$ spatial points and $M$ temporal points ($ttrain=M\delta t$ training time) is fed into each NG-RC ($L$ total) in an open-loop manner as illustrated in Fig. 1(a). Here, the goal is to have the one-step-ahead prediction $x\xafl(tm+1)$ of the NG-RC equal to the model prediction $xl(tm+1)$, where $tm+1=tm+\delta t$. That is, we seek a solution to

that minimizes $||x\xafl\u2212xl||2+\alpha ||Wl||2$ over all $M$ temporal points. Here, $Wl$ is found using regularized regression with regularization parameter $\alpha $ (see Appendix A for $\alpha $ optimization). For our parameters, $Wl$ is a $(1\xd7dtotal)=(1\xd7136)$ matrix. For future reference, the computational complexity of training a single NG-RC scales as $Mdtotal2$ and, hence, as $LMdtotal2$ for all $L$ NG-RCs.

After finding $Wl$, we switch to a prediction mode, where the training data are no longer input. The output of each NG-RC is sent to the input in a closed-loop manner and used to create $Ol,total(t)$ for each NG-RC as shown in Fig. 1(b). The parallel NG-RCs now form an autonomous spatiotemporal dynamical system. Here, the “warm up” of the NG-RC requires $k$ previous time steps as the initial conditions, which is often available if we immediately switching to the forecasting mode from the training mode.

## V. RESULTS

To test the accuracy and speed of our new algorithm, we generate spatiotemporal data by integrating Eq. (1) using a fourth order Runge–Kutta method with a fixed step size of 0.001 MTU, where we save data at steps of $\delta t=0.01$ MTU. We use initial conditions $x1=F+0.01$, $xl\u22601=F$, and $yj,l=zi,j,l=0$, integrate for ten MTU to dissipate transients, and discard these data. We integrate for an additional 11 000 MTU to generate the data from which we select an interval $ttrain$ ($M=ttrain/\delta t$) to use as a training data set. After integration, we normalize the data to have zero mean and unit standard deviation. After training, we select $k$ consecutive time steps to warm up the NGRCs for the prediction mode and select the following testing time interval to generate the ground-truth test data (data never seen by the NG-RCs during training), an example of which is shown in Fig. 2(a).

### A. Single NG-RC

As a baseline, we first predict the extended Lorenz96 system using a non-parallel architecture formed by a single NG-RC. The model receives all $L$ variables as input and is trained to perform one-step ahead prediction at all spatial locations; i.e., $Nin=Nout=L$. Here, the dimension of $Ol,lin$ is $dlin=kNin=108$ when $k=3$ as used here. Thus, the feature vector $Ol,total$ has $dtotal=5995$ components. The large number of features causes the model to focus on unrelated observations outside of the spatial correlation length, and the single NG-RC fails to predict the spatiotemporal dynamics after a short period. Figure 2(b) shows the NG-RC-predicted spatiotemporal dynamics of the extended Lorenz96 model using a relatively small training data set size $ttrain=10$ MTU ($M=1,000$). As we show in Secs. V D and V E, the performance of this approach can be improved using a much larger large training data set, which helps the model learn the appropriated features and to filter out the uncorrelated ones.

To quantify the prediction quality, we determine the normalized root-mean-square error over all spatial locations given by

The right-hand side of Eq. (4) is already normalized because $xl$ has unit standard deviation.

As shown in Fig. 2(d), the NRMSE for the single NG-RC prediction increases from nearly zero, eventually reaching a saturated value. We define a prediction horizon as the time where NRMSE = 0.3 (vertical dashed line), which is equal to 0.1 MTU for the single NG-RC prediction realization shown in Fig. 2. When averaged over 100 predictions for different initial conditions, the prediction horizon is equal to 0.01 $\xb1$ 0.03 MTU or approximately 0.05 $\xb1$ 0.14 atmospheric days for the extended Lorenz96 model.

### B. *L* independent NG-RCs

In this section, we implement our scheme that uses smaller NG-RCs operating in parallel. We train $L=36$ NG-RCs, each with $dtotal=136$. In terms of computational complexity, training the $L$ parallel NG-RCs is around 50 times less expensive than training the single NG-RC used in Sec. V A, as the NG-RC computational complexity scales as $O(Mdtotal2)$ as we discuss later. Furthermore, we choose $Nin=5$ so that each parallel NG-RC focuses on data from nearby spatial locations within the spatial correlation distance. Figure 3(b) shows the NG-RC-predicted spatiotemporal dynamics of the extended Lorenz96 model using $ttrain=10$ MTU ($M=1000$), the same training data set as in the previous example. The difference between ground-truth and prediction is initially small [Fig. 3(c)] but eventually diverge because the chaotic nature of the system will amplify a small difference between the two. Even though long-term prediction is lost, the predicted behavior has coherent structures that are visually similar to the extended Lorenz96 model. Here, the prediction horizon is equal to 0.92 for the single prediction realization shown in Fig. 3. When averaged over 100 predictions for different initial conditions, the prediction horizon is equal to 0.66 $\xb1$ 0.15 MTU or approximately 3.2 $\xb1$ 0.7 atmospheric days for the extended Lorenz96 model. We comment on the accuracy of our prediction in Sec. VI.

### C. Using translational symmetry

We further decrease the training data set size by taking into account the translational invariance of the extended Lorenz96 model given in Eq. (1), which arises from the cyclic boundary conditions and the spatially independent model parameters. Because of this symmetry, $Wl$ should be independent of $l$ in the asymptotic limit $ttrain\u2192\u221e$. To quantify the independence for the finite $ttrain$ used here, we measure the similarity of the $Wl$’s by defining a correlation coefficient

where $T$ is the transpose operation, $\u22c5$ is the dot product operation, and $C=1$ ($C=0$) indicates (un)correlated matrices. For the case presented in Fig. 3, we find that it is equal to 0.96, indicating that the NG-RC does a reasonable job of discovering this symmetry even with short $ttrain$.

We force translational symmetry by training a single $W$ and use it for all spatial locations $l$. Operationally, we concatenate all $Ol,total$ to create a data structure that has dimension $LMdtotal$, and hence, the training computational complexity scales as $LMdtotal2$, the same as in the previous scheme. This procedure relies on the fact that each spatial location has identical behavior—in a statistical sense—and hence, all the data are produced by the same underlying dynamical flow. Effectively, it increases the training time to $Lttrain$ for an observation time $ttrain$ of the spatiotemporal dynamics. A similar approach has been used recently when using a traditional RC for forecasting spatiotemporal dynamics.^{31}

Accounting for the translational symmetry somewhat improves the prediction horizon for the same training data size, i.e., the same value of $M$, as seen in Fig. 4, which displays a prediction horizon of 1.04 for the same initial condition used in Figs. 2 and 3. The mean prediction horizon for 100 different initial conditions increases by $\u223c$29% to 0.85 $\xb1$ 0.17 MTU.

### D. Prediction accuracy comparison

As seen in Fig. 5, the single NG-RC approach shows a mean prediction horizon near zero (poor prediction ability) for $ttrain$ up to $\u223c60$ MTU where it starts to improve and reaches the value of 0.82 $\xb1$ 0.15 MTU for $ttrain=1,000$ (the maximum $ttrain$ we explore in this work). For the $L$ independent NG-RCs, the prediction horizon is near zero, begins to improve for $ttrain\u22732$, and saturates above $ttrain\u227340$, obtaining a similar value for the single NG-RC approach with a training data set 25 times smaller. This makes the $L$ independent parallel NG-RCs to have a computational complexity $\u22481.4\xd7103$ smaller than the single NG-RC. On the other hand, we obtain reasonable performance when respecting the translational symmetry for the smallest training time shown in the plot, and we obtain nearly the same prediction performance as the other approaches for $ttrain\u22731$. Thus, we see that we can reduce the observation time and computational cost by $\u223c1/L$, which is expected because the effective training time is $L$ times longer.

### E. Lower-dimensional cases for the Lorenz 96 model

Here, we apply our model to lower-dimensional cases of the extended Lorenz96 model and compare our results to previous research that use these simplified models.

#### 1. Parallel NG-RC model for *L* **=** *J* **=** *I* **=** 8

First, we apply our parallel NG-RC approach ($Nin=5,Nout=1$) to the extended Lorenz96 system with $L=J=I=8$ and compare our results to our baseline method (single NG-RC) and to previous works. Figure 6(a) shows the mean prediction horizon as a function of the number of training steps $M$ for these cases. Here, we use training steps $M$ rather than training time $ttrain$ as the horizontal axis for a direct comparison to other works. For our previous results presented above, the conversion is $ttrain=M\delta t$.

When using a single NG-RC (green stars), we obtain a mean prediction horizon of $1.03\xb10.39$ MTU for $M=\u2273104$ ($ttrain\u2273100MTU$ ) where it starts to saturate. Here, the feature vector has $dtotal=325$ components. On the other hand, when using $L$ independent NG-RCs (blue squares), our model obtains a mean prediction horizon of $1.05\xb10.38$ MTU with $M=4000$ corresponding to a training time of $ttrain=40$ MTU. Here, each one of the $L$ parallel NG-RCs is trained individually with $M$ training points from the respective region using $dtotal=136$ features, which results in a computational cost $\u223c2$ times smaller than the single NG-RC. Unsurprisingly, the single NG-RC performs very similar to the $L$ parallel NG-RCs, as the number of spatial variables is low ($L=8$) and the single NG-RC input dimension $Nin=L$ contains fewer unrelated variables than the higher dimensional case shown in Fig. 5. Finally, when using a single $W$ that respects the translation symmetry (orange circles), we obtain a similar result (with no statistical difference) for $M=400$ ($ttrain=4$ MTU), reducing the training time by a factor of $\u223c$10 ($\u223cL$) and a similar reduction in the computational cost. This reduction is expected because data from all $L$ spatial locations are concatenated to form a single training data set in this method with an effective size $L$ times longer than $ttrain=4$ MTU. Figure 7 shows typical predictions for the three cases discussed above.

Chattopadhyay *et al.*^{6} also predict the dynamics of this simplified extended Lorenz96 system using $Nin=Nout=L=8$ using a traditional RC with $dtotal=5000$ nodes (equal to the size of their feature vector) with $M=1\xd7104\u22122\xd7106$ training steps. Their results are shown in Fig. 6(a) (red triangles). Their approach shows an improvement in performance when increasing $M$ to $5\xd7105$, where the mean prediction horizon is similar to our NG-RC approaches. We highlight that our parallel NG-RCs that respect translation symmetry (orange circles) present similar results with a computational cost $\u223c2.1\xd7105$ smaller than their RC trained with $M=5\xd7105$ data points (see Appendix B for more details in computational complexity comparisons).

Intriguingly, the results of Chattopadhyay *et al.* show a step-like improvement when increasing the training steps to $M\u2265106$. They do not know the reason for this step and leave it for future investigations. It is important to notice that they use a single training data set in their work, whereas we average our results over multiple training data sets. One possible hypothesis for their observation is that the step improvement is related to the specific details of their training set.

To explore this possibility, we show the mean prediction horizon of our approaches using the best-performing single training data set in Fig. 6(b). We see that our best-performing training data set shows a similar mean prediction horizon using less training data than the RC trained with $M\u2265106$ data points. Our results could be coincidental but points out one issue that may be responsible for explaining their observation. The one take-away message is that it is important to average over training data sets to remove possible spurious effects on the prediction error due to the specific details of the training data set, which becomes more important for the short set sizes used here.

As another comparison of our work to past reports, Pyle *et al.*^{7} also predict the same low-dimensional Lorenz96 system using a non-parallel ML scheme except that they use an approach similar to the NG-RC with $k=1$ and all monomials up to quartic order. They use a single training data set of $M=5\xd7105$ training points, the same as Chattopadhyay *et al.*, and obtain a mean prediction horizon of $1.60\xb10.53$ MTU. In terms of computational cost, our parallel approaches that do not (do) exploit the symmetry of the model are $2.1\xd7102$ ($2.1\xd7103$) less expensive than their approach.

#### 2. Parallel NG-RC model for the Lorenz96 model without fine-scale variables with *L* **=** 40, *J* **=** *I* **=** 0

Here, we use our approach to predict the dynamics of a simpler Lorenz96 model that does not include the fine spatiotemporal variables $yj,l$ and $zi,j,l$ [$J=I=0$—see Eq. (1)] with $L=40$.

First, we use our single NG-RC baseline model, a single NG-RC with $dtotal=7381$ features. Similar to the predictions for the other Lorenz96 models shown in Secs. V A–V D and V E 1 the single NG-RC only starts to improve its performance for a higher $M$ in comparison with the parallel approaches, as shown in Fig. 8, which displays the mean prediction horizon as a function of the training steps $M$. The maximum mean prediction horizon for the single NG-RC is $7.3\xb11.3$ $\Lambda $ for $M=105$ ($ttrain=1000$ MTU), where $\Lambda =1/\lambda =1/1.68$ MTU is the Lyapunov time with $L=40$ and $F=8$ (parameters used here).^{14}

When using $L$ independently trained NG-RCs (blue squares), the performance begins to improve for $M\u2273200$ ($ttrain\u22732$) and saturates for $M=6000$ ($ttrain=60$), where the mean prediction horizon is $8.0\xb11.7$ $\Lambda $. Each parallel NG-RC has $dtotal=136$ features, which makes the computational complex of this approach $1.2\xd7103$ times smaller than the single NG-RC method.

In comparison, using a single $W$ that respects the translation symmetry (orange circles) results in a mean prediction horizon of $7.7\xb11.7$ $\Lambda $ with $M=100$ (a training time of $ttrain=1$ MTU). That is, we observe a similar prediction horizon using a factor of 60 smaller training data set in comparison with the $L$ independent NG-RCs. Figure 9 shows typical predictions for the three cases.

Vlachas *et al.*^{14} use a parallel RC scheme^{21} to predict the same Lorenz96 model. They use 20 parallel RCs, each with $dtotal=3,000$ nodes, $Nin=10>Nout=2$, and $M=100000$. Their method obtains a mean prediction horizon of approximately $3.3$ $\Lambda $, represented by the red down triangle in Fig. 8. When comparing to our approaches, we find that our $L$ independent NG-RCs model (blue squares) obtains a prediction horizon $2.4$ times longer using a training data set $17$ times smaller and with a computational complexity $\u223c4\xd7103$ times shorter than their result. When considering the translational symmetry (orange circles), our approach obtains the same $2.4$ improvement factor on the prediction horizon, but with a training data set $103$ times smaller and with a computational complexity $\u223c2.4\xd7105$ times shorter than their approach.

Recently, Platt *et al.*^{32} optimized the parallel RC architecture to obtain a mean prediction horizon twice as long as Vlachas *et al.* using $Nin=6>Nout=2$ with smaller RCs (each with $dtotal=720$ nodes) and training data set ($M=40000$). Their result is represented by the purple right triangle in Fig. 8. Our parallel model with independent NG-RCs obtains slightly better performance in the mean prediction horizon with a computational complexity $\u223c102$ shorter using $\u223c7\xd7$ less training data than Platt *et al.* Considering the translational symmetry, our model presents a computational complexity $5.6\xd7103$ smaller using $4\xd7102$ less training data. See Appendix B for more details on the computational complexity comparison. Last, Platt *et al.* also use a single RC with $dtotal=6000$ nodes to predict the same system. While this model presents a worse performance in comparison with the three NG-RC approaches presented here, it surprisingly outperforms the parallel RC model by Vlachas *et al.*, indicating an important improvement due to the optimizations done by Platt *et al*.

## VI. DISCUSSIONS AND CONCLUSIONS

We emphasize that parallel machine learning architectures provide high-efficiency prediction of a high-dimensional spatiotemporal dynamical system. Partitioning the learning system in small subsystems, each of which can be predicted using a smaller ML model unit, provides better predictions, and is less computationally expensive than using a single model to predict the entire system.

In our proposed method, we maximize the parallelism and let each parallel ML model unit predict a single variable of the system. Thus, there are as many parallel units as the number of predicted variables in the system (here the slow variables of the Lorenz96 model). Furthermore, this parallel approach lends itself to parallelization methods using graphical processor units or multi-processor computer clusters.

We show that our parallel scheme composed of independently trained NG-RCs outperforms a non-parallel model composed of a single larger NG-RC. For the higher dimensional Lorenz96 system addressed in this work ($L=36,I=J=10$), the parallelization provides a $\u223c1.4\xd7103$ improvement factor in the computational cost while obtaining similar prediction accuracy using 25 times smaller training data set than the non-parallel architecture. We further decrease both training data and computational cost by another factor of $\u223cL$ by addressing the system translational symmetry, which is sometimes present in spatiotemporal systems with cyclic boundary conditions.

We also predict the dynamics of lower-dimensional Lorenz96-like models to compare our results to previous research. We demonstrate that our method is more accurate or can be training with less data and computational cost or both. For the extended Lorenz96 with $L=I=J=8$, we show that our parallel approach obtains similar results to a traditional RC implemented by Chattopadhyay *et al.*^{6} but with a computational cost up to $\u223c2.1\xd7104$ smaller using up to $1.2\xd7102$ less training data. When considering the translational symmetry, these numbers improve by another factor of 10. Chattopadhyay *et al.* also implement a deep learning network for this problem, but the deep learning method demonstrates no accuracy improvement and requires more computation time.

We also use our approach to predict the Lorenz96 system for $L=40,I=J=0$, where we obtain better results with a training data set up to $103$ times smaller and computational costs up to $2.4\xd7105$ smaller than parallel implementations of traditional RCs implemented by Vlachas *et al.*^{14} We also obtain better results than parallel implementations of traditional RCs that have implemented Platt *et al.*^{32} using up to $4\xd7102$ times less training data and a computational costs up to $5.6\xd7103$ smaller.

The low computational cost, less training data requirement, and fewer optimizable parameters of the NG-RC in comparison with other ML approaches allow us to implement our parallel architecture in a standard desktop computer and obtain sub-second training and prediction times. To give an absolute scale for the computational cost for our approach, we produce the results for this paper using Python 3.7.6, NumPy 1.19.15, and scikit-learn 0.24.2 on an x86-64 CPU running Windows 10. For the results presented in Figs. 3 and 4, the computation time for training all $L=36$ NG-RCs with $M=1000$ data points is $55\xb11$ ms, while the runtime for predicting one time step is $394\xb13$ or $10.9\xb10.1$ $\mu $s per spatial location per step.

Future directions of our research include hybrid approaches using NG-RCs that combine model-generated and experimentally observed data, such as those explored using a traditional RC.^{9–12} Also, our method should generalize to two- and three-dimensional fluid dynamics problems where an even greater reduction in the required data set size is anticipated. Combined with our approach of one-step-ahead prediction, a coarser spatiotemporal grid can be used, offering the possibility of greatly speeding up spatiotemporal simulations.

## ACKNOWLEDGMENTS

We gratefully acknowledge discussions of this work with E. Bollt and the financial support of the Air Force Office of Scientific Research (Contract No. FA9550-20-1-0177).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Wendson A. S. Barbosa:** Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Resources (lead); Software (lead); Visualization (lead); Writing – original draft (equal); Writing – review & editing (equal). **Daniel J. Gauthier:** Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (lead); Investigation (equal); Methodology (equal); Project administration (lead); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX A: RIDGE REGRESSION PARAMETER OPTIMIZATION

We train our model using a supervised learning algorithm where the input data $x$ and the desired output $y$ are known in advance for the entire training period. Here, $y$ is evaluated at time $tm+1$, whereas $x$ is evaluated at time $tm$ because we are making a next-step-ahead prediction. We use Ridge regression to find the matrix of weights $W$ that minimizes

where $y\xaf=WOtotal$ is the model output, $Ototal$ is the feature vector obtained from the input data $x$, and $\alpha $ is the ridge parameter. Here, $||\u22c5||$ represents the L2-norm. When $\alpha $ is zero, ridge regression reduces to regular least-square regression. The ridge parameter adds a penalty term to prevent overfitting.

We use a grid-search procedure to find the optimal $\alpha $ that maximizes the mean prediction horizon for each case shown in Sec. V. Figure 10 shows the optimization results for all cases shown in Figs. 5, 6, and 8. For each $\alpha $, we calculate the mean prediction horizon for ten different training sets. For each training set, we perform predictions for 10 different initial conditions, totaling 100 predictions per $\alpha $. We repeat this process for different training times $ttrain$. It is seen that for $L=36,J=I=10$ [Figs. 10(a)–10(c)] and for $L=J=I=8$ [Figs. 10(d)–10(f)], the three NG-RC algorithms are not sensitive to the value of $\alpha $ for higher training times. On the other hand, for $L=40,J=I=0$ [Figs. 10(g)–10(i)], the three algorithms show a trend to have a worse performance for higher values of $\alpha $.

### APPENDIX B: COMPUTATIONAL COMPLEXITY

Here, we provide an estimation of the computational complexity of our parallel NG-RC approach in comparison with the other RC-based approaches mentioned above. The main contribution to the computational complexity for both the NG-RC and the regular RC is performing the ridge regression, which scales as $O(Mdtotal2)$ for $M$ training points and $dtotal$ features. The comparison for the extended Lorenz96 system with ($L=36,J=I=10$), ($L=J=I=8$), and ($L=40,J=I=0$) is shown in Tables I–III, respectively. Note that for the case where we train a single $W$ respecting the translational symmetry (first row of each table), the number of training points is multiplied by the number of spatial locations $L$ to reflect the training data concatenation as discussed in Sec. V C.

. | ML model . | M . | d_{total}
. | N_{in}
. | N_{out}
. | Parallel units trained . | Speed up . |
---|---|---|---|---|---|---|---|

Our approach—Respecting symmetry | NG-RC | 100 × 36 | 136 | 5 | 1 | … | … |

Our approach—L independent NG-RCs | NG-RC | 4000 | 136 | 5 | 1 | 36 | 40 |

Our approach—Single NG-RC | NG-RC | 100 000 | 5995 | 36 | 36 | … | 5.4 × 10^{4} |

. | ML model . | M . | d_{total}
. | N_{in}
. | N_{out}
. | Parallel units trained . | Speed up . |
---|---|---|---|---|---|---|---|

Our approach—Respecting symmetry | NG-RC | 100 × 36 | 136 | 5 | 1 | … | … |

Our approach—L independent NG-RCs | NG-RC | 4000 | 136 | 5 | 1 | 36 | 40 |

Our approach—Single NG-RC | NG-RC | 100 000 | 5995 | 36 | 36 | … | 5.4 × 10^{4} |

. | ML model . | M . | d_{total}
. | N_{in}
. | N_{out}
. | Parallel units trained . | Speed up . |
---|---|---|---|---|---|---|---|

Our approach—Respecting symmetry | NG-RC | 400 × 8 | 136 | 5 | 1 | … | … |

Our approach—L independent NG-RCs | NG-RC | 4000 | 136 | 5 | 1 | 8 | 10 |

Our approach—Single NG-RC | NG-RC | 10 000 | 325 | 8 | 8 | … | 18 |

Chattopadhyay et al.^{6} | RC | 500 000 | 5000 | 8 | 8 | … | 2.1 × 10^{5} |

Pyle et al.^{7} | NG-RC | 500 000 | 495 | 8 | 8 | … | 2.1 × 10^{3} |

. | ML model . | M . | d_{total}
. | N_{in}
. | N_{out}
. | Parallel units trained . | Speed up . |
---|---|---|---|---|---|---|---|

Our approach—Respecting symmetry | NG-RC | 400 × 8 | 136 | 5 | 1 | … | … |

Our approach—L independent NG-RCs | NG-RC | 4000 | 136 | 5 | 1 | 8 | 10 |

Our approach—Single NG-RC | NG-RC | 10 000 | 325 | 8 | 8 | … | 18 |

Chattopadhyay et al.^{6} | RC | 500 000 | 5000 | 8 | 8 | … | 2.1 × 10^{5} |

Pyle et al.^{7} | NG-RC | 500 000 | 495 | 8 | 8 | … | 2.1 × 10^{3} |

. | ML model . | M . | d_{total}
. | N_{in}
. | N_{out}
. | Parallel units trained . | Speed up . |
---|---|---|---|---|---|---|---|

Our approach—Respecting symmetry | NG-RC | 100 × 40 | 136 | 5 | 1 | … | … |

Our approach—L independent NG-RCs | NG-RC | 6000 | 136 | 5 | 1 | 40 | 60 |

Our approach—Single NG-RC | NG-RC | 100 000 | 7381 | 40 | 40 | … | 7 × 10^{4} |

Vlachas et al.^{14} | RC | 100 000 | 3,000 | 10 | 2 | 20 | 2.4 × 10^{5} |

Platt et al.^{32} | RC | 40 000 | 720 | 6 | 2 | 20 | 5.6 × 10^{3} |

. | ML model . | M . | d_{total}
. | N_{in}
. | N_{out}
. | Parallel units trained . | Speed up . |
---|---|---|---|---|---|---|---|

Our approach—Respecting symmetry | NG-RC | 100 × 40 | 136 | 5 | 1 | … | … |

Our approach—L independent NG-RCs | NG-RC | 6000 | 136 | 5 | 1 | 40 | 60 |

Our approach—Single NG-RC | NG-RC | 100 000 | 7381 | 40 | 40 | … | 7 × 10^{4} |

Vlachas et al.^{14} | RC | 100 000 | 3,000 | 10 | 2 | 20 | 2.4 × 10^{5} |

Platt et al.^{32} | RC | 40 000 | 720 | 6 | 2 | 20 | 5.6 × 10^{3} |

In our analysis for the RC complexity, we do not account for the cost of multiplying the node states with the adjacency matrix that represent the network because these RC approaches use neural networks with sparse connectivity (we assume this fact when it is not stated). Also, we do not take into account special function evaluation costs, such as the hyperbolic tangent present in the traditional RC. For the NG-RC, we do not take into account the computational cost of the feature vector creation that happens before training.

## REFERENCES

*Controlling Optical Chaos, Spatio-Temporal Dynamics, and Patterns*, Advances in Atomic, Molecular, and Optical Physics Vol. 54, edited by P. Berman, C. Lin, and E. Arimondo (Academic Press, 2007), pp. 615–697.

*Turbulence, Coherent Structures, Dynamical Systems and Symmetry*, Cambridge Monographs on Mechanics (Cambridge University Press, 1996).

*Seminar on Predictability*(ECMWF, Reading, UK, 1996), Vol. 1, pp. 1–18.