Dissipative partial differential equations that exhibit chaotic dynamics tend to evolve to attractors that exist on finite-dimensional manifolds. We present a data-driven reduced-order modeling method that capitalizes on this fact by finding a coordinate representation for this manifold and then a system of ordinary differential equations (ODEs) describing the dynamics in this coordinate system. The manifold coordinates are discovered using an undercomplete autoencoder—a neural network (NN) that reduces and then expands dimension. Then, the ODE, in these coordinates, is determined by a NN using the neural ODE framework. Both of these steps only require snapshots of data to learn a model, and the data can be widely and/or unevenly spaced. Time-derivative information is not needed. We apply this framework to the Kuramoto–Sivashinsky equation for domain sizes that exhibit chaotic dynamics with again estimated manifold dimensions ranging from 8 to 28. With this system, we find that dimension reduction improves performance relative to predictions in the ambient space, where artifacts arise. Then, with the low-dimensional model, we vary the training data spacing and find excellent short- and long-time statistical recreation of the true dynamics for widely spaced data (spacing of 0.7 Lyapunov times). We end by comparing performance with various degrees of dimension reduction and find a “sweet spot” in terms of performance vs dimension.

Data-driven modeling of chaotic dissipative dynamical systems has garnered substantial attention due to the rise in a variety of machine learning tools. Here, we describe how to take advantage of the low-dimensional manifold that the data lie on in conjunction with a neural-network approach to ordinary differential equations (ODEs) for reduced-order modeling of systems with chaotic dynamics. Our approach is to first learn an invariant manifold from data and then learn an ODE, using the so-called neural ODE framework, for the dynamics on the manifold. Distinctive features of our approach are that it seeks a minimal-dimensional state representation, and it maintains the Markovian character of the true dynamics. With this method, we recreate the chaotic dynamics of the Kuramoto–Sivashinsky equation using training data that are widely spaced in time.

A common question in many applications is as follows: given a time series of measurements on a system, how can a predictive model be generated to estimate future states of the system? For problems, such as weather forecasting, just knowing the future state of the system is useful. In other problems, such as minimizing turbulent drag on an aircraft, models are desirable for creating control policies. These models can sometimes be generated from first principles, but insufficient information on the system often limits the ability to write them explicitly. Even when a model can be written out explicitly, it may be very high-dimensional and computationally expensive to simulate. Thus, it is often desirable to generate a low-dimensional model from data.

We consider data sets {u(t1),u(t2),,u(tN)}, where u(ti)Rd comes from measuring the state of a system at a given time ti. The full state at a given time can be represented by direct measurements (e.g., position and velocity of a pendulum) or by a representation that is diffeomorphic to the state. One such representation is time delay measurements of the system, as shown by Takens.1 In the case of the pendulum, for example, this could correspond to writing the state as u(t)=[θ(t),θ(tτ),θ(t2τ),θ(t3τ)], where θ is the angle of the pendulum and τ is the delay time. If u lies on a d-manifold, then a time delay representation in R2d+1 is diffeomorphic to u.1 

For many systems of interest, the state space is high-dimensional and the dynamics are chaotic. Despite this complex behavior, when these systems are dissipative, the long-time dynamics often lie on a smooth invariant finite-dimensional inertial manifold MRd of a much lower dimension (dM) than that of the state space.2 Two such systems are the Kuramoto–Sivashinsky equation (KSE)3–6 and the complex Ginzburg–Landau equation.7 Similarly, for the Navier–Stokes in two and three spatial dimensions, it has been shown that there is an approximate inertial manifold.8,9 It is approximate in the sense that there are small variations in many dimensions that can be assumed to be zero and still provide an accurate approximation of the state.

With a mapping to the manifold coordinate system, the state can be represented (at least locally) in the manifold coordinates h(t)RdM. That is, mappings χ and χˇ exist such that h=χ(u) and u=χˇ(h). In the machine learning literature, χ and χˇ correspond to the encoder and decoder, respectively, of a so-called undercomplete autoencoder structure,10 as we further discuss below. It should be noted that there is no guarantee that M can be globally represented with a Cartesian representation in dM dimensions. Indeed, in general, this cannot be done, and an “atlas” of overlapping local representations, or “charts,” must be used.11 The application considered here will not require this more general formalism, but for related work using charts, see Ref. 12. Our aim here is to use data from a spatiotemporally chaotic system to learn the mappings χ(u) and χˇ(h) back and forth between Rd and a coordinate system on M and then to learn the evolution equation for the dynamics in this coordinate system. This will be a minimal-dimensional representation for the dynamics of the system in question. We first introduce some background for the dimension reduction problem and then for the time evolution problem.

Dimension reduction is a challenging and widely studied problem, and many approaches have been considered. Frequently, a linear dimension reduction technique is used (i.e., M is taken to exist in a linear subspace of Rd). This choice can be rationalized by invocation of Whitney’s theorem:13 any smooth dM-manifold can be embedded into R2dM. Sauer et al. later refined this result by showing that this embedding can be performed by almost every smooth map from a dM-manifold to Rn for n>2db,14 where db is the box-counting dimension of the attractor that lies in the manifold M. The box-counting dimension is one of a family of fractal dimensions that can be computed for a given attractor15 and must be less than or equal to the manifold dimension. Thus, for almost every smooth map, including linear ones, hR2dM+1 contains the same information as u (i.e., a map exists for reconstructing u from h).

Cunningham and Ghahramani16 give an overview of linear dimension techniques from a machine learning perspective. Many of these collapse to one of the most common methods—principal component analysis (PCA). In PCA, the linear transformation that minimizes the reconstruction error or maximizes the variance in the data is found. This transformation comes from projecting data onto the leading left singular vectors URd×dh of the centered snapshot matrix X=[u(t0)u,,u(tN)u] (here, . denotes ensemble average). This is equivalent to finding the eigenvectors of the covariance matrix XXT. Similarly, the eigenvectors of XTX can be found and related to the eigenvectors of XXT through the singular value decomposition. This approach is sometimes known as classical scaling.17 The projection onto these modes is u~=UUTu, where superscript tilde denotes the approximation of the state u. In this projection, h=UTu is the low-dimensional representation. Although this projection is the optimal linear transformation for minimizing the reconstruction error, h may still require as many as 2dM+1 dimensions to contain the same information as u.

Finding a minimal representation in general requires a nonlinear transformation. Many different techniques exist for nonlinear dimension reduction, often with the goal of preserving some property of the original dataset in the lower dimension. Some popular methods include kernel PCA, diffusion maps, local linear embedding (LLE), isometric feature mapping (Isomap), and t-distributed stochastic neighbor embedding (tSNE).17 In kernel PCA, the matrix XTX is replaced with a kernel matrix K(ui,uj). This can be viewed as an application of the “kernel trick” on the covariance matrix between data that has been mapped into an higher-dimensional (often infinite-dimensional) “feature space.”17 The low-dimensional representation is then computed using the eigenvectors such that h=[idvi,1K(ui,u),,idvi,dhK(ui,u)], where vi,k is the ith element of the kth eigenvector of K(ui,uj).

Similarly, diffusion maps, LLE, and Isomap can be viewed as extensions of kernel PCA with special kernels.17 In diffusion maps, a Gaussian kernel is applied to the data giving the matrix Aij=exp(||uiuj||/2ϵ), where ϵ is a tuning parameter that represents the local connectivity.18 Then, the dimension reduction is performed by computing the eigenvalue decomposition of this matrix (normalized so columns add to one) giving hi=[vi,2,,vi,dh+1]. The first eigenvector is the trivial all-ones vector.18 In LLE, a linear combination of the k nearest neighbors to a data point are used to minimize i||uijWi,juj||2, where Wi,j=0 if uj is not a neighbor of ui. Then, the low-dimensional representation h is calculated to minimize i||hijWi,jhj||2 using the W calculated in the first step.19 The solution that minimizes this cost function can be found by computing the eigenvectors of (IW)T(IW) and letting hi=[vi,1,,vi,dh]. For Isomap, the kernel matrix comes from computing the double centered geodesic distances20,21K(ui,uj)=1/2HD(ui,uj)2H. Here, D(ui,uj) is the geodesic distance, which is computed in two steps. First, a graph of the k nearest neighbors between points is constructed, weighted by their Euclidean distances, and then the distance between points is computed to be the shortest pathway between any points along the graph. After computing D, the matrix is centered with H=1/NI11T (1=[1,,1]T), and the eigenvalue decomposition of K gives hi=[λ1vi,1,,λdhvi,dh]. Isomap was used to reduce the dimension of many dynamical systems in Ref. 22.

The last algorithm we mention here is tSNE.23 Unlike the previous methods, which use the eigenvalue decompositions to solve the optimization problem, tSNE uses gradient descent. In tSNE, the objective is to match the distribution of the data in the high-dimensional space to the data in the low-dimensional space. This is done by finding h that minimizes the Kullback–Leibler (KL) divergence ijpijlogpij/qij, where p and q are the distributions of the high- and low-dimensional data, respectively. In tSNE, these distributions are approximated by p=exp(||uiuj||2/2σ2)/klexp(||ukul||2/2σ2) and q=(1+||hihj||2)1/kl(1+||hkhl||2)1. There is no guarantee in finding a global minimum or in finding the same solution every time. tSNE is frequently used for visualizing high-dimensional data in two or three dimensions because it often separates complex data sets out into visually distinct clusters.

A few major drawbacks of these nonlinear dimension reduction techniques are they do not provide the function χˇ, which reconstructs u from h, they do not provide a method for adding out-of-sample data, and they do not scale well for large amounts of data. Except for tSNE, the Nystro¨m extension can be used to resolve the out-of-sample problem for these methods.24 However, using the Nystro¨m extension results in different functions for χ depending on whether the data are in-sample or out-of-sample.

Because of these limitations, instead of using one of the above techniques, we tackle the dimension reduction problem directly by approximating the mappings to the manifold coordinates χ and back χˇ as NNs; i.e., we use an undercomplete autoencoder.25,26 We describe autoencoders in more detail in Sec. II.

We now turn to developing, from data, dynamical models for the evolution of a state u(t). We consider systems that display deterministic, Markovian dynamics; therefore, if u(t) is the full state (either directly or diffeomorphically through embedding of a sufficient number of time delays), then the dynamics can either be represented by a discrete-time map

(1)

or an ordinary differential equation (ODE)

(2)

The learning goal is an accurate representation of F or f or their lower-dimensional analogs in the case that we apply dimension reduction.

We consider first the discrete-time problem. This is easier because that is the format in which we usually have data. For simple systems that decay to a fixed point or display quasiperiodic dynamics with discrete frequencies, linear discrete-time methods, such as dynamic mode decomposition (DMD),27 can predict dynamics. This type of idea can be extended to nonlinear systems when there is a change of basis, which makes the dynamics linear. In Lusch et al.,28 it was shown that an autoencoder can learn this change of basis. However, this approach has not been used to predict the long-time behavior of chaotic systems, which is our interest here. More generally, for every dynamical system, there is a linear, but infinite-dimensional Koopman operator that describes the evolution of an arbitrary observable.29 Our goal is to reduce dimension; therefore, we do not take this approach.

Predicting chaotic dynamics in state space requires a nonlinear model. One successful class of approaches for doing this includes reservoir networks30 and recurrent neural networks (RNNs).31,32 These methods use discrete-time maps, are non-Markovian, and typically increase the dimension of the state space. Reservoir networks work by evolving forward a high-dimensional reservoir state r(t)Rdr and finding a mapping from r to u. This reservoir state is evolved forward in time by some function r(t+τ)=G(r(t),Winu(t)), where G and Win are chosen a priori. Then, the task is finding the optimal parameters p in u~(t+τ)=Wout(r(t+τ);p) that minimize ||u(t+τ)u~(t+τ)||2. This is non-Markovian because the prediction of the state u(t+τ) depends on the previous u(t) and r(t). In Pathak et al.,30 a reservoir network was used to predict the chaotic dynamics of the KSE. For data with a state space dimension d=64, they choose a reservoir dimension dr=5000. That is, here, the dimension of the representation has not been reduced, but rather expanded, by two orders of magnitude. An interesting way to address this issue is to realize the reservoir architecture in hardware rather than software.33 

Similar to reservoir networks, RNNs have a hidden state hrRdhr that is evolved forward in time. The hidden state is evolved forward by hr(t+τ)=σh(u(t),hr(t);Wh), and the future state is estimated from the hidden state u~(t+τ)=σu(hr(t+τ);Wu). The functions σh and σu take different forms depending upon the type of RNN—two examples are the long short-term memory (LSTM)34 and the gated recurrent unit.35 Regardless of architecture, the functions σh and σu are constructed from NN with parameters Wh and Wu. These parameters come from minimizing the same loss as in reservoir computing ||u(t+τ)u~(t+τ)||2. Vlachas et al.31 provide a comparison between reservoir networks and RNN for chaotic dynamics of the KSE. Both provide predictive capabilities for multiple Lyapunov times but, as noted, are high-dimensional and non-Markovian. Additionally, these methods typically require evenly spaced data for training, and start-up often requires multiple known states, instead of a single initial condition.

In contrast to these high-dimensional non-Markovian approaches to learning a discrete-time mapping, we have shown in prior work10 that a dense NN can approximate F(u). This approach is advantageous because, like the underlying system, it is Markovian—prediction of the next state only depends on the current state. Unlike the previous methods, this approach also drastically reduced the dimension of the state by representing it in terms of the manifold coordinates. For example, for chaotic dynamics of the KSE with a domain size of L=22, we showed that the state uR64 could be represented by hR8. We found this coordinate system using an undercomplete autoencoder that corrected on PCA, as described in more detail in Sec. III A. Representing the state in this minimal fashion allows for faster prediction and may provide insight by limiting the system to only the essential degrees of freedom. Iten et al.36 illustrated the latter point by recovering physical variables from data in some low-dimensional example settings.

Rather than approximating F(u) in Eq. (1), the discrete-time representation, one can, in principle, approximate f(u) in Eq. (2), the continuous-time representation. This is more challenging because one does not generally have access to data for du/dt. When the time derivative du/dt is known (or estimated from closely spaced data) for all of the states, Gonzalez-Garcia et al.37 showed that a dense NN can provide an accurate functional form of f for the KSE in a regime that exhibits a periodic orbit. In their work, they input the state and its derivatives into the NN. A similar data-driven method that requires du/dt is “Sparse Identification of Nonlinear Dynamics” (SINDy).38 In this approach, a dictionary of candidate nonlinear functions are selected to represent f(u), and sparse regression is used to identify the dominant terms based on data for both u and du/dt. This idea has been extended for use with autoencoders for dimension reduction, while still requiring time-derivative data, in some cases with invariant manifolds of dimension three or smaller.39 In a distinct approach, Raissi et al.40 showed that du/dt can be approximated from closely spaced data with a multistep time-integration scheme, such as the second-order Adams–Bashforth formula, and a NN can be trained to match this approximation.

Now, we turn to the more general situation, where one desires a data-driven ODE representation but does not have data on time derivatives or state data sufficiently closely spaced to accurately approximate them. Now, the task of learning f(u) is closely related to the problem of data assimilation.68 In that problem, the general functional form of f is given and the task is to learn a relatively small number of parameters, whereas in the present case, we wish to represent f(u) as an essentially arbitrary function. A framework for doing this, with f given as a neural network, was presented by Chen et al.41 who dubbed the approach “neural ODEs.” To determine the neural network parameters of f, the difference between data and predictions is minimized. To determine the derivatives of f with respect to the parameters, either an adjoint problem can be solved or an automatic differentiation method can be used. With a given f, the state evolution at arbitrary points in time can be computed as the solution to Eq. (2). We further describe, and apply, this approach in Sec. II.

Neural ODEs have been applied to a number of time series problems. Maulik et al.42 used neural ODEs and LSTMs for the Burgers equation and showed that both outperform a Galerkin projection approach. Portwood et al.43 used the neural ODE approach to find an evolution equation for dissipation in decaying isotropic turbulence, showing that it outperformed a heuristic model. Neural ODEs have also been applied to flow around a cylinder in the time-periodic regime, where velocity field simulation data were projected onto eight PCA modes, and the time evolution of these modes was determined.44 

The present work combines nonlinear dimension reduction using autoencoders with the neural ODE method to model the dynamics of a system displaying spatiotemporal chaos, the Kuramoto–Sivashinsky equation, over a range of parameter values. In Sec. II, we introduce the framework of our methodology. Section III A briefly describes the results for the dimension reduction problem alone, and then Sec. III B uses the reduced-dimensional descriptions to illustrate performance of the neural ODE approximation for closely spaced data. An important conclusion here is that dimension reduction can improve neural ODE performance relative to predictions in the ambient space, where artifacts arise. Section III C examines the role of data spacing on neural ODE performance, showing that even for chaotic systems, good performance on both short-time predictions and long-time statistics can be obtained even from widely spaced data, up to a fairly well-defined limit. Finally, Sec. III D shows comparisons of neural ODE performance with various degrees of dimension reduction, finding a “sweet spot” in terms of performance vs dimension. We summarize in Sec. IV.

We consider data uRd that lie on a dM-dimensional manifold M that is embedded in the ambient space Rd of the data. With the data, we find a coordinate transformation χ:RdRdM giving the state in the manifold coordinates, hRdM. We also find the inverse χˇ:RdMRd to reconstruct u from h. Then, we describe the dynamics on M with the differential equation

(3)

If we do not do any dimension reduction, then Eq. (3) is the same as Eq. (2). Of course, for an arbitrary data set, we do not know dMa priori; therefore, in Sec. III D, we present results with various choices for dh, where hRdh. Using the mapping to the manifold coordinates, the evolution in the manifold coordinates, and the mapping back, we evolve new initial conditions forward in time. Therefore, our task is to approximate χ, χˇ, and g. We also consider the case with no dimension reduction, where we determine f(u), the right hand side (RHS) of the ODE in the ambient space. In general, there can be no expectation that any of these functions have a simple (e.g., polynomial) form; therefore, here, we approximate them with NNs, as detailed below.

Figure 1 illustrates this framework on data from a solution of the Lorenz equation45 that we embedded in four dimensions by mapping the z coordinate of the Lorenz equation to the Archimedean spiral. The change of coordinates for this embedding is given by [u1,u2,u3,u4]=[x,y,αzcosαz,αzsinαz], where x,y,z are the standard variables in the Lorenz equation and α=0.2. In Fig. 1(a), we show an example of learning the manifold coordinates for this system. The three spatial dimensions for the embedded data are u1, u3, and u4, and the color is u2. For a trajectory to be chaotic, it must lie on at least a three-dimensional manifold; therefore, the minimum embedding dimension for a chaotic trajectory that requires all the steps in this framework is four dimensions. Figure 1(b) illustrates learning the vector field g in the manifold coordinates. In Fig. 1(c), we show how, after learning these functions, new initial conditions can be mapped to the manifold coordinates, evolved forward in time, and then mapped back to the ambient space.

FIG. 1.

(a) Learning the three-dimensional manifold coordinates of the four-dimensional Lorenz butterfly wrapped on the Archimedean spiral (color is the fourth dimension). (b) Learning the vector field in the manifold coordinates. (c) Transforming new initial conditions (black dots) into the manifold coordinates, evolving them according to the learned vector field g (black curves) and transforming back into the original space.

FIG. 1.

(a) Learning the three-dimensional manifold coordinates of the four-dimensional Lorenz butterfly wrapped on the Archimedean spiral (color is the fourth dimension). (b) Learning the vector field in the manifold coordinates. (c) Transforming new initial conditions (black dots) into the manifold coordinates, evolving them according to the learned vector field g (black curves) and transforming back into the original space.

Close modal

To find χ and χˇ, we represent them as NNs and find parameters that minimize a reconstruction loss averaged over a batch of data given by

(4)

where θ1 and θ2 are the weights of χ and χˇ, respectively. Here, and elsewhere, we use stochastic gradient descent methods to determine the parameters. Further details are given in Sec. III A. This architecture is known as an undercomplete autoencoder, where χ is an encoder and χˇ is a decoder.26 Autoencoders are widely used for dimension reduction; in the fluid mechanics context, they have been used for many examples, including flow around an airfoil,46 flow around a flat plate,47 Kolmogorov flow,48 and channel flow.49,69

As noted above, to approximate g in Eq. (3), we use the neural ODE approach of Chen et al.41 In the neural ODE framework, we use g to estimate h~(ti+τ), an approximation of the reduced state h(ti+τ), at some time ti+τ by integrating

(5)

where θ3 is the set of weights of the NN. Then, g is learned by minimizing the difference between the prediction of the state [h~(ti+τ)] and the known state [h(ti+τ)], in the manifold coordinates, at that time. Specifically, the loss we minimize is

(6)

Here, ||.||1 denotes the L1-norm—of course, other norms can be used. By taking this approach, we can estimate g from data spaced further apart in time than when g (or more precisely, dh/dt) is estimated directly from finite differences.

The difficulty comes in determining the gradient of the loss with respect to the weights of the NN, J/θ3.

One approach to computation of J/θ3 is to use automatic differentiation to backpropagate through all of the steps of the time-integration scheme used to solve Eq. (5). Another approach involves solving an adjoint problem backward in time, which is the primary method used in Ref. 41 and indeed is the classical approach to optimization problems involving differential equation constraints.68 A drawback of backpropagating through the solver is that to calculate the gradient, the entire state must be stored at each time step, which can lead to memory issues.41 However, we consider a chaotic system, which puts an implicit constraint on how far apart data can feasibly be sampled (roughly one Lyapunov time) and still yield a good estimate of g. In the present work, we reach this limit before the memory limit of backpropagation becomes an issue. In our trials, the adjoint method yielded results in good agreement with backpropagation but required longer computation times for training; therefore, we chose to use the backpropagation approach.

Note that we have separated the training problems for determining the manifold coordinates (i.e., the dimension reduction problem) from that for learning the dynamics in those coordinates. In principle, one could have put those problems together, minimizing a single loss that is a linear combination of Eqs. (4) and (6). Indeed, Champion et al.39 did something similar, though it must be noted that their approach requires data for time derivatives, while ours does not, and also that they knew in advance the dimensions of the manifolds where their data live, which we do not. In any case, the invariant manifold dimensions in their examples were three or less. There are two reasons why we treated the two problems separately. The first is conceptual. The question of finding the manifold on which a data set lies is a self-contained problem. It seems natural to address this problem first and only afterward address the question of how the dynamics evolve on that manifold. The second reason is more practical. Even if one knows a priori the dimension of the manifold, solving simultaneously for the autoencoder and time evolution means simultaneously solving for three neural networks with a very complicated loss landscape. Indeed, when we attempted to train both models together, the predictive capabilities were extremely poor. Furthermore, in the problems we address, the manifold dimension is not known a priori; therefore, trying to learn a vector field (RHS of an ODE) when one does not even have an idea of how many dimensions that vector field should have may lead to much wasted computer time. Therefore, we take the view that it makes more sense to first get an estimate of how many dimensions are required to represent the data and only then to estimate the vector field that determines the dynamics.

The data sets we consider are numerical solutions of the 1D Kuramoto–Sivashinsky equation (KSE),

(7)

in a domain of length L with periodic boundary conditions. The domain size determines the types of dynamics this system exhibits. For the domain sizes we consider, trajectories exhibit sustained chaos. The dimension of the manifold that contains the global attractor has been computationally approximated using several approaches.10,50,51 We generate high-dimensional state representations (uRd) via a Galerkin projection of v onto Fourier modes. Then, we use an exponential time differencing method52 to integrate the resulting system of ordinary differential equations forward in time. The code used is available from Cvitanović et al.53 The data vectors u that we use are solutions v of the KSE on d=64 equally spaced grid points in the domain. We performed resolution checks on a grid of d=128 and found, for the domain sizes we consider, trajectories from the same initial condition track one another for multiple Lyapunov times.

Section III A briefly describes the dimension reduction (manifold representation results). Section III B considers neural ODE predictions, with and without dimension reduction, for closely spaced data. Section III C shows the effect of data spacing on the results, and Sec. III D illustrates the effect of the degree of dimension reduction. We summarize in Sec. IV.

We consider datasets for three domain sizes, L=22, 44, and 66, where we estimate the manifold dimension to be dM=8, 18, and 28, as explained in Sec. III A. The relevant timescale for each of these systems is the Lyapunov timescale (inverse of the largest Lyapunov exponent), which we previously found to be very close to the integral time for KSE data.10 For these domain sizes, the Lyapunov times are τL=22.2, 12.3, and 11.6, respectively.50,54 We use 105 time units of data for training at each domain size and vary how frequently we sample this dataset in time. For training the NNs, we use Keras55 for the autoencoders and PyTorch56 for the neural ODEs. The neural ODE code is a modification on the code used in Ref. 41, which includes methods for computing the gradient both with the adjoint method and backpropagation.

In each section, we use autoencoders to approximate the map to the manifold coordinates χ and back χˇ. We found in Ref. 10 that a useful way to represent the map to the manifold coordinates (h=χ(u)) is as a difference from a linear projection of the data onto the leading dh PCA modes,

(8)

Here, E is a NN, U is the full PCA matrix, and Pdh is the projection onto the leading dh modes. Similarly, we can learn a difference for the decoder [u~=χˇ(h)],

(9)

where D is a NN. By taking this approach, we simplify the problem such that the NN only needs to correct upon PCA. We refer to this as a hybrid NN (HNN).

In Ref. 10, we also took advantage of the translational invariance of the problem. We used the first Fourier mode method-of-slices57 to phase shift the data. In this method, the phase is calculated as ϕ(t)=atan2{Im[a1(t)],Re[a1(t)]}, where a1 is the projection of the state onto the first Fourier mode. Then, the state is phase shifted at each time using the Fourier shift theorem u^a(t)=a(t)eikϕ(t). This approach removes the continuous symmetry, which reduces the dimension of the problem by one. More precisely, it decouples the evolution of the phase ϕ from the evolution of the pattern u^a—the phase evolution depends on u^a but not vice versa. This reduction in dimension improves the performance of the autoencoder but introduces near-discontinuities in the flow that arise when a1 approaches zero that must be accounted for by warping time with “in-slice” time.57 These near-discontinuities add additional difficulties in training the time evolution models; therefore, we did not take this approach here.

In Ref. 10, we trained these HNNs using an Adam optimizer with a learning rate of 0.001 for multiple dimensions dh. Figure 2 shows the mean squared reconstruction error (MSE) of a test data set not used in training at various dimensions for each of the domain sizes. When using this hybrid approach, we see a significant drop in the MSE at dimensions of dh=8, 18, and 28 for the domain sizes L=22, 44, and 66, respectively. We take these values of dh to be the “correct” dimension, dM, in Secs. III B and III C. It is worth noting that finding this drop requires careful hyperparameter tuning. For example, we found using rectified linear units rather than sigmoids resulted in higher MSE for dhdM while having little effect on dh<dM, making the drop in MSE less evident. Due to difficulties in autoencoder training and due to the increased complexity of the system at larger domain sizes, it can be difficult to determine the dimension from the drop in the autoencoder MSE alone. In Sec. III D, we further examine the issue of estimating manifold dimension by considering the performance of neural ODE models as a function of dimension—if the estimate of dM is poor (especially on the low side), then the dynamic model is poor.

FIG. 2.

MSE of test data as a function of dimension for domain sizes L=22, 44, and 66. The expected manifold dimensions are dM=8, 18, and 28 for these domain sizes, respectively.

FIG. 2.

MSE of test data as a function of dimension for domain sizes L=22, 44, and 66. The expected manifold dimensions are dM=8, 18, and 28 for these domain sizes, respectively.

Close modal

The estimated dimension at L=22 of dM=8 is in agreement with estimates by Ding et al.,50 Vlachas et al.,58 Yang and Radons,59 and Cvitanović et al.60 Ding et al. reached this conclusion by considering a finite number of unstable periodic orbits embedded in the long-time attractor and performing Floquet analysis on these solutions. Vlachas et al. also used an autoencoder approach of varying dimensions and considering the MSE at different dimensions. In that work, they used different autoencoder architectures (e.g., convolutional autoencoder), with different activation functions (continuously differentiable exponential linear unit), and still found the same drop at dh=8. Yang and Radons59 showed that the Lyapunov vectors can be split into two categories by looking for a drop in the Lyapunov exponents. The positive and weakly negative Lyapunov exponents give Lyapunov vectors called “physical modes” that were shown to span a linear space that locally approximates the inertial manifold.59 This drop in Lyapunov exponents was found to be at dh=8 in Ref. 60 and at dh=9 in Ref. 59.

For larger domains, the dimension estimates reported here are consistent with Yang et al.51 In that work, they tracked the number of physical modes for larger domain sizes of the KSE. We showed in Ref. 10 that our estimates of dM agree with the predicted dimensions and linear scaling of dimensions shown in Yang et al.51 Linear scaling is also seen in Sondak and Protopapas,61 though their estimate of the manifold dimension is higher than ours or any of the other works mentioned here. They approximated dimension by training autoencoders with a high-dimensional latent space and then considered the variance of each latent variable. Then, dM was taken to be the number of latent variables with a variance above some threshold.

These results guide our selection of dimension in Secs. III B and III C. Furthermore, we find that once dhdM, the same MSE is achieved whether we use the full HNN or simply project onto dM PCA modes [by setting E(UTu)=0 in Eq. (8)]. This observation indicates, for the KSE with periodic boundary conditions, that the leading PCA amplitudes fully parameterize the state. Note that this parametrization is nonlinear because a nonlinear decoder is still needed to reconstruct the full state. Therefore, in Secs. III B and III C, χ is the projection onto dM PCA modes and χˇ is a NN that approximates the remaining PCA coefficients from the leading ones. Table I shows the architectures for the autoencoders and for the NNs used for time evolution in Secs. III B and III C.

TABLE I.

Architectures of NNs and matrices used in Secs. III AIII C. “Shape” indicates the dimension of each layer and “activation” the corresponding activation functions (S is the sigmoid activation).26 

FunctionShapeActivation
Encoder    
L = 22, 44, 66 χ d:dM Linear 
Decoder    
L = 22, 44 χˇ dM:500:d S:linear 
Decoder    
L = 66 χˇ dM:500:500:d S:S:linear 
ODE f, g d,dM:200:200:200:d,dM S:S:S:linear 
Discrete    
map G dM:200:200:200:dM S:S:S:linear 
FunctionShapeActivation
Encoder    
L = 22, 44, 66 χ d:dM Linear 
Decoder    
L = 22, 44 χˇ dM:500:d S:linear 
Decoder    
L = 66 χˇ dM:500:500:d S:S:linear 
ODE f, g d,dM:200:200:200:d,dM S:S:S:linear 
Discrete    
map G dM:200:200:200:dM S:S:S:linear 

Now that we have a map to the manifold coordinates and back, and the next step is approximating the ODE. Here, we train three different types of ODEs to evaluate the impact of mapping the data to the manifold coordinates. We learn the RHS of the ODE in the manifold coordinates: h˙=g(h), in physical space: u˙=f(u), and in the space of the spatial Fourier coefficients: u^˙=f^(u^), where u^=F(u) and F is the discrete Fourier transform. For the last two cases, there is no dimension reduction; therefore, χ is the identity in the first case and the discrete Fourier transform in the second, with χˇ being, again, the identity in the first case and the discrete inverse Fourier transform in the second.

We train each of the NN with data spaced 0.25 time units in these trials. This timescale is substantially shorter than the Lyapunov time, which decouples the issue of dimension from that of time resolution. Each NN is trained until the loss levels off using an Adam optimizer with a learning rate of 103 that we drop to 104 halfway through training. First, the autoencoder is trained and then the neural ODE. To avoid spurious results, we train five autoencoders and then 10 ODEs and select the combination of an autoencoder and ODE that provides the best short-time tracking. Due to random weight initialization and the stochastic nature of the optimization, we train multiple models and use the one that performs best. This is a common practice in neural-network training, as noted by Bishop in Sec. 9.6 of Ref. 62. As mentioned earlier, the models are trained to minimize the loss in Eq. (6). In previous work,10 we added an additional contribution to the loss for the time evolution problem that penalizes predicted states that fall outside the envelope of the joint PDF for energy production and dissipation. In the present work, we found excellent performance of the best models without the need for this additional term because we did not phase shift the data and thus did not have to work in “in-slice” time.

In Fig. 3, we compare the short- and long-time trajectories between the data at L=22 [Fig. 3(a)] and the NN models [Figs. 3(b)3(d)]. For this domain size, the Lyapunov time is τL=22.2. The same initial condition, which is on the attractor, is used for all cases. The first model prediction, shown in Fig. 3(b), comes from g (the low-dimensional ODE, with dh=8); the second model prediction, shown in Fig. 3(c), comes from f (the full physical space ODE); and the third model prediction, shown in Fig. 3(d), comes from f^ (the full Fourier space ODE). For each of these figures, the first 50 time units (about 2.5 Lyapunov times) are shown in the left column, and 450–500 time units (i.e., after the solution has evolved for more than 20 Lyapunov times) are shown on the right.

FIG. 3.

Examples of trajectory predictions for different models, L=22,τ=0.25. The left column shows predictions up to 2.3 Lyapunov times, and the right shows predictions after 23 Lyapunov times. (a) True trajectory and (b) predicted trajectory when approximating g with dM=8. (c) Predicted trajectory when approximating f. (d) Predicted trajectory when approximating F(f).

FIG. 3.

Examples of trajectory predictions for different models, L=22,τ=0.25. The left column shows predictions up to 2.3 Lyapunov times, and the right shows predictions after 23 Lyapunov times. (a) True trajectory and (b) predicted trajectory when approximating g with dM=8. (c) Predicted trajectory when approximating f. (d) Predicted trajectory when approximating F(f).

Close modal

All three of the models are able to generate good predictions for times up to about t=30; this is a reasonable quantitative prediction horizon for a data-driven model of chaotic dynamics. After many Lyapunov times, a good model should still yield behavior consistent with the data. Indeed, at long times, the reduced model no longer matches the true solution but continues to exhibit realistic dynamics, as shown in Fig. 3(b). In Sec. III C, we discuss the long-time behavior of the reduced model in more detail. In contrast, the full state model predictions at long times, shown in Figs. 3(c) and 3(d), develop high-wavenumber striations, which eventually pollute the lower wavenumbers and are thus not faithful to the dynamics. This same behavior appears when modeling the full state evolution for the larger domain sizes L=44 and 66. Training high-dimensional neural ODEs is computationally more expensive, and the predictions from these models are worse.

To diagnose the origin of the spurious high-wavenumber behavior, we plot the magnitude of the Fourier modes as a function of time in Fig. 4 over a very long time interval—hundreds of Lyapunov times. In both cases, we observe a very slow linear growth in the high-wavenumber modes. The inset in Fig. 4(b) shows the time evolution of the RHS of the ODE for the real part of one of the high-wavenumber modes f^30. At long-times, f^30 settles to a constant—effectively a bias term. The time integration of this constant results in the linear growth seen in Fig. 4. We tested multiple activation functions in the NN architecture, and we see growth in the high wavenumbers regardless of whether sigmoids, hyperbolic tangents, or rectified linear units are used.

FIG. 4.

Magnitude of Fourier modes over time when evolving an initial condition with a neural ODE approximation of (a) the full-space dynamics in real space (f) and of (b) the full-space dynamics in Fourier space (f^), L=22,τ=0.25. The inset in (b) is the evolution of the real part of f^30.

FIG. 4.

Magnitude of Fourier modes over time when evolving an initial condition with a neural ODE approximation of (a) the full-space dynamics in real space (f) and of (b) the full-space dynamics in Fourier space (f^), L=22,τ=0.25. The inset in (b) is the evolution of the real part of f^30.

Close modal

To further explain the origin of the bias that leads to the linear growth, we first note that in the true system, the high wavenumbers are strongly damped due to hyperdiffusivity in the KSE; therefore, data sampled from the attractor have negligible content in these wavenumbers. In particular, with the data given, the model has no information with which to learn that these modes are strongly damped. Furthermore, the prediction horizon in the training of the neural ODE is short (0.25 here); therefore, small errors in high wavenumbers have a negligible effect on the loss. Thus, the model does not learn that there should be a strong damping term that would overwhelm any small bias that arises in the RHS due to the stochastic nature of the initialization and training of neural net representations. Indeed, experiments that explicitly introduce a damping term, modifying Eq. (3) to read dh/dt=g(h;θ3)ah where a>0, is sufficient to stabilize models against this slow linear growth. In Ref. 63, we show that this is the case when the damping is chosen to match the linear term in the KSE. The combination of these factors is why the trajectories of the high-dimensional models leave the attractor, while the trajectories of the reduced-dimensional models, where degrees of freedom that are strongly damped and thus have negligible amplitude on the attractor are absent, do not.

Now, we consider only reduced-dimension ODE models and address the dependence of model quality on the temporal spacing τ between data points. For L=22, 44, and 66, we select the manifold dimensions mentioned in Sec. III AdM=8, 18, and 28, respectively. We judge the model performance based on, first, short-time tracking for L=22 and then long-time statistical agreement for all domain sizes. In these cases, autoencoders and ODEs were both trained with τ much larger than the value of 0.25 used in Sec. III B, while keeping all other training parameters the same (e.g., architecture, optimizer, dimension). Figure 5 compares a true short-time trajectory [Fig. 5(a)] to the model prediction (with τ=10) of this trajectory [Fig. 5(b)] starting from the same initial condition. Qualitatively, the space-time plots exhibit similar behavior over around 80 time units. The magnitude of the difference is shown in Fig. 5(c). Here, we see the difference growing at around 40 time units due primarily to a growing spatial phase difference between the true and predicted results. Figure 5(d) shows the norm of this difference and the minimum value of the norm when shifting u~ to any position in the domain. The minimum translated difference, in 5(d), remains much smaller than the standard Euclidean distance, indicating that, in this case, the phase difference causes most of the error.

FIG. 5.

Example of trajectory and predictions for data spaced τ=10 apart, L=22: (a) true trajectory, (b) predicted trajectory, (c) absolute error between the two trajectories, and (d) norm of the error. The norm of the error is shown for the raw data and for the spatial shift that minimizes this error at each time.

FIG. 5.

Example of trajectory and predictions for data spaced τ=10 apart, L=22: (a) true trajectory, (b) predicted trajectory, (c) absolute error between the two trajectories, and (d) norm of the error. The norm of the error is shown for the raw data and for the spatial shift that minimizes this error at each time.

Close modal

We now consider ensemble average quantities to better understand the reconstruction quality of the models as a function of τ. In addition to the ODE models, we also consider discrete-time maps h(ti+τ)=G(h(ti)). Details of the discrete-time map NN architecture are given in Table I.

Figure 6 shows the root mean squared difference between the exact trajectory and ODE/discrete-time map models as a function of time, averaged over 1000 initial conditions and normalized by the root mean squared difference between random training data, which is denoted with D. In this figure, the ODE models and discrete-time maps use data spaced τ=1016. For τ<10, there is little improvement in the ODE models’ performance. ODE predictions at and below τ=15 all track well for short times, with the error being halfway to random at around t1.5τL, and diverge at long times, with the error leveling off at around t3τL. Then, performance degrades sharply at τ=16 (yellow curve). In all cases, the discrete-time map performs worse than the ODE with the same data spacing. Furthermore, although we do not show it here, the performance for discrete-time maps becomes even worse when trying to interpolate between prediction times. This is a significant drawback not seen when using ODEs.

FIG. 6.

Normalized difference between trajectory and prediction as a function of time for different training data spacing, L=22,dh=8.

FIG. 6.

Normalized difference between trajectory and prediction as a function of time for different training data spacing, L=22,dh=8.

Close modal

A similar trend appears in the temporal autocorrelation. Figure 7 shows the temporal autocorrelation of u at a given gridpoint averaged over space and 1000 initial conditions. The temporal autocorrelation of the neural ODE matches the exact temporal autocorrelation well for data spaced up to τ=15. Then, there is an abrupt deviation from the true solution at τ=16. Also, in Fig. 7, we show the temporal autocorrelation for discrete-time maps. At τ=10, discrete-time maps perform worse than all ODEs below τ=15, and predictions worsen when increasing τ.

FIG. 7.

Temporal autocorrelation for models learned with different data spacing, L=22,dh=8.

FIG. 7.

Temporal autocorrelation for models learned with different data spacing, L=22,dh=8.

Close modal

The other half of evaluating a model is determining if trajectories stay on the attractor at long times. For this purpose, we consider the long-time joint PDF of the first (ux) and second (uxx) spatial derivatives of the solution. In the KSE, ux2¯ is the energy production and uxx2¯ is the dissipation,60 where .¯ is the spatial average. We select these quantities because of their relevance to the energy balance and because two-dimensional joint PDFs are more challenging and important to reconstruct than one-dimensional PDFs. In Fig. 8, the joint PDF for the data is compared to various models. The colormap is a log scale, with low probability excursions in black and no data in white regions. When τ=10, the ODE model matches well, with differences primarily in the low probability excursions, while at τ=10, the discrete-time mapping appears more diffuse in the high probability region and matches poorly. At τ=15, the joint PDF for the ODE prediction still matches well, degrading once τ16. This deterioration becomes more evident at τ=20.

FIG. 8.

(a) Joint PDF at L=22,dh=8. (b) and (c) Joint PDFs when approximating the ODE and the discrete-time map, respectively, at τ=10. (d)–(f) Joint PDFs when approximating the ODE for τ=15, 16, and 20.

FIG. 8.

(a) Joint PDF at L=22,dh=8. (b) and (c) Joint PDFs when approximating the ODE and the discrete-time map, respectively, at τ=10. (d)–(f) Joint PDFs when approximating the ODE for τ=15, 16, and 20.

Close modal

We quantify the difference between true and predicted PDFs by computing the KL divergence

(10)

between the true joint PDF, P, and the predicted PDF, P~. We take the convention of setting the quantity under the integral to 0 when P=0 or P~=0 as was done in Ref. 64. The KL divergence appears in Fig. 9 for various τ, where we normalize by the Lyapunov time τL. For all domain sizes, the joint PDFs for neural ODE models match the true PDF very well when data are spaced below 0.7 Lyapunov times. In contrast, good reconstruction of the joint PDF with the discrete-time model, for L=22, requires τ/τL0.4.

FIG. 9.

KL divergence of the models trained with different data spacing. Here, time is scaled with Lyapunov time for the corresponding domain size. For L=22,44,66, we use dh=8,18,28.

FIG. 9.

KL divergence of the models trained with different data spacing. Here, time is scaled with Lyapunov time for the corresponding domain size. For L=22,44,66, we use dh=8,18,28.

Close modal

In Secs. III AIII C, the model dimension was fixed at values assumed to be the correct inertial manifold dimensions based on other studies.10,50,51 Here, we vary the dimension and consider the dynamical model performance, which provides more information on the necessary number of dimension. For training, we again use 105 time units of data spaced apart 0.25 time units and train 5 autoencoders and 15 neural ODEs. The data are normalized by subtracting the mean and dividing by the standard deviation before training and projected onto the full set of PCA modes before going into the autoencoder (UTu). Training was done with an Adam optimizer with learning rate scheduling from 103 to 104 halfway through training. For L=22, we trained for 5000 epochs, and for L=44,66, we trained for 500 epochs due to the computational cost of the larger networks. In all these trials, the loss plateaued at the end of training, but further training in the L=44,66 cases could lead to mild improvement. The code used for training these autoencoders is available on GitHub.65 

Our first result here is the observation that, for models with a lower dimension than the “correct” one, reasonable approximations of the joint PDF of ux and uxx can be obtained but only if a nonlinear encoder is used for dimension reduction. This point is illustrated in Fig. 10(a), for the case L=22, where we see good reconstruction of the joint PDF when using a nonlinear encoder. Accordingly, the results below all use a nonlinear encoder; architectures are reported in Table II. However, even though this statistic is reconstructed well at low dimensions, Fig. 10(b) shows the MSE for the autoencoder, with a nonlinear encoder, still drops significantly at dh=8. This MSE is on test data normalized using the training data mean and standard deviation. As with the autoencoders used in Secs. III B and III C, we again performed manual hyperparameter tuning. Changing the activation functions from sigmoids to rectified linear units and removing the learning rate scheduler both resulted in higher MSE for dhdM.

FIG. 10.

(a) KL divergence of the best autoencoder and neural ODE pair at multiple dimensions for L=22. The nonlinear encoder corresponds to an autoencoder with a NN for encoding, and the linear encoder corresponds to an autoencoder with the projection onto the first dh Fourier modes for encoding. Both cases use a NN for decoding. (b) Mean squared error of reconstruction for the nonlinear autoencoder at each dimension.

FIG. 10.

(a) KL divergence of the best autoencoder and neural ODE pair at multiple dimensions for L=22. The nonlinear encoder corresponds to an autoencoder with a NN for encoding, and the linear encoder corresponds to an autoencoder with the projection onto the first dh Fourier modes for encoding. Both cases use a NN for decoding. (b) Mean squared error of reconstruction for the nonlinear autoencoder at each dimension.

Close modal
TABLE II.

Architectures of NNs used in Sec. III D. Labels are the same as in Table I.

FunctionShapeActivation
Encoder L = 22 χ d:500:dh S:linear 
Encoder L = 44, 66 χ d:500:500:dh S:S:linear 
Decoder L = 22 χˇ dh:500:d S:linear 
Decoder L = 44, 66 χˇ dh:500:500:d S:S:linear 
ODE g dh:200:200:200:dh S:S:S:linear 
FunctionShapeActivation
Encoder L = 22 χ d:500:dh S:linear 
Encoder L = 44, 66 χ d:500:500:dh S:S:linear 
Decoder L = 22 χˇ dh:500:d S:linear 
Decoder L = 44, 66 χˇ dh:500:500:d S:S:linear 
ODE g dh:200:200:200:dh S:S:S:linear 

Figure 11 shows the ensemble-averaged short-time tracking error of the models with the best tracking as the dimension varies. For all domain sizes, the recreation gradually improves and then becomes dimension-independent as dimension increases. This happens because the short-time tracking is directly related to the loss minimized in training the neural ODEs, and h contains the same information as u if dh is large enough. As mentioned before, for L=22, 44, and 66, respectively, the “correct” manifold dimensions are 8, 18, and 28, and Fig. 11 shows the tracking error becoming dimension-independent near these values.

FIG. 11.

Short-time error at various dimensions. Domain sizes are L=22,44,66 for (a)–(c).

FIG. 11.

Short-time error at various dimensions. Domain sizes are L=22,44,66 for (a)–(c).

Close modal

Now, we turn to long-time statistical recreation upon varying dimensions. Figure 12 shows the KL divergence from the true joint PDF P(ux,uxx) for all the autoencoder and neural ODE pairs trained for each dimension and domain size. For reference, the dashed lines show DKL when comparing the true joint PDF, used to compute DKL for the models, to a separate joint PDF from the true system and evolved forward for the same amount of time as the models but from a different initial condition. This gives a baseline for how two empirical joint PDFs from the true system compare. In all the cases, when the dimension is low, the models do a poor job of reconstructing the joint PDF. When we are near the expected manifold dimension, i.e., dh8, 18, and 28, for L=22,44, and 66, respectively, the typical model performance dramatically improves and the variation in performance between the models decreases, with DKL102 for the best-performing models and only a factor of two or three larger for the worst. Furthermore, DKL between true and best-model predictions is only slightly larger than that between the two time intervals for the true system, with particularly close agreement for L=22 and 44. In Fig. 13, we plot the true PDFs and the best-model PDFs at L=22,44,66 and dh=8,18,28, illustrating the excellent agreement. However, with a further increase of dimension, the variation in performance again becomes large, although the best models still perform well. Like the short-time prediction, the joint PDFs of the best models become dimension-independent, but here, we see that errors arise, particularly the slow linear growth in high wavenumbers, when training the neural ODE with unnecessary dimensions. Finally, recall that the error at high dh arises from spurious slow linear growth at high wavenumbers, which, as we saw in Sec. III B, arises in the absence of any autoencoder at all.

FIG. 12.

KL divergence for all trained models at different dimensions. Domain sizes are L=22,44,66 for (a)–(c). The dotted lines on each plot correspond to the DKL computed by comparing two joint PDFs of the true system.

FIG. 12.

KL divergence for all trained models at different dimensions. Domain sizes are L=22,44,66 for (a)–(c). The dotted lines on each plot correspond to the DKL computed by comparing two joint PDFs of the true system.

Close modal
FIG. 13.

(a)–(c) True joint PDFs for L=22, 44, and 66. (d)–(f) Joint PDFs from the best models (lowest DKL) for L=22,44,66 at dh=8,18,28.

FIG. 13.

(a)–(c) True joint PDFs for L=22, 44, and 66. (d)–(f) Joint PDFs from the best models (lowest DKL) for L=22,44,66 at dh=8,18,28.

Close modal

In summary, Fig. 12 highlights that there is a “sweet spot”, dhdM, where we see robust model performance. Below this, too few dimensions are retained; therefore, model performance is poor, and above it, spurious slow growth of high wavenumbers pollutes the long-time predictions. Therefore, lack of robustness in model performance is an indication that the dimension has been chosen incorrectly—either too low or too high. As noted previously, we also find that stabilizing long-time dynamics is possible with the addition of a damping term.

Neural ODEs and undercomplete autoencoders can be used to create accurate data-driven models for time evolution of spatiotemporal chaotic dynamical systems, with dimensions near the expected manifold dimension, which for the present examples is in the approximate range of 8–28. Dimension reduction is a vital step in the process. Both the training process and the prediction of trajectories are more expensive with more dimensions, and additional dimensions lead to computational difficulties with the introduction of small spurious bias that leads to slow linear growth of high-wavenumber modes. With dimension reduction, we find that training with data spaced up to 0.7 Lyapunov times gives accurate short-time tracking and long-time statistical reconstruction. Finally, we investigate the impact of varying the dimension and find that model performance improves with dimension and stops improving once the manifold dimension is reached. However, because of the spurious excitation of high wavenumbers, keeping too many dimensions hurts training, resulting in many poor models. While we have used the Kuramoto–Sivashinsky equation as a model system, this computational approach using data mediated neural ODEs might also be useful in the modeling of PDE systems, such as atmospheric phenomena,45 in the modeling of infinite-dimensional discrete-time maps,66 or in the modeling of other complex spatially distributed systems.67 In forthcoming work, we will show how this reduced-order modeling approach can be integrated into a reinforcement learning framework for control of spatiotemporally chaotic dynamics.

This work was supported by the AFOSR (No. FA9550-18-1-0174) and ONR (No. N00014-18-1-2865) (Vannevar Bush Faculty Fellowship).

The authors have no conflicts to disclose.

Alec J. Linot: Conceptualization (equal); Formal analysis (equal); Investigation (lead); Methodology (equal); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead); Writing – review and editing (equal). Michael D. Graham: Conceptualization (equal); Formal analysis (equal); Funding acquisition (lead); Methodology (equal); Resources (lead); Supervision (lead); Writing – original draft (supporting); Writing – review and editing (equal).

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
F.
Takens
, “Detecting strange attractors in turbulence,” in Dynamical Systems and Turbulence, Warwick 1980, edited by D. Rand and L.-S. Young (Springer, Berlin, 1981), pp. 366–381.
2.
R.
Temam
, “
Inertial manifolds
,”
Math. Intell.
12
(
4
),
68
74
(
1990
).
3.
C.
Foias
,
B.
Nicolaenko
,
G. R.
Sell
, and
R.
Temam
, “
Inertial manifold for the Kuramoto-Sivashinsky equation and an estimate of their lowest dimension
,”
1986
,
279
(
1988
).
4.
M. S.
Jolly
,
R.
Rosa
, and
R.
Temam
, “
Evaluating the dimension of an inertial manifold for the Kuramoto-Sivashinsky equation
,”
Adv. Differ. Equ.
5
(
1–3
),
31
66
(
2000
).
5.
R.
Temam
and
X.
Wang
, “
Estimates on the lowest dimension of inertial manifolds for the Kuramoto-Sivasbinsky equation in the general case
,”
Differ. Integral Equ.
7
(
4
),
1095
1108
(
1994
).
6.
S.
Zelik
, “
Inertial manifolds and finite-dimensional reduction for dissipative PDEs
,”
Proc. R. Soc. Edinb. A: Math.
144
(
6
),
1245
1327
(
2013
).
7.
C. R.
Doering
,
J. D.
Gibbon
,
D. D.
Holm
, and
B.
Nicolaenko
, “
Low-dimensional behaviour in the complex Ginzburg-Landau equation
,”
Nonlinearity
1
(
2
),
279
309
(
1988
).
8.
C.
Foias
,
O.
Manley
, and
R.
Temam
, “
Modelling of the interaction of small and large eddies in two dimensional turbulent flows
,”
ESAIM: Math. Model. Numer. Anal.
22
(
1
),
93
118
(
1988
).
9.
R.
Temam
, “
Do inertial manifolds apply to turbulence?
,”
Phys. D
37
(
1–3
),
146
152
(
1989
).
10.
A. J.
Linot
and
M. D.
Graham
, “
Deep learning to discover and predict dynamics on an inertial manifold
,”
Phys. Rev. E
101
,
062209
(
2020
).
11.
J.
Lee
and
J.
Lee
,
Introduction to Smooth Manifolds
, Graduate Texts in Mathematics (
Springer
,
2003
).
12.
D.
Floryan
and
M. D.
Graham
, “Charts and atlases for nonlinear data-driven models of dynamics on manifolds,” arXiv:2108.05928 (2021).
13.
H.
Whitney
, “
The self-intersections of a smooth n-manifold in 2n-space
,”
Ann. Math.
45
(
2
),
220
246
(
1944
).
14.
T.
Sauer
,
J. A.
Yorke
, and
M.
Casdagli
, “
Embedology
,”
J. Stat. Phys.
65
(
3
),
579
616
(
1991
).
15.
H.
Hentschel
and
I.
Procaccia
, “
The infinite number of generalized dimensions of fractals and strange attractors
,”
Phys. D
8
(
3
),
435
444
(
1983
).
16.
J. P.
Cunningham
and
Z.
Ghahramani
, “
Linear dimensionality reduction: Survey, insights, and generalizations
,”
J. Mach. Learn. Res.
16
,
2859
2900
(
2015
).
17.
L. J. P.
Van Der Maaten
,
E. O.
Postma
, and
H. J.
Van Den Herik
, “
Dimensionality reduction: A comparative review
,”
Technical Report TiCC-TR, Vol. 2009-005, Tilburg University, 2009
, https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf (
2009
).
18.
A. L.
Ferguson
,
A. Z.
Panagiotopoulos
,
I. G.
Kevrekidis
, and
P. G.
Debenedetti
, “
Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach
,”
Chem. Phys. Lett.
509
(
1–3
),
1
11
(
2011
).
19.
S. T.
Roweis
and
L. K.
Saul
, “
Nonlinear dimensionality reduction by locally linear embedding
,”
Science
290
(
5500
),
2323
2326
(
2000
).
20.
H.
Choi
and
S.
Choi
, “
Robust kernel isomap
,”
Pattern Recognit.
40
(
3
),
853
862
(
2007
).
21.
J. B.
Tenenbaum
,
V.
de Silva
, and
J. C.
Langford
, “
A global geometric framework for nonlinear dimensionality reduction
,”
Science
290
(
5500
),
2319
2323
(
2000
).
22.
E.
Bollt
, “
Attractor modeling and empirical nonlinear model reduction of dissipative dynamical systems
,”
Int. J. Bifurc. Chaos
17
(
04
),
1199
1219
(
2007
).
23.
G.
Hinton
and
S.
Roweis
, “
Stochastic neighbor embedding
,”
Adv. Neural Inf. Process. Syst.
Vol. 15, https://papers.nips.cc/paper/2002/hash/6150ccc6069bea6b5716254057a194ef-Abstract.html (
2003
).
24.
Y.
Bengio
,
J. F.
Paiement
,
P.
Vincent
,
O.
Delalleau
,
N.
Le Roux
, and
M.
Ouimet
, “
Out-of-sample extensions for LLE, Isomap, MDS, eigenmaps, and spectral clustering
,”
Adv. Neural Inf. Process. Syst.
Vol. 16, https://papers.nips.cc/paper/2003/hash/cf05968255451bdefe3c5bc64d550517-Abstract.html (
2004
).
25.
G. E.
Hinton
and
R. R.
Salakhutdinov
, “
Reducing the dimensionality of data with neural networks
,”
Science
313
(
5786
),
504
507
(
2006
).
26.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
, The Deep Learning Book (MIT Press, 2017), Vol. 521.
27.
J. N.
Kutz
,
S. L.
Brunton
,
B. W.
Brunton
, and
J. L.
Proctor
,
Dynamic Mode Decomposition
(
Society for Industrial and Applied Mathematics
,
Philadelphia, PA
,
2016
).
28.
B.
Lusch
,
J. N.
Kutz
, and
S. L.
Brunton
, “
Deep learning for universal linear embeddings of nonlinear dynamics
,”
9
,
4950
(
2018
).
29.
M.
Budišić
,
R.
Mohr
, and
I.
Mezić
, “
Applied Koopmanism
,”
Chaos
22
(
4
),
047510
(
2012
).
30.
J.
Pathak
,
B.
Hunt
,
M.
Girvan
,
Z.
Lu
, and
E.
Ott
, “
Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach
,”
Phys. Rev. Lett.
120
(
2
),
024102
(
2018
).
31.
P.
Vlachas
,
J.
Pathak
,
B.
Hunt
,
T.
Sapsis
,
M.
Girvan
,
E.
Ott
, and
P.
Koumoutsakos
, “
Backpropagation algorithms and reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics
,”
Neural Netw.
126
,
191
217
(
2020
).
32.
P. R.
Vlachas
,
W.
Byeon
,
Z. Y.
Wan
,
T. P.
Sapsis
, and
P.
Koumoutsakos
, “
Data-driven forecasting of high-dimensional chaotic systems with long short-term memory networks
,”
Proc. R. Soc. A
474
(
2213
),
20170844
(
2018
).
33.
D.
Canaday
,
A.
Griffith
, and
D. J.
Gauthier
, “
Rapid time series prediction with a hardware-based reservoir computer
,”
Chaos
28
(
12
),
123119
(
2018
).
34.
S.
Hochreiter
and
J.
Schmidhuber
, “
Long short-term memory
,”
Neural Comput.
9
(
8
),
1735
1780
(
1997
).
35.
K.
Cho
,
B.
van Merriënboer
,
D.
Bahdanau
, and
Y.
Bengio
, “On the properties of neural machine translation: Encoder–decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (Association for Computational Linguistics, Doha, Qatar, 2014), pp. 103–111.
36.
R.
Iten
,
T.
Metger
,
H.
Wilming
,
L.
del Rio
, and
R.
Renner
, “
Discovering physical concepts with neural networks
,”
Phys. Rev. Lett.
124
(
1
),
010508
(
2020
).
37.
R.
Gonzalez-Garcia
,
R.
Rico-Martinez
, and
I. G.
Kevrekidis
, “
Identification of distributed parameter systems: A neural net based approach
,”
Comput. Chem. Eng.
22
,
S965
S968
(
1998
).
38.
S. L.
Brunton
,
J. L.
Proctor
, and
J. N.
Kutz
, “
Discovering governing equations from data by sparse identification of nonlinear dynamical systems
,”
Proc. Natl. Acad. Sci. U.S.A.
113
(
15
),
3932
3937
(
2016
).
39.
K.
Champion
,
B.
Lusch
,
J. N.
Kutz
, and
S. L.
Brunton
, “
Data-driven discovery of coordinates and governing equations
,”
Proc. Natl. Acad. Sci. U.S.A.
116
(
45
),
22445
22451
(
2019
).
40.
M.
Raissi
,
P.
Perdikaris
, and
G. E.
Karniadakis
, “Multistep neural networks for data-driven discovery of nonlinear dynamical systems,” arXiv:1801.01236 (2018), pp. 1–19.
41.
R. T. Q.
Chen
,
Y.
Rubanova
,
J.
Bettencourt
, and
D.
Duvenaud
, “Neural ordinary differential equations,” arXiv:1806.07366 (2019).
42.
R.
Maulik
,
A.
Mohan
,
B.
Lusch
,
S.
Madireddy
,
P.
Balaprakash
, and
D.
Livescu
, “
Time-series learning of latent-space dynamics for reduced-order model closure
,”
Phys. D
405
,
132368
(
2020
).
43.
G. D.
Portwood
,
P. P.
Mitra
,
M. D.
Ribeiro
,
T. M.
Nguyen
,
B. T.
Nadiga
,
J. A.
Saenz
,
M.
Chertkov
,
A.
Garg
,
A.
Anandkumar
,
A.
Dengel
,
R.
Baraniuk
, and
D. P.
Schmidt
, “Turbulence forecasting via neural ODE,” arXiv:1911.05180 (2019).
44.
C. J. G.
Rojas
,
A.
Dengel
, and
M. D.
Ribeiro
, “Reduced-order model for fluid flows via neural ordinary differential equations,” arXiv:2102.02248 (2021).
45.
E. N.
Lorenz
, “
Deterministic nonperiodic flow
,”
J. Atmos. Sci.
20
(
2
),
130
141
(
1963
).
46.
N.
Omata
and
S.
Shirayama
, “
A novel method of low-dimensional representation for temporal behavior of flow fields using deep autoencoder
,”
AIP Adv.
9
(
1
),
015006
(
2019
).
47.
N. J.
Nair
and
A.
Goza
, “
Leveraging reduced-order models for state estimation using deep learning
,”
J. Fluid Mech.
897
,
R1
(
2020
).
48.
J.
Page
,
M. P.
Brenner
, and
R. R.
Kerswell
, “
Revealing the state space of turbulence using machine learning
,”
Phys. Rev. Fluids
6
,
034402
(
2021
).
49.
M.
Milano
and
P.
Koumoutsakos
, “
Neural network modeling for near wall turbulent flow
,”
J. Comput. Phys.
182
(
1
),
1
26
(
2002
).
50.
X.
Ding
,
H.
Chaté
,
P.
Cvitanović
,
E.
Siminos
, and
K. A.
Takeuchi
, “
Estimating the dimension of an inertial manifold from unstable periodic orbits
,”
Phys. Rev. Lett.
117
(
2
),
1
5
(
2016
).
51.
H. L.
Yang
,
K. A.
Takeuchi
,
F.
Ginelli
,
H.
Chaté
, and
G.
Radons
, “
Hyperbolicity and the effective dimension of spatially extended dissipative systems
,”
Phys. Rev. Lett.
102
(
7
),
1
4
(
2009
).
52.
A.-K.
Kassam
and
L. N.
Trefethen
, “
Fourth-order time-stepping for stiff PDEs
,”
SIAM J. Sci. Comput.
26
(
4
),
1214
1233
(
2005
).
53.
P.
Cvitanović
,
R.
Artuso
,
R.
Mainieri
,
G.
Tanner
, and
G.
Vattay
,
Chaos: Classical and Quantum
(
Niels Bohr Institute
,
Copenhagen
,
2016
).
54.
R. A.
Edson
,
J. E.
Bunder
,
T. W.
Mattner
, and
A. J.
Roberts
, “
Lyapunov exponents of the Kuramoto–Sivashinsky PDE
,”
ANZIAM J.
61
(
3
),
270
285
(
2019
).
55.
F.
Chollet
et al., see https://keras.io for Keras (2015).
56.
A.
Paszke
,
S.
Gross
,
F.
Massa
,
A.
Lerer
,
J.
Bradbury
,
G.
Chanan
,
T.
Killeen
,
Z.
Lin
,
N.
Gimelshein
,
L.
Antiga
,
A.
Desmaison
,
A.
Kopf
,
E.
Yang
,
Z.
DeVito
,
M.
Raison
,
A.
Tejani
,
S.
Chilamkurthy
,
B.
Steiner
,
L.
Fang
,
J.
Bai
, and
S.
Chintala
, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32 (Curran Associates, Inc., 2019), pp. 8024–8035.
57.
N. B.
Budanur
,
P.
Cvitanović
,
R. L.
Davidchack
, and
E.
Siminos
, “
Reduction of SO(2) symmetry for spatially extended dynamical systems
,”
Phys. Rev. Lett.
114
(
8
),
084102
(
2015
).
58.
P.
Vlachas
,
G.
Arampatzis
,
C.
Uhler
, and
P.
Koumoutsakos
, “Learning the effective dynamics of complex multiscale systems,” arXiv:2006.13431 (2021).
59.
H.-L.
Yang
and
G.
Radons
, “
Geometry of inertial manifolds probed via a Lyapunov projection method
,”
Phys. Rev. Lett.
108
,
154101
(
2012
).
60.
P.
Cvitanović
,
R. L.
Davidchack
, and
E.
Siminos
, “
On the state space geometry of the Kuramoto–Sivashinsky flow in a periodic domain
,”
SIAM J. Appl. Dyn. Syst.
9
(
1
),
1
33
(
2010
).
61.
D.
Sondak
and
P.
Protopapas
, “
Learning a reduced basis of dynamical systems using an autoencoder
,”
Phys. Rev. E
104
,
034202
(
2021
).
62.
C. M.
Bishop
,
Neural Networks for Pattern Recognition
(
Oxford University Press, Inc.
,
New York
,
1995
).
63.
A. J.
Linot
,
J. W.
Burby
,
Q.
Tang
,
P.
Balaprakash
,
M. D.
Graham
, and
R.
Maulik
, “Stabilized neural ordinary differential equations for long-time forecasting of dynamical systems,” arXiv:2203.15706 (2022).
64.
F.
Cazáis
and
A.
Lhéritier
, “Beyond two-sample-tests: Localizing data discrepancies in high-dimensional spaces,” in 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (IEEE, 2015), pp. 1–10.
65.
A. J.
Linot
and
M. D.
Graham
, see https://github.com/alinot5/KSE_ROM_NODE for Autoencoder (2022).
66.
A. Y.
Okulov
, “
Structured light entities, chaos and nonlocal maps
,”
Chaos, Solitons and Fractals
133
,
109638
(
2020
).
67.
P. G.
Kevrekidis
,
J.
Cuevas-Maraver
,
Y.
Drossinos
,
Z.
Rapti
, and
G. A.
Kevrekidis
, “
Reaction-diffusion spatial modeling of COVID-19: Greece and Andalusia as case examples
,”
Phys. Rev. E
104
,
024412
(
2021
).
68.
M.
Asch
,
M.
Bocquet
, and
M.
Nodet
, (SIAM, 2016).
69.
K.
Fukami
,
T.
Nakamura
, and
K.
Fukagata
Convolutional neural network based hierarchical autoencoder for nonlinear mode decomposition of fluid field data
,”
Phys. Fluids
32
,
095110
(
2020
).