Dissipative partial differential equations that exhibit chaotic dynamics tend to evolve to attractors that exist on finite-dimensional manifolds. We present a data-driven reduced-order modeling method that capitalizes on this fact by finding a coordinate representation for this manifold and then a system of ordinary differential equations (ODEs) describing the dynamics in this coordinate system. The manifold coordinates are discovered using an undercomplete autoencoder—a neural network (NN) that reduces and then expands dimension. Then, the ODE, in these coordinates, is determined by a NN using the neural ODE framework. Both of these steps only require snapshots of data to learn a model, and the data can be widely and/or unevenly spaced. Time-derivative information is not needed. We apply this framework to the Kuramoto–Sivashinsky equation for domain sizes that exhibit chaotic dynamics with again estimated manifold dimensions ranging from 8 to 28. With this system, we find that dimension reduction improves performance relative to predictions in the ambient space, where artifacts arise. Then, with the low-dimensional model, we vary the training data spacing and find excellent short- and long-time statistical recreation of the true dynamics for widely spaced data (spacing of Lyapunov times). We end by comparing performance with various degrees of dimension reduction and find a “sweet spot” in terms of performance vs dimension.
Data-driven modeling of chaotic dissipative dynamical systems has garnered substantial attention due to the rise in a variety of machine learning tools. Here, we describe how to take advantage of the low-dimensional manifold that the data lie on in conjunction with a neural-network approach to ordinary differential equations (ODEs) for reduced-order modeling of systems with chaotic dynamics. Our approach is to first learn an invariant manifold from data and then learn an ODE, using the so-called neural ODE framework, for the dynamics on the manifold. Distinctive features of our approach are that it seeks a minimal-dimensional state representation, and it maintains the Markovian character of the true dynamics. With this method, we recreate the chaotic dynamics of the Kuramoto–Sivashinsky equation using training data that are widely spaced in time.
I. INTRODUCTION
A common question in many applications is as follows: given a time series of measurements on a system, how can a predictive model be generated to estimate future states of the system? For problems, such as weather forecasting, just knowing the future state of the system is useful. In other problems, such as minimizing turbulent drag on an aircraft, models are desirable for creating control policies. These models can sometimes be generated from first principles, but insufficient information on the system often limits the ability to write them explicitly. Even when a model can be written out explicitly, it may be very high-dimensional and computationally expensive to simulate. Thus, it is often desirable to generate a low-dimensional model from data.
We consider data sets , where comes from measuring the state of a system at a given time . The full state at a given time can be represented by direct measurements (e.g., position and velocity of a pendulum) or by a representation that is diffeomorphic to the state. One such representation is time delay measurements of the system, as shown by Takens.1 In the case of the pendulum, for example, this could correspond to writing the state as , where is the angle of the pendulum and is the delay time. If lies on a -manifold, then a time delay representation in is diffeomorphic to .1
For many systems of interest, the state space is high-dimensional and the dynamics are chaotic. Despite this complex behavior, when these systems are dissipative, the long-time dynamics often lie on a smooth invariant finite-dimensional inertial manifold of a much lower dimension () than that of the state space.2 Two such systems are the Kuramoto–Sivashinsky equation (KSE)3–6 and the complex Ginzburg–Landau equation.7 Similarly, for the Navier–Stokes in two and three spatial dimensions, it has been shown that there is an approximate inertial manifold.8,9 It is approximate in the sense that there are small variations in many dimensions that can be assumed to be zero and still provide an accurate approximation of the state.
With a mapping to the manifold coordinate system, the state can be represented (at least locally) in the manifold coordinates . That is, mappings and exist such that and . In the machine learning literature, and correspond to the encoder and decoder, respectively, of a so-called undercomplete autoencoder structure,10 as we further discuss below. It should be noted that there is no guarantee that can be globally represented with a Cartesian representation in dimensions. Indeed, in general, this cannot be done, and an “atlas” of overlapping local representations, or “charts,” must be used.11 The application considered here will not require this more general formalism, but for related work using charts, see Ref. 12. Our aim here is to use data from a spatiotemporally chaotic system to learn the mappings and back and forth between and a coordinate system on and then to learn the evolution equation for the dynamics in this coordinate system. This will be a minimal-dimensional representation for the dynamics of the system in question. We first introduce some background for the dimension reduction problem and then for the time evolution problem.
A. Dimension reduction methods
Dimension reduction is a challenging and widely studied problem, and many approaches have been considered. Frequently, a linear dimension reduction technique is used (i.e., is taken to exist in a linear subspace of ). This choice can be rationalized by invocation of Whitney’s theorem:13 any smooth -manifold can be embedded into . Sauer et al. later refined this result by showing that this embedding can be performed by almost every smooth map from a -manifold to for ,14 where is the box-counting dimension of the attractor that lies in the manifold . The box-counting dimension is one of a family of fractal dimensions that can be computed for a given attractor15 and must be less than or equal to the manifold dimension. Thus, for almost every smooth map, including linear ones, contains the same information as (i.e., a map exists for reconstructing from ).
Cunningham and Ghahramani16 give an overview of linear dimension techniques from a machine learning perspective. Many of these collapse to one of the most common methods—principal component analysis (PCA). In PCA, the linear transformation that minimizes the reconstruction error or maximizes the variance in the data is found. This transformation comes from projecting data onto the leading left singular vectors of the centered snapshot matrix (here, denotes ensemble average). This is equivalent to finding the eigenvectors of the covariance matrix . Similarly, the eigenvectors of can be found and related to the eigenvectors of through the singular value decomposition. This approach is sometimes known as classical scaling.17 The projection onto these modes is , where superscript tilde denotes the approximation of the state . In this projection, is the low-dimensional representation. Although this projection is the optimal linear transformation for minimizing the reconstruction error, may still require as many as dimensions to contain the same information as .
Finding a minimal representation in general requires a nonlinear transformation. Many different techniques exist for nonlinear dimension reduction, often with the goal of preserving some property of the original dataset in the lower dimension. Some popular methods include kernel PCA, diffusion maps, local linear embedding (LLE), isometric feature mapping (Isomap), and t-distributed stochastic neighbor embedding (tSNE).17 In kernel PCA, the matrix is replaced with a kernel matrix . This can be viewed as an application of the “kernel trick” on the covariance matrix between data that has been mapped into an higher-dimensional (often infinite-dimensional) “feature space.”17 The low-dimensional representation is then computed using the eigenvectors such that , where is the ith element of the kth eigenvector of .
Similarly, diffusion maps, LLE, and Isomap can be viewed as extensions of kernel PCA with special kernels.17 In diffusion maps, a Gaussian kernel is applied to the data giving the matrix , where is a tuning parameter that represents the local connectivity.18 Then, the dimension reduction is performed by computing the eigenvalue decomposition of this matrix (normalized so columns add to one) giving . The first eigenvector is the trivial all-ones vector.18 In LLE, a linear combination of the nearest neighbors to a data point are used to minimize , where if is not a neighbor of . Then, the low-dimensional representation is calculated to minimize using the calculated in the first step.19 The solution that minimizes this cost function can be found by computing the eigenvectors of and letting . For Isomap, the kernel matrix comes from computing the double centered geodesic distances20,21 . Here, is the geodesic distance, which is computed in two steps. First, a graph of the k nearest neighbors between points is constructed, weighted by their Euclidean distances, and then the distance between points is computed to be the shortest pathway between any points along the graph. After computing , the matrix is centered with (), and the eigenvalue decomposition of gives . Isomap was used to reduce the dimension of many dynamical systems in Ref. 22.
The last algorithm we mention here is tSNE.23 Unlike the previous methods, which use the eigenvalue decompositions to solve the optimization problem, tSNE uses gradient descent. In tSNE, the objective is to match the distribution of the data in the high-dimensional space to the data in the low-dimensional space. This is done by finding that minimizes the Kullback–Leibler (KL) divergence , where and are the distributions of the high- and low-dimensional data, respectively. In tSNE, these distributions are approximated by and . There is no guarantee in finding a global minimum or in finding the same solution every time. tSNE is frequently used for visualizing high-dimensional data in two or three dimensions because it often separates complex data sets out into visually distinct clusters.
A few major drawbacks of these nonlinear dimension reduction techniques are they do not provide the function , which reconstructs from , they do not provide a method for adding out-of-sample data, and they do not scale well for large amounts of data. Except for tSNE, the Nystrm extension can be used to resolve the out-of-sample problem for these methods.24 However, using the Nystrm extension results in different functions for depending on whether the data are in-sample or out-of-sample.
Because of these limitations, instead of using one of the above techniques, we tackle the dimension reduction problem directly by approximating the mappings to the manifold coordinates and back as NNs; i.e., we use an undercomplete autoencoder.25,26 We describe autoencoders in more detail in Sec. II.
B. Time evolution methods
We now turn to developing, from data, dynamical models for the evolution of a state . We consider systems that display deterministic, Markovian dynamics; therefore, if is the full state (either directly or diffeomorphically through embedding of a sufficient number of time delays), then the dynamics can either be represented by a discrete-time map
or an ordinary differential equation (ODE)
The learning goal is an accurate representation of or or their lower-dimensional analogs in the case that we apply dimension reduction.
We consider first the discrete-time problem. This is easier because that is the format in which we usually have data. For simple systems that decay to a fixed point or display quasiperiodic dynamics with discrete frequencies, linear discrete-time methods, such as dynamic mode decomposition (DMD),27 can predict dynamics. This type of idea can be extended to nonlinear systems when there is a change of basis, which makes the dynamics linear. In Lusch et al.,28 it was shown that an autoencoder can learn this change of basis. However, this approach has not been used to predict the long-time behavior of chaotic systems, which is our interest here. More generally, for every dynamical system, there is a linear, but infinite-dimensional Koopman operator that describes the evolution of an arbitrary observable.29 Our goal is to reduce dimension; therefore, we do not take this approach.
Predicting chaotic dynamics in state space requires a nonlinear model. One successful class of approaches for doing this includes reservoir networks30 and recurrent neural networks (RNNs).31,32 These methods use discrete-time maps, are non-Markovian, and typically increase the dimension of the state space. Reservoir networks work by evolving forward a high-dimensional reservoir state and finding a mapping from to . This reservoir state is evolved forward in time by some function , where and are chosen a priori. Then, the task is finding the optimal parameters in that minimize . This is non-Markovian because the prediction of the state depends on the previous and . In Pathak et al.,30 a reservoir network was used to predict the chaotic dynamics of the KSE. For data with a state space dimension , they choose a reservoir dimension . That is, here, the dimension of the representation has not been reduced, but rather expanded, by two orders of magnitude. An interesting way to address this issue is to realize the reservoir architecture in hardware rather than software.33
Similar to reservoir networks, RNNs have a hidden state that is evolved forward in time. The hidden state is evolved forward by , and the future state is estimated from the hidden state . The functions and take different forms depending upon the type of RNN—two examples are the long short-term memory (LSTM)34 and the gated recurrent unit.35 Regardless of architecture, the functions and are constructed from NN with parameters and . These parameters come from minimizing the same loss as in reservoir computing . Vlachas et al.31 provide a comparison between reservoir networks and RNN for chaotic dynamics of the KSE. Both provide predictive capabilities for multiple Lyapunov times but, as noted, are high-dimensional and non-Markovian. Additionally, these methods typically require evenly spaced data for training, and start-up often requires multiple known states, instead of a single initial condition.
In contrast to these high-dimensional non-Markovian approaches to learning a discrete-time mapping, we have shown in prior work10 that a dense NN can approximate . This approach is advantageous because, like the underlying system, it is Markovian—prediction of the next state only depends on the current state. Unlike the previous methods, this approach also drastically reduced the dimension of the state by representing it in terms of the manifold coordinates. For example, for chaotic dynamics of the KSE with a domain size of , we showed that the state could be represented by . We found this coordinate system using an undercomplete autoencoder that corrected on PCA, as described in more detail in Sec. III A. Representing the state in this minimal fashion allows for faster prediction and may provide insight by limiting the system to only the essential degrees of freedom. Iten et al.36 illustrated the latter point by recovering physical variables from data in some low-dimensional example settings.
Rather than approximating in Eq. (1), the discrete-time representation, one can, in principle, approximate in Eq. (2), the continuous-time representation. This is more challenging because one does not generally have access to data for . When the time derivative is known (or estimated from closely spaced data) for all of the states, Gonzalez-Garcia et al.37 showed that a dense NN can provide an accurate functional form of for the KSE in a regime that exhibits a periodic orbit. In their work, they input the state and its derivatives into the NN. A similar data-driven method that requires is “Sparse Identification of Nonlinear Dynamics” (SINDy).38 In this approach, a dictionary of candidate nonlinear functions are selected to represent , and sparse regression is used to identify the dominant terms based on data for both and . This idea has been extended for use with autoencoders for dimension reduction, while still requiring time-derivative data, in some cases with invariant manifolds of dimension three or smaller.39 In a distinct approach, Raissi et al.40 showed that can be approximated from closely spaced data with a multistep time-integration scheme, such as the second-order Adams–Bashforth formula, and a NN can be trained to match this approximation.
Now, we turn to the more general situation, where one desires a data-driven ODE representation but does not have data on time derivatives or state data sufficiently closely spaced to accurately approximate them. Now, the task of learning is closely related to the problem of data assimilation.68 In that problem, the general functional form of is given and the task is to learn a relatively small number of parameters, whereas in the present case, we wish to represent as an essentially arbitrary function. A framework for doing this, with given as a neural network, was presented by Chen et al.41 who dubbed the approach “neural ODEs.” To determine the neural network parameters of , the difference between data and predictions is minimized. To determine the derivatives of with respect to the parameters, either an adjoint problem can be solved or an automatic differentiation method can be used. With a given , the state evolution at arbitrary points in time can be computed as the solution to Eq. (2). We further describe, and apply, this approach in Sec. II.
Neural ODEs have been applied to a number of time series problems. Maulik et al.42 used neural ODEs and LSTMs for the Burgers equation and showed that both outperform a Galerkin projection approach. Portwood et al.43 used the neural ODE approach to find an evolution equation for dissipation in decaying isotropic turbulence, showing that it outperformed a heuristic model. Neural ODEs have also been applied to flow around a cylinder in the time-periodic regime, where velocity field simulation data were projected onto eight PCA modes, and the time evolution of these modes was determined.44
The present work combines nonlinear dimension reduction using autoencoders with the neural ODE method to model the dynamics of a system displaying spatiotemporal chaos, the Kuramoto–Sivashinsky equation, over a range of parameter values. In Sec. II, we introduce the framework of our methodology. Section III A briefly describes the results for the dimension reduction problem alone, and then Sec. III B uses the reduced-dimensional descriptions to illustrate performance of the neural ODE approximation for closely spaced data. An important conclusion here is that dimension reduction can improve neural ODE performance relative to predictions in the ambient space, where artifacts arise. Section III C examines the role of data spacing on neural ODE performance, showing that even for chaotic systems, good performance on both short-time predictions and long-time statistics can be obtained even from widely spaced data, up to a fairly well-defined limit. Finally, Sec. III D shows comparisons of neural ODE performance with various degrees of dimension reduction, finding a “sweet spot” in terms of performance vs dimension. We summarize in Sec. IV.
II. FRAMEWORK
We consider data that lie on a -dimensional manifold that is embedded in the ambient space of the data. With the data, we find a coordinate transformation giving the state in the manifold coordinates, . We also find the inverse to reconstruct from . Then, we describe the dynamics on with the differential equation
If we do not do any dimension reduction, then Eq. (3) is the same as Eq. (2). Of course, for an arbitrary data set, we do not know a priori; therefore, in Sec. III D, we present results with various choices for , where . Using the mapping to the manifold coordinates, the evolution in the manifold coordinates, and the mapping back, we evolve new initial conditions forward in time. Therefore, our task is to approximate , , and . We also consider the case with no dimension reduction, where we determine , the right hand side (RHS) of the ODE in the ambient space. In general, there can be no expectation that any of these functions have a simple (e.g., polynomial) form; therefore, here, we approximate them with NNs, as detailed below.
Figure 1 illustrates this framework on data from a solution of the Lorenz equation45 that we embedded in four dimensions by mapping the coordinate of the Lorenz equation to the Archimedean spiral. The change of coordinates for this embedding is given by , where are the standard variables in the Lorenz equation and . In Fig. 1(a), we show an example of learning the manifold coordinates for this system. The three spatial dimensions for the embedded data are , , and , and the color is . For a trajectory to be chaotic, it must lie on at least a three-dimensional manifold; therefore, the minimum embedding dimension for a chaotic trajectory that requires all the steps in this framework is four dimensions. Figure 1(b) illustrates learning the vector field in the manifold coordinates. In Fig. 1(c), we show how, after learning these functions, new initial conditions can be mapped to the manifold coordinates, evolved forward in time, and then mapped back to the ambient space.
To find and , we represent them as NNs and find parameters that minimize a reconstruction loss averaged over a batch of data given by
where and are the weights of and , respectively. Here, and elsewhere, we use stochastic gradient descent methods to determine the parameters. Further details are given in Sec. III A. This architecture is known as an undercomplete autoencoder, where is an encoder and is a decoder.26 Autoencoders are widely used for dimension reduction; in the fluid mechanics context, they have been used for many examples, including flow around an airfoil,46 flow around a flat plate,47 Kolmogorov flow,48 and channel flow.49,69
As noted above, to approximate in Eq. (3), we use the neural ODE approach of Chen et al.41 In the neural ODE framework, we use to estimate , an approximation of the reduced state , at some time by integrating
where is the set of weights of the NN. Then, is learned by minimizing the difference between the prediction of the state [] and the known state [], in the manifold coordinates, at that time. Specifically, the loss we minimize is
Here, denotes the -norm—of course, other norms can be used. By taking this approach, we can estimate from data spaced further apart in time than when (or more precisely, ) is estimated directly from finite differences.
The difficulty comes in determining the gradient of the loss with respect to the weights of the NN, .
One approach to computation of is to use automatic differentiation to backpropagate through all of the steps of the time-integration scheme used to solve Eq. (5). Another approach involves solving an adjoint problem backward in time, which is the primary method used in Ref. 41 and indeed is the classical approach to optimization problems involving differential equation constraints.68 A drawback of backpropagating through the solver is that to calculate the gradient, the entire state must be stored at each time step, which can lead to memory issues.41 However, we consider a chaotic system, which puts an implicit constraint on how far apart data can feasibly be sampled (roughly one Lyapunov time) and still yield a good estimate of . In the present work, we reach this limit before the memory limit of backpropagation becomes an issue. In our trials, the adjoint method yielded results in good agreement with backpropagation but required longer computation times for training; therefore, we chose to use the backpropagation approach.
Note that we have separated the training problems for determining the manifold coordinates (i.e., the dimension reduction problem) from that for learning the dynamics in those coordinates. In principle, one could have put those problems together, minimizing a single loss that is a linear combination of Eqs. (4) and (6). Indeed, Champion et al.39 did something similar, though it must be noted that their approach requires data for time derivatives, while ours does not, and also that they knew in advance the dimensions of the manifolds where their data live, which we do not. In any case, the invariant manifold dimensions in their examples were three or less. There are two reasons why we treated the two problems separately. The first is conceptual. The question of finding the manifold on which a data set lies is a self-contained problem. It seems natural to address this problem first and only afterward address the question of how the dynamics evolve on that manifold. The second reason is more practical. Even if one knows a priori the dimension of the manifold, solving simultaneously for the autoencoder and time evolution means simultaneously solving for three neural networks with a very complicated loss landscape. Indeed, when we attempted to train both models together, the predictive capabilities were extremely poor. Furthermore, in the problems we address, the manifold dimension is not known a priori; therefore, trying to learn a vector field (RHS of an ODE) when one does not even have an idea of how many dimensions that vector field should have may lead to much wasted computer time. Therefore, we take the view that it makes more sense to first get an estimate of how many dimensions are required to represent the data and only then to estimate the vector field that determines the dynamics.
The data sets we consider are numerical solutions of the 1D Kuramoto–Sivashinsky equation (KSE),
in a domain of length with periodic boundary conditions. The domain size determines the types of dynamics this system exhibits. For the domain sizes we consider, trajectories exhibit sustained chaos. The dimension of the manifold that contains the global attractor has been computationally approximated using several approaches.10,50,51 We generate high-dimensional state representations () via a Galerkin projection of onto Fourier modes. Then, we use an exponential time differencing method52 to integrate the resulting system of ordinary differential equations forward in time. The code used is available from Cvitanović et al.53 The data vectors that we use are solutions of the KSE on equally spaced grid points in the domain. We performed resolution checks on a grid of and found, for the domain sizes we consider, trajectories from the same initial condition track one another for multiple Lyapunov times.
III. RESULTS
Section III A briefly describes the dimension reduction (manifold representation results). Section III B considers neural ODE predictions, with and without dimension reduction, for closely spaced data. Section III C shows the effect of data spacing on the results, and Sec. III D illustrates the effect of the degree of dimension reduction. We summarize in Sec. IV.
We consider datasets for three domain sizes, , , and , where we estimate the manifold dimension to be , , and , as explained in Sec. III A. The relevant timescale for each of these systems is the Lyapunov timescale (inverse of the largest Lyapunov exponent), which we previously found to be very close to the integral time for KSE data.10 For these domain sizes, the Lyapunov times are , , and , respectively.50,54 We use time units of data for training at each domain size and vary how frequently we sample this dataset in time. For training the NNs, we use Keras55 for the autoencoders and PyTorch56 for the neural ODEs. The neural ODE code is a modification on the code used in Ref. 41, which includes methods for computing the gradient both with the adjoint method and backpropagation.
A. Dimension reduction with autoencoders
In each section, we use autoencoders to approximate the map to the manifold coordinates and back . We found in Ref. 10 that a useful way to represent the map to the manifold coordinates ()) is as a difference from a linear projection of the data onto the leading PCA modes,
Here, is a NN, is the full PCA matrix, and is the projection onto the leading modes. Similarly, we can learn a difference for the decoder [],
where is a NN. By taking this approach, we simplify the problem such that the NN only needs to correct upon PCA. We refer to this as a hybrid NN (HNN).
In Ref. 10, we also took advantage of the translational invariance of the problem. We used the first Fourier mode method-of-slices57 to phase shift the data. In this method, the phase is calculated as , where is the projection of the state onto the first Fourier mode. Then, the state is phase shifted at each time using the Fourier shift theorem . This approach removes the continuous symmetry, which reduces the dimension of the problem by one. More precisely, it decouples the evolution of the phase from the evolution of the pattern —the phase evolution depends on but not vice versa. This reduction in dimension improves the performance of the autoencoder but introduces near-discontinuities in the flow that arise when approaches zero that must be accounted for by warping time with “in-slice” time.57 These near-discontinuities add additional difficulties in training the time evolution models; therefore, we did not take this approach here.
In Ref. 10, we trained these HNNs using an Adam optimizer with a learning rate of for multiple dimensions . Figure 2 shows the mean squared reconstruction error (MSE) of a test data set not used in training at various dimensions for each of the domain sizes. When using this hybrid approach, we see a significant drop in the MSE at dimensions of , , and for the domain sizes , , and , respectively. We take these values of to be the “correct” dimension, , in Secs. III B and III C. It is worth noting that finding this drop requires careful hyperparameter tuning. For example, we found using rectified linear units rather than sigmoids resulted in higher MSE for while having little effect on , making the drop in MSE less evident. Due to difficulties in autoencoder training and due to the increased complexity of the system at larger domain sizes, it can be difficult to determine the dimension from the drop in the autoencoder MSE alone. In Sec. III D, we further examine the issue of estimating manifold dimension by considering the performance of neural ODE models as a function of dimension—if the estimate of is poor (especially on the low side), then the dynamic model is poor.
The estimated dimension at of is in agreement with estimates by Ding et al.,50 Vlachas et al.,58 Yang and Radons,59 and Cvitanović et al.60 Ding et al. reached this conclusion by considering a finite number of unstable periodic orbits embedded in the long-time attractor and performing Floquet analysis on these solutions. Vlachas et al. also used an autoencoder approach of varying dimensions and considering the MSE at different dimensions. In that work, they used different autoencoder architectures (e.g., convolutional autoencoder), with different activation functions (continuously differentiable exponential linear unit), and still found the same drop at . Yang and Radons59 showed that the Lyapunov vectors can be split into two categories by looking for a drop in the Lyapunov exponents. The positive and weakly negative Lyapunov exponents give Lyapunov vectors called “physical modes” that were shown to span a linear space that locally approximates the inertial manifold.59 This drop in Lyapunov exponents was found to be at in Ref. 60 and at in Ref. 59.
For larger domains, the dimension estimates reported here are consistent with Yang et al.51 In that work, they tracked the number of physical modes for larger domain sizes of the KSE. We showed in Ref. 10 that our estimates of agree with the predicted dimensions and linear scaling of dimensions shown in Yang et al.51 Linear scaling is also seen in Sondak and Protopapas,61 though their estimate of the manifold dimension is higher than ours or any of the other works mentioned here. They approximated dimension by training autoencoders with a high-dimensional latent space and then considered the variance of each latent variable. Then, was taken to be the number of latent variables with a variance above some threshold.
These results guide our selection of dimension in Secs. III B and III C. Furthermore, we find that once , the same MSE is achieved whether we use the full HNN or simply project onto PCA modes [by setting in Eq. (8)]. This observation indicates, for the KSE with periodic boundary conditions, that the leading PCA amplitudes fully parameterize the state. Note that this parametrization is nonlinear because a nonlinear decoder is still needed to reconstruct the full state. Therefore, in Secs. III B and III C, is the projection onto PCA modes and is a NN that approximates the remaining PCA coefficients from the leading ones. Table I shows the architectures for the autoencoders and for the NNs used for time evolution in Secs. III B and III C.
. | Function . | Shape . | Activation . |
---|---|---|---|
Encoder | |||
L = 22, 44, 66 | χ | Linear | |
Decoder | |||
L = 22, 44 | S:linear | ||
Decoder | |||
L = 66 | S:S:linear | ||
ODE | f, g | S:S:S:linear | |
Discrete | |||
map | G | S:S:S:linear |
. | Function . | Shape . | Activation . |
---|---|---|---|
Encoder | |||
L = 22, 44, 66 | χ | Linear | |
Decoder | |||
L = 22, 44 | S:linear | ||
Decoder | |||
L = 66 | S:S:linear | ||
ODE | f, g | S:S:S:linear | |
Discrete | |||
map | G | S:S:S:linear |
B. Effect of dimension reduction on time evolution
Now that we have a map to the manifold coordinates and back, and the next step is approximating the ODE. Here, we train three different types of ODEs to evaluate the impact of mapping the data to the manifold coordinates. We learn the RHS of the ODE in the manifold coordinates: , in physical space: , and in the space of the spatial Fourier coefficients: , where and is the discrete Fourier transform. For the last two cases, there is no dimension reduction; therefore, is the identity in the first case and the discrete Fourier transform in the second, with being, again, the identity in the first case and the discrete inverse Fourier transform in the second.
We train each of the NN with data spaced 0.25 time units in these trials. This timescale is substantially shorter than the Lyapunov time, which decouples the issue of dimension from that of time resolution. Each NN is trained until the loss levels off using an Adam optimizer with a learning rate of that we drop to halfway through training. First, the autoencoder is trained and then the neural ODE. To avoid spurious results, we train five autoencoders and then 10 ODEs and select the combination of an autoencoder and ODE that provides the best short-time tracking. Due to random weight initialization and the stochastic nature of the optimization, we train multiple models and use the one that performs best. This is a common practice in neural-network training, as noted by Bishop in Sec. 9.6 of Ref. 62. As mentioned earlier, the models are trained to minimize the loss in Eq. (6). In previous work,10 we added an additional contribution to the loss for the time evolution problem that penalizes predicted states that fall outside the envelope of the joint PDF for energy production and dissipation. In the present work, we found excellent performance of the best models without the need for this additional term because we did not phase shift the data and thus did not have to work in “in-slice” time.
In Fig. 3, we compare the short- and long-time trajectories between the data at [Fig. 3(a)] and the NN models [Figs. 3(b)–3(d)]. For this domain size, the Lyapunov time is . The same initial condition, which is on the attractor, is used for all cases. The first model prediction, shown in Fig. 3(b), comes from (the low-dimensional ODE, with ); the second model prediction, shown in Fig. 3(c), comes from (the full physical space ODE); and the third model prediction, shown in Fig. 3(d), comes from (the full Fourier space ODE). For each of these figures, the first 50 time units (about Lyapunov times) are shown in the left column, and 450–500 time units (i.e., after the solution has evolved for more than 20 Lyapunov times) are shown on the right.
All three of the models are able to generate good predictions for times up to about ; this is a reasonable quantitative prediction horizon for a data-driven model of chaotic dynamics. After many Lyapunov times, a good model should still yield behavior consistent with the data. Indeed, at long times, the reduced model no longer matches the true solution but continues to exhibit realistic dynamics, as shown in Fig. 3(b). In Sec. III C, we discuss the long-time behavior of the reduced model in more detail. In contrast, the full state model predictions at long times, shown in Figs. 3(c) and 3(d), develop high-wavenumber striations, which eventually pollute the lower wavenumbers and are thus not faithful to the dynamics. This same behavior appears when modeling the full state evolution for the larger domain sizes and . Training high-dimensional neural ODEs is computationally more expensive, and the predictions from these models are worse.
To diagnose the origin of the spurious high-wavenumber behavior, we plot the magnitude of the Fourier modes as a function of time in Fig. 4 over a very long time interval—hundreds of Lyapunov times. In both cases, we observe a very slow linear growth in the high-wavenumber modes. The inset in Fig. 4(b) shows the time evolution of the RHS of the ODE for the real part of one of the high-wavenumber modes . At long-times, settles to a constant—effectively a bias term. The time integration of this constant results in the linear growth seen in Fig. 4. We tested multiple activation functions in the NN architecture, and we see growth in the high wavenumbers regardless of whether sigmoids, hyperbolic tangents, or rectified linear units are used.
To further explain the origin of the bias that leads to the linear growth, we first note that in the true system, the high wavenumbers are strongly damped due to hyperdiffusivity in the KSE; therefore, data sampled from the attractor have negligible content in these wavenumbers. In particular, with the data given, the model has no information with which to learn that these modes are strongly damped. Furthermore, the prediction horizon in the training of the neural ODE is short (0.25 here); therefore, small errors in high wavenumbers have a negligible effect on the loss. Thus, the model does not learn that there should be a strong damping term that would overwhelm any small bias that arises in the RHS due to the stochastic nature of the initialization and training of neural net representations. Indeed, experiments that explicitly introduce a damping term, modifying Eq. (3) to read where , is sufficient to stabilize models against this slow linear growth. In Ref. 63, we show that this is the case when the damping is chosen to match the linear term in the KSE. The combination of these factors is why the trajectories of the high-dimensional models leave the attractor, while the trajectories of the reduced-dimensional models, where degrees of freedom that are strongly damped and thus have negligible amplitude on the attractor are absent, do not.
C. Effect of temporal data spacing on time evolution prediction
Now, we consider only reduced-dimension ODE models and address the dependence of model quality on the temporal spacing between data points. For , , and , we select the manifold dimensions mentioned in Sec. III A—, , and , respectively. We judge the model performance based on, first, short-time tracking for and then long-time statistical agreement for all domain sizes. In these cases, autoencoders and ODEs were both trained with much larger than the value of 0.25 used in Sec. III B, while keeping all other training parameters the same (e.g., architecture, optimizer, dimension). Figure 5 compares a true short-time trajectory [Fig. 5(a)] to the model prediction (with ) of this trajectory [Fig. 5(b)] starting from the same initial condition. Qualitatively, the space-time plots exhibit similar behavior over around 80 time units. The magnitude of the difference is shown in Fig. 5(c). Here, we see the difference growing at around 40 time units due primarily to a growing spatial phase difference between the true and predicted results. Figure 5(d) shows the norm of this difference and the minimum value of the norm when shifting to any position in the domain. The minimum translated difference, in 5(d), remains much smaller than the standard Euclidean distance, indicating that, in this case, the phase difference causes most of the error.
We now consider ensemble average quantities to better understand the reconstruction quality of the models as a function of . In addition to the ODE models, we also consider discrete-time maps . Details of the discrete-time map NN architecture are given in Table I.
Figure 6 shows the root mean squared difference between the exact trajectory and ODE/discrete-time map models as a function of time, averaged over 1000 initial conditions and normalized by the root mean squared difference between random training data, which is denoted with . In this figure, the ODE models and discrete-time maps use data spaced . For , there is little improvement in the ODE models’ performance. ODE predictions at and below all track well for short times, with the error being halfway to random at around , and diverge at long times, with the error leveling off at around . Then, performance degrades sharply at (yellow curve). In all cases, the discrete-time map performs worse than the ODE with the same data spacing. Furthermore, although we do not show it here, the performance for discrete-time maps becomes even worse when trying to interpolate between prediction times. This is a significant drawback not seen when using ODEs.
A similar trend appears in the temporal autocorrelation. Figure 7 shows the temporal autocorrelation of at a given gridpoint averaged over space and 1000 initial conditions. The temporal autocorrelation of the neural ODE matches the exact temporal autocorrelation well for data spaced up to . Then, there is an abrupt deviation from the true solution at . Also, in Fig. 7, we show the temporal autocorrelation for discrete-time maps. At , discrete-time maps perform worse than all ODEs below , and predictions worsen when increasing .
The other half of evaluating a model is determining if trajectories stay on the attractor at long times. For this purpose, we consider the long-time joint PDF of the first () and second () spatial derivatives of the solution. In the KSE, is the energy production and is the dissipation,60 where is the spatial average. We select these quantities because of their relevance to the energy balance and because two-dimensional joint PDFs are more challenging and important to reconstruct than one-dimensional PDFs. In Fig. 8, the joint PDF for the data is compared to various models. The colormap is a log scale, with low probability excursions in black and no data in white regions. When , the ODE model matches well, with differences primarily in the low probability excursions, while at , the discrete-time mapping appears more diffuse in the high probability region and matches poorly. At , the joint PDF for the ODE prediction still matches well, degrading once . This deterioration becomes more evident at .
We quantify the difference between true and predicted PDFs by computing the KL divergence
between the true joint PDF, , and the predicted PDF, . We take the convention of setting the quantity under the integral to when or as was done in Ref. 64. The KL divergence appears in Fig. 9 for various , where we normalize by the Lyapunov time . For all domain sizes, the joint PDFs for neural ODE models match the true PDF very well when data are spaced below Lyapunov times. In contrast, good reconstruction of the joint PDF with the discrete-time model, for , requires .
D. Effect of model dimension on time evolution predictions
In Secs. III A–III C, the model dimension was fixed at values assumed to be the correct inertial manifold dimensions based on other studies.10,50,51 Here, we vary the dimension and consider the dynamical model performance, which provides more information on the necessary number of dimension. For training, we again use time units of data spaced apart 0.25 time units and train 5 autoencoders and 15 neural ODEs. The data are normalized by subtracting the mean and dividing by the standard deviation before training and projected onto the full set of PCA modes before going into the autoencoder (). Training was done with an Adam optimizer with learning rate scheduling from to halfway through training. For , we trained for 5000 epochs, and for , we trained for 500 epochs due to the computational cost of the larger networks. In all these trials, the loss plateaued at the end of training, but further training in the cases could lead to mild improvement. The code used for training these autoencoders is available on GitHub.65
Our first result here is the observation that, for models with a lower dimension than the “correct” one, reasonable approximations of the joint PDF of and can be obtained but only if a nonlinear encoder is used for dimension reduction. This point is illustrated in Fig. 10(a), for the case , where we see good reconstruction of the joint PDF when using a nonlinear encoder. Accordingly, the results below all use a nonlinear encoder; architectures are reported in Table II. However, even though this statistic is reconstructed well at low dimensions, Fig. 10(b) shows the MSE for the autoencoder, with a nonlinear encoder, still drops significantly at . This MSE is on test data normalized using the training data mean and standard deviation. As with the autoencoders used in Secs. III B and III C, we again performed manual hyperparameter tuning. Changing the activation functions from sigmoids to rectified linear units and removing the learning rate scheduler both resulted in higher MSE for .
. | Function . | Shape . | Activation . |
---|---|---|---|
Encoder L = 22 | χ | d:500:dh | S:linear |
Encoder L = 44, 66 | χ | d:500:500:dh | S:S:linear |
Decoder L = 22 | dh:500:d | S:linear | |
Decoder L = 44, 66 | dh:500:500:d | S:S:linear | |
ODE | g | dh:200:200:200:dh | S:S:S:linear |
. | Function . | Shape . | Activation . |
---|---|---|---|
Encoder L = 22 | χ | d:500:dh | S:linear |
Encoder L = 44, 66 | χ | d:500:500:dh | S:S:linear |
Decoder L = 22 | dh:500:d | S:linear | |
Decoder L = 44, 66 | dh:500:500:d | S:S:linear | |
ODE | g | dh:200:200:200:dh | S:S:S:linear |
Figure 11 shows the ensemble-averaged short-time tracking error of the models with the best tracking as the dimension varies. For all domain sizes, the recreation gradually improves and then becomes dimension-independent as dimension increases. This happens because the short-time tracking is directly related to the loss minimized in training the neural ODEs, and contains the same information as if is large enough. As mentioned before, for , , and , respectively, the “correct” manifold dimensions are , , and , and Fig. 11 shows the tracking error becoming dimension-independent near these values.
Now, we turn to long-time statistical recreation upon varying dimensions. Figure 12 shows the KL divergence from the true joint PDF for all the autoencoder and neural ODE pairs trained for each dimension and domain size. For reference, the dashed lines show when comparing the true joint PDF, used to compute for the models, to a separate joint PDF from the true system and evolved forward for the same amount of time as the models but from a different initial condition. This gives a baseline for how two empirical joint PDFs from the true system compare. In all the cases, when the dimension is low, the models do a poor job of reconstructing the joint PDF. When we are near the expected manifold dimension, i.e., , , and , for and , respectively, the typical model performance dramatically improves and the variation in performance between the models decreases, with for the best-performing models and only a factor of two or three larger for the worst. Furthermore, between true and best-model predictions is only slightly larger than that between the two time intervals for the true system, with particularly close agreement for and . In Fig. 13, we plot the true PDFs and the best-model PDFs at and , illustrating the excellent agreement. However, with a further increase of dimension, the variation in performance again becomes large, although the best models still perform well. Like the short-time prediction, the joint PDFs of the best models become dimension-independent, but here, we see that errors arise, particularly the slow linear growth in high wavenumbers, when training the neural ODE with unnecessary dimensions. Finally, recall that the error at high arises from spurious slow linear growth at high wavenumbers, which, as we saw in Sec. III B, arises in the absence of any autoencoder at all.
In summary, Fig. 12 highlights that there is a “sweet spot”, , where we see robust model performance. Below this, too few dimensions are retained; therefore, model performance is poor, and above it, spurious slow growth of high wavenumbers pollutes the long-time predictions. Therefore, lack of robustness in model performance is an indication that the dimension has been chosen incorrectly—either too low or too high. As noted previously, we also find that stabilizing long-time dynamics is possible with the addition of a damping term.
IV. CONCLUSION
Neural ODEs and undercomplete autoencoders can be used to create accurate data-driven models for time evolution of spatiotemporal chaotic dynamical systems, with dimensions near the expected manifold dimension, which for the present examples is in the approximate range of 8–28. Dimension reduction is a vital step in the process. Both the training process and the prediction of trajectories are more expensive with more dimensions, and additional dimensions lead to computational difficulties with the introduction of small spurious bias that leads to slow linear growth of high-wavenumber modes. With dimension reduction, we find that training with data spaced up to Lyapunov times gives accurate short-time tracking and long-time statistical reconstruction. Finally, we investigate the impact of varying the dimension and find that model performance improves with dimension and stops improving once the manifold dimension is reached. However, because of the spurious excitation of high wavenumbers, keeping too many dimensions hurts training, resulting in many poor models. While we have used the Kuramoto–Sivashinsky equation as a model system, this computational approach using data mediated neural ODEs might also be useful in the modeling of PDE systems, such as atmospheric phenomena,45 in the modeling of infinite-dimensional discrete-time maps,66 or in the modeling of other complex spatially distributed systems.67 In forthcoming work, we will show how this reduced-order modeling approach can be integrated into a reinforcement learning framework for control of spatiotemporally chaotic dynamics.
ACKNOWLEDGMENTS
This work was supported by the AFOSR (No. FA9550-18-1-0174) and ONR (No. N00014-18-1-2865) (Vannevar Bush Faculty Fellowship).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Alec J. Linot: Conceptualization (equal); Formal analysis (equal); Investigation (lead); Methodology (equal); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead); Writing – review and editing (equal). Michael D. Graham: Conceptualization (equal); Formal analysis (equal); Funding acquisition (lead); Methodology (equal); Resources (lead); Supervision (lead); Writing – original draft (supporting); Writing – review and editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.