For a large class of dynamical systems, the optimally time-dependent (OTD) modes, a set of deformable orthonormal tangent vectors that track directions of instabilities along any trajectory, are known to depend “pointwise” on the state of the system on the attractor but not on the history of the trajectory. We leverage the power of neural networks to learn this “pointwise” mapping from the phase space to OTD space directly from data. The result of the learning process is a cartography of directions associated with strongest instabilities in the phase space. Implications for data-driven prediction and control of dynamical instabilities are discussed.
The optimally time-dependent (OTD) modes are a set of deformable orthonormal tangent vectors that track directions of instabilities along any trajectory of a dynamical system. Traditionally, these modes are computed by a time-marching approach, which involves solving multiple initial-boundary-value problems concurrently with the state equations. However, for a large class of dynamical systems, the OTD modes are known to depend “pointwise” on the state of the system on the attractor but not on the history of the trajectory. We propose a neural-network algorithm to learn this “pointwise” mapping from the phase space to OTD space directly from data. The neural network produces a cartography of directions of strongest instability in the phase space, as well as estimates for the leading Lyapunov exponents.
I. INTRODUCTION
The theory of dynamical systems has a long history of trying to elucidate one of the most important concepts in science and engineering: instability. Dynamical instabilities are important because they can give rise to a variety of phenomena with unexpected, and even disastrous, consequences. They occur in fluid mechanics,1,2 climate dynamics,3 optics,4 and thermoacoustics5 and come in many shapes and forms. Perhaps the simplest of all are instabilities arising from a linear mechanism, whose investigation traditionally involves linearizing the governing equations around a fixed point and looking for unstable eigenvalues of the linearized problem.6 Generalization of this concept to periodic orbits led to the well-known Floquet theory, in which stability of periodic trajectories is ascertained by computing the spectrum of the monodromy matrix.6 The realization that episodes of transient growth cannot be predicted by linear or Floquet analyses came much later and in turn gave rise to the theory of non-normal and transient instabilities.7 This culminated with the introduction of the Lyapunov exponents and Lyapunov vectors, which provide a rigorous description of instability mechanisms in chaotic systems.8,9 Since then, considerable effort has been devoted to the development of algorithms for computation of Lyapunov exponents and Lyapunov vectors from data. Great strides have been made by the likes of Eckmann et al.,10 Sano and Sawada,11 Rosenstein et al.,12 Kantz,13 and Wolfe and Samelson,14 who proposed new methods to compute Lyapunov exponents and Lyapunov vectors from experimental measurements or improved on existing ones.
The key feature shared by these algorithms is that they monitor the fate of perturbations around a reference orbit for sufficiently long times, sometimes resorting to orthonormalization in order to prevent blow-up of the perturbation vectors. However, tracking the evolution of perturbations along a trajectory is nothing more than a time-marching approach in disguise. More importantly, it does not utilize the fact that for a large class of dynamical systems, the th Lyapunov vector depends pointwise on the state of the system [with the mapping being uniquely determined by the phase-space point ], and consequently, the th Lyapunov exponent—a quadratic form over —can be recovered by measure-averaging over many measurement points, rather than time-averaging over a long trajectory. This was pointed out by Ershov and Potapov,15 who recognized the benefits of using measure-averaging in lieu of time-averaging, as the former allows for the measurement points to be arranged in arbitrary order and even for them to belong to different trajectories. These results show that if one were able to compute the pointwise function from data, then one would immediately have access to a complete picture of the directions of instabilities at any point in the phase space, as well as accurate estimates for the Lyapunov exponents.
As far as we know, the only attempt to compute the map was made in Ref. 16, where we used Fenichel's theory of slow invariant manifolds17 to derive analytical expressions for in situations exhibiting slow-fast dynamics. However, application of this method is limited because (a) the dimensionality of the system cannot be too large for the manifold analysis to be tractable and (b) the system must have a well-defined separation of time scales, short of which Fenichel’s theory becomes moot. Of course, the method being analytical, there is no data component to it. This is precisely what we set out to rectify in the present paper, with the introduction of a machine-learning algorithm that infers the map from a large collection of state measurements.
Machine-learning (ML) algorithms have pervaded virtually every area of science and engineering because of two reasons. First, the amount of data available from experiments, field measurements, or numerical simulations has reached unprecedented levels. Second, ML algorithms have proven to be extremely versatile and powerful from the standpoint of extracting patterns and information from tremendously complex systems that would otherwise remain inaccessible. Applications of ML to dynamical systems include dimensionality reduction and flow-feature extraction,18–21 discovery of governing equations,22–24 design of optimal control strategies,25 turbulence closure modeling,26,27 integration of partial differential equations,28–30 and forecasting of dynamical instabilities in chaotic systems.31,32 The recent emergence of deep learning has led to a push in the latter direction, with recurrent neural networks33,34 and reservoir computing35,36 leading the way.
The learning algorithm that we propose in this paper produces a cartography of directions associated with strongest instabilities in the phase space, from which the leading Lyapunov exponents can be extracted. These directions coincide with the backward Lyapunov vectors of Legras and Vautard9 and the optimally time-dependent (OTD) modes of Babaee and Sapsis.37 (The equivalence between the two was established in Ref. 16.) The potential of the learning algorithm in problems related to prediction and control of transient instabilities and extreme events is considerable because the proposed method is fully data-driven, has no restricting assumptions other than invertibility, autonomy, ergodicity, and measure-invariance of the underlying dynamical system, and only requires state measurements as inputs.
II. FORMULATION OF THE PROBLEM
A. Preliminaries
We consider the autonomous dynamical system
where belongs to a compact Riemannian manifold endowed with the Borel -algebra and a measure , the map is sufficiently smooth to ensure existence and uniqueness of solutions, and overdot denotes differentiation with respect to the time variable . We assume that the transformation
sometimes referred to as the “flow map,” is invertible, measure-preserving, and ergodic. Measure-invariance is important from the standpoint of defining time-averages of scalar functions. (This is the well-known Birkhoff ergodic theorem.38) Measure-invariance and ergodicity are important to guarantee that time-averages and measure-averages coincide,
for all . In words, these assumptions ensure that trajectories asymptotically settle into an attractor (which may be steady, time-periodic, quasiperiodic, or chaotic) and remain on that attractor (i.e., there is no “switching” between attractors).
Infinitesimal perturbations around a given trajectory obey the variational equation
where belongs to the tangent space of the manifold at point , denoted by , and is the Gâteaux derivative of evaluated at along the direction . In principle, the variational equation could be used to track directions of instabilities around trajectories. In practice, however, this is impossible, because any collection of perturbations evolved with (4) for a sufficiently long time would see the magnitude of its members grow or decay exponentially fast, and the angle between them rapidly vanish.
To compute a set of meaningful directions (or “modes”) from the variational equation, Babaee and Sapsis37 proposed to enforce orthonormality of the ’s at all times. One way to achieve this is to continuously apply the Gram–Schmidt algorithm to the collection , starting with and moving down. Blanchard and Sapsis16 showed that the resulting vectors obey
for , where the angle brackets denote a suitable inner product on . In the above, we recognize the variational equation (the left-hand side and the first term on the right-hand side), appended with terms enforcing the orthonormality constraint (the last two terms on the right-hand side). We also note that the summation index goes to rather than so that (5) assumes a lower-triangular form. (This reflects the very structure of the Gram–Schmidt algorithm.) The ’s have been referred to as the “OTD modes” and the collection as the “OTD subspace.”37 The terms “subspace” and “modes” are appropriate because the collection forms a real vector space, for which the ’s are an orthonormal basis.
A key property of the OTD modes is that they and the ’s span the same subspace. The first OTD mode aligns with the most unstable direction, just like does. The second OTD mode is not free to align with the second-most unstable direction because it must remain orthogonal to . However, the subspace spanned by and coincides with the two-dimensional subspace that is most rapidly growing (this is also the subspace spanned by and ). For hyperbolic fixed points, the OTD subspace aligns with the most unstable eigenspace of the associated linearized operator.37 For time-dependent trajectories, the OTD subspace aligns with the left eigenspace of the Cauchy–Green tensor, which characterizes transient instabilities.39 As a result, the th Lyapunov exponent can be recovered from the th OTD mode,
The fact that the OTD modes track directions of instabilities along any trajectory has been utilized on multiple occasions, including in the context of prediction of extreme events in chaotic systems40 and design of low-dimensional controllers for stabilization of unsteady flows.41,42
The OTD system (5) is a set of time-dependent differential equations, which must be solved concurrently with the dynamical system (1). The standard approach for infinite-dimensional systems is to discretize (1) and (5) in space and advance each using a time-stepping scheme. The dimension of the phase space after discretization may be quite large, with the discretized state potentially having thousands of millions of degrees of freedom. In such cases, computation of the first OTD modes involves solving -dimensional differential equations (the OTD equations), plus a -dimensional differential equation for the state itself. For very large , this procedure is computationally onerous, and alternative approaches might be desirable.
B. Formulation of the learning problem
For dynamical systems satisfying the assumptions made earlier (autonomy, invertibility, ergodicity, and measure-invariance), the OTD modes asymptotically converge to a set of vectors defined at every point on the attractor.15,16,43 In other words, in the post-transient regime, only depends on the state , but not on the history of the trajectory or its own initial condition . Hence, we may cease to view as being parametrized by and instead view it as a graph from the phase space to tangent space,
In this context, the collection has been referred to as the “stationary Lyapunov basis” (SLB) at point (Ref. 15).
The existence of the SLB at every point of the attractor was established separately by Ershov and Potapov15 and Goldhirsch et al.43 as a consequence of the Oseledec theorem.44 The question of uniqueness and continuity with respect to was also addressed by Ershov and Potapov.15 They showed that for a given , more than one SLB may exist, but only one is stable with respect to perturbations of the underlying state. So, the OTD modes are uniquely determined by the point in the phase space. They also showed that if the Lyapunov spectrum is not quasidegenerate (i.e., all Lyapunov exponents are distinct, and there is no “crossing” of Lyapunov exponents under small perturbations), then the graph (7) is continuous in . Uniqueness and continuity are important because they allow for the possibility of representing the graph (7) as a superposition of smooth basis functions, or as a realization of a Gaussian process. Once the graph (7)—or an approximation of it—is known, the Lyapunov exponents can be computed by replacing time-averaging with measure-averaging in (6),
A promising approach is to learn the mapping (6) from data. This requires several ingredients. First, we assume that we have a large collection of “snapshots” available for the state. Each must belong to the attractor, but not necessarily to the same trajectory, a consequence of the use of measure-averaging. Second, we assume that we have a mechanism to compute the vector field and the action of the linearized operator at in any direction . Third, we need to eliminate the dependence of the OTD system (5) on time. This is done by applying the chain rule to the left-hand side of (5), resulting in
where is the Gâteaux derivative of evaluated at along the direction . There is no explicit temporal dependence in (9) so that the variable should no longer be thought of as a point on a particular time-dependent trajectory, but rather as any point on the attractor. System (9) may also be thought of as a partial differential equation on .
For computational purposes, it is useful to consider the discretized counterpart of (9),
where and belong to , is the Jacobian of the vector field , and is the Jacobian of with respect to . Although not explicitly shown in (10), the vector should be understood as . We are now in a position to state the learning problem:
Given a dataset of snapshots belonging to the attractor , and a mechanism to compute and the action of , find the collection of graphs that best satisfies (10) at every .
In what follows, we solve the learning problem by a deep-learning approach based on neural networks.
III. LEARNING THE OTD MODES FROM DATA
We will find it convenient to operate in the “big-data” regime, so we assume that the dataset used in the learning algorithm contains a very large number of snapshots. Before we proceed, we reiterate the fundamental assumptions that are made about the data. The underlying dynamical system from which data are collected should be autonomous, invertible, measure-preserving, and ergodic, and should have a nonquasidegenerate Lyapunov spectrum. As discussed in Sec. II B, these assumptions are key to ensure existence, unicity, and continuity of the SLB in the phase space.
A. Network architecture
To learn the graphs from the collection of snapshots, we employ a neural-network approach. This is appropriate because each is a continuous function of . This allows us to leverage the universal approximation theorem,45 which states that any function may be approximated by a sufficiently large and deep neural network.
Neural networks do suffer from several shortcomings. They come with heavy computational requirements, are sensitive to the choice of hyperparameters, and do not always deliver on noisy data. However, the pros largely outweigh the cons. First, neural networks are known to be more powerful and more expressive than simpler (neighborhood-based or autoregressive) methods. Second, as discussed in Sec. I, neural networks have been used successfully to solve partial differential equations, which is essentially what the learning problem proposed in Sec. II B amounts to. Third, neural networks are generally trained by some variant of stochastic gradient descent and thus do not suffer from the requirement that the entire dataset be loaded in memory at training time. With an eye on large-scale deep-learning applications, neural networks are, therefore, the de facto correct choice for our learning problem.
In what follows, we refer to the learned OTD modes as the “deep OTD (dOTD) modes.”
1. Overview of the network architecture
We find it natural to assign to each OTD mode its own neural network , where denotes the parameters (weights and biases) of the th network. We use the same fully-connected feed-forward architecture with hidden layers for all OTD modes [Fig. 1(a)]. (The input and output layers are numbered and , respectively.) The loss function for the th network is specified as
which is nothing more than the residual of the th OTD equation in system (10). We note that any SLB is a global minimizer of , with being trivially zero. So, this choice of loss function gives rise to no estimation error, and the model is only limited by the hypothesis class (i.e., the set of functions within reach of the neural network for a given number of layers and neurons) and the tolerance specified for the optimization algorithm. Those are commonly referred to as “approximation error” and “optimization error,” respectively.46
Equation (11) shows that the loss function for the th dOTD mode depends on the first dOTD modes. This raises the question of the order in which the dOTD modes ought to be learned. In what follows, we opt for a “sequential” approach [Fig. 1(b)], whereby the parameters are optimized sequentially (starting with and moving down), and the outputs of the first networks are fed into the th network as dummy inputs. (By “dummy inputs,” we mean quantities that are fed into the neural network for the sole purpose of computing the loss function.) We find this architecture to be intuitive because it mimics the triangular structure of the Gram–Schmidt algorithm. Our numerical experiments suggest that this approach is more stable (compared to other approaches described below), in the sense that it facilitates convergence of the optimization algorithm. The only issue has to do with error accumulation, arising as a result of using the outcomes of the first optimizations to compute the th dOTD mode. However, this is easily fixed by tightening the tolerance of the optimization algorithm or doing multiple passes of training with decreasing tolerance.
Of course, the “sequential” approach is not the only option. Instead, one could solve the optimization problems concurrently, whereby the optimization algorithm performs one iteration for each neural network before updating the parameters. Each iteration would need to be done sequentially (i.e., starting with , then , etc.), and “coupling” between the dOTD modes would be done by passing the first parameter vectors at the current iteration to the th neural network. Alternatively, one could combine the loss functions (11) and solve for all of the dOTD modes in a single optimization pass using the combined loss function. However, these two approaches appear to cause difficulty for the optimization algorithm, both in terms of execution speed and its ability to converge.
Use of the loss function (11) alone, which is solely based on the residual of the OTD system, might be problematic for two reasons. First, as discussed in Sec. II B, any SLB is a global minimizer of (11), but only one of them is stable (this is the SLB to which the OTD modes converge when computed with a time-marching approach). So, we need a mechanism to ensure that the optimization algorithm converges to the stable SLB and not to any of the unstable ones. Second, there may be other (possibly local) minimizers of (11) besides SLBs, so we also need a mechanism that prevents the optimization routine from getting trapped in an irrelevant minimum.
2. Ensuring orthonormality of the dOTD modes
We begin with the question of how to ensure that the optimization algorithm converges to an SLB, with no consideration yet for whether that SLB is stable or not. We first note that minimization of the loss function (11) does not guarantee orthonormality of the resulting dOTD modes. [For example, the trivial solution is a global minimizer of (11).] The reason is that the terms responsible for orthonormalizing the OTD modes in the time-dependent problem (5) are no longer sufficient to enforce this constraint. So, orthonormality must be enforced explicitly in the neural network. This is important because our numerical experiments suggest that the SLBs are the only orthonormal minimizers of (11). In other words, explicitly enforcing orthonormality of the dOTD modes helps the optimization algorithm systematically converge to an SLB, rather than some other irrelevant minimum.
Enforcing orthonormality of the dOTD modes can be realized in two ways. One approach is to embed Gram–Schmidt orthonormalization immediately after the last layer of the network so that the dOTD modes are orthonormal by construction. Another approach is to append to the loss function (11) a regularization term,
The first term in the curly brackets enforces normality of , and the second term enforces orthogonality of and (). The regularization parameters and should be chosen on a case-by-case basis, which offers less flexibility than the Gram–Schmidt approach. We note that the regularization approach and the Gram–Schmidt approach have no consequence for the estimation error because their respective loss functions are exactly zero for any SLB.
In practice, we have found that the Gram–Schmidt approach is more robust than the regularization approach, in the sense that the former requires fewer iterations for the optimization algorithm to converge to an SLB. We note that each iteration in the Gram–Schmidt approach requires computing the gradients of the Gram–Schmidt layer with respect to the network parameters by backpropagation, which is significantly more expensive than computing the gradients of the regularizing terms in (12). However, this is a burden worth carrying, given that we have found cases in which the regularization approach failed to converge (for a given iteration budget) while the Gram–Schmidt approach succeeded, and no cases suggesting otherwise.
3. Ensuring uniqueness of the dOTD modes
Now that we have designed a mechanism ensuring that the optimization algorithm converges to an SLB, we must address the question of how to isolate the stable SLB from all of the unstable ones. We begin with a simple example that provides insight into this issue. Consider a case in which the trajectory of interest is a hyperbolic fixed point, denoted by . Theorem 2.3 in Babaee and Sapsis37 states that any -dimensional eigenspace of the operator is an SLB of and that the only stable SLB is that associated with the most unstable eigenvalues of . In this work, “stability” was determined by examining the time-dependent problem governing the evolution of a perturbed SLB. In the learning problem, however, we have eliminated any notion of temporality from the OTD system. Hence, for the case of a hyperbolic fixed point, the neural network, in its current manifestation, may converge to any of the -choose- eigenspaces of and not necessarily to the most unstable one. This is problematic because the OTD modes draw their power from their ability to track directions of greatest instabilities. Naturally, we wish our learning algorithm to also have this feature.
To make sure that the learning algorithm returns the SLB associated with directions of strongest instabilities, we use a criterion based on Lyapunov exponents. As discussed in Sec. II B, the th Lyapunov exponent can be computed as a measure-average of the Lagrange multiplier . With a finite-size dataset, however, we can do no better than approximating by a finite sum over the data points. If the dataset is composed of multiple long trajectories initialized on the attractor according to some probability distribution (in general, we want the initial conditions to be independent and identically distributed), we have that
where is the state of the th trajectory after some long time . The above limiting statement holds only when , but in practice, we merely require that be large enough so as to ensure convergence of the distribution of initial conditions to an invariant one. Equation (13) also holds when the dataset is composed of uniformly-sampled snapshots collected in the asymptotic regime of a single long trajectory, in which case (13) is equivalent to standard time-averaging. (Note that in either case, the snapshots may be arranged in any arbitrary order.) Equation (13) can be modified to account for the fact that is modeled as a neural network. We define
as the “learned” Lyapunov exponent associated with the th dOTD mode. This is the best approximation of available, given the constraints related to finiteness of the dataset and representability of the OTD modes with neural networks.
that penalizes small Lyapunov exponents. Here, is a monotonically increasing function on that exacerbates differences between the ’s. Possible choices for include , , , and . Lyapunov regularization guarantees that the SLB to which the optimization algorithm converges is such that the ’s come in decreasing order; that is, the th dOTD mode picks up the th-largest Lyapunov exponent. Lyapunov regularization thus ensures that the dOTD modes learned by the neural network coincide with the unique stable solution of the time-dependent OTD system (5).
Adding Lyapunov regularization to the loss function (11) has the effect of introducing an estimation error because the value of for the optimal is generally nonzero, except in very specific cases [for example, if and is zero]. No estimation error is a feature worth preserving because it allows us to focus our attention on the approximation error and the optimization error, thereby facilitating design and optimization of the neural network. To this effect, we specify an optimization schedule so that Lyapunov regularization is “switched off” after a certain number of iterations. This allows us to “steer” the optimization algorithm into a favorable direction initially, while ensuring no estimation error for iterations subsequent to relaxation of Lyapunov regularization.
B. Attractor reconstruction
The last ingredient needed to make the method fully data-driven is a mechanism to reconstruct the vector field and the action of the Jacobian matrix from the collection of snapshots . We note that if the governing equations of the dynamical system are known, then there is no need for such a mechanism because and can be evaluated directly from the equations of motion. We also note that if we can only record some observable of the state, but not the state itself, then we can use delay embedding to reconstruct the attractor and compute the dOTD modes in the embedded space. (This case will not be considered in this work.)
As discussed in Sec. I, discovery and reconstruction of governing equations from data is an active field of research. Any of the state-of-the-art methods could be applied to the present problem, each introducing its own level of complexity. The key issue is that reconstruction of can be done offline, regardless of the dimensionality of the system. In other words, computation of for each may be viewed as a preprocessing step and, therefore, does not add to the computational burden related to optimizing the parameters of the neural network. In what follows, we opt for perhaps the simplest of all approaches. We assume that the snapshots are sampled along a single long trajectory with a uniform and sufficiently small sampling time-step so that we may approximate as a standard Euler-forward finite difference,
Higher-order finite-difference formulas (e.g., Adams–Bashforth, Adams–Moulton, or backward differentiation formulas) may be used if higher accuracy is desired. Finite differences have the advantage of being extremely cheap to compute, even for high-dimensional systems. Implementation is straightforward, and the requirement that be small is far from drastic.
To compute the action of the Jacobian matrix from data, we employ the classical algorithm proposed independently by Eckmann et al.10 and Sano and Sawada.11 First, we scan the dataset to identify the nearest neighbors of each datapoint . The nearest neighbors of are defined as those points of the orbit (past or future) that are contained in a ball of radius centered at ,
If is sufficiently small, then each vector may be viewed as a perturbation vector from the reference orbit. We, therefore, have
which allows us to compute the action of the Jacobian matrix on the vectors . Now, the critical step is to note that the vectors belong to the tangent space at point and so do the OTD modes. (In fact, the OTD modes form a basis of this space when .) So, if we stack up the vectors into a matrix , then the least-square fit
should be a reasonably good approximation for the dOTD modes. Here, is the pseudoinverse of . (We note that the least-square approach allows for exceeding the dimension of the phase space .) From (19), we can compute the action of the linearized operator on the dOTD modes as
where is a -by- matrix with columns . Equation (20) requires no information other than the snapshot data and can be used to evaluate the loss function (11).
For low-dimensional systems, it is possible to form the matrix explicitly and store it in memory. This computation can be done offline for every datapoint , and consequently, the computational burden associated with reconstruction is nil for the neural network. For high-dimensional systems, however, forming and storing is not possible, so (20) must be computed online [i.e., every time the loss function (11) is evaluated], which introduces an additional cost. Another aggravating issue is that the complexity of most nearest-neighbor-search algorithms grows exponentially with the dimension of the state, making the above approach intractable for greater than about 25 (Ref. 47). The curse of dimensionality thus requires that we pursue a different strategy.
C. Learning in a high-dimensional phase space
For high-dimensional systems, the neural-network approach is intractable for two reasons. First, there is the issue of reconstructing the Jacobian matrix (or, equivalently, its action on the dOTD modes), which was discussed in Sec. III B. Second, to evaluate the term appearing in (11), we must compute the gradient of the dOTD modes with respect to the state vector, resulting in a -by- matrix. For large , this computation is virtually hopeless. So, to make the neural-network approach applicable to high-dimensional systems, we proceed to an order-reduction of the phase space.
If arises from discretizing a partial differential equation (PDE) defined in a domain , then one approach is to randomly select points in that domain according to some probability distribution. Each snapshot then has dimension , with presumably much smaller than . When contains nodal values of the state, this approach amounts to randomly excising entries from each . (The excised entries need not be the same for all the snapshots.) Random sampling has been used successfully in a number of problems related to deep learning of PDEs,29,48 largely because it has the advantage of being a “mesh-free” approach. However, applicability of this method to the present problem is limited because the algorithm for attractor reconstruction requires that the sampled points be the same for all snapshots. As a result, we may need a large number of sampled points (with possibly on the order of ) to faithfully capture the dynamics. If is reasonably small, however, the dOTD modes can be learned at the sampled points and subsequently reconstructed over the entire domain using any standard interpolation algorithm.
To avoid these difficulties, we use an approach based on the Galerkin projection. We assume that any state on the attractor can be represented as a superposition of proper-orthogonal-decomposition (POD) modes,
where is the mean flow, contains the first POD modes, and contains the corresponding POD coefficients. Measure-invariance and ergodicity of the attractor allow us to view as a function of time or as a function of the state, so that we may use and interchangeably. Each POD mode can be computed directly from data by the method of snapshots,18 and the th POD coefficient can be obtained by projecting onto the th POD mode. The number of retained POD modes is determined by examining the cumulative energy of the POD eigenvalues. In general, is selected so as to account for at least 95% of the total energy. We also note that implicit in (21) is a one-to-one correspondence between snapshots in the -space () and snapshots in the -space (). This allows us to view any function of as a function of and vice versa.
In the POD subspace, the dynamics obeys
so, in principle, we could use the neural-network approach to learn the OTD modes in the -subspace. We would simply apply the method proposed in Secs. III A and III B with in place of , in place of , and in place of and then project the dOTD modes learned in the -subspace back to the full space, resulting in . However, this approach is flawed because by construction, it assumes that the OTD modes live in the same subspace as the state itself (that is, the POD subspace spanned by the columns of ). In reality, the OTD modes belong to the tangent space at point , whose principal directions have no reason to coincide with that of the state on the attractor. This inconsistency persists regardless of the number of POD modes used in (21) so that resorting to very large does not solve the problem.
All is not lost, though, since we can use different projection subspaces for the state and the OTD modes,
where is a reduced orthonormal basis of the tangent space at and contains the basis coefficients. We note that the number of POD modes () and columns of () need not be the same, which allows for the possibility of learning the OTD modes in a subspace bigger than the POD subspace. If is a good approximation of the tangent space at , then the nearest neighbors of satisfy
where is a matrix of coefficients. To compute from data, we solve the minimization problem
which is tantamount to a principal component analysis (PCA) of the set of nearest neighbors . In other words, the columns of are the leading POD modes of the nearest neighbors of ; we refer to them as the “tangent POD (tPOD) modes.” We note that this approach has been used as the basis for a number of manifold learning algorithms that involve reconstruction of the tangent space from data.49,50
Now that we have two low-dimensional representations available for the state and the OTD modes, the learning problem reduces to finding the collection of graphs,
given a dataset of reduced states and tangent bases . To derive an equation for , we substitute (23a) and (23b) into (10) and arrive at
Three remarks are in order. First, we note that (27) is a system of -dimensional differential equations, much less expensive to solve the original system of -dimensional differential equations. Second, we note the presence of an additional term on the right-hand side of (27) arising from the dependence of the tPOD modes on the state . Third, we have defined the reduced operator
which is the projection of the high-dimensional operator onto the reduced basis . Equation (28) provides additional insight as to why the “naive” approach described two paragraphs earlier was not a good one. This approach was equivalent to having for every point on the attractor, leading to . However, this is a poor approximation for because is a reduced basis of the phase space but not the tangent space.
To compute the reduced operator from data, we use an approach similar to that proposed in Sec. III B. We first recall that (18) provides a mechanism to compute the action of on a collection of perturbation vectors at point . However, the columns of are linear combinations of these perturbation vectors, as per the POD construction. Therefore, we may write , where is a matrix of coefficients. This leads to
The operator has the advantage of being low-dimensional, so it can be computed offline and stored in memory, along with the POD-reduced vector field .
The last term on the right-hand side of (27), however, is problematic because it involves the temporal derivative of the reduced tangent basis . We could use the chain rule and write , but the gradient is expensive to compute. Another option is to use a finite-difference formula in the spirit of (16),
If is sufficiently small, then the tPOD modes smoothly deform from to , so (30) is a good approximation. [We note in passing that continuity of requires continuity of both and .] With this approach, the term can be computed offline and passed to the neural network as a dummy input.
Finally, we address the question of whether the “local” tangent bases could be combined into a larger “global” subspace that does not depend on . Mathematically, this is equivalent to seeking a reduced basis of the tangent bundle,
where is the disjoint union operator. Such a construct would have the benefit of eliminating the last term on the right-hand side of (27). Also, it would be esthetically more appealing because the OTD modes would be learned in a common subspace, regardless of the point at which they are computed. To construct a reduced basis of , one can perform POD on the set of tangent bases . This results in a set of “bundle modes,” denoted by , whose span is the best -dimensional approximation of the tangent bundle . The expectation is that the number of bundle modes , although generally greater than the number of local tPOD modes , will still be much smaller than . The dOTD modes may then be sought as vectors in the bundle space; that is, , where the coefficients satisfy
and is defined as
For not too large, the reduced operator may be computed offline and stored in memory.
Applicability of the tangent-bundle method is limited to cases in which the governing equations of the dynamical system are available. This is because cannot be computed solely from snapshot data, except when the linearized operator is known explicitly. To see this, we first recognize that by construction, each bundle mode is a linear combination of the columns of . So, evaluation of (33) from data is conditioned on our ability to compute or, equivalently, each member of the set . However, there is no mechanism to compute from data when . [Equation (18) holds only for .] Thus, evaluation of (33) requires explicit knowledge of the linearized operator and, in turn, the governing equations. Because of this restriction, this approach will not be pursued any further in this work.
D. Implementation
We conclude this section with a few words on implementation. The neural network is built from scratch in Python. We use Autograd51 for automatic differentiation. The activation function between hidden layers is the hyperbolic tangent, although other choices (e.g., sigmoid function or swish function) are possible.52 The activation function for the output layer is a linear function. As discussed in Sec. III A 2, the last layer of the neural network is followed by a Gram–Schmidt layer. The weights are initialized according to Xavier initialization.53 For Lyapunov regularization, we use .
We use Adam54 to solve the optimization problem. In its current manifestation, the code supports mini-batching, although caution must be exercised when selecting the batch size in order not to deteriorate the accuracy of (14). (The results presented in Sec. IV do not use mini-batching.) We have found no need for specifying learning rate schedules in the optimizer. Different stopping criteria may be used, depending on how the vector field and linearized operator are computed. If and are evaluated from the governing equations, and the hypothesis class is reasonably large, then optimization may be terminated when the loss function (11) is smaller than a user-specified tolerance. (As a result, the overall error is dominated by the approximation error.) If and are reconstructed from data, then it is preferable to terminate optimization after a user-specified number of iterations, in recognition of the fact that the reconstruction process is approximate and, therefore, introduces an estimation error.
To generate the dataset, we consider a single long trajectory of the dynamical system, rather than multiple shorter trajectories with distinct initial conditions. The dataset is constructed by collecting equally-spaced snapshots (with sampling time ) along that long trajectory. (For the results presented in Sec. IV, coincides with the time-step size used to advance the dynamical system.) For each snapshot, we reconstruct the vector field and linearized operator using the KNeighborsClassifier implemented in scikit-learn. We note that the nearest-neighbor search is carried out over the entire collection of snapshots, including those collected in the transient regime, so as to improve accuracy of the reconstruction algorithm.10 For high-dimensional systems, the nearest-neighbor search is conducted in the POD subspace to alleviate computational cost. (By this, we mean that we merely extract time stamps for the nearest neighbors in the -space and then use the corresponding snapshots in the -space in the reconstruction algorithm.) In the examples presented in Sec. IV, the vector field and linearized operator are formed explicitly, stored in memory, and passed to the neural network as dummy inputs. This is possible either because the dynamical system is low-dimensional or because we first proceeded to a reduction of the dynamics using the Galerkin approach described in Sec. III C.
Upon reconstruction of and , we discard from the dataset those snapshots that were collected in the transient regime of the long trajectory. As discussed in Sec. III A 1, this step is critical to ensure validity of the measure-averaging operation. The resulting “truncated” dataset is then split into training, validation, and test sets. The training set is used to optimize the weights of the neural network and the test set to evaluate its predictive capability. The validation set is used to tune the hyperparameters of the neural network, as described in Ref. 55. (For a given dynamical system, we consider a range of random combinations of hyperparameters and select the architecture for which the validation error is the smallest.) The validation set is also used to tune the learning rate for the Adam optimizer, the schedule for Lyapunov regularization, and the number of nearest neighbors used in the reconstruction step.
The code is available on GitHub.56
IV. RESULTS
A. Performance metrics
To evaluate the accuracy of the learning algorithm, we consider two performance metrics. The first metric is the PDE loss defined in (11), which quantifies the extent to which the neural network satisfies the OTD equations (10). We use this metric as our primary guide in our search for appropriate hyperparameters because a small PDE loss on training and validation data is a key prerequisite for generalizability on test data.
However, as discussed in Sec. III A 1, the PDE loss does not discriminate between stable and unstable SLBs. So, we supplement it with
where denotes the th dOTD mode and denotes the th OTD mode computed by direct numerical integration of (5). The distance takes values between 0 and 1, with the former indicating that and coincide and the latter indicating that and are orthogonal. We could also use the distance between the subspaces and , but that measure, unlike (34), assigns the same score to all the SLBs, regardless of stability (when ).
If the OTD modes are learned in a reduced subspace, then it is useful to compute
where is a placeholder for or , depending on whether the “local” or “bundle” approach is used. The above is equivalent to computing the distance function in the reduced subspace in which the modes are learned. This quantity is important because dimensionality reduction of the tangent space (or tangent bundle) introduces an additional error over which the neural network has no control. Thus, situations may arise in which is small, but is large. Should this occur, a quick fix is to increase the dimension of the reduced subspace .
For each example considered below, we report training, validation, and testing error as measured by the two metrics and . (For the latter, we report its measure-average, , over the dataset considered.) These numbers have been averaged over ten learning experiments. Each learning experiment corresponds to a different (random) initialization of the weights . The hyperparameters, training set, validation set, and test set are kept the same from one learning experiment to the next. The results were found to be robust to small changes in hyperparameters and dataset selection.
B. Examples
1. Low-dimensional nonlinear system
We begin with a three-dimensional nonlinear system that was proposed by Noack et al.57 as a testbed for investigating Hopf bifurcations in laminar bluff-body flows. We choose this system because it provides a good illustration for many of the comments made in Secs. II and III. The governing equations are given by
with being a positive constant. The system admits a linearly unstable fixed point () and a stable periodic solution,
which defines a limit cycle of radius in the plane. As noted by Noack et al.,57 the limit cycle is asymptotically and globally stable.
For , it is possible to derive analytical expressions for the OTD modes on the limit cycle,
where
These are the modes to which solutions of the OTD equations converge when computed with a time-stepping approach. [To the best of our knowledge, this is the first time that exact (nonasymptotic) expressions have been reported for unsteady OTD dynamics.] Each may be expressed as a function of the state as follows:
The ordered set is the unique stable SLB at point on the limit cycle, although other unstable SLBs exist. If we define
then any of the ordered triples , , and are solutions to the OTD equations, but only is stable.
We consider the case and use the neural-network approach to learn the graphs from data. We begin by generating a long trajectory initiated close to the limit cycle. This trajectory is computed by a third-order Adams–Bashforth method with time-step size for a total duration of time units. We record snapshots at every time step, resulting in an initial dataset comprising 5000 points. For each datapoint, we reconstruct the vector field and linearized operator using its seven nearest neighbors. We then discard the first 1000 datapoints corresponding to the transient regime. The resulting dataset is comprised of 4000 snapshots recorded in the interval , spanning about six periods in the asymptotic regime of the long trajectory. We then divide up this dataset into three. The training set is comprised of 10 equally-spaced points over the period ; the validation test of 629 equally-spaced points over the period , and the training set of the remaining points. (We use the same dataset splitting to train the three neural networks.) The neural network is composed of one hidden layer with 40 neurons, and the learning rate for the Adam optimizer is set to 0.04. Lyapunov regularization is active for the first 1000 iterations and turned off for the remainder of the optimization. The dOTD modes are learned sequentially, with the maximum number of iterations specified as 3000. For these parameters, the training, validation, and test errors are shown in Table I.
. | Training set . | Validation set . | Test set . | |||
---|---|---|---|---|---|---|
i . | . | . | . | . | . | . |
1 | 1.2 × 10−6 | 1.1 × 10−6 | 2.7 × 10−6 | 9.6 × 10−7 | 2.5 × 10−6 | 9.6 × 10−7 |
2 | 7.9 × 10−6 | 1.2 × 10−4 | 3.9 × 10−4 | 1.3 × 10−4 | 1.9 × 10−4 | 1.4 × 10−4 |
3 | 8.2 × 10−6 | 1.2 × 10−4 | 3.9 × 10−4 | 1.3 × 10−4 | 1.9 × 10−4 | 1.4 × 10−4 |
. | Training set . | Validation set . | Test set . | |||
---|---|---|---|---|---|---|
i . | . | . | . | . | . | . |
1 | 1.2 × 10−6 | 1.1 × 10−6 | 2.7 × 10−6 | 9.6 × 10−7 | 2.5 × 10−6 | 9.6 × 10−7 |
2 | 7.9 × 10−6 | 1.2 × 10−4 | 3.9 × 10−4 | 1.3 × 10−4 | 1.9 × 10−4 | 1.4 × 10−4 |
3 | 8.2 × 10−6 | 1.2 × 10−4 | 3.9 × 10−4 | 1.3 × 10−4 | 1.9 × 10−4 | 1.4 × 10−4 |
Figure 2(a) shows the distance function computed on part of the test set. [Time series for the dOTD modes and the numerically integrated OTD modes are shown in Figs. S1(a)–S1(c) of the supplementary material.] The key points are that (a) the neural network is able to learn the graphs from the limited number of training points supplied to it and (b) error accumulation is inconsequential because we were careful not to terminate the optimization prematurely. We note that closely follows , a consequence of the fact that is completely determined by and . We also note that Lyapunov regularization is instrumental in the optimizer converging to the stable SLB (39). The Lyapunov exponents learned by the algorithm (, , and ) are reasonably close to the analytical values (, , and ).
As discussed in Sec. I, the novelty of our approach is that the neural network provides, locally for each , not only the directions of strongest instabilities, but also the degree of instability along each of these directions. This is made visually clear in Fig. 2(b), which shows all the points from the test set in the three-dimensional phase space. Each point is colored according to its local leading Lyapunov exponent , as learned by the neural network. The neural network correctly learns that for all on the limit cycle. Figure 2(b) also shows, for two of the test points, the three dOTD modes (i.e., the directions of strongest instabilities) learned by the algorithm. Consistent with (39), the first mode is tangent to the trajectory. Figure 2(b) thus reinforces the claim made in Sec. I that our approach provides a “cartography” of instabilities in the phase space.
2. Charney–DeVore system
Next, we consider a modified version of the classical Charney–DeVore model, which describes atmospheric circulations at midlatitudes. We consider a six-dimensional truncation of the original system, which models a barotropic flow in a plane channel with topography.58 The governing equations are given by
with parameters
where or 2. The parameters and account for zonal advection in the and directions, respectively; for the so-called effects; and for topographic interactions; for Ekman damping; and and for zonal forcing in the and directions, respectively. We set , , , , , and . These values of the parameters give rise to significant transitions between regimes of the “zonal” and “blocked” flow, resulting from the nonlinear interaction between barotropic and topographic instabilities.58 The extreme episodes of blocked flow are of main interest to us because they are the manifestation of transient instabilities. Thus, it is in those intervals where we attempt to learn the OTD modes from data.
We use a third-order Adams–Bashforth scheme with time-step size to generate a long trajectory spanning 4000 time units. We use zero initial conditions, except for and . Snapshots are recorded at every time step, resulting in an initial dataset with 80 000 points. For each datapoint, we reconstruct the vector field and linearized operator using its 60 nearest neighbors and then discard the first 10 000 data points corresponding to the transient regime (). The resulting dataset is then divided up into training, validation, and test sets. For the training set, we consider the interval during which the trajectory passes through a regime of blocked flow [see Figs. S2(a)–S2(f) in the supplementary material] and select 50 uniformly-spaced training points in this interval. The remaining 1750 points in this interval form the validation set. The test set comprises all data points in the original dataset, except for those used for training and validation. The test data, therefore, contains multiple episodes of the blocked flow [Fig. 3(a)]. We only attempt to learn the first OTD mode . The neural network has two hidden layers, each with 128 neurons. The learning rate for the Adam algorithm is 0.001. Lyapunov regularization is used for the first 2000 iterations and switched off thereafter. Optimization is terminated at 5000 iterations. For these parameters, the training, validation, and test errors are shown in Table II.
. | Training set . | Validation set . | Test set . | |||
---|---|---|---|---|---|---|
i . | . | . | . | . | . | . |
1 | 2.2 × 10−5 | 2.0 × 10−2 | 7.0 × 10−5 | 2.5 × 10−2 | 2.1 × 10−1 | 2.6 × 10−1 |
. | Training set . | Validation set . | Test set . | |||
---|---|---|---|---|---|---|
i . | . | . | . | . | . | . |
1 | 2.2 × 10−5 | 2.0 × 10−2 | 7.0 × 10−5 | 2.5 × 10−2 | 2.1 × 10−1 | 2.6 × 10−1 |
Figure 3(b) shows time series for the distance , which measures agreement between and . Not surprisingly, the neural network is able to learn the mapping from the phase space to the OTD space in the interval in which the training points were supplied. Much more remarkable is the outstanding agreement in other intervals of blocked flow (e.g., and )—that is, the region in the phase space synonymous with transient instabilities—showing that the neural network only needs to know what a single interval of blocked flow looks like to be able to predict all other such intervals, past or future. This level of prediction capability is unprecedented in the context of extreme events in dynamical systems. Figure 3(b) and Fig. S3 in the supplementary material also show that agreement is generally poor outside intervals of the blocked flow, which explains the relatively large test errors reported in Table II. This should come as no surprise because neural networks are known to perform poorly when the testing data look nothing like the training data. (The fundamental assumption for generalizability is that training and testing data are drawn independently from the same probability distribution.)
Our attempts to train the neural network in an interval of zonal flow, or in an interval containing both blocked and zonal flow regimes, were unsuccessful. We suspect that this is because the intervals of zonal flow are more “chaotic” (with more time scales involved) than intervals of blocked flow. Our numerical experiments suggest that improving expressivity of the neural network (by increasing the number of hidden layers and neurons) does not solve the problem. We note that a similar observation was made by Raissi,48 who attempted to machine-learn the Kuramoto–Sivashinsky equation with a neural network. Raissi48 noted that for this system, intervals of laminar flow posed no difficulty to the neural network, while chaotic intervals were “stubbornly” challenging, with the optimization algorithm not converging to the “right values” of the network parameters. This description aligns with what we have seen in the present investigation of the Charney–DeVore system. We leave investigation of this issue for future work.
3. Flow past a cylinder
We conclude this section with an application of the learning algorithm to a high-dimensional dynamical system. This is to provide an illustration of the Galerkin approach proposed in Sec. III C. Specifically, we consider the flow of a two-dimensional fluid of density and kinematic viscosity past a rigid circular cylinder of diameter with a uniform free-stream velocity . The Navier–Stokes equations can be written in a dimensionless form as
with no-slip boundary condition () on the cylinder surface and a uniform flow () in the far field. Velocity, time, and length have been scaled with the cylinder diameter and free-stream velocity , and the Reynolds number is . We consider the case , for which there exists a limit-cycle attractor, which is believed to be globally and asymptotically stable.59 Our computational approach (mesh topology, spatial discretization, and time-stepping scheme) is identical to that used by Blanchard et al.41,42
This flow lends itself to dimensionality reduction because the limit-cycle attractor, while being part of an infinite-dimensional phase space, is low-dimensional, with a handful of POD modes faithfully capturing nearly all of the energy. (In fact, the system discussed in Sec. IV B 1 was originally introduced as a simplified model for this flow.) Low-dimensionality of the attractor is important for leveraging the full power of the reduced-order learning algorithm proposed in Sec. III C. We also note that learning the dOTD modes on the limit cycle does not have much merit from the standpoint of predicting instabilities, but it does from the standpoint of flow control. As discussed in Sec. V, having access to the OTD modes at any point along the periodic orbit is the stepping stone for the application of the OTD control strategy proposed by Blanchard et al.42
We begin with the generation of a long trajectory on the limit cycle by integrating the Navier–Stokes equations for 400 time units (corresponding to about 52 periods) with a time-step size of 0.002. Snapshots are recorded every ten time steps, resulting in an initial dataset with 20 000 flow snapshots. The POD modes are computed using 192 snapshots equally spaced over one period. We retain the first POD modes, accounting for more than 99.9% of the cumulative energy. Time series for the POD coefficients are generated by projecting the 20 000 flow snapshots on the POD modes [Fig. 4(a)]. For each point , we reconstruct the vector field using an Euler-forward finite-difference approximation in the -space. [Alternatively, one could project (16) on the POD modes.] To compute the reduced basis in which the OTD modes will be learned, we consider the local approach proposed in Sec. III C. We compute the tPOD modes using the 50 nearest neighbors of each and then use (29) and (30) to reconstruct the reduced linearized operator and the last term on the right-hand side of (27), respectively. We consider local tangent bases with the dimension ranging from to 6.
Upon completion of the reconstruction step, the 20 000-point dataset in the -space is divided up into training, validation, and test sets. The training set is comprised of 20 equally-spaced points over the period , the validation test of 55 points over the period , and the training set of 110 points over the interval . (We discard the remaining 19 815 points to avoid lengthy calculations.) Results are presented for the first two dOTD modes. For both, we use the same dataset splitting and a neural network with two 32-unit hidden layers. The Adam optimization algorithm uses a learning rate of 0.01 and is terminated after 2000 iterations. Lyapunov regularization is turned off after 100 iterations. For these parameters, the training, validation, and test errors are shown in Table III.
. | Training set . | Validation set . | Test set . | ||||||
---|---|---|---|---|---|---|---|---|---|
i . | . | . | . | . | . | . | . | . | . |
1 | 8.8 × 10−5 | 8.5 × 10−3 | 8.5 × 10−3 | 4.3 × 10−4 | 6.7 × 10−3 | 6.7 × 10−2 | 6.9 × 10−4 | 4.3 × 10−3 | 4.3 × 10−3 |
2 | 5.7 × 10−5 | 7.5 × 10−2 | 7.5 × 10−2 | 4.9 × 10−4 | 5.5 × 10−2 | 5.5 × 10−2 | 7.0 × 10−4 | 6.0 × 10−2 | 6.0 × 10−2 |
. | Training set . | Validation set . | Test set . | ||||||
---|---|---|---|---|---|---|---|---|---|
i . | . | . | . | . | . | . | . | . | . |
1 | 8.8 × 10−5 | 8.5 × 10−3 | 8.5 × 10−3 | 4.3 × 10−4 | 6.7 × 10−3 | 6.7 × 10−2 | 6.9 × 10−4 | 4.3 × 10−3 | 4.3 × 10−3 |
2 | 5.7 × 10−5 | 7.5 × 10−2 | 7.5 × 10−2 | 4.9 × 10−4 | 5.5 × 10−2 | 5.5 × 10−2 | 7.0 × 10−4 | 6.0 × 10−2 | 6.0 × 10−2 |
For , Figs. 4(b) and 4(c) show that the distances and are virtually zero at the test points. This shows that (a) the error introduced by the low-dimensional reconstruction of the tangent space is negligible and (b) the neural network finds the best representation of the OTD modes in this reduced tangent space. Figures S4(a)–S4(f) in the supplementary material provide visual confirmation that the dOTD modes learned in the reconstructed tangent space are indistinguishable from their numerically integrated counterparts. These results illustrate the benefits of learning the OTD modes in a reduced subspace. Equally good agreement was obtained for and 6 with the same network parameters.
V. DISCUSSION
We now discuss possible improvements and modifications to the OTD learning algorithm introduced in Sec. III as well as implications for data-driven control of instabilities in dynamical systems.
In the examples discussed in Sec. IV, the nonlinearity appearing in the governing equations was no stronger than quadratic with respect to the state variables, and the neural network was able to learn the OTD graph using the state as its only “active” input. In cases in which the nonlinearity is known to be, or suspected to be, stronger than quadratic (e.g., with terms involving higher-order polynomials, trigonometric functions, or nonlinear differential operators), it is likely that supplying as the only input will call for wider, deeper networks than used in Sec. IV. To keep the number of network parameters reasonably small and facilitate convergence of the optimization algorithm, one possibility is to use as additional inputs a library of nonlinear functions of the state; for example, . This approach is in the same spirit as the SINDy algorithm,22 in which sparse regression is applied to a library of nonlinear functions of in order to discover governing equations from state measurements. (We note that equation discovery goes well beyond the SINDy algorithm, with more general techniques such as grammar-based equation discovery60 and process-based modeling.61,62)
We also note that in any laboratory experiment, sensing capabilities are limited by the apparatus, leading to errors in state estimation and reconstruction. Uncertainty in state measurements may be accounted for by trading the neural-network approach for one based on Gaussian processes (GPs) because GPs have the advantage of providing error estimates at each testing points. GPs have been found capable of handling sizable noise levels in a number of problems similar to the present, including deep learning of partial differential equations and discovery of governing equations from noisy measurements.24,63 Another possibility is to use an approach based on reservoir computing, a type of neural-network architecture on which noise in the dataset has, paradoxically, a stabilizing effect. Vlachas et al.36 recently applied reservoir computing to the problem of predicting chaotic dynamics and found that addition of noise in the training data not only led to better generalization, but also improved performance of the network on both the training and testing datasets.
The method proposed in Sec. III is fully data-driven, in the sense that no input other than state snapshots is required to learn the dOTD modes. (The vector field and linearized operator are reconstructed using nothing but state snapshots.) If governing equations are available (either derived from first principles or reconstructed from data), then there is another possibility to learn the graphs ; that is, generate a large number of pairs by solving the state and OTD equations numerically and model the input–output relationship with a neural network, whose parameters are found by minimizing the empirical risk,
where is an appropriate loss function (e.g., quadratic loss or log-cosh loss). To may be appended the loss function (11), which then acts as a physics-informed24,64 (equation-based) regularization term. The downsides of this approach are that (a) access to governing equations is mandatory and (b) generating the input–output pairs requires solving -dimensional initial-boundary-value problems, which may be computationally expensive.
From the standpoint of predicting instabilities, the benefit of learning the dOTD modes from data is that it gives access to directions of instabilities at any point in the phase space and to the leading Lyapunov exponents, regardless of their sign. But the dOTD learning algorithm also has significant implications from the standpoint of controlling instabilities. As discussed in Sec. II A, we have recently shown that the OTD modes can be incorporated into reduced-order control algorithms for stabilization of unsteady high-dimensional flows. The OTD control strategy proposed in Ref. 42 requires solving the OTD equations concurrently with the state equations because the control force belongs to the OTD subspace. With the dOTD learning approach, this requirement disappears because the neural-network approach can be used to build a library of dOTD modes for various regions of the phase space. (Construction of the OTD library may be done offline.) Then, as the controlled trajectory wanders about in the phase space, the controller can look up in the OTD library the dOTD modes associated with the current state. (If the trajectory visits a state that is not present in the library, then one can interpolate between nearby states for which dOTD modes are available.) Library look-up can be done in real time because the computational complexity of the look-up algorithm scales with the dimension of the OTD subspace, which makes the approach very attractive from the standpoint of controlling high-dimensional systems in real time. We note that similar ideas were employed in the context of nonlinear model order reduction by Amsallem et al.20,65
Finally, there is the issue of how the predictive capabilities of the neural network are affected by changes in system parameters. We first note that even a small change in system parameters has the potential of considerably altering the topology of the phase space and, in particular, the number, nature, and properties of the attractors. (This is apparent when a bifurcation occurs.) We also note that one of the prerequisites for the neural network to perform well on unseen data is that the training and testing data be drawn from the same underlying probability distribution. Large variations in system parameters are likely to violate this assumption, seriously compromising generalizability of the neural network. Small variations in system parameters may lead to good generalizability provided that the changes in phase-space topology and underlying probability distribution are also small.
VI. CONCLUSION
For a large class of dynamical systems, the optimally time-dependent (OTD) modes, a set of deformable orthonormal tangent vectors that track directions of instabilities along any trajectory, are known to depend pointwise on the state of the system on the attractor but not on the history of the trajectory. We have developed a learning algorithm based on neural networks to learn this pointwise mapping from the phase space to OTD space using data collected along one or more trajectories of the system. The proposed method is fully data-driven as it requires no other input than snapshots of the state and is applicable regardless of the dimensionality of the system. The learning process provides a cartography of directions associated with strongest instabilities in the phase space as well as accurate estimates for the leading Lyapunov exponents of the attractor. This has significant implications for data-driven prediction of dynamical instabilities, with the learning algorithm exhibiting predictive capabilities of extreme events to a degree that is unprecedented, but also for design and implementation of reduced-order controllers capable of operating in real time.
SUPPLEMENTARY MATERIAL
See the supplementary material for supplementary figures referenced in the paper.
ACKNOWLEDGMENTS
The authors acknowledge discussions with Dr. Hassan Arbabi. This work was supported by the Army Research Office (Grant No. W911NF-17-1-0306).