Numerous physics theories are rooted in partial differential equations (PDEs). However, the increasingly intricate physics equations, especially those that lack analytic solutions or closed forms, have impeded the further development of physics. Computationally solving PDEs by classic numerical approaches suffers from the trade-off between accuracy and efficiency and is not applicable to the empirical data generated by unknown latent PDEs. To overcome this challenge, we present KoopmanLab, an efficient module of the Koopman neural operator (KNO) family, for learning PDEs without analytic solutions or closed forms. Our module consists of multiple variants of the KNO, a kind of mesh-independent neural-network-based PDE solvers developed following the dynamic system theory. The compact variants of KNO can accurately solve PDEs with small model sizes, while the large variants of KNO are more competitive in predicting highly complicated dynamic systems govern by unknown, high-dimensional, and non-linear PDEs. All variants are validated by mesh-independent and long-term prediction experiments implemented on representative PDEs (e.g., the Navier–Stokes equation and the Bateman–Burgers equation in fluid mechanics) and ERA5 (i.e., one of the largest high-resolution global-scale climate datasets in earth physics). These demonstrations suggest the potential of KoopmanLab to be a fundamental tool in diverse physics studies related to equations or dynamic systems.

Solving partial differential equations (PDEs) essentially requires characterizing an appropriate solution operator F that relates Φ=ΦD;Rdϕ, a Banach space of inputs (i.e., initial values), with Γ=ΓD;Rdγ, a Banach space of solutions (i.e., target values), for a typically time-dependent PDE family defined on a bounded open set DRd,
(1)
(2)
(3)
In Eqs. (1)(3), set T=0, denotes the time domain. Notion Lϕ is a differential operator characterized by ϕ. Mapping κ is a function that lives in a function space determined by Lϕ. Mapping γ is the solution of the PDE family that we attempt to obtain. The boundary and initial conditions are denoted by γB and γI, respectively. Mathematically, driving an accurate solution operator F:ϕ,γB,γIγ is the key step to obtain the PDE solution γ. However, even in the case where the boundary and initial conditions are constant [i.e., the solution operator F:ϕ,γB,γIγ reduces to F: ϕγ], driving an analytic expression of solution operator F can be highly non-trivial.1,2

The absence of analytic solutions of various important PDEs in science and engineering naturally calls for the rapid development of computational solvers, which attempt to approximate a parametric counterpart FθF parameterized by θ to derive solution γ.1,3,4 To date, the joint efforts of physics, mathematics, and computer science have given birth to two mainstream families of PDE solvers:5 

  1. The first family of solvers are classic numerical ones. Typical instances of these solvers include finite element (FEM),6 finite difference (FDM),7 and finite volume (FVM)8 methods. In general, these methods discretize space and time domains following specific mesh designs and solve parameterized PDEs on meshes by certain iterative algorithms. Specifically, FEM subdivides the original domain into a set of sub-domains defined by a collection of element equations and recombines these element equations to derive the global solution.6 FDM approximates derivatives as finite differences measured on local values.7 FVM transforms the original problem into a series of surface flux calculations on local volumes.8 

  2. The second family of solvers are neural-network-based ones. With a pursuit of accelerating PDE solving and improving the applicability on real data, three kinds of neural-network-based solvers have been proposed:

    • One kind of solvers discretize domains D and T into x and y meshes and approximate a finite-dimensional and mesh-dependent solution operator Fθ by a parameterized neural network between finite Euclidean spaces, i.e., Fθ:Rx×Ry×ΘRx×Ry (e.g., see Refs. 9–11). Given an arbitrary input γxt, the trained neural network can function as a solution operator to predict γxt+τ=Fθγxt for a certain time difference τ.

    • Another kind of solvers directly parameterize equation solution γ as a neural network, i.e., Fθ:D×T×ΘR (e.g., see Refs. 12–15). These solvers are mesh-independent and accurate in learning a given PDE because they can directly transform arbitrary domain and parameter setting to target equation solution γ.

    • The last kind of solvers, including neural operators, attempt to parameterize a mesh-dependent and infinite-dimensional solution operator with neural networks, i.e., Fθ: Φ × Θ → Γ (e.g., see Refs. 5 and 1621). These mesh-independent solvers can be flexibly implemented on different discretization schemes and only need to be trained once for a given PDE family. The equation solution γ of different instances of the PDE family can be generated by a computationally reusable forward pass of the network,5,19 which can be further accelerated by fast Fourier transform.19 Representative demonstrations of this kind of solver are Fourier neural operator19 and its variants (e.g., adaptive Fourier neural operator22 and FourCastNet23,24). These frameworks not only solve PDEs with known expressions but also be able to predict complex dynamic systems governed by unknown PDEs on real datasets (e.g., climate system23,24).

Although substantial progress has been accomplished by existing PDE solvers from various perspectives, there remain critical challenges in this booming direction.

In practice, the mesh-dependent property of classic numerical solvers has implied an inevitable trade-off between computation accuracy and efficiency, i.e., fine-grained meshes ensure accuracy yet coarse-grained meshes are favorable for efficiency.19,25 However, in many cases, the applications of PDE solving (e.g., numerical weather forecasting26,27) require timely and accurate computation. To ensure accuracy and speed, every single time of computation in the downstream applications supported by classic numerical solvers frequently costs large amounts of computing resources. In cases with limited computing power, a significant time delay may occur. Moreover, all numerical solvers require the explicit definitions of target PDEs as a priori knowledge and are less applicable to predict real data generated by unknown PDEs.19 

As for neural-network-based solvers, challenges still arise from multiple perspectives, even though these solvers have outperformed the classic numerical ones in prediction efficiency significantly. Type (a) solvers, as we have suggested, are mesh-dependent and lack generalization capacities across different mesh designs.5 Type (b) solvers are limited to learning a concrete instance of the PDE rather than the entire family and, consequently, require restarted training given a different instance and cannot handle the data with unknown PDEs.5 Although type (c) solvers can learn the entire PDE family in a mesh-independent manner,5,19 they may face challenges in characterizing the long-term behavior of equation solution γ. To understand these challenges, let us consider the iterative update strategy of neural operators for any xtD × {t},5,
(4)
in which ε0, denotes time difference, notion σ:RR is an arbitrary element-wise non-linear activation function, notion W:Rdγ̂Rdγ̂ stands for a linear layer, function κθ:R2d+dϕRdγ̂ is a neural network parameterized by θ, and mapping γ̂:D×TRdγ̂ denotes the parameterized counterpart of equation solution γ generated by the neural network (e.g., by embedding).5 In Eq. (4), the integral term associated with κθ defines an kernel integral operator to parameterize the Green function Jϕ:D×T×D×TR,
(5)
where the Green function is determined by ϕ as well. One can see a similar form of Eq. (5) in Ref. 5. Computationally, the iteration of Eq. (4) can be significantly accelerated by Fourier transform, which leads to the well-known Fourier neural operator.19 
From a dynamic system perspective, Eq. (4) is similar to the iterative dynamics of an infinite-dimensional non-linear dynamic system of equation solution γt=γD×{t}, where each snapshot γD×{t} is generated after function γ acts on all elements in set D × {t}. Mathematically, the dynamics is defined as
(6)
or equivalently
(7)
in which ζ:Rdγ×TRdγ denotes the associated infinite-dimensional evolution mapping.
The challenge faced by type (c) solvers lies in that evolution mapping ζ, maybe even more intricate than equation solution γ itself. Let us consider the cocycle property of the flow mapping θ associated with ζ, according to the modern dynamic system theory28,
(8)
Operator ◦ denotes the composition of mappings. In general, Eq. (8) determines how equation solution γ evolves across adjoining time intervals. In a special case where ζ, is time-independent, i.e., tζ,t0, Eq. (8) reduces to the autonomous case
(9)
Otherwise, Eq. (8) generally corresponds to the non-autonomous case where the underlying mechanisms governing the evolution of γ vary across time. Consequently, a large ɛ may correspond to a highly non-trivial evolution process of γ, making γ̂xt+ε less predictable during iterative updating and reducing the precision of Eq. (4) significantly. This phenomenon inevitably impedes the accurate prediction of the long-term dynamics (i.e., ɛ) of diverse non-linear PDE families (e.g., see those in epidemic prevention,29 economic modeling,30 and weather forecast23,24). To overcome this obstacle, existing models are forced to improve accuracy at the cost of efficiency.

In this paper, we build on Koopman neural operator (KNO), one of our latest works,31 to develop an efficient module of PDE solving and overcome the limitation in characterizing the long-term behaviors of complicated PDE families. As a study on computational physics programs, our research has the following contributions compared with our previous work.31 

First, we generalize the original KNO to four kinds of variants. Beyond the original KNO, these differentiated variants offer more possibilities for data-specific and task-oriented solver designs. Specifically, the compact variants of KNO realized by multi-layer perceptrons (MLP) and convolutional neural networks (CNN) can accurately solve PDE with small model sizes. The large variants of KNO implemented on visual transformers can predict highly intricate dynamic systems governed by unknown, high-dimensional, and non-linear PDEs (e.g., climate system).

Second, we propose KoopmanLab, a PyTorch module of Koopman neural operator family, as a self-contained and user-friendly platform for PDE solving. All necessary tools, such as those for data loading, model construction, parameter manipulation, output visualization, and performance quantification, are offered in a user-friendly manner to support customized applications.

Third, we offer comprehensive validation of the proposed module on representative datasets, including those generated by important PDEs in fluid mechanics (e.g., the Navier–Stokes equation and the Bateman–Burgers equation) or obtained by global meteorological recording research (e.g., atmospheric, land, and oceanic climate fields in ERA5 dataset).32 By measuring accuracy, quantifying efficiency, and comparing all KNO variants with other state-of-the-art alternatives (e.g., Fourier neural operator19 and FourCastNet23,24), we suggest the potential of our module to serve as an ideal choice of PDE solving and dynamic system prediction.

Although the original Koopman neural operator has been proposed in our earlier work,31 here we elaborate on its mechanisms for completeness. We further present more mathematical details that are not covered in Ref. 31 to analyze the convergence of the original Koopman neural operator.

Koopman neural operator (KNO) is proposed to deal with the non-linear, and potentially non-autonomous, dynamic system in Eqs. (6) and (7). The idea underlying KNO arises from the pursuit to transform the non-linear system in Eqs. (6) and (7) to a sufficiently simple linear one
(10)
where g is an appropriate transform and A is a linear operator. In the modern dynamic system theory,28 this pursuit may be achieved if we can develop an approach to characterize the Koopman operator K, an infinite-dimensional linear operator governing all possible observations of the dynamic system of equation solution γ, to act on the flow mapping θ and linearizing the dynamics of γ in an appropriate observation space. This idea has been extensively applied in plasma physics,33 fluid dynamics,34 robot kinetics,35 and neuroscience.36 
Mathematically, we need to find a set of observation functions (or named as measurement functions)28 
(11)
such that a family of Koopman operators can be identified for the autonomous [i.e., Kε:GRdγ×TGRdγ×T] or the non-autonomous [i.e., Ktt+ε:GRdγ×TGRdγ×T] case. These Koopman operators can function on the observations of γ to update them
(12)
(13)
where Eqs. (12) and (13) correspond to the autonomous and non-autonomous cases, respectively. The updating is implemented in a linear manner, which can be illustrated by taking the non-autonomous case as an example
(14)
Apart from the linear system of gγt in Eq. (14), one may also consider the Lie operator [i.e., the Lie derivative of g along the vector field γ], which is generator operator of such a Koopman operator37–39 
(15)
Equation (15) defines a linear system of gγt as well
(16)
which can also be considered in the application.
To understand the linearization of gγt by the Koopman operator from the perspective of PDE solving, let us consider the Lax pair M,N of an integrable version of Eqs. (1)(3)40,
(17)
(18)
(19)
where Dxn denotes the nth total derivative operator and I is an identity operator. Equation (18) denotes an eigenvalue problem at moment t. A relationship between linear operators M and N can be identified if we calculate the time derivative of Eq. (18),
(20)
which directly leads to
(21)
where M,N=MNNM denotes the commutator of operators. Combining Eqs. (17)(21) with Eq. (16), we can readily see the close relation between N and Ktt+ε,
(22)
which holds in the autonomous case as well. In sum, the linearization of gγt is intrinsically related to the Lax pair and the inverse scattering transform of integrable PDEs.40 Note that similar ideas have been comprehensively explored in mathematics and physics.41–44 
Once we find a way to derive the Koopman operator, we can reformulate Eq. (4) as
(23)
where γ̂t=γ̂D×{t}. Certainly, an infinite-dimensional linear operator is not operable in practice. To enable neural networks to learn a potential Koopman operator, we need to consider K̂Rr×r, a finite matrix, as a counterpart of K that acts on K=spanĜ, a finite invariant sub-space spanned by Ĝ={g1,,gr}GRdγ×T,
(24)
where ν1,,νrRr and ⟨·, ·⟩ denotes the inner product. Mathematically, any finite set of eigenfunctions of the Koopman operator K can span a finite invariant sub-space.

There exist numerous previous studies that pursue to characterize the Koopman operator by machine-learning-based approaches. Some approaches are highly practical but limited to the autonomous case [e.g., the case in Eq. (12)].45–48 Other approaches are more general in application but critically depend on a priori knowledge about the eigenvalue spectrum (e.g., the numbers of real and complex eigenvalues) of Koopman operator to deal with the continuous spectrum problem.49 

In practice, a balance should be reached between mathematical completeness and computational practicality. An ideal Koopman-operator-based PDE solver should fit in with both autonomous and non-autonomous cases and limit the dependence of a priori knowledge as much as possible (even though these restraints inevitably reduce mathematical completeness). To explore such a balance, we introduce the Koopman neural operator (KNO), a flexible approach, in our previous work.31 

The formalization of KNO begins with the Krylov sequence50 of the observable defined by a unit time step ε0,, which is used in the Krylov subspace approach for computing the eigenvalues of large matrices.50 One can see its application in Koopman-operator-related algorithms such as the Hankel-DMD,51 sHankel-DMD,52 and HAVOK.53 Specifically, the Krylov sequence is given as
(25)
which is generated by K and gγ0,
(26)
Computationally, the Krylov sequence can be sampled by a Hankel matrix of observations
(27)
where mN+ denotes the dimension of delay-embedding. In Eq. (27), each column is a sampled result that approximates a function in the Krylov subspace.
If the Koopman operator has a discrete spectrum (e.g., has eigenvalues), there exists an invariant subspace K of the Koopman operator, which can be spanned by the Krylov subspace
(28)
as long as ndimK1 [here, dim denotes the dimensionality]. This property suggests the possibility of approximating the actual Koopman operator to K by K̂tt+ε:GRdγ×TK, a finite Koopman operator restricted to K for any tT. Mathematically, matrix K̂ is required to satisfy the Galerkin projection relation
(29)
where hGRdγ×T is an arbitrary function.54,55 If the target Koopman operator is bounded and Hm,n spans its invariant subspace, the approximation can be realized by
(30)
where μ is a measure on GRdγ×T and ‖·‖F denotes the Frobenius norm. Once a restricted Koopman operator is derived, we can obtain the following iterative dynamics:
(31)
in which Hm×nk is the kth column of Hm×n.

As for the case where the Koopman operator has a continuous spectrum (e.g., has no eigenvalue), there is no finite invariant subspace to support computational approximation. Such an ill-posed situation remains for future exploration.

The restricted Koopman operator K̂ can be learned efficiently if it corresponds to autonomous system, i.e., K̂tt+ε=K̂ε. However, an online optimization will be necessary if it corresponds to non-autonomous system, i.e., K̂tt+ε is time-varying. Limited by computing resources or data size, expensive online optimization may not always be available during PDE solving. Therefore, we propose a compromised approach to realize off-line training under the ergodic assumption51,56 of the dynamic system of γt, i.e., γt ultimately visits every possible system states as t. Under this assumption, the proportion of retention time of γt at a certain system state is equivalent to the probability of this state in the space, making the time-averaging equivalent to the actual expectation at the limit of infinite time. Based on this property, we can define an expectation of the restricted Koopman operator associated with ɛ,
(32)
(33)
For a fixed time difference ɛ, the expected Koopman operator K̄ε:GRdγ×TK is a time-average of K̂tt+ε that can be learned during offline optimization.
Given an ideal setting of m, we can ensure the convergence of the eigenvalues and eigenfunctions of K̂ to those of K under the assumption of ergodic property. Similar conclusions can be seen in Ref. 54. To understand this convergence, we need to indicate several important properties. First, as we have mentioned, there exists an equivalence relation between time-averaging and the real expectation as the time approaches to infinity (i.e., the Birkhoff ergodic theorem51,56)
(34)
Second, Eq. (34) directly implies that
(35)
(36)
where * denotes the complex conjugate and ,K stands for the inner product of functions in K. Given the learned Koopman operator K̄ε, Eq. (36) coincides with the definition of the actual Gramian matrix V associated with the inner product space K,
(37)
(38)
where we mark K̄εi1Rn=K̄εi1gγ0,,K̄εi1gγnε for convenience. Equation (38) is derived from the fact that Hm×n serves as the sampling of Rn. Meanwhile, the left side of Eq. (35) coincides with the empirical Gramian matrix V̂ associated with matrix Hm×n,
(39)
Our formal proof can be developed based on these two properties. Let us consider the first r < n element of Rn,
(40)
which defines a possible basis of K.
Theoretically, the learned Koopman operator restricted to K can be represented by a companion matrix
(41)
The last column of C denotes the coordinate of K̄εrgγ0 defined by the basis, which should be calculated as
(42)
Empirically, the learned Koopman operator restricted to spanHm,n can be also represented by a companion matrix, whose last column can be calculated as
(43)
It is trivial to prove
(44)
applying Eq. (36) and Eqs. (38) and (39). Similarly, we can know
(45)
Therefore, we can derive
(46)
based on Eqs. (44) and (45), implying that
(47)
In Eq. (47), notion PMk denotes the set of all the k-order principal minors of the corresponding matrix. Now, let us consider the characteristic polynomials of Ĉ and C,
(48)
(49)
whose distance at the limit of m can be measured as
(50)
(51)
Because the roots of a given polynomial evolve continuously as the function of the coefficients, we know that Ĉ and C share the same set of eigenvalues at the limit of m since their characteristic polynomials converge to the same. Moreover, the convergence of Ĉ to C and the convergence of the eigenvalues Ĉ to those of C eventually imply the convergence of the eigenfunctions.

In Ref. 31, we have proposed an architecture to implement the original Koopman neural operator on neural networks. The details of architecture designs are presented below:

  • Part 1: Observation. An encoder [e.g., a single non-linear layer with tanh activation function in the original Koopman neural operator] serves as observation function g to transform ϕt=ϕD×{t}, an arbitrary input of the PDE family (e.g., ϕt can be directly set as γt), into gγt̂GRdγ̂×T,
    (52)
    See Fig. 1 for illustrations.
  • Part 2: Fourier transform. Similar to the Fourier neural operator,19 the original Koopman neural operator also utilizes the Fourier transform during the iterative update of the Green function in Eq. (5). Given gγt̂, we derive the Fourier transform gF=Fg, where we truncate the Fourier series at ω, a maximum frequency
    (53)
    Note that χ denotes the indicator function [i.e., χAa=1 if aA, otherwise χAa=0]. Computationally, the above transform is implemented by fast Fourier transform. For convenience, we mark
    (54)
    as the transformed result of γt̂. Different from Ref. 19, our main motivation for using the truncated Fourier transform is to extract the low-frequency information (i.e., main system components) of the represented equation solution gγt̂. Certainly, frequency truncation inevitably causes the loss of high-frequency information (i.e., high-frequency perturbations or edges). In the original Koopman neural operator, Part 5 is designed to complement the lost information associated with high-frequency. See Fig. 1 for more details.
  • Part 3: Hankel representation and offline Koopman operator. Once we have derived gFγt̂ for every tεN+, a Hankel matrix Ĥm×n of gFγt̂ will be generated following mN, a dimension of delay-embedding (note that nN is the number of all accessible samples)
    (55)
    We train a o × o linear layer to learn a neural network representation of Koopman operator K̄ε:GRdγ̂×TK̂ following Eqs. (32) and (33), where K̂ is spanned by Ĥm×n. The learned K̄ε can be used to predict the future state of gFγ̂m+n1ε as
    (56)
    where notion T denotes the transpose of a matrix. See Fig. 1 for illustrations.
  • Part 4: Inverse Fourier transform. After gFγ̂m+nε is predicted in Part 3, it is transformed from the Fourier space to GRdγ̂×T by an inverse Fourier transform
    (57)
    where t=m+n+r1ε. For convenience, we mark
    (58)
    See Fig. 1 for instances of Part 4.
  • Part 5: High-frequency information complement. In the original Koopman neural operator, we use a convolutional layer to extract high-frequency of gγt̂ because convolutional layers can amplify high-frequency components according to Fourier analysis.57 Therefore, we train a convolutional layer C on the outputs of Part 1 to extract their high-frequency information. As a complement of Parts 2–4, the convolutional layer realizes a forward prediction of high-frequency information
    (59)
    See Fig. 1 for illustrations.
  • Part 6: Inverse observation. Once two future states, gγ̂m+nε and gCγ̂m+nε, are predicted by Parts 2–4 and Part 5, they are unified in a linear manner
    (60)
    Given gUγ̂m+nε, a non-linear decoder [e.g., a single non-linear layer with tanh activation function in the original neural operator] is trained to approximate the inverse of observation function
    (61)
    and derive
    (62)
    as the target state of equation solution in space Rdγ̂. See Fig. 1 for illustrations.

FIG. 1.

Conceptual illustrations of neural network architectures of the original Koopman neural operator. (a) summarizes key mathematical transform in each part, where r is the prediction length. (b) visualizes a prediction instance on the two-dimensional Navier–Stokes equation.

FIG. 1.

Conceptual illustrations of neural network architectures of the original Koopman neural operator. (a) summarizes key mathematical transform in each part, where r is the prediction length. (b) visualizes a prediction instance on the two-dimensional Navier–Stokes equation.

Close modal
Parts 1–6 define the iterative update strategy of Eq. (23). For any t>tεN, the iterative dynamics is given as
(63)
in which γ̂tmε,t denotes a vector γ̂tmε,,γ̂t. The loss function for optimizing Eq. (63) is defined as
(64)
where λp,λr0, denotes the weights of prediction and reconstruction processes in the loss function.

The one-unit architecture of the original Koopman neural operator is visualized in Fig. 1. A multi-unit architecture can be readily constructed by cascading the copy of Parts 2–5 multiple times.

Beyond the original Koopman neural operator (KNO),31 we generalize it to four kinds of variants to fit in with different application demands.

The compact KNO sub-family includes two compact variants of KNO realized by multi-layer perceptrons (MLP-KNO) and convolutional neural networks (CNN–KNO). These variants are proposed to accurately solve PDEs with small model sizes. Specifically, they are designed following Eqs. (52)(62), where encoder and decoder are defined as
(65)
(66)
In Eqs. (65) and (66), mapping η denotes a non-linear activation function [e.g., we use tanh in our research], notions We and Wd are two weight matrices of the corresponding sizes, and Ce and Cd are two convolutional layers. Our motivation of proposing different types of encoders and decoders is to enable the defined model to fit in with diverse application demands. In general, a multi-layer perceptron encoder or decoder can be efficiently and flexibly implemented on any shape of data while a convolutional encoder or decoder may be better at extracting the local information (e.g., boundaries) of high-resolution data.

Our proposed KoopmanLab module offers user-friendly tools to customize an MLP-KNO. A customized instance is presented below.

Similarly, a CNN–KNO can be customized using the following code, where we present one-dimensional and two-dimensional visions.

To validate the proposed compact KNO sub-family in PDE solving tasks, we design mesh-independent and long-term prediction tasks on representative PDEs (e.g., the two-dimensional Navier–Stokes equation58 and the one-dimensional Bateman–Burgers equation59). The numerical datasets of these two are provided by Ref. 19.

Specifically, the incompressible two-dimensional Navier–Stokes equation has a vorticity form
(67)
(68)
(69)
in which γ denotes the vorticity, χ measures the velocity, and ψ is a time-independent forcing term. The viscosity coefficient is ν ∈ {10−3, 10−4} in our research. Given the data with the highest mesh resolution, one can further generate the data with the lower resolution by direct down-sampling.19 The data with the highest mesh resolution has 213 grids.19 Our KoopmanLab module offers a function to load the data of the incompressible two-dimensional Navier–Stokes equation.
The one-dimensional Bateman–Burgers equation is defined as
(70)
(71)
in which γI stands for a periodic initial condition γILperiodic20,1;R and parameter ν0, is the viscosity coefficient. We set ν = 100 in our research. The data with highest mesh resolution has 216 grids.19 To load this dataset, one can use the following function:

In Fig. 2(a), we validate the mesh-independent property of the proposed compact KNO sub-family adopting the same setting used in our earlier work.31 The mesh-independent property, as suggested by Refs. 5 and 1621, arises from the fact that the neural operator is expected to learn the solution operator of an entire PDE family rather than be limited to a concrete parameterized instance. Specifically, we conduct the experiments on the data of one-dimensional Bateman–Burgers equation associated with different mesh granularity conditions (i.e., spatial resolution of meshes). Different versions of the compact KNO sub-family are defined by changing hyper-parameters [e.g., operator size o, frequency mode number f, and the power of the Koopman operator r=ttε in Eq. (56)]. These models are trained by 1000 randomly selected samples with the lowest spatial resolution and conduct 1-s forward prediction on 200 samples associated with different resolutions. Batch size is set as 10, the learning rate is initialized as 0.001 and halved every 100 epochs, and the weights of prediction and reconstruction in Eq. (64) are set as λp,λr=5,0.5. All models are trained by the Adam optimizer61 on a single NVIDIA Titan V graphics processing unit (GPU). As shown in Fig. 2(a), the prediction errors of all versions of the compact KNO sub-family maintain constantly across different spatial resolutions, suggesting the capacity of the compact KNO sub-family to be mesh-independent. Mesh-independence is important for PDE solving because it allows one to train a neural-network-based PDE solver on the data with low spatial resolution and directly apply the solver on the data with high spatial resolution, which helps balance between accuracy and efficiency in PDE solving. In our earlier work,31 one can further see a detailed comparison between the original KNO and FNO19 in mesh-independent prediction task, where the original KNO outperforms FNO with a much smaller model size (e.g., a size of 5 × 103 for KNO and a size of 2 × 107 for FNO). Other neural operator models, such as graph neural operator (GNO)5 and multipole graph neural operator (MGNO),62 are no longer considered because they have been demonstrated as less accurate than FNO as reported by Ref. 19.

FIG. 2.

Experimental validation of the compact KNO sub-family. (a) Test performances in the mesh-independent experiment on the Bateman–Burgers equation. (b) Test performances in the long-term prediction experiment on the two-dimensional Navier–Stokes equation with a viscosity coefficient ν = 10−3. (c) Test performances in the same long-term prediction experiment on the two-dimensional Navier–Stokes equation with ν = 10−4. (d) The examples of prediction results and errors (MSE) on the two-dimensional Navier–Stokes equation with ν = 10−3. (e) The examples of prediction results and errors (MSE) on the two-dimensional Navier–Stokes equation with ν = 10−4.

FIG. 2.

Experimental validation of the compact KNO sub-family. (a) Test performances in the mesh-independent experiment on the Bateman–Burgers equation. (b) Test performances in the long-term prediction experiment on the two-dimensional Navier–Stokes equation with a viscosity coefficient ν = 10−3. (c) Test performances in the same long-term prediction experiment on the two-dimensional Navier–Stokes equation with ν = 10−4. (d) The examples of prediction results and errors (MSE) on the two-dimensional Navier–Stokes equation with ν = 10−3. (e) The examples of prediction results and errors (MSE) on the two-dimensional Navier–Stokes equation with ν = 10−4.

Close modal

In Figs. 2(b)2(e), we validate the compact KNO sub-family by a long-term prediction task designed on the two-dimensional Navier–Stokes equation datasets with viscosity coefficients ν = 10−3 [Fig. 2(b)] and ν = 10−4 [Fig. 2(c)]. A down-sampling scaling factor of 2 is defined to generate the datasets with 212 grids. For comparison, a one-unit FNO is defined following the default setting introduced in Ref. 19. A 40-time-interval prediction task is conducted on the dataset with ν = 10−3, where models are trained on 1000 samples of γ0,12×0,10 and tested on 200 samples of γ0,12×10,50. Similarly, a more challenging 10-time-interval prediction task is conducted on the dataset with ν = 10−4 in which models are trained on 8000 samples of γ0,12×0,10 and tested on 200 samples of γ0,12×10,20. Figures 2(b) and 2(c) report the prediction performance of all models as the function of increasing prediction duration length. Figures 2(d) and 2(e) visualize predicted instances and errors in the cases with ν = 10−3 [Fig. 2(d)] and ν = 10−4 [Fig. 2(e)] during testing. All experimental results suggest the optimal potential of the compact KNO sub-family in characterizing the long-term evolution of PDE solutions. Combining these results with the model sizes measured in Table I, we suggest that the compact KNO sub-family realizes a better balance between accuracy and efficiency because a KNO variant with a smaller model size can still outperform FNO significantly. Meanwhile, although KNO costs more time in training and testing than FNO, it is significantly faster than the pseudo-spectral method,60 a representative numerical approach for simulating fluids. Because the datasets of the two-dimensional Navier–Stokes equation58 and the one-dimensional Bateman–Burgers equation59 used in our experiments are generated by the pseudo-spectral method60 following Ref. 19, our results suggest the potential of a trained KNO model to achieve the acceleration of multiple orders of magnitude during reproducing the accurate outputs of the pseudo-spectral method60 with small errors.

TABLE I.

Comparison between the implemented models in Figs. 2(b) and 2(c). Parameters o, f, and r in model settings denote the size of the Koopman operator in Eq. (55), the number of remaining frequency component after frequency truncation in Eq. (53), and the iteration number of the Koopman operator in Eq. (55), respectively. The parameter numbers of machine learning models are counted by the tool provided by Refs. 23 and 24. On a single NVIDIA Titan V GPU, the training time cost is defined as the average time consumption of each epoch of training when the batch size is 10 and sample number is 1000. The testing time cost is measured as the average time consumption for each implemented approach to finish the prediction or computation of γ0,12×10,50 once. The pseudo-spectral method60 is implemented following Refs. 23 and 24 to generate the datasets of the two-dimensional Navier–Stokes equation, where the unit time step is set as 10−4 to ensure numerical stability.

ModelsSettingsParameter numberTraining time costTesting time cost
FNO19  Default settings, one-unit 233 897 16.386 s/epoch 0.0567 s/sample 
MLP-KNO o,f,r=32,10,12 206 538 89.082 s/epoch 0.2389 s/sample 
MLP-KNO o,f,r=32,16,8 526 026 64.964 s/epoch 0.1634 s/sample 
MLP-KNO o,f,r=48,10,12 464 170 86.740 s/epoch 0.2380 s/sample 
CNN–KNO o,f,r=32,10,12 206 538 89.928 s/epoch 0.2360 s/sample 
CNN–KNO o,f,r=32,16,8 526 026 67.814 s/epoch 0.1736 s/sample 
CNN–KNO o,f,r=48,10,12 464 170 89.042 s/epoch 0.2287 s/sample 
Pseudo-spectral method60  Calculates γ0,12×10,50 ⋯ ⋯ 511.09 s/sample 
ModelsSettingsParameter numberTraining time costTesting time cost
FNO19  Default settings, one-unit 233 897 16.386 s/epoch 0.0567 s/sample 
MLP-KNO o,f,r=32,10,12 206 538 89.082 s/epoch 0.2389 s/sample 
MLP-KNO o,f,r=32,16,8 526 026 64.964 s/epoch 0.1634 s/sample 
MLP-KNO o,f,r=48,10,12 464 170 86.740 s/epoch 0.2380 s/sample 
CNN–KNO o,f,r=32,10,12 206 538 89.928 s/epoch 0.2360 s/sample 
CNN–KNO o,f,r=32,16,8 526 026 67.814 s/epoch 0.1736 s/sample 
CNN–KNO o,f,r=48,10,12 464 170 89.042 s/epoch 0.2287 s/sample 
Pseudo-spectral method60  Calculates γ0,12×10,50 ⋯ ⋯ 511.09 s/sample 

Different from the compact KNO sub-family, the ViT-KNO sub-family is proposed for dealing with more intricate situations (here ViT stands for Vision Transformer63). Numerous applications of PDE solving (e.g., global climate forecasting) require the solver to be able to capture the underlying patterns of ultra-large datasets that may be related with certain unknown PDEs. Meanwhile, there may exist multiple variables of interest that govern by a group of coupled PDEs. To fit in with these situations, we follow the main idea of the compact KNO sub-family to develop a kind of transfer-based PDE solver. The mechanism underlying the proposed ViT-KNO sub-family is not completely same as Eqs. (52)(62) because some mathematical details are modified to improve model applicability on noisy real datasets. We suggest the benefits of our modifications based on an experiment on ERA5, one of the largest dataset of global atmospheric, land, and oceanic climate fields.32 Nevertheless, more in-depth mathematical analyses of these modifications remain for future studies.

Let us consider a case where there exist v coupled variables, {γ1, …, γh}, defined on domain D × T. The dynamics of these variables are govern by a group of PDEs with unknown expressions. The objective is to learn the equation solutions of these latent PDEs such that the dynamics of {γ1, …, γh} can be accurately characterized.

The architecture of ViT-KNO sub-family consists of seven parts. Below, we present a detailed computational implementation of each part.

  • Part 1: Observation. Similar to the encoder design in the compact KNO sub-family, an encoder component is implemented in ViT-KNO to serve as observation function g and transform ϕts=ϕsD×{t} into gγ̂ts for each s ∈ {1, …, h}. Specifically, the encoder is realized by the token embedding layer in Vision Transformer (ViT).63 Given a joint input ϕt1,,ϕthRdϕ×h, we first transform it into a three-dimensional token tensor Φt by a convolutional layer Ce and an arbitrary non-linear transform Q,
    (72)
    where domain D is reorganized into u × v patches (i.e., tokens). The patch is a kind of square and non-overlapping macro-mesh. If domain D has been already discretized into multiple meshes, then the size of a patch equals the number of meshes it covers. Parameter l denotes a customized embedding dimension, which is not necessarily the same as h. In real cases, the definition of the non-linear transform Q can be selected according to different application demands. The derived tensor gΓ̂t denotes the joint representation of gγ̂t1,,gγ̂tl. See Fig. 3 for illustrations.
  • Part 2: Fourier transform. Similar to adaptive Fourier neural operator,23,24,64 a truncated Fourier transform is applied on the first two dimensions of gΓ̂t to derive the Fourier series of each embedded variable s ∈ {1, …, l},
    (73)
    where u={1,,u}. For convenience, we mark
    (74)
    (75)
    in which s ∈ {1, …, l}. Similar to the compact KNO sub-family, frequency truncation leads to the loss of high-frequency information. In the ViT-KNO sub-family, Part 5 is designed for complementing high-frequency information. See Fig. 3 for details.
  • Part 3: Koopman-operator-associated component. After deriving gFΓ̂t for every tεN+, a Koopman-operator-associated component is designed to function on the third dimension of every token in gFΓ̂t and realize the iterative dynamics
    (76)
    in which Koopman operator K̄ε is learned by a linear transform and layer S is constructed by a non-linear activation function η acting on a linear layer W,
    (77)
    Although S is not a part of the original Koopman neural operator,31 including it can efficiently enhance the capacity of this component to characterize intricate large-scale data. In KoopmanLab, the leaky rectified linear unit (Leaky ReLU)65 is suggested as a default choice of S, which can also reduce to the ReLU function as a special case. See Fig. 3 for illustrations of Part 3.
  • Part 4: Inverse Fourier transform. Once gFΓ̂t+ε is derived in Part 3, the inverse Fourier transform is applied on gFΓ̂t+ε to transform gFΓ̂t+ε back to the observation space
    (78)
    based on which we can define
    (79)
    See instances in Fig. 3.
  • Part 5: High-frequency information complement. Same as the compact KNO sub-family, there is a component for complementing high-frequency information in ViT-KNO. This component is also realized by a convolutional layer C that acts on the outputs of Part 1 to learn the dynamics of high-frequency information
    (80)
    See Fig. 1 for illustrations.
  • Part 6: Variable coupling. Given two predicted states, gΓ̂t+ε and gCΓ̂t+ε, by Parts 2–4 and Part 5, we combine them in a linear form
    (81)
    Because ViT-KNO is designed to learn multi-variate systems governed by unknown coupled PDEs, we need to characterize the coupling relation among variables. Because we lack the a priori knowledge about these underlying PDEs, we suggest to capture the coupling relation by optimizing a non-linear layer M,
    (82)
    Following the idea of adaptive Fourier neural operator,23,24,64 we use the Gaussian Error Linear Unit (GELU) as the activation function in this non-linear layer. See Fig. 3 for illustrations.
  • Part 7: Inverse observation. Given gMΓ̂t+ε, a decoder is implemented to function as the inverse of observation function
    (83)
    Similar to the compact KNO sub-family, there are two kinds of decoders included in our proposed KoopmanLab module
    (84)
    in which Wd and Cd denote linear and convolutional layers, respectively. These two kinds of decoder designs distinguish between two variants of the ViT-KNO sub-family. See Fig. 3 for illustrations.

FIG. 3.

Conceptual illustrations of neural network architectures of the ViT-KNO sub-family. (a) summarizes the computational design of each part. (b) illustrates an instance of the parallel design (i.e., multi-head design) of the ViT-KNO sub-family, where the depth of each head is 1.

FIG. 3.

Conceptual illustrations of neural network architectures of the ViT-KNO sub-family. (a) summarizes the computational design of each part. (b) illustrates an instance of the parallel design (i.e., multi-head design) of the ViT-KNO sub-family, where the depth of each head is 1.

Close modal
Parts 1–7 define the iterative update strategy of Eq. (23) in a multi-variate case. For any tT, the iterative dynamics is given as
(85)
Multi-step prediction can be realized in an iterative manner. The loss function for optimizing Eq. (85) is
(86)
where λp,λr0, are the weights of prediction and reconstruction.

Several computational tricks can be considered in the application. First, a LASSO regularization66 can be included to improve the robustness and sparsity of Koopman operator K̄ε in Eq. (76). This trick has been demonstrated as effective in adaptive Fourier neural operator23,24,64 and is applicable to the ViT-KNO sub-family as well. Second, the transformer architecture supports a parallel design the ViT-KNO sub-family. Specifically, the third dimension of the output of Part 1 can be subdivided into multiple parts [e.g., gΓ̂tRu×v×l is subdivided into k parts such that each part is an element in Ru×v×lk]. Then, Parts 2–6 are copied k × j times, where each group of j copies is organized into a sequence. Each sequence of j copies is referred to as a head in the transformer, processes a corresponding 1k part of gΓ̂tRu×v×l, and shares parameters during optimization (see Fig. 3). Computationally, parameters k and j are referred to as the number and the depth of heads. The processed outputs of these k parallel heads are unified by Part 7 to derive the final prediction result. In our proposed KoopmanLab, these two tricks are included to improve computational efficiency.

Our KoopmanLab module supports customizing ViT-KNO frameworks. Below, we present an instance of ViT-KNO with a multi-layer perceptron as the decoder

Similarly, a ViT-KNO whose decoder is a convolutional layer can be defined as the following:

Note that there exist some detailed model parameters that are not covered by the above codes because they are highly coupled during computation or less important in our theoretical derivations. Users are suggested to adjust them after loading the source code of ViT-KNO.

To validate the proposed ViT-KNO sub-family, we implement a large-scale experiment on ERA5, one of the largest high-resolution datasets of global-scale multi-variate climate fields.32 This dataset has been extensively applied in weather forecasting tasks (e.g., see FourCastNet23,24), ensuring the reproducibility and comparability of our results.

Twenty important climate variables are considered in our research, including mean large-scale precipitation (MSLP), relative humidity with 500 hPa (R500), relative humidity with 850 hPa (R850), surface pressure (SP), 2 m temperature (T2M), temperature with 500 hPa (T500), temperature with 850 hPa (T850), total column water vapor (TCWV), the 10 m u-component of wind (U10), the u-component of wind with 500 hPa (U500), the u-component of wind with 850 hPa (U850), the u-component of wind with 1000 hPa (U1000), the 10 m v-component of wind (V10), the v-component of wind with 500 hPa (V500), the v-component of wind with 850 hPa (V850), the v-component of wind with 1000 hPa (V1000), the geopotential with 50 hPa (Z50), the geopotential with 500 hPa (Z500), the geopotential with 850 hPa (Z850), and the geopotential with 1000 hPa (Z1000).

In a long-term prediction task, we test a ViT-KNO whose decoder is a multi-layer perceptron and encoder contains a head structure with a depth of 4 to function as mapping Q in Eq. (72). This model is provided as a default choice of ViT-KNO in the KoopmanLab module. For comparison, we implement the official version of FourCastNet provided in Refs. 23 and 24. Limited by computing resources (the FourCastNet is reported to be trained on 3808 NVIDIA A100 GPUs with numerous computational optimization tricks23), we are unable to reproduce the memory consuming optimization tricks introduced in Refs. 23 and 24 (e.g., the two-stage training procedure). Therefore, we compare between ViT-KNO and FourCastNet in a fair case where there is no extra optimization trick applied on the training of both models.

Given the samples of initial conditions, the implemented ViT-KNO and FourCastNet are required to predict the future states of all 20 climate variables after t ∈ {6, 12, 18, …, 240} h. The training dataset includes the samples recorded from 1979 to 2015. The validation dataset includes the samples recorded during 2016 and 2017. The test dataset includes the samples recorded in 2018. The spatial resolution of all samples is set as 1440 × 720. All data are pre-processed in a standard manner following Refs. 23, 24, and 32, where a Z-transform is applied to normalize all variables. The training of our ViT-KNO and FourCastNet is implemented in a multi-GPU environment with 128 × 16 GBs in total. The testing of trained ViT-KNO and FourCastNet is implemented in a single 16-GB GPU environment.

Key model settings are summarized below. For ViT-KNO, the batch size is set as 64, the learning rate is 5 × 10−4 and updated by a cosine annealing approach, the patch size is set as 8 × 8, the number of heads is 16, the depth of heads is 12, the embedded dimension l in Eq. (72) is 768, the relative weights of prediction and reconstruction in the loss function are λp = 0.8 and λr = 0.2, and the number of kept low-frequency modes after the fast Fourier transform is f = 32. For FourCastNet, all model settings are adopted from the official version.23,24 The defined ViT-KNO model has 73 905 428 parameters in total while the official version of FourCastNet23,24 has 74 691 840 parameters. The optimizer of both models is the FusedAdam, an Adam optimizer61 that support mixed precision and distributed training. All 20 climate variables are learned and predicted together rather than, respectively. Note that the information of land-form is not provided to both models. Both models are required to learn climate variables with no additional information. The maximum number of available epochs for training is set as 150 to explore when these models can converge, which costs about 46.25 h in our environment.

In Fig. 4, we visualize several instances of the predicted climate fields by the ViT-KNO after 70 epochs of training, accompanied by corresponding true values. High consistency can be seen between these ground truths and their predicted counterparts derived by ViT-KNO. Quantitatively, the prediction accuracy of each climate variable during testing is measured by anomaly correlation coefficient (ACC) in Figs. 5 and 6, where we compare between ViT-KNO and FourCastNet after 70 and 115 epochs of training, respectively. According to the same prediction task reported by Refs. 23 and 24, the trained ViT-KNO outperforms the baseline state-of-the-art deep learning models for weather forecasting proposed by Ref. 67 significantly. According to our observations in Figs. 5 and 6, ViT-KNO generally achieves a higher testing accuracy than FourCastNet. The advantages of ViT-KNO compared with FourCastNet will be enlarged as the prediction duration length increases. Moreover, the implemented FourCastNet seems to over-fit the training dataset as the training epoch number increases while ViT-KNO does not (one can find this phenomenon by comparing between the model performances reported in Figs. 5 and 6). These results suggest the potential for ViT-KNO to become a competitive alternative of FourCastNet, especially in the learning tasks of long-term and non-linear dynamics.

FIG. 4.

Visualization of three instances of the experimental results on ERA5 dataset after 70 epochs of training. (a)–(c), respectively, show the time-dependent predicted results of global wind speed, U10, and V10 variables on selected moments, accompanied by corresponding ground truths. Note that wind speed is not originally included in the 20 climate variables selected from ERA5. It is calculated as wind speed=U102+V102.

FIG. 4.

Visualization of three instances of the experimental results on ERA5 dataset after 70 epochs of training. (a)–(c), respectively, show the time-dependent predicted results of global wind speed, U10, and V10 variables on selected moments, accompanied by corresponding ground truths. Note that wind speed is not originally included in the 20 climate variables selected from ERA5. It is calculated as wind speed=U102+V102.

Close modal
FIG. 5.

Comparison between ViT-KNO and FourCastNet on the 20 variables of ERA5 dataset after 70 epochs of training. (a)–(t) Time-dependent prediction accuracy of each model is measured by the anomaly correlation coefficient (ACC). The colored area denotes the interval of accuracy whose boundaries are fractiles. The dashed line denotes the average accuracy.

FIG. 5.

Comparison between ViT-KNO and FourCastNet on the 20 variables of ERA5 dataset after 70 epochs of training. (a)–(t) Time-dependent prediction accuracy of each model is measured by the anomaly correlation coefficient (ACC). The colored area denotes the interval of accuracy whose boundaries are fractiles. The dashed line denotes the average accuracy.

Close modal
FIG. 6.

Comparison between ViT-KNO and FourCastNet on the 20 variables of ERA5 dataset after 115 epochs of training. (a)–(t) Time-dependent prediction accuracy of each model is measured by the anomaly correlation coefficient (ACC). The colored area denotes the interval of accuracy whose boundaries are fractiles. The dashed line denotes the average accuracy.

FIG. 6.

Comparison between ViT-KNO and FourCastNet on the 20 variables of ERA5 dataset after 115 epochs of training. (a)–(t) Time-dependent prediction accuracy of each model is measured by the anomaly correlation coefficient (ACC). The colored area denotes the interval of accuracy whose boundaries are fractiles. The dashed line denotes the average accuracy.

Close modal

Moreover, the time costs of a single time of prediction by ViT-KNO and FourCastNet are observed as ≃0.851 972 9 and ≃0.734 301 49 s on a single 16-GB GPU. Compared with the classic numerical weather forecasting systems (e.g., the ECMWF integrated forecasting system) whose prediction inevitably requires a multi-GPU environment (e.g., more than 1000 NVIDIA Selene nodes where each node consists of 8 NVIDIA A100 GPUs),68–70 ViT-KNO is orders of magnitude faster in the application (e.g., the integrated forecasting system L91 18 km model is expected to cost about 9840 node seconds for prediction on a NVIDIA Selene node70). Therefore, our ViT-KNO has the potential to serve as a unit in ensemble weather forecasting frameworks to realize an efficient prediction of global weather. Although we cannot directly implement the ECMWF integrated forecasting system for comparison due to the limited computing resources, one can refer to Refs. 23 and 24 for the comparison between FourCastNet and the ECMWF integrated forecasting system, which may provide an indirect verification of the competitive performance of ViT-KNO model. More detailed comparisons remain for explorations in the future.

In this paper, we have presented KoopmanLab, an efficient module of Koopman neural operator family, for solving partial differential equations. The included models in this module, such as the compact KNO sub-family and the ViT-KNO sub-family, are provided with mathematical foundations, computational designs, and validations in solving concrete PDEs or predicting intricate dynamic system governed by unknown coupled PDEs. All models are suggested as competitive with other state-of-the-art approaches in corresponding tasks. Compared with classic numerical and neural-network-based PDE solvers, the proposed KNO variants can achieve significant acceleration, more robust mesh-independence, higher generalization capacity on changed conditions, more flexibility in characterizing latent PDEs with unknown forms, and a better balance between accuracy and efficiency. Therefore, we suggest the potential of KoopmanLab be applied in diverse down-stream tasks related with PDE solving. Users can download this module via

Several important questions remain for future studies. First, one may consider more specialized computational optimization of models in KoopmanLab (e.g., consider multi-stage training as suggested in Refs. 23 and 24 or multi-objective balancing by the Pareto theory71,72). Second, one can explore a more detailed comparison between the ViT-KNO sub-family and the ECMWF Integrated Forecasting System under the equivalent hardware conditions. Third, one can analyze the errors of our models caused by the potential continuous spectrum of the Koopman operator or the absence of ergodic property in real cases.

Y.T. and P.S. appreciate the support from the Artificial and General Intelligence Research Program of Guo Qiang Research Institute at Tsinghua University (Grant No. 2020GQG1017), the Tsinghua University Initiative Scientific Research Program, and the Huawei Innovation Research Program (Grant No. TC20221109044). X.H. was supported by the National Natural Science Foundation of China (Grant No. 42125503) and the National Key Research and Development Program of China (Grant Nos. 2022YFE0195900, 2021YFC3101600, 2020YFA0608000, and 2020YFA0607900).

The authors have no conflicts to disclose.

Wei Xiong: Conceptualization (supporting); Data curation (lead); Investigation (equal); Methodology (supporting); Software (lead); Validation (lead); Visualization (supporting); Writing – review & editing (supporting). Muyuan Ma: Data curation (supporting); Investigation (supporting); Methodology (supporting); Software (equal); Validation (equal); Visualization (equal). Xiaomeng Huang: Data curation (equal); Project administration (supporting); Resources (equal); Validation (equal); Writing – review & editing (equal). Ziyang Zhang: Resources (supporting). Pei Sun: Conceptualization (equal); Funding acquisition (lead); Project administration (lead); Resources (equal); Supervision (equal); Writing – review & editing (equal). Yang Tian: Conceptualization (lead); Formal analysis (lead); Methodology (lead); Software (supporting); Supervision (supporting); Validation (lead); Writing – original draft (lead); Writing – review & editing (equal).

The code and data that support the findings of this study are available via https://github.com/Koopman-Laboratory/KoopmanLab.73 

1.
M. S.
Gockenbach
,
Partial Differential Equations: Analytical and Numerical Methods
(
SIAM
,
2005
), Vol.
122
.
2.
H.
Tanabe
,
Functional Analytic Methods for Partial Differential Equations
(
CRC Press
,
2017
).
3.
L.
Debnath
and
L.
Debnath
,
Nonlinear Partial Differential Equations for Scientists and Engineers
(
Springer
,
2005
).
4.
R. M.
Mattheij
,
S. W.
Rienstra
, and
J. T. T.
Boonkkamp
,
Partial Differential Equations: Modeling, Analysis, Computation
(
SIAM
,
2005
).
5.
Z.
Li
,
N.
Kovachki
,
K.
Azizzadenesheli
,
B.
Liu
,
K.
Bhattacharya
,
A.
Stuart
, and
A.
Anandkumar
, “
Neural operator: Graph kernel network for partial differential equations
,” arXiv:2003.03485 (
2020
).
6.
J. N.
Reddy
,
Introduction to the Finite Element Method
(
McGraw-Hill Education
,
2019
).
7.
K.
Lipnikov
,
G.
Manzini
, and
M.
Shashkov
, “
Mimetic finite difference method
,”
J. Comput. Phys.
257
,
1163
1227
(
2014
).
8.
R.
Eymard
,
T.
Gallouët
, and
R.
Herbin
, “
Finite volume methods
,”
Handb. Numer. Anal.
7
,
713
1018
(
2000
).
9.
X.
Guo
,
W.
Li
, and
F.
Iorio
, “
Convolutional neural networks for steady flow approximation
,” in
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(
Association for Computing Machinery
,
New York
,
2016
), pp.
481
490
.
10.
Y.
Zhu
and
N.
Zabaras
, “
Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification
,”
J. Comput. Phys.
366
,
415
447
(
2018
).
11.
S.
Bhatnagar
,
Y.
Afshar
,
S.
Pan
,
K.
Duraisamy
, and
S.
Kaushik
, “
Prediction of aerodynamic flow fields using convolutional neural networks
,”
Comput. Mech.
64
,
525
545
(
2019
).
12.
W.
E
and
B.
Yu
, “
The deep ritz method: A deep learning-based numerical algorithm for solving variational problems
,”
Commun. Math. Stat.
6
,
1
12
(
2018
).
13.
M.
Raissi
,
P.
Perdikaris
, and
G. E.
Karniadakis
, “
Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations
,”
J. Comput. Phys.
378
,
686
707
(
2019
).
14.
L.
Bar
and
N.
Sochen
, “
Unsupervised deep learning algorithm for PDE-based forward and inverse problems
,” arXiv:1904.05417 (
2019
).
15.
S.
Pan
and
K.
Duraisamy
, “
Physics-informed probabilistic learning of linear embeddings of nonlinear dynamics with guaranteed stability
,”
SIAM J. Appl. Dyn. Syst.
19
,
480
509
(
2020
).
16.
L.
Lu
,
P.
Jin
, and
G. E.
Karniadakis
, “
DeepONet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators
,” arXiv:1910.03193 (
2019
).
17.
K.
Bhattacharya
,
B.
Hosseini
,
N. B.
Kovachki
, and
A. M.
Stuart
, “
Model reduction and neural networks for parametric PDEs
,”
SIAM J. Comput. Math.
7
,
121
157
(
2021
).
18.
N. H.
Nelsen
and
A. M.
Stuart
, “
The random feature model for input-output maps between banach spaces
,”
SIAM J. Sci. Comput.
43
,
A3212
A3243
(
2021
).
19.
Z.
Li
,
N. B.
Kovachki
,
K.
Azizzadenesheli
,
B.
Liu
,
K.
Bhattacharya
,
A.
Stuart
, and
A.
Anandkumar
, “
Fourier neural operator for parametric partial differential equations
,”
in International Conference on Learning Representations (OpenReview, 2021)
.
20.
N.
Kovachki
,
S.
Lanthaler
, and
S.
Mishra
, “
On universal approximation and error bounds for Fourier neural operators
,”
J. Mach. Learn. Res.
22
(
290
),
1
76
(
2021
).
21.
Z.
Li
,
D. Z.
Huang
,
B.
Liu
, and
A.
Anandkumar
, “
Fourier neural operator with learned deformations for PDEs on general geometries
,” arXiv:2207.05209 (
2022
).
22.
J.
Guibas
,
M.
Mardani
,
Z.
Li
,
A.
Tao
,
A.
Anandkumar
, and
B.
Catanzaro
, “
Efficient token mixing for transformers via adaptive Fourier neural operators
,” in
International Conference on Learning Representations
,
2021
.
23.
J.
Pathak
,
S.
Subramanian
,
P.
Harrington
,
S.
Raja
,
A.
Chattopadhyay
,
M.
Mardani
,
T.
Kurth
,
D.
Hall
,
Z.
Li
,
K.
Azizzadenesheli
et al, “
FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators
,” arXiv:2202.11214 (
2022
).
24.
T.
Kurth
,
S.
Subramanian
,
P.
Harrington
,
J.
Pathak
,
M.
Mardani
,
D.
Hall
,
A.
Miele
,
K.
Kashinath
, and
A.
Anandkumar
, “
FourCastNet: Accelerating global high-resolution weather forecasting using adaptive Fourier neural operators
,”
Proceedings of the Platform for Advanced Scientific Computing Conference
1
11
(
2023
).
25.
E.
Tadmor
, “
A review of numerical methods for nonlinear partial differential equations
,”
Bull. Am. Math. Soc.
49
,
507
554
(
2012
).
26.
D. J.
Stensrud
,
Parameterization Schemes: Keys to Understanding Numerical Weather Prediction Models
(
Cambridge University Press
,
2009
).
27.
P.
Bauer
,
A.
Thorpe
, and
G.
Brunet
, “
The quiet revolution of numerical weather prediction
,”
Nature
525
,
47
55
(
2015
).
28.
S. L.
Brunton
,
M.
Budisic
,
E.
Kaiser
, and
J. N.
Kutz
, “
Modern Koopman theory for dynamical systems
,”
SIAM Rev.
64
,
229
340
(
2022
).
29.
D.
Cao
,
Y.
Wang
,
J.
Duan
,
C.
Zhang
,
X.
Zhu
,
C.
Huang
,
Y.
Tong
,
B.
Xu
,
J.
Bai
,
J.
Tong
et al, “
Spectral temporal graph neural network for multivariate time-series forecasting
,” in
Advances in Neural Information Processing Systems (NeurIPS Foundation
, 2020), Vol. 33, pp.
17766
17778
.
30.
F.
Aminian
,
E. D.
Suarez
,
M.
Aminian
, and
D. T.
Walz
, “
Forecasting economic data with neural networks
,”
Comput. Econ.
28
,
71
88
(
2006
).
31.
W.
Xiong
,
X.
Huang
,
Z.
Zhang
,
R.
Deng
,
P.
Sun
, and
Y.
Tian
, “
Koopman neural operator as a mesh-free solver of non-linear partial differential equations
,” arXiv:2301.10022 (
2023
).
32.
H.
Hersbach
,
B.
Bell
,
P.
Berrisford
,
S.
Hirahara
,
A.
Horányi
,
J.
Muñoz-Sabater
,
J.
Nicolas
,
C.
Peubey
,
R.
Radu
,
D.
Schepers
et al, “
The ERA5 global reanalysis
,”
Q. J. R. Metereol. Soc.
146
,
1999
2049
(
2020
).
33.
R.
Taylor
,
J. N.
Kutz
,
K.
Morgan
, and
B. A.
Nelson
, “
Dynamic mode decomposition for plasma diagnostics and validation
,”
Rev. Sci. Instrum.
89
,
053501
(
2018
).
34.
C. W.
Rowley
,
I.
Mezić
,
S.
Bagheri
,
P.
Schlatter
, and
D. S.
Henningson
, “
Spectral analysis of nonlinear flows
,”
J. Fluid Mech.
641
,
115
127
(
2009
).
35.
I.
Abraham
and
T. D.
Murphey
, “
Active learning of dynamics for data-driven control using Koopman operators
,”
IEEE Trans. Rob.
35
,
1071
1083
(
2019
).
36.
B. W.
Brunton
,
L. A.
Johnson
,
J. G.
Ojemann
, and
J. N.
Kutz
, “
Extracting spatial–temporal coherent patterns in large-scale neural recordings using dynamic mode decomposition
,”
J. Neurosci. Methods
258
,
1
15
(
2016
).
37.
B. O.
Koopman
, “
Hamiltonian systems and transformation in Hilbert space
,”
Proc. Natl. Acad. Sci. U. S. A.
17
,
315
318
(
1931
).
38.
R.
Abraham
,
J. E.
Marsden
, and
T.
Ratiu
,
Manifolds, Tensor Analysis, and Applications
(
Springer Science & Business Media
,
2012
), Vol.
75
.
39.
C.
Chicone
and
Y.
Latushkin
,
Evolution Semigroups in Dynamical Systems and Differential Equations
(
American Mathematical Society
,
1999
), Vol.
70
.
40.
P. D.
Lax
, “
Integrals of nonlinear equations of evolution and solitary waves
,”
Commun. Pure Appl. Math.
21
,
467
490
(
1968
).
41.
J. P.
Parker
and
J.
Page
, “
Koopman analysis of isolated fronts and solitons
,”
SIAM J. Appl. Dyn. Syst.
19
,
2803
2828
(
2020
).
42.
H.
Nakao
and
I.
Mezić
, “
Spectral analysis of the Koopman operator for partial differential equations
,”
Chaos
30
,
113131
(
2020
).
43.
C.
Gin
,
B.
Lusch
,
S. L.
Brunton
, and
J. N.
Kutz
, “
Deep learning models for global coordinate transformations that linearise PDEs
,”
Eur. J. Appl. Math.
32
,
515
539
(
2021
).
44.
J.
Page
and
R. R.
Kerswell
, “
Koopman analysis of Burgers equation
,”
Phys. Rev. Fluids
3
,
071901
(
2018
).
45.
N.
Takeishi
,
Y.
Kawahara
, and
T.
Yairi
, “
Learning Koopman invariant subspaces for dynamic mode decomposition
,” in
Advances in Neural Information Processing Systems
,
30
(
NeurIPS Foundation
,
2017
).
46.
O.
Azencot
,
N. B.
Erichson
,
V.
Lin
, and
M.
Mahoney
, “
Forecasting sequential data using consistent Koopman autoencoders
,” in
International Conference on Machine Learning
(
PMLR
,
2020
), pp.
475
485
.
47.
S. E.
Otto
and
C. W.
Rowley
, “
Linearly recurrent autoencoder networks for learning dynamics
,”
SIAM J. Appl. Dyn. Syst.
18
,
558
593
(
2019
).
48.
D. J.
Alford-Lago
,
C. W.
Curtis
,
A. T.
Ihler
, and
O.
Issan
, “
Deep learning enhanced dynamic mode decomposition
,”
Chaos
32
,
033116
(
2022
).
49.
B.
Lusch
,
J. N.
Kutz
, and
S. L.
Brunton
, “
Deep learning for universal linear embeddings of nonlinear dynamics
,”
Nat. Commun.
9
,
4950
(
2018
).
50.
Y.
Saad
,
Numerical Methods for Large Eigenvalue Problems: Revised Edition
(
SIAM
,
2011
).
51.
H.
Arbabi
and
I.
Mezic
, “
Ergodic theory, dynamic mode decomposition, and computation of spectral properties of the Koopman operator
,”
SIAM J. Appl. Dyn. Syst.
16
,
2096
2126
(
2017
).
52.
N.
Črnjarić-Žic
,
S.
Maćešić
, and
I.
Mezić
, “
Koopman operator spectrum for random dynamical systems
,”
J. Nonlinear Sci.
30
,
2007
2056
(
2020
).
53.
S. L.
Brunton
,
B. W.
Brunton
,
J. L.
Proctor
,
E.
Kaiser
, and
J. N.
Kutz
, “
Chaos as an intermittently forced linear system
,”
Nat. Commun.
8
,
19
(
2017
).
54.
M.
Korda
and
I.
Mezić
, “
On convergence of extended dynamic mode decomposition to the Koopman operator
,”
J. Nonlinear Sci.
28
,
687
710
(
2018
).
55.
M.
Li
and
L.
Jiang
, “
Data-driven reduced-order modeling for nonautonomous dynamical systems in multiscale media
,” arXiv:2204.13180 (
2022
).
56.
I. P.
Cornfeld
,
S. V.
Fomin
, and
Y. G.
Sinai
,
Ergodic Theory
(
Springer Science & Business Media
,
2012
), Vol.
245
.
57.
N.
Park
and
S.
Kim
, “
How do vision transformers work?
,” in
International Conference on Learning Representations
(
OpenReview
,
2021
).
58.
C. Y.
Wang
, “
Exact solutions of the steady-state Navier-Stokes equations
,”
Annu. Rev. Fluid Mech.
23
,
159
177
(
1991
).
59.
E. R.
Benton
and
G. W.
Platzman
, “
A table of solutions of the one-dimensional burgers equation
,”
Q. Appl. Math.
30
,
195
212
(
1972
).
60.
T. K.
Sengupta
,
P.
Sundaram
,
A.
Sengupta
et al, “
Analysis of pseudo-spectral methods used for numerical simulation of turbulence
,” arXiv:2109.00255 (
2021
).
61.
D. P.
Kingma
and
J.
Ba
, “
Adam: A method for stochastic optimization
,” arXiv:1412.6980 (
2014
).
62.
Z.
Li
,
N.
Kovachki
,
K.
Azizzadenesheli
,
B.
Liu
,
A.
Stuart
,
K.
Bhattacharya
, and
A.
Anandkumar
, “
Multipole graph neural operator for parametric partial differential equations
,” in
Advances in Neural Information Processing Systems (NeurIPS Foundation
, 2020), Vol. 33, pp.
6755
6766
.
63.
A.
Dosovitskiy
,
L.
Beyer
,
A.
Kolesnikov
,
D.
Weissenborn
,
X.
Zhai
,
T.
Unterthiner
,
M.
Dehghani
,
M.
Minderer
,
G.
Heigold
,
S.
Gelly
et al, “
An image is worth 16 × 16 words: Transformers for image recognition at scale
,”
International Conference on Learning Representations (OpenReview
, 2020), (n.d.).
64.
J.
Guibas
,
M.
Mardani
,
Z.
Li
,
A.
Tao
,
A.
Anandkumar
, and
B.
Catanzaro
, “
Adaptive Fourier neural operators: Efficient token mixers for transformers
,” arXiv:2111.13587 (
2021
).
65.
A.
Radford
,
L.
Metz
, and
S.
Chintala
, “
Unsupervised representation learning with deep convolutional generative adversarial networks
,” arXiv:1511.06434 (
2015
).
66.
R.
Tibshirani
, “
Regression shrinkage and selection via the lasso
,”
J. R. Stat. Soc.: Ser. B
58
,
267
288
(
1996
).
67.
J. A.
Weyn
,
D. R.
Durran
,
R.
Caruana
, and
N.
Cresswell-Clay
, “
Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models
,”
J. Adv. Model. Earth Syst.
13
,
e2021MS002502
(
2021
).
68.
C. D.
Roberts
,
R.
Senan
,
F.
Molteni
,
S.
Boussetta
,
M.
Mayer
, and
S. P.
Keeley
, “
Climate model configurations of the ECMWF integrated forecasting system (ECMWF-IFS cycle 43r1) for HighResMIP
,”
Geosci. Model Dev.
11
,
3681
3712
(
2018
).
69.
P.
Lopez
, “
A lightning parameterization for the ECMWF integrated forecasting system
,”
Mon. Weather Rev.
144
,
3057
3075
(
2016
).
70.
P.
Bauer
,
T.
Quintino
,
N.
Wedi
,
A.
Bonanni
,
M.
Chrust
,
W.
Deconinck
,
M.
Diamantakis
,
P.
Düben
,
S.
English
,
J.
Flemming
et al,
The ECMWF Scalability Programme: Progress and Plans
(
European Centre for Medium Range Weather Forecasts
,
2020
).
71.
O.
Sener
and
V.
Koltun
, “
Multi-task learning as multi-objective optimization
,” in
Advances in Neural Information Processing Systems
,
31
(
NeurIPS Foundation
,
2018
).
72.
M.
Momma
,
C.
Dong
, and
J.
Liu
, “
A multi-objective/multi-task learning framework induced by Pareto stationarity
,” in
International Conference on Machine Learning
(
PMLR
,
2022
), pp.
15895
15907
.
73.
W.
Xiong
,
M.
Ma
,
X.
Huang
,
Z.
Zhang
,
P.
Sun
, and
Y.
Tian
(
2023
). “
KoopmanLab: A module of Koopman neural operator family for solving partial differential equations
,”
Github
. https://github.com/Koopman-Laboratory/KoopmanLab