Numerous physics theories are rooted in partial differential equations (PDEs). However, the increasingly intricate physics equations, especially those that lack analytic solutions or closed forms, have impeded the further development of physics. Computationally solving PDEs by classic numerical approaches suffers from the trade-off between accuracy and efficiency and is not applicable to the empirical data generated by unknown latent PDEs. To overcome this challenge, we present KoopmanLab, an efficient module of the Koopman neural operator (KNO) family, for learning PDEs without analytic solutions or closed forms. Our module consists of multiple variants of the KNO, a kind of mesh-independent neural-network-based PDE solvers developed following the dynamic system theory. The compact variants of KNO can accurately solve PDEs with small model sizes, while the large variants of KNO are more competitive in predicting highly complicated dynamic systems govern by unknown, high-dimensional, and non-linear PDEs. All variants are validated by mesh-independent and long-term prediction experiments implemented on representative PDEs (e.g., the Navier–Stokes equation and the Bateman–Burgers equation in fluid mechanics) and ERA5 (i.e., one of the largest high-resolution global-scale climate datasets in earth physics). These demonstrations suggest the potential of KoopmanLab to be a fundamental tool in diverse physics studies related to equations or dynamic systems.

## I. INTRODUCTION

### A. The rising of partial differential equation solvers

**F**that relates $\Phi =\Phi D;Rd\varphi $, a Banach space of inputs (i.e., initial values), with $\Gamma =\Gamma D;Rd\gamma $, a Banach space of solutions (i.e., target values), for a typically time-dependent PDE family defined on a bounded open set $D\u2282Rd$,

**L**

_{ϕ}is a differential operator characterized by

*ϕ*. Mapping $\kappa \u22c5$ is a function that lives in a function space determined by

**L**

_{ϕ}. Mapping $\gamma \u22c5$ is the solution of the PDE family that we attempt to obtain. The boundary and initial conditions are denoted by

*γ*

_{B}and

*γ*

_{I}, respectively. Mathematically, driving an accurate solution operator $F:\varphi ,\gamma B,\gamma I\u21a6\gamma $ is the key step to obtain the PDE solution $\gamma \u22c5$. However, even in the case where the boundary and initial conditions are constant [i.e., the solution operator $F:\varphi ,\gamma B,\gamma I\u21a6\gamma $ reduces to

**F**:

*ϕ*↦

*γ*], driving an analytic expression of solution operator

**F**can be highly non-trivial.

^{1,2}

The absence of analytic solutions of various important PDEs in science and engineering naturally calls for the rapid development of computational solvers, which attempt to approximate a parametric counterpart **F**_{θ} ≃ **F** parameterized by *θ* to derive solution $\gamma \u22c5$.^{1,3,4} To date, the joint efforts of physics, mathematics, and computer science have given birth to two mainstream families of PDE solvers:^{5}

The first family of solvers are classic numerical ones. Typical instances of these solvers include finite element (FEM),

^{6}finite difference (FDM),^{7}and finite volume (FVM)^{8}methods. In general, these methods discretize space and time domains following specific mesh designs and solve parameterized PDEs on meshes by certain iterative algorithms. Specifically, FEM subdivides the original domain into a set of sub-domains defined by a collection of element equations and recombines these element equations to derive the global solution.^{6}FDM approximates derivatives as finite differences measured on local values.^{7}FVM transforms the original problem into a series of surface flux calculations on local volumes.^{8}The second family of solvers are neural-network-based ones. With a pursuit of accelerating PDE solving and improving the applicability on real data, three kinds of neural-network-based solvers have been proposed:

One kind of solvers discretize domains

*D*and*T*into*x*and*y*meshes and approximate a finite-dimensional and mesh-dependent solution operator**F**_{θ}by a parameterized neural network between finite Euclidean spaces, i.e., $F\theta :Rx\xd7Ry\xd7\Theta \u2192Rx\xd7Ry$ (e.g., see Refs. 9–11). Given an arbitrary input $\gamma xt$, the trained neural network can function as a solution operator to predict $\gamma xt+\tau =F\theta \gamma xt$ for a certain time difference*τ*.Another kind of solvers directly parameterize equation solution $\gamma \u22c5$ as a neural network, i.e., $F\theta :D\xd7T\xd7\Theta \u2192R$ (e.g., see Refs. 12–15). These solvers are mesh-independent and accurate in learning a given PDE because they can directly transform arbitrary domain and parameter setting to target equation solution $\gamma \u22c5$.

The last kind of solvers, including neural operators, attempt to parameterize a mesh-dependent and infinite-dimensional solution operator with neural networks, i.e.,

**F**_{θ}: Φ × Θ → Γ (e.g., see Refs. 5 and 16–21). These mesh-independent solvers can be flexibly implemented on different discretization schemes and only need to be trained once for a given PDE family. The equation solution $\gamma \u22c5$ of different instances of the PDE family can be generated by a computationally reusable forward pass of the network,^{5,19}which can be further accelerated by fast Fourier transform.^{19}Representative demonstrations of this kind of solver are Fourier neural operator^{19}and its variants (e.g., adaptive Fourier neural operator^{22}and FourCastNet^{23,24}). These frameworks not only solve PDEs with known expressions but also be able to predict complex dynamic systems governed by unknown PDEs on real datasets (e.g., climate system^{23,24}).

### B. The limitation of previous partial differential equation solvers

Although substantial progress has been accomplished by existing PDE solvers from various perspectives, there remain critical challenges in this booming direction.

In practice, the mesh-dependent property of classic numerical solvers has implied an inevitable trade-off between computation accuracy and efficiency, i.e., fine-grained meshes ensure accuracy yet coarse-grained meshes are favorable for efficiency.^{19,25} However, in many cases, the applications of PDE solving (e.g., numerical weather forecasting^{26,27}) require timely and accurate computation. To ensure accuracy and speed, every single time of computation in the downstream applications supported by classic numerical solvers frequently costs large amounts of computing resources. In cases with limited computing power, a significant time delay may occur. Moreover, all numerical solvers require the explicit definitions of target PDEs as *a priori* knowledge and are less applicable to predict real data generated by unknown PDEs.^{19}

^{5}Type (b) solvers are limited to learning a concrete instance of the PDE rather than the entire family and, consequently, require restarted training given a different instance and cannot handle the data with unknown PDEs.

^{5}Although type (c) solvers can learn the entire PDE family in a mesh-independent manner,

^{5,19}they may face challenges in characterizing the long-term behavior of equation solution $\gamma \u22c5$. To understand these challenges, let us consider the iterative update strategy of neural operators for any

*x*

_{t}∈

*D*× {

*t*},

^{5}

^{,}

*θ*, and mapping $\gamma \u0302:D\xd7T\u2192Rd\gamma \u0302$ denotes the parameterized counterpart of equation solution

*γ*generated by the neural network (e.g., by embedding).

^{5}In Eq. (4), the integral term associated with

*κ*

_{θ}defines an kernel integral operator to parameterize the Green function $J\varphi :D\xd7T\xd7D\xd7T\u2192R$,

*ϕ*as well. One can see a similar form of Eq. (5) in Ref. 5. Computationally, the iteration of Eq. (4) can be significantly accelerated by Fourier transform, which leads to the well-known Fourier neural operator.

^{19}

*γ*acts on all elements in set

*D*× {

*t*}. Mathematically, the dynamics is defined as

*θ*associated with $\zeta \u22c5,\u22c5$ according to the modern dynamic system theory

^{28}

^{,}

*ɛ*may correspond to a highly non-trivial evolution process of $\gamma \u22c5$, making $\gamma \u0302xt+\epsilon $ less predictable during iterative updating and reducing the precision of Eq. (4) significantly. This phenomenon inevitably impedes the accurate prediction of the long-term dynamics (i.e.,

*ɛ*→

*∞*) of diverse non-linear PDE families (e.g., see those in epidemic prevention,

^{29}economic modeling,

^{30}and weather forecast

^{23,24}). To overcome this obstacle, existing models are forced to improve accuracy at the cost of efficiency.

### C. Our contributions to partial differential equation solvers

In this paper, we build on Koopman neural operator (KNO), one of our latest works,^{31} to develop an efficient module of PDE solving and overcome the limitation in characterizing the long-term behaviors of complicated PDE families. As a study on computational physics programs, our research has the following contributions compared with our previous work.^{31}

First, we generalize the original KNO to four kinds of variants. Beyond the original KNO, these differentiated variants offer more possibilities for data-specific and task-oriented solver designs. Specifically, the compact variants of KNO realized by multi-layer perceptrons (MLP) and convolutional neural networks (CNN) can accurately solve PDE with small model sizes. The large variants of KNO implemented on visual transformers can predict highly intricate dynamic systems governed by unknown, high-dimensional, and non-linear PDEs (e.g., climate system).

Second, we propose KoopmanLab, a PyTorch module of Koopman neural operator family, as a self-contained and user-friendly platform for PDE solving. All necessary tools, such as those for data loading, model construction, parameter manipulation, output visualization, and performance quantification, are offered in a user-friendly manner to support customized applications.

Third, we offer comprehensive validation of the proposed module on representative datasets, including those generated by important PDEs in fluid mechanics (e.g., the Navier–Stokes equation and the Bateman–Burgers equation) or obtained by global meteorological recording research (e.g., atmospheric, land, and oceanic climate fields in ERA5 dataset).^{32} By measuring accuracy, quantifying efficiency, and comparing all KNO variants with other state-of-the-art alternatives (e.g., Fourier neural operator^{19} and FourCastNet^{23,24}), we suggest the potential of our module to serve as an ideal choice of PDE solving and dynamic system prediction.

## II. THE INITIAL VERSION OF KOOPMAN NEURAL OPERATOR

### A. The original Koopman neural operator: Objective

**A**is a linear operator. In the modern dynamic system theory,

^{28}this pursuit may be achieved if we can develop an approach to characterize the Koopman operator

**K**, an infinite-dimensional linear operator governing all possible observations of the dynamic system of equation solution $\gamma \u22c5$, to act on the flow mapping

*θ*and linearizing the dynamics of $\gamma \u22c5$ in an appropriate observation space. This idea has been extensively applied in plasma physics,

^{33}fluid dynamics,

^{34}robot kinetics,

^{35}and neuroscience.

^{36}

^{28}

^{,}

^{37–39}

^{40}

^{,}

*n*th total derivative operator and

*I*is an identity operator. Equation (18) denotes an eigenvalue problem at moment

*t*. A relationship between linear operators

**M**and

**N**can be identified if we calculate the time derivative of Eq. (18),

**N**and $Ktt+\epsilon $,

^{40}Note that similar ideas have been comprehensively explored in mathematics and physics.

^{41–44}

**K**that acts on $K=spanG\u0302$, a finite invariant sub-space spanned by $G\u0302={g1,\u2026,gr}\u2282GRd\gamma \xd7T$,

**K**can span a finite invariant sub-space.

### B. The original Koopman neural operator: Mathematics

There exist numerous previous studies that pursue to characterize the Koopman operator by machine-learning-based approaches. Some approaches are highly practical but limited to the autonomous case [e.g., the case in Eq. (12)].^{45–48} Other approaches are more general in application but critically depend on *a priori* knowledge about the eigenvalue spectrum (e.g., the numbers of real and complex eigenvalues) of Koopman operator to deal with the continuous spectrum problem.^{49}

In practice, a balance should be reached between mathematical completeness and computational practicality. An ideal Koopman-operator-based PDE solver should fit in with both autonomous and non-autonomous cases and limit the dependence of *a priori* knowledge as much as possible (even though these restraints inevitably reduce mathematical completeness). To explore such a balance, we introduce the Koopman neural operator (KNO), a flexible approach, in our previous work.^{31}

^{50}of the observable defined by a unit time step $\epsilon \u22080,\u221e$, which is used in the Krylov subspace approach for computing the eigenvalues of large matrices.

^{50}One can see its application in Koopman-operator-related algorithms such as the Hankel-DMD,

^{51}sHankel-DMD,

^{52}and HAVOK.

^{53}Specifically, the Krylov sequence is given as

**K**and $g\gamma 0$,

*t*∈

*T*. Mathematically, matrix $K\u0302$ is required to satisfy the Galerkin projection relation

^{54,55}If the target Koopman operator is bounded and $Hm,n$ spans its invariant subspace, the approximation can be realized by

*μ*is a measure on $GRd\gamma \xd7T$ and ‖·‖

_{F}denotes the Frobenius norm. Once a restricted Koopman operator is derived, we can obtain the following iterative dynamics:

*k*th column of

**H**

_{m×n}.

As for the case where the Koopman operator has a continuous spectrum (e.g., has no eigenvalue), there is no finite invariant subspace to support computational approximation. Such an ill-posed situation remains for future exploration.

^{51,56}of the dynamic system of

*γ*

_{t}, i.e.,

*γ*

_{t}ultimately visits every possible system states as

*t*→

*∞*. Under this assumption, the proportion of retention time of

*γ*

_{t}at a certain system state is equivalent to the probability of this state in the space, making the time-averaging equivalent to the actual expectation at the limit of infinite time. Based on this property, we can define an expectation of the restricted Koopman operator associated with

*ɛ*,

*ɛ*, the expected Koopman operator $K\u0304\epsilon :GRd\gamma \xd7T\u2192K$ is a time-average of $K\u0302tt+\epsilon $ that can be learned during offline optimization.

### C. The original Koopman neural operator: Convergence

*m*→

*∞*, we can ensure the convergence of the eigenvalues and eigenfunctions of $K\u0302$ to those of

**K**under the assumption of ergodic property. Similar conclusions can be seen in Ref. 54. To understand this convergence, we need to indicate several important properties. First, as we have mentioned, there exists an equivalence relation between time-averaging and the real expectation as the time approaches to infinity (i.e., the Birkhoff ergodic theorem

^{51,56})

**V**associated with the inner product space $K$,

**H**

_{m×n}serves as the sampling of

**R**

_{n}. Meanwhile, the left side of Eq. (35) coincides with the empirical Gramian matrix $V\u0302$ associated with matrix

**H**

_{m×n},

*r*<

*n*element of

**R**

_{n},

**C**denotes the coordinate of $K\u0304\epsilon rg\gamma 0$ defined by the basis, which should be calculated as

*k*-order principal minors of the corresponding matrix. Now, let us consider the characteristic polynomials of $C\u0302$ and

**C**,

*m*→

*∞*can be measured as

**C**share the same set of eigenvalues at the limit of

*m*→

*∞*since their characteristic polynomials converge to the same. Moreover, the convergence of $C\u0302$ to

**C**and the convergence of the eigenvalues $C\u0302$ to those of

**C**eventually imply the convergence of the eigenfunctions.

### D. The original Koopman neural operator: Computation

In Ref. 31, we have proposed an architecture to implement the original Koopman neural operator on neural networks. The details of architecture designs are presented below:

**Part 1: Observation.**An encoder [e.g., a single non-linear layer with $tanh\u22c5$ activation function in the original Koopman neural operator] serves as observation function $g\u22c5$ to transform $\varphi t=\varphi D\xd7{t}$, an arbitrary input of the PDE family (e.g.,*ϕ*_{t}can be directly set as*γ*_{t}), into $g\gamma t\u0302\u2208GRd\gamma \u0302\xd7T$,See Fig. 1 for illustrations.(52)$g\gamma t\u0302=Encoder\varphi t,\u2200t\u2208T.$**Part 2: Fourier transform.**Similar to the Fourier neural operator,^{19}the original Koopman neural operator also utilizes the Fourier transform during the iterative update of the Green function in Eq. (5). Given $g\gamma t\u0302$, we derive the Fourier transform $gF\u22c5=F\u25e6g\u22c5$, where we truncate the Fourier series at*ω*, a maximum frequencyNote that $\chi \u22c5\u22c5$ denotes the indicator function [i.e., $\chi Aa=1$ if(53)$gF\xi =\chi 0,\omega \xi \u222bD\xd7{t}g\gamma \u0302xtexp\u22122\pi i\u27e8xt,\xi \u27e9dxt.$*a*∈*A*, otherwise $\chi Aa=0$]. Computationally, the above transform is implemented by fast Fourier transform. For convenience, we markas the transformed result of $\gamma t\u0302$. Different from Ref. 19, our main motivation for using the truncated Fourier transform is to extract the low-frequency information (i.e., main system components) of the represented equation solution $g\gamma t\u0302$. Certainly, frequency truncation inevitably causes the loss of high-frequency information (i.e., high-frequency perturbations or edges). In the original Koopman neural operator,(54)$gF\gamma t\u0302=F\u25e6g\gamma t\u0302=gF\xi |\xi \u22080,\u221e$**Part 5**is designed to complement the lost information associated with high-frequency. See Fig. 1 for more details.**Part 3: Hankel representation and offline Koopman operator.**Once we have derived $gF\gamma t\u0302$ for every $t\u2208\epsilon N+$, a Hankel matrix $H\u0302m\xd7n$ of $gF\gamma t\u0302$ will be generated following $m\u2208N$, a dimension of delay-embedding (note that $n\u2208N$ is the number of all accessible samples)We train a(55)$H\u0302m\xd7n=gF\gamma \u03020gF\gamma \u0302\epsilon \cdots gF\gamma \u0302n\epsilon \vdots \vdots \vdots \vdots gF\gamma \u0302m\u22121\epsilon gF\gamma \u0302m\epsilon \cdots gF\gamma \u0302m+n\u22121\epsilon .$*o*×*o*linear layer to learn a neural network representation of Koopman operator $K\u0304\epsilon :GRd\gamma \u0302\xd7T\u2192K\u0302$ following Eqs. (32) and (33), where $K\u0302$ is spanned by $H\u0302m\xd7n$. The learned $K\u0304\epsilon $ can be used to predict the future state of $gF\gamma \u0302m+n\u22121\epsilon $ aswhere notion(56)$gF\gamma \u0302m+n+r\u22121\epsilon =K\u0304\epsilon rH\u0302m\xd7nnTm,r\u2208N+,$**T**denotes the transpose of a matrix. See Fig. 1 for illustrations.**Part 4: Inverse Fourier transform.**After $gF\gamma \u0302m+n\epsilon $ is predicted in**Part 3**, it is transformed from the Fourier space to $GRd\gamma \u0302\xd7T$ by an inverse Fourier transformwhere $t=m+n+r\u22121\epsilon $. For convenience, we mark(57)$g\gamma \u0302xt=12\pi d\gamma \u0302\u222b\u2212\u221e\u221egF\xi exp2\pi i\u27e8xt,\xi \u27e9d\xi ,$See Fig. 1 for instances of(58)$g\gamma \u0302m+n+r\u22121\epsilon =F\u22121\u25e6gF\gamma \u0302m+n+r\u22121\epsilon .$**Part 4**.**Part 5: High-frequency information complement.**In the original Koopman neural operator, we use a convolutional layer to extract high-frequency of $g\gamma t\u0302$ because convolutional layers can amplify high-frequency components according to Fourier analysis.^{57}Therefore, we train a convolutional layer**C**on the outputs of**Part 1**to extract their high-frequency information. As a complement of**Parts 2–4**, the convolutional layer realizes a forward prediction of high-frequency informationSee Fig. 1 for illustrations.(59)$gC\gamma \u0302j+r\u22121\epsilon ,\u2026,gC\gamma \u0302j+m+r\u22121\epsilon T=Cg\gamma \u0302j\epsilon ,\u2026,g\gamma \u0302j+m\epsilon T,\u2200j=1,\u2026,n.$**Part 6: Inverse observation.**Once two future states, $g\gamma \u0302m+n\epsilon $ and $gC\gamma \u0302m+n\epsilon $, are predicted by**Parts 2–4**and**Part 5**, they are unified in a linear mannerGiven $gU\gamma \u0302m+n\epsilon $, a non-linear decoder [e.g., a single non-linear layer with $tanh\u22c5$ activation function in the original neural operator] is trained to approximate the inverse of observation function(60)$gU\gamma \u0302m+n+r\u22121\epsilon =g\gamma \u0302m+n+r\u22121\epsilon +gC\gamma \u0302m+n+r\u22121\epsilon .$and derive(61)$g\u22121\u22c5\u2243gU\u22121\u22c5$as the target state of equation solution in space $Rd\gamma \u0302$. See Fig. 1 for illustrations.(62)$\gamma \u0302m+n+r\u22121\epsilon =DecodergU\gamma \u0302m+n+r\u22121\epsilon $

**Parts 1–6**define the iterative update strategy of Eq. (23). For any $t\u2032>t\u2208\epsilon N$, the iterative dynamics is given as

The one-unit architecture of the original Koopman neural operator is visualized in Fig. 1. A multi-unit architecture can be readily constructed by cascading the copy of **Parts 2–5** multiple times.

## III. THE KOOPMAN NEURAL OPERATOR FAMILY

Beyond the original Koopman neural operator (KNO),^{31} we generalize it to four kinds of variants to fit in with different application demands.

### A. The compact KNO sub-family: Definition

*η*denotes a non-linear activation function [e.g., we use $tanh\u22c5$ in our research], notions

**W**

_{e}and

**W**

_{d}are two weight matrices of the corresponding sizes, and

**C**

_{e}and

**C**

_{d}are two convolutional layers. Our motivation of proposing different types of encoders and decoders is to enable the defined model to fit in with diverse application demands. In general, a multi-layer perceptron encoder or decoder can be efficiently and flexibly implemented on any shape of data while a convolutional encoder or decoder may be better at extracting the local information (e.g., boundaries) of high-resolution data.

Our proposed KoopmanLab module offers user-friendly tools to customize an MLP-KNO. A customized instance is presented below.

Similarly, a CNN–KNO can be customized using the following code, where we present one-dimensional and two-dimensional visions.

### B. The compact KNO sub-family: Validation

To validate the proposed compact KNO sub-family in PDE solving tasks, we design mesh-independent and long-term prediction tasks on representative PDEs (e.g., the two-dimensional Navier–Stokes equation^{58} and the one-dimensional Bateman–Burgers equation^{59}). The numerical datasets of these two are provided by Ref. 19.

*ν*∈ {10

^{−3}, 10

^{−4}} in our research. Given the data with the highest mesh resolution, one can further generate the data with the lower resolution by direct down-sampling.

^{19}The data with the highest mesh resolution has 2

^{13}grids.

^{19}Our KoopmanLab module offers a function to load the data of the incompressible two-dimensional Navier–Stokes equation.

*γ*

_{I}stands for a periodic initial condition $\gamma I\u2208Lperiodic20,1;R$ and parameter $\nu \u22080,\u221e$ is the viscosity coefficient. We set

*ν*= 100 in our research. The data with highest mesh resolution has 2

^{16}grids.

^{19}To load this dataset, one can use the following function:

In Fig. 2(a), we validate the mesh-independent property of the proposed compact KNO sub-family adopting the same setting used in our earlier work.^{31} The mesh-independent property, as suggested by Refs. 5 and 16–21, arises from the fact that the neural operator is expected to learn the solution operator of an entire PDE family rather than be limited to a concrete parameterized instance. Specifically, we conduct the experiments on the data of one-dimensional Bateman–Burgers equation associated with different mesh granularity conditions (i.e., spatial resolution of meshes). Different versions of the compact KNO sub-family are defined by changing hyper-parameters [e.g., operator size *o*, frequency mode number *f*, and the power of the Koopman operator $r=t\u2032\u2212t\epsilon $ in Eq. (56)]. These models are trained by 1000 randomly selected samples with the lowest spatial resolution and conduct 1-s forward prediction on 200 samples associated with different resolutions. Batch size is set as 10, the learning rate is initialized as 0.001 and halved every 100 epochs, and the weights of prediction and reconstruction in Eq. (64) are set as $\lambda p,\lambda r=5,0.5$. All models are trained by the Adam optimizer^{61} on a single NVIDIA Titan V graphics processing unit (GPU). As shown in Fig. 2(a), the prediction errors of all versions of the compact KNO sub-family maintain constantly across different spatial resolutions, suggesting the capacity of the compact KNO sub-family to be mesh-independent. Mesh-independence is important for PDE solving because it allows one to train a neural-network-based PDE solver on the data with low spatial resolution and directly apply the solver on the data with high spatial resolution, which helps balance between accuracy and efficiency in PDE solving. In our earlier work,^{31} one can further see a detailed comparison between the original KNO and FNO^{19} in mesh-independent prediction task, where the original KNO outperforms FNO with a much smaller model size (e.g., a size of 5 × 10^{3} for KNO and a size of 2 × 10^{7} for FNO). Other neural operator models, such as graph neural operator (GNO)^{5} and multipole graph neural operator (MGNO),^{62} are no longer considered because they have been demonstrated as less accurate than FNO as reported by Ref. 19.

In Figs. 2(b)–2(e), we validate the compact KNO sub-family by a long-term prediction task designed on the two-dimensional Navier–Stokes equation datasets with viscosity coefficients *ν* = 10^{−3} [Fig. 2(b)] and *ν* = 10^{−4} [Fig. 2(c)]. A down-sampling scaling factor of 2 is defined to generate the datasets with 2^{12} grids. For comparison, a one-unit FNO is defined following the default setting introduced in Ref. 19. A 40-time-interval prediction task is conducted on the dataset with *ν* = 10^{−3}, where models are trained on 1000 samples of $\gamma 0,12\xd70,10$ and tested on 200 samples of $\gamma 0,12\xd710,50$. Similarly, a more challenging 10-time-interval prediction task is conducted on the dataset with *ν* = 10^{−4} in which models are trained on 8000 samples of $\gamma 0,12\xd70,10$ and tested on 200 samples of $\gamma 0,12\xd710,20$. Figures 2(b) and 2(c) report the prediction performance of all models as the function of increasing prediction duration length. Figures 2(d) and 2(e) visualize predicted instances and errors in the cases with *ν* = 10^{−3} [Fig. 2(d)] and *ν* = 10^{−4} [Fig. 2(e)] during testing. All experimental results suggest the optimal potential of the compact KNO sub-family in characterizing the long-term evolution of PDE solutions. Combining these results with the model sizes measured in Table I, we suggest that the compact KNO sub-family realizes a better balance between accuracy and efficiency because a KNO variant with a smaller model size can still outperform FNO significantly. Meanwhile, although KNO costs more time in training and testing than FNO, it is significantly faster than the pseudo-spectral method,^{60} a representative numerical approach for simulating fluids. Because the datasets of the two-dimensional Navier–Stokes equation^{58} and the one-dimensional Bateman–Burgers equation^{59} used in our experiments are generated by the pseudo-spectral method^{60} following Ref. 19, our results suggest the potential of a trained KNO model to achieve the acceleration of multiple orders of magnitude during reproducing the accurate outputs of the pseudo-spectral method^{60} with small errors.

Models . | Settings . | Parameter number . | Training time cost . | Testing time cost . |
---|---|---|---|---|

FNO^{19} | Default settings, one-unit | 233 897 | 16.386 s/epoch | 0.0567 s/sample |

MLP-KNO | $o,f,r=32,10,12$ | 206 538 | 89.082 s/epoch | 0.2389 s/sample |

MLP-KNO | $o,f,r=32,16,8$ | 526 026 | 64.964 s/epoch | 0.1634 s/sample |

MLP-KNO | $o,f,r=48,10,12$ | 464 170 | 86.740 s/epoch | 0.2380 s/sample |

CNN–KNO | $o,f,r=32,10,12$ | 206 538 | 89.928 s/epoch | 0.2360 s/sample |

CNN–KNO | $o,f,r=32,16,8$ | 526 026 | 67.814 s/epoch | 0.1736 s/sample |

CNN–KNO | $o,f,r=48,10,12$ | 464 170 | 89.042 s/epoch | 0.2287 s/sample |

Pseudo-spectral method^{60} | Calculates $\gamma 0,12\xd710,50$ | ⋯ | ⋯ | 511.09 s/sample |

Models . | Settings . | Parameter number . | Training time cost . | Testing time cost . |
---|---|---|---|---|

FNO^{19} | Default settings, one-unit | 233 897 | 16.386 s/epoch | 0.0567 s/sample |

MLP-KNO | $o,f,r=32,10,12$ | 206 538 | 89.082 s/epoch | 0.2389 s/sample |

MLP-KNO | $o,f,r=32,16,8$ | 526 026 | 64.964 s/epoch | 0.1634 s/sample |

MLP-KNO | $o,f,r=48,10,12$ | 464 170 | 86.740 s/epoch | 0.2380 s/sample |

CNN–KNO | $o,f,r=32,10,12$ | 206 538 | 89.928 s/epoch | 0.2360 s/sample |

CNN–KNO | $o,f,r=32,16,8$ | 526 026 | 67.814 s/epoch | 0.1736 s/sample |

CNN–KNO | $o,f,r=48,10,12$ | 464 170 | 89.042 s/epoch | 0.2287 s/sample |

Pseudo-spectral method^{60} | Calculates $\gamma 0,12\xd710,50$ | ⋯ | ⋯ | 511.09 s/sample |

### C. The ViT-KNO sub-family: Definition

Different from the compact KNO sub-family, the ViT-KNO sub-family is proposed for dealing with more intricate situations (here ViT stands for Vision Transformer^{63}). Numerous applications of PDE solving (e.g., global climate forecasting) require the solver to be able to capture the underlying patterns of ultra-large datasets that may be related with certain unknown PDEs. Meanwhile, there may exist multiple variables of interest that govern by a group of coupled PDEs. To fit in with these situations, we follow the main idea of the compact KNO sub-family to develop a kind of transfer-based PDE solver. The mechanism underlying the proposed ViT-KNO sub-family is not completely same as Eqs. (52)–(62) because some mathematical details are modified to improve model applicability on noisy real datasets. We suggest the benefits of our modifications based on an experiment on ERA5, one of the largest dataset of global atmospheric, land, and oceanic climate fields.^{32} Nevertheless, more in-depth mathematical analyses of these modifications remain for future studies.

Let us consider a case where there exist *v* coupled variables, {*γ*^{1}, …, *γ*^{h}}, defined on domain *D* × *T*. The dynamics of these variables are govern by a group of PDEs with unknown expressions. The objective is to learn the equation solutions of these latent PDEs such that the dynamics of {*γ*^{1}, …, *γ*^{h}} can be accurately characterized.

The architecture of ViT-KNO sub-family consists of seven parts. Below, we present a detailed computational implementation of each part.

**Part 1: Observation.**Similar to the encoder design in the compact KNO sub-family, an encoder component is implemented in ViT-KNO to serve as observation function $g\u22c5$ and transform $\varphi ts=\varphi sD\xd7{t}$ into $g\gamma \u0302ts$ for each*s*∈ {1, …,*h*}. Specifically, the encoder is realized by the token embedding layer in Vision Transformer (ViT).^{63}Given a joint input $\varphi t1,\u2026,\varphi th\u2208Rd\varphi \xd7h$, we first transform it into a three-dimensional token tensor Φ_{t}by a convolutional layer**C**_{e}and an arbitrary non-linear transform**Q**,where domain(72)$Q\u25e6Ce\varphi t1,\u2026,\varphi th=g\Gamma \u0302t\u2208Ru\xd7v\xd7l,$*D*is reorganized into*u*×*v*patches (i.e., tokens). The patch is a kind of square and non-overlapping macro-mesh. If domain*D*has been already discretized into multiple meshes, then the size of a patch equals the number of meshes it covers. Parameter*l*denotes a customized embedding dimension, which is not necessarily the same as*h*. In real cases, the definition of the non-linear transform**Q**can be selected according to different application demands. The derived tensor $g\Gamma \u0302t$ denotes the joint representation of $g\gamma \u0302t1,\u2026,g\gamma \u0302tl$. See Fig. 3 for illustrations.**Part 2: Fourier transform.**Similar to adaptive Fourier neural operator,^{23,24,64}a truncated Fourier transform is applied on the first two dimensions of $g\Gamma \u0302t$ to derive the Fourier series of each embedded variable*s*∈ {1, …,*l*},where $u={1,\u2026,u}$. For convenience, we mark(73)$gFs\xi =\chi 0,\omega \xi \u222bu\xd7v\xd7{t}g\gamma \u0302sxt\xd7exp\u22122\pi i\u27e8xt,\xi \u27e9dxt,$(74)$gFs\gamma t\u0302=F\u25e6g\gamma \u0302ts=gFs\xi |\xi \u22080,\u221e,$in which(75)$gF\Gamma \u0302t=F\u25e6g\Gamma \u0302t=gF1\gamma t\u0302,\u2026,gFl\gamma t\u0302$*s*∈ {1, …,*l*}. Similar to the compact KNO sub-family, frequency truncation leads to the loss of high-frequency information. In the ViT-KNO sub-family,**Part 5**is designed for complementing high-frequency information. See Fig. 3 for details.**Part 3: Koopman-operator-associated component.**After deriving $gF\Gamma \u0302t$ for every $t\u2208\epsilon N+$, a Koopman-operator-associated component is designed to function on the third dimension of every token in $gF\Gamma \u0302t$ and realize the iterative dynamicsin which Koopman operator $K\u0304\epsilon $ is learned by a linear transform and layer(76)$gF\Gamma \u0302t+\epsilon =K\u0304\epsilon SgF\Gamma \u0302t=gF1\gamma \u0302t+\epsilon ,\u2026,gFl\gamma \u0302t+\epsilon $**S**is constructed by a non-linear activation function*η*acting on a linear layer**W**,Although(77)$S=\eta \u25e6W.$**S**is not a part of the original Koopman neural operator,^{31}including it can efficiently enhance the capacity of this component to characterize intricate large-scale data. In KoopmanLab, the leaky rectified linear unit (Leaky ReLU)^{65}is suggested as a default choice of**S**, which can also reduce to the ReLU function as a special case. See Fig. 3 for illustrations of**Part 3**.**Part 4: Inverse Fourier transform.**Once $gF\Gamma \u0302t+\epsilon $ is derived in**Part 3**, the inverse Fourier transform is applied on $gF\Gamma \u0302t+\epsilon $ to transform $gF\Gamma \u0302t+\epsilon $ back to the observation spacebased on which we can define(78)$g\gamma \u0302sxt+\epsilon =12\pi d\gamma \u0302\u222b\u2212\u221e\u221egFs\xi exp2\pi i\u27e8xt+\epsilon ,\xi \u27e9d\xi $See instances in Fig. 3.(79)$g\Gamma \u0302t+\epsilon =F\u22121\u25e6gF\Gamma \u0302t+\epsilon =g\gamma \u03021xt+\epsilon ,\u2026,g\gamma \u0302lxt+\epsilon .$**Part 5: High-frequency information complement.**Same as the compact KNO sub-family, there is a component for complementing high-frequency information in ViT-KNO. This component is also realized by a convolutional layer**C**that acts on the outputs of**Part 1**to learn the dynamics of high-frequency informationSee Fig. 1 for illustrations.(80)$gC\Gamma \u0302t+\epsilon =Cg\Gamma \u0302t.$**Part 6: Variable coupling.**Given two predicted states, $g\Gamma \u0302t+\epsilon $ and $gC\Gamma \u0302t+\epsilon $, by**Parts 2–4**and**Part 5**, we combine them in a linear formBecause ViT-KNO is designed to learn multi-variate systems governed by unknown coupled PDEs, we need to characterize the coupling relation among variables. Because we lack the(81)$gU\Gamma \u0302t+\epsilon =g\Gamma \u0302t+\epsilon +gC\Gamma \u0302t+\epsilon .$*a priori*knowledge about these underlying PDEs, we suggest to capture the coupling relation by optimizing a non-linear layer**M**,Following the idea of adaptive Fourier neural operator,(82)$gM\Gamma \u0302t+\epsilon =MgU\Gamma \u0302t+\epsilon .$^{23,24,64}we use the Gaussian Error Linear Unit (GELU) as the activation function in this non-linear layer. See Fig. 3 for illustrations.**Part 7: Inverse observation.**Given $gM\Gamma \u0302t+\epsilon $, a decoder is implemented to function as the inverse of observation functionSimilar to the compact KNO sub-family, there are two kinds of decoders included in our proposed KoopmanLab module(83)$\gamma \u0302t+\epsilon 1,\u2026,\gamma \u0302t+\epsilon h\u2243g\u22121gM\Gamma \u0302t+\epsilon =DecodergM\Gamma \u0302t+\epsilon .$in which(84)$Decoder\u2208{Wd,Cd}$**W**_{d}and**C**_{d}denote linear and convolutional layers, respectively. These two kinds of decoder designs distinguish between two variants of the ViT-KNO sub-family. See Fig. 3 for illustrations.

**Parts 1–7**define the iterative update strategy of Eq. (23) in a multi-variate case. For any

*t*∈

*T*, the iterative dynamics is given as

Several computational tricks can be considered in the application. First, a LASSO regularization^{66} can be included to improve the robustness and sparsity of Koopman operator $K\u0304\epsilon $ in Eq. (76). This trick has been demonstrated as effective in adaptive Fourier neural operator^{23,24,64} and is applicable to the ViT-KNO sub-family as well. Second, the transformer architecture supports a parallel design the ViT-KNO sub-family. Specifically, the third dimension of the output of **Part 1** can be subdivided into multiple parts [e.g., $g\Gamma \u0302t\u2208Ru\xd7v\xd7l$ is subdivided into *k* parts such that each part is an element in $Ru\xd7v\xd7lk$]. Then, **Parts 2–6** are copied *k* × *j* times, where each group of *j* copies is organized into a sequence. Each sequence of *j* copies is referred to as a head in the transformer, processes a corresponding $1k$ part of $g\Gamma \u0302t\u2208Ru\xd7v\xd7l$, and shares parameters during optimization (see Fig. 3). Computationally, parameters *k* and *j* are referred to as the number and the depth of heads. The processed outputs of these *k* parallel heads are unified by **Part 7** to derive the final prediction result. In our proposed KoopmanLab, these two tricks are included to improve computational efficiency.

Our KoopmanLab module supports customizing ViT-KNO frameworks. Below, we present an instance of ViT-KNO with a multi-layer perceptron as the decoder

Similarly, a ViT-KNO whose decoder is a convolutional layer can be defined as the following:

Note that there exist some detailed model parameters that are not covered by the above codes because they are highly coupled during computation or less important in our theoretical derivations. Users are suggested to adjust them after loading the source code of ViT-KNO.

### D. The ViT-KNO sub-family: Validation

To validate the proposed ViT-KNO sub-family, we implement a large-scale experiment on ERA5, one of the largest high-resolution datasets of global-scale multi-variate climate fields.^{32} This dataset has been extensively applied in weather forecasting tasks (e.g., see FourCastNet^{23,24}), ensuring the reproducibility and comparability of our results.

Twenty important climate variables are considered in our research, including mean large-scale precipitation (MSLP), relative humidity with 500 hPa (R500), relative humidity with 850 hPa (R850), surface pressure (SP), 2 m temperature (T2M), temperature with 500 hPa (T500), temperature with 850 hPa (T850), total column water vapor (TCWV), the 10 m u-component of wind (U10), the u-component of wind with 500 hPa (U500), the u-component of wind with 850 hPa (U850), the u-component of wind with 1000 hPa (U1000), the 10 m v-component of wind (V10), the v-component of wind with 500 hPa (V500), the v-component of wind with 850 hPa (V850), the v-component of wind with 1000 hPa (V1000), the geopotential with 50 hPa (Z50), the geopotential with 500 hPa (Z500), the geopotential with 850 hPa (Z850), and the geopotential with 1000 hPa (Z1000).

In a long-term prediction task, we test a ViT-KNO whose decoder is a multi-layer perceptron and encoder contains a head structure with a depth of 4 to function as mapping **Q** in Eq. (72). This model is provided as a default choice of ViT-KNO in the KoopmanLab module. For comparison, we implement the official version of FourCastNet provided in Refs. 23 and 24. Limited by computing resources (the FourCastNet is reported to be trained on 3808 NVIDIA A100 GPUs with numerous computational optimization tricks^{23}), we are unable to reproduce the memory consuming optimization tricks introduced in Refs. 23 and 24 (e.g., the two-stage training procedure). Therefore, we compare between ViT-KNO and FourCastNet in a fair case where there is no extra optimization trick applied on the training of both models.

Given the samples of initial conditions, the implemented ViT-KNO and FourCastNet are required to predict the future states of all 20 climate variables after *t* ∈ {6, 12, 18, …, 240} h. The training dataset includes the samples recorded from 1979 to 2015. The validation dataset includes the samples recorded during 2016 and 2017. The test dataset includes the samples recorded in 2018. The spatial resolution of all samples is set as 1440 × 720. All data are pre-processed in a standard manner following Refs. 23, 24, and 32, where a *Z*-transform is applied to normalize all variables. The training of our ViT-KNO and FourCastNet is implemented in a multi-GPU environment with 128 × 16 GBs in total. The testing of trained ViT-KNO and FourCastNet is implemented in a single 16-GB GPU environment.

Key model settings are summarized below. For ViT-KNO, the batch size is set as 64, the learning rate is 5 × 10^{−4} and updated by a cosine annealing approach, the patch size is set as 8 × 8, the number of heads is 16, the depth of heads is 12, the embedded dimension *l* in Eq. (72) is 768, the relative weights of prediction and reconstruction in the loss function are *λ*_{p} = 0.8 and *λ*_{r} = 0.2, and the number of kept low-frequency modes after the fast Fourier transform is *f* = 32. For FourCastNet, all model settings are adopted from the official version.^{23,24} The defined ViT-KNO model has 73 905 428 parameters in total while the official version of FourCastNet^{23,24} has 74 691 840 parameters. The optimizer of both models is the FusedAdam, an Adam optimizer^{61} that support mixed precision and distributed training. All 20 climate variables are learned and predicted together rather than, respectively. Note that the information of land-form is not provided to both models. Both models are required to learn climate variables with no additional information. The maximum number of available epochs for training is set as 150 to explore when these models can converge, which costs about 46.25 h in our environment.

In Fig. 4, we visualize several instances of the predicted climate fields by the ViT-KNO after 70 epochs of training, accompanied by corresponding true values. High consistency can be seen between these ground truths and their predicted counterparts derived by ViT-KNO. Quantitatively, the prediction accuracy of each climate variable during testing is measured by anomaly correlation coefficient (ACC) in Figs. 5 and 6, where we compare between ViT-KNO and FourCastNet after 70 and 115 epochs of training, respectively. According to the same prediction task reported by Refs. 23 and 24, the trained ViT-KNO outperforms the baseline state-of-the-art deep learning models for weather forecasting proposed by Ref. 67 significantly. According to our observations in Figs. 5 and 6, ViT-KNO generally achieves a higher testing accuracy than FourCastNet. The advantages of ViT-KNO compared with FourCastNet will be enlarged as the prediction duration length increases. Moreover, the implemented FourCastNet seems to over-fit the training dataset as the training epoch number increases while ViT-KNO does not (one can find this phenomenon by comparing between the model performances reported in Figs. 5 and 6). These results suggest the potential for ViT-KNO to become a competitive alternative of FourCastNet, especially in the learning tasks of long-term and non-linear dynamics.

Moreover, the time costs of a single time of prediction by ViT-KNO and FourCastNet are observed as ≃0.851 972 9 and ≃0.734 301 49 s on a single 16-GB GPU. Compared with the classic numerical weather forecasting systems (e.g., the ECMWF integrated forecasting system) whose prediction inevitably requires a multi-GPU environment (e.g., more than 1000 NVIDIA Selene nodes where each node consists of 8 NVIDIA A100 GPUs),^{68–70} ViT-KNO is orders of magnitude faster in the application (e.g., the integrated forecasting system L91 18 km model is expected to cost about 9840 node seconds for prediction on a NVIDIA Selene node^{70}). Therefore, our ViT-KNO has the potential to serve as a unit in ensemble weather forecasting frameworks to realize an efficient prediction of global weather. Although we cannot directly implement the ECMWF integrated forecasting system for comparison due to the limited computing resources, one can refer to Refs. 23 and 24 for the comparison between FourCastNet and the ECMWF integrated forecasting system, which may provide an indirect verification of the competitive performance of ViT-KNO model. More detailed comparisons remain for explorations in the future.

## IV. CONCLUSION

In this paper, we have presented KoopmanLab, an efficient module of Koopman neural operator family, for solving partial differential equations. The included models in this module, such as the compact KNO sub-family and the ViT-KNO sub-family, are provided with mathematical foundations, computational designs, and validations in solving concrete PDEs or predicting intricate dynamic system governed by unknown coupled PDEs. All models are suggested as competitive with other state-of-the-art approaches in corresponding tasks. Compared with classic numerical and neural-network-based PDE solvers, the proposed KNO variants can achieve significant acceleration, more robust mesh-independence, higher generalization capacity on changed conditions, more flexibility in characterizing latent PDEs with unknown forms, and a better balance between accuracy and efficiency. Therefore, we suggest the potential of KoopmanLab be applied in diverse down-stream tasks related with PDE solving. Users can download this module via

Several important questions remain for future studies. First, one may consider more specialized computational optimization of models in KoopmanLab (e.g., consider multi-stage training as suggested in Refs. 23 and 24 or multi-objective balancing by the Pareto theory^{71,72}). Second, one can explore a more detailed comparison between the ViT-KNO sub-family and the ECMWF Integrated Forecasting System under the equivalent hardware conditions. Third, one can analyze the errors of our models caused by the potential continuous spectrum of the Koopman operator or the absence of ergodic property in real cases.

## ACKNOWLEDGMENTS

Y.T. and P.S. appreciate the support from the Artificial and General Intelligence Research Program of Guo Qiang Research Institute at Tsinghua University (Grant No. 2020GQG1017), the Tsinghua University Initiative Scientific Research Program, and the Huawei Innovation Research Program (Grant No. TC20221109044). X.H. was supported by the National Natural Science Foundation of China (Grant No. 42125503) and the National Key Research and Development Program of China (Grant Nos. 2022YFE0195900, 2021YFC3101600, 2020YFA0608000, and 2020YFA0607900).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Wei Xiong**: Conceptualization (supporting); Data curation (lead); Investigation (equal); Methodology (supporting); Software (lead); Validation (lead); Visualization (supporting); Writing – review & editing (supporting). **Muyuan Ma**: Data curation (supporting); Investigation (supporting); Methodology (supporting); Software (equal); Validation (equal); Visualization (equal). **Xiaomeng Huang**: Data curation (equal); Project administration (supporting); Resources (equal); Validation (equal); Writing – review & editing (equal). **Ziyang Zhang**: Resources (supporting). **Pei Sun**: Conceptualization (equal); Funding acquisition (lead); Project administration (lead); Resources (equal); Supervision (equal); Writing – review & editing (equal). **Yang Tian**: Conceptualization (lead); Formal analysis (lead); Methodology (lead); Software (supporting); Supervision (supporting); Validation (lead); Writing – original draft (lead); Writing – review & editing (equal).

## DATA AVAILABILITY

The code and data that support the findings of this study are available via https://github.com/Koopman-Laboratory/KoopmanLab.^{73}

## REFERENCES

*Partial Differential Equations: Analytical and Numerical Methods*

*Functional Analytic Methods for Partial Differential Equations*

*Nonlinear Partial Differential Equations for Scientists and Engineers*

*Partial Differential Equations: Modeling, Analysis, Computation*

*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*

*International Conference on Learning Representations*(OpenReview, 2021)

*Parameterization Schemes: Keys to Understanding Numerical Weather Prediction Models*

*Advances in Neural Information Processing Systems*(NeurIPS Foundation

*Manifolds, Tensor Analysis, and Applications*

*Evolution Semigroups in Dynamical Systems and Differential Equations*

*Advances in Neural Information Processing Systems*

*Numerical Methods for Large Eigenvalue Problems: Revised Edition*

*Ergodic Theory*

*International Conference on Learning Representations*

*Advances in Neural Information Processing Systems*(NeurIPS Foundation

*The ECMWF Scalability Programme: Progress and Plans*

*Advances in Neural Information Processing Systems*