A Swin-Transformer-based Model for Efficient Compression of Turbulent Flow Data

This study proposes a novel deep-learning-based method for generating reduced representations of turbulent flows that ensures efficient storage and transfer while maintaining high accuracy during decompression. A Swin-Transformer network combined with a physical constraints-based loss function is utilized to compress the turbulent flows with high compression ratios and then restore the data with the underlying physical properties. The forced isotropic turbulent flow is used to demonstrate the ability of the Swin-Transformer-based (ST) model, where the instantaneous and statistical results show the excellent ability of the model to recover the flow data with remarkable accuracy. Furthermore, the capability of the ST model is compared with a typical Convolutional Neural Network-based auto-encoder (CNN-AE) by using the turbulent channel flow at two friction Reynolds numbers $Re_\tau$ = 180 and 550. The results generated by the ST model are significantly more consistent with the DNS data than those recovered by the CNN-AE, indicating the superior ability of the ST model to compress and restore the turbulent flow. This study also compares the compression performance of the ST model at different compression ratios (CR) and finds that the model has low enough error even at very high CR. Additionally, the effect of transfer learning (TL) is investigated, showing that TL reduces the training time by 64\% while maintaining high accuracy. The results illustrate for the first time that the Swin-Transformer-based model incorporating a physically constrained loss function can compress and restore turbulent flows with the correct physics.


I. INTRODUCTION
Turbulence, represented by the chaotic interactions among multiple spatial and temporal flow scales, has a significant impact on various fields such as aerospace 1 , environment 2 , wind energy 3,4 , and combustion 5 .With the development of measurement technologies and computing power, high-quality turbulence data can be obtained through experiments or simulations.In terms of experiments, hot-wire anemometry 6,7 , Particle Image Velocimetry (PIV) 8 , and Particle-Tracking Velocimetry (PTV) 9 can measure the instantaneous velocity fields of turbulent flows with high accuracy and high spatial and temporal resolution.
In terms of simulations, several computational fluid simulations are making it possible to process large amounts of data quickly and accurately, such as Reynolds-Averaged Navier-Stokes (RANS) models 10 , Large Eddy Simulation (LES) 11 , and Direct Numerical Simulation (DNS) 12 .The advancement of experimental and simulation techniques and the increasing demand for high-quality turbulence data have led to large amounts of high-dimensional data, posing great challenges in storage and transmission.Therefore, efficient and accurate data compression techniques are necessary to reduce storage requirements, facilitate data transfer, and extract the main features of the flow field.Efficient storage and transmission methods are critical to turbulence research and help to understand the complex behavior of turbulence.
Typically, data compression techniques extract the most critical features in the data while eliminating redundant or irrelevant information.Some techniques have been developed for the efficient storage and transfer of data.Singular value decomposition (SVD), a classic matrix decomposition technique, has been applied for data dimensionality reduction, feature extraction, and dynamic mode analysis 13,14 .Principal component analysis (PCA) (usually termed as proper orthogonal decomposition (POD) in the fluid dynamics community) [15][16][17][18] , an unsupervised linear mapping compression method based on SVD technique, transforms the high dimensional data into the lower representation.Dynamic mode decomposition (DMD) is also based on SVD to compute the low-rank representation of the spatio-temporal flow data 19 .The above methods for data compression are all linear techniques, which makes them sensitive to outliers in the data.Another limitation of the above methods is they can not handle translation, rotation, and scaling of the data 19 .Furthermore, many nonlinear methods have been developed to capture complicated nonlinear structures in data.Kernel Principal Component Analysis (KPCA) was proposed by Schölkopf et al. 20 , which can efficiently compute principal components in high dimensional spaces by using integral operator kernel functions.Lee et al. 21compared two nonlinear projection algorithms, Isomap and Curvilinear Distance Analysis (CDA), and showed that Isomap is faster and theoretically more robust than CDA, while CDA is slower but more robust in practical applications.Hinton and Roweis 22 introduced a probabilistic approach, called Stochastic neighbor embedding, for mapping high-dimensional representations or pairwise differences to a lower-dimensional space while preserving the neighborhood relations.A wavelet-based method incorporating a block-structured Cartesian mesh method was proposed by Sakai et al. 23 for the flow simulation data compression.Sifuzzaman et al. 24 compared the wavelet transform with the Fourier transform, revealing that the former approach took less response time.These methods provide more flexibility than linear compression methods but can result in high computation time and cost, especially for large datasets.
Thanks to big data, computing power, and algorithm development, machine learning has received extensive attention in recent decades and has been applied in various fields, such as computer vision 25,26 , speech recognition 27 , natural language translation 28 , weather forecasting 29 , autonomous driving 30 and so on.In Fluid Dynamics, machine learning has been applied to solve several problems, such as flow denoising and reconstruction [31][32][33][34][35][36][37][38] , flow prediction 39,40 , active flow control 41,42 , and turbulent inflow generation 43,44 .The findings from the previous papers demonstrate the potential of deep learning to efficiently handle complex spatiotemporal data.Furthermore, deep learning-based techniques have shown great promise over the past decades in compressing fluid flow data efficiently while preserving its main features.Liu et al. 45 presented a data compression model using a generative adversarial network (GAN), where the discriminative network compresses data, and the generative network reconstructs data.They verified the performance of the GAN-based model on 3D flow past the cylinder, separation flow on the leeward of the double-delta wing, and shockwave vortex interaction.The results showed that the GAN-based model could save compression time and provide acceptable reconstruction quality.Glaws et al. 46 proposed a fully convolutional autoencoder deep-learning method to compress decaying homogeneous isotropic turbulence, Taylor-Green vortex, and turbulent channel flow.The study demonstrated the autoencoder model outperformed a variant of SVD with a similar compression ratio and had a good generalization.Furthermore, Olmo et al. 47  Their results showed that the MSCSP-AE could capture the crucial feature of the flow field and then feed the compressed data to LSTM to ensure the model predicts the key pattern of the flow.In the papers mentioned above, the compression models utilize stacked convolutional layers as the basis for their models, where finite-size filters capture the spatial correlation between neighborhood points, creating a more compact representation.
The convolutional layer plays a vital role in deep learning due to its ability to capture adjacent spatial information and its non-linear approximation algorithm.However, convolutional layers rely on the kernel, or receptive field, which is limited to acquiring only local spatial correlations within the kernel field, making it challenging to recognize complex patterns 48,49 .The padding operation is one of the important parts of the convolutional layer, which is used to keep the feature map size the same as the original input.Still, it may cause artifacts at the edges of the input data, potentially affecting the model's performance in various applications, including turbulent boundary layer reconstruction citeYousifetal2023b.
Additionally, the convolutional layer was originally used to solve the pixel prediction and reconstruction in images, where pixels are distributed uniformly in a rectangular or square region.However, when processing the non-uniform flow data in fluid mechanics, the convolutional layer requires pre-processing it into a uniformly cartesian mesh, which is unrealistic 50 .Moreover, the convolutional layer could lack flow details and consequently give wrong results for complex geometries 51 .
Recently, Transformer 52 has achieved some success in sequence prediction and natural language processing (NLP) 44,[53][54][55][56] , as its attention mechanism can discover the long-term dependencies in data, which has also sparked attention to its potential in computer vision applications.For example, Carion et al. decoder.The encoder plays a critical role in reducing the input data size for efficient storage and transmission while maintaining the important features.The decoder is responsible for restoring the original data from the reduced representations with high accuracy.Figure 1 (a) shows that the encoder starts and ends with a dense layer, with a series of Swin Transformer blocks (SwinT-blocks) and patch-merging sandwiched in between.The decoder structure is symmetrical with the encoder one, but the patch-splitting replaces the patch-merging.
Here, the dense layers at the beginning project the data to an arbitrary dimension C, while the dense layers at the end project the data dimension back to the original dimension.The SwinT-block captures the main features of the data, which will be described in detail later.
The patch-merging operation performs a similar function to the downsampling layer in CNN, which reduces the number of patches as the network is stacked.While the patch-splitting operation can be considered an upsampling layer, increasing the number of patches.It is worth noting that the entire architecture has no convolutional layers.MLP in the block is placed with a LayerNorm layer at the beginning, followed by residual connections that connect the output with its input.The ViT uses global self-attention to calculate relationships between all tokens, which increases the computational cost when the number of tokens is very large.However, unlike global self-attention in ViT, as Figure 2 (a) shows, the ST model uses local self-attention to compute self-attention within each nonoverlapping local window, where each window contains M×M patches (with M set to 8 in this study).The computational complexity Ω of the global multi-head self-attention (MSA) and window-based MSA for input data of h×w size can be expressed as follows: here, the only difference is the last term, where the global MSA is quadratic to the input size (hw), whereas the W-MSA is linear to hw when the value of M is fixed.Therefore, W-MSA is more cost-effective, especially for larger input sizes.
Furthermore, the lack of cross-window information, that is the connection on the boundaries of each window can be solved by using a shifted window multi-head self-attention Self-attention in W-MSA and SW-MSA is a function that maps a query and a set of key-value pairs to an output, and its formula is as follows: where W Q , W K , W V are the weight matrices shared among all windows; where the quantities with " ˆ" are the outputs of the ST model; ∥

A. Forced isotropic turbulence flow data
For the demonstration case, the forced isotropic turbulence dataset obtained from the JHTDB at a Taylor-scale Reynolds number Re λ = λu rms /ν = 418 is considered to train and test the proposed ST model, where λ = (15νu 2 rms /ε) 1/2 is Taylor microscale, u rms = (⟨u i u i ⟩/3) 1/2 represents root-mean-squared velocity, ν is the kinematic viscosity and ε means dissipation rate.This dataset was generated from DNS using a pseudo-spectral parallel code.The governing equations used for simulation were the incompressible Navier-Stokes equations.The velocity vector u = (u, v, w), where u, v, w are streamwise, wall-normal, and spanwise components, respectively, with the corresponding directions x, y, z.The grid points are uniformly distributed in all directions.The detailed parameters for the forced isotropic turbulence are shown in Table I.Further information regarding the simulation and the database utilized in this study can be found in Perlman et al. 63 .

B. Turbulent channel flow
The turbulent channel flow data at Re τ = 180 and 550 are utilized as datasets for the proposed model.The flow data are produced through DNS using the incompressible momentum and continuity equations, which are expressed as: In the equations above, u = (u, v, w) denotes the velocity vector, where u, v and w represent the streamwise, wall-normal and spanwise components in x, y, z directions.the channel top and bottom are subject to no-slip conditions.The grid points are uniformly distributed in the x and z directions, while a non-uniform distribution is used in the y direction.DNS data obtained from Moser et al. 64 have been used to validate the turbulence generated by the simulation, and it was verified that the simulated data had similar statistical characteristics.The simulation uses the pressure implicit split operator algorithm to solve the coupled pressure-momentum system.A second-order accurate linear upwind scheme is utilized to discretize the convective fluxes.Similarly, all other discretization schemes used in the simulation also have second-order accuracy.
The training dataset contains 16,000 snapshots of a single (y − z) plane extracted from turbulent channel flow simulation, split evenly between turbulence data at Re τ = 180 and Re τ = 550, with 8,000 snapshots in each subset.Additionally, the test dataset for each case consists of another 1,000 snapshots.To apply transfer learning to the data at In addition to qualitative assessments, a detailed analysis of flow statistics is conducted to evaluate the performance of the ST model.There is clear evidence from the above demonstration results that the ST model is capable of compressing and decompressing the uniformly distributed turbulent flow effectively and maintaining the same instantaneous and statistical results as the ground truth data.In the next section, the ability of the ST model to reconstruct the non-uniformly distributed turbulent flow is verified.

B. Turbulent channel flow
In this section, the compression and decompression capabilities of the ST model are ver-  The turbulent statistics of the reconstructed velocity fields are compared with the turbulent statistics of the DNS turbulent channel flow at Re τ = 180 and 550 in Figure 8 (a) and (b), respectively.The mean streamwise velocity (U + ) profiles of the decompressed flow using the ST model and the CNN-AE at Re τ = 180 and 550 show accurate alignment with the profiles from the DNS data, covering the entire y + range.The comparison of the rootmean-square (r.m.s.) profiles of the velocity components (u + rms , v + rms and w + rms ) reveal a different observation.The r.m.s.profiles of the reconstructed flow obtained using the ST model fit well with the DNS data at both Re τ = 180 and 550.In contrast, the CNN-AE produces relatively less accurate results, particularly for the flow at Re τ = 550.Similarly, the Reynolds shear stress profile profiles have the same behavior as the r.m.s.profiles.This can be attributed to the fact that at higher Re τ , the flow becomes more complex and chaotic, making it more challenging for the CNN-AE to reconstruct the boundary region accurately.To further confirm the capability of the ST model in reconstructing genuine spatial spectra of the restored velocity fields, the premultiplied spanwise wavenumber energy spectra of the three velocity components denoted as k z ϕ ξξ , are examined.Here, ϕ ξξ denotes the spanwise wavenumber spectrum, ξ means velocity component and k z is the spanwise wavenumber.
Figure 10 shows the plots of the k + z ϕ + ξξ as a function of the wall-normal distance y + and the spanwise wavelength λ + z .The spectra of the velocity components obtained from the ST model conform to the spectra from the DNS data with a small discrepancy observed at the high wavenumbers, while the k + z ϕ + ξξ plots obtained from the CNN-AE are less accurate where ξi and ξ i denote the decompressed velocity fields by each model and the DNS data, respectively.I represents the total number of test snapshots, which is set to 1,000.  to the ability of the ST model to capture long-distance spatial correlation, making it more suitable for non-uniformly distributed data.These results give confidence that the ST model can be applied to complex geometric flow data such as pipe flow by adjusting the window segmentation strategy and masking mechanism, while for the CNN-AE, the use of the padding operation can result in significant errors at the boundaries.
In addition to the CR = 64 mentioned earlier in this section, here, two more CR values are added to validate the ability of the ST model.The errors of the three velocity components increase relatively as the CR increases, which aligns with the trade-off between CR and improved Glaws's work by leveraging the physical properties inherent in the CFD, which led to short training time and less training data under the same quality reconstructions.Yousif et al. 43 applied a multiscale convolutional auto-encoder with a subpixel convolution layer (MSCSP-AE) to obtain the compact representation of the turbulent channel flow and used Long-Short-Term-Memory (LSTM) Network as a sequence learning model to predict the flow field over time scales.
57 introduced Detection Transformer (DETR) for objection detection.Dosovitskiy et al.58 proposed the Vision Transformer (ViT) for image classification tasks and demonstrated that ViT outperforms CNNs.Han et al.59 proposed the Transformer in Transformer (TNT) for visual recognition tasks, demonstrating better preservation of local information than ViT.Liu et al. 60 introduced the Swin Transformer with the shifted window scheme to address the window artifact problems encountered in the ViT model and found that the Swin Transformer achieves advanced performance on object detection and semantic segmentation.Thanks to the impressive performance of the Swin Transformer, there are a large number of papers that utilized the Swin Transformer to tackle various vision problems.Liang et al. 48restored high-quality images from low-quality images using Swin Transformers as deep feature extraction blocks and convolutional layers as shallow feature extraction blocks.Liu et al. 61 extended the Swin Transformer model from image recognition to video recognition and performed well on Kinetics-400, Kinetics-600, and Something-Something v2 benchmarks.Lu et al. 49 developed an Image Compression using the variational autoencoder (VAE) architecture and Swin Transformer.Their study indicated that the Swin Transformer model requires significantly fewer model parameters than other advanced methods such as CNN-based learnt image encoding.Inspired by the success of Swin Transformer-based models in the computer vision field, this study proposes an efficient Swin-Transformer (ST)-based model incorporating the physical properties of the flow field for turbulent data storage and transmission.The ST model does not use convolutional layers to avoid the limitations of convolutional layers, such as artifacts caused by padding operation, local spatial limitations caused by the finite-size kernel, and the inapplicability of non-uniform grid data.The remainder of this paper is organized as follows.Section 2 introduces the methodology of compressing and decompressing flow data using the proposed ST model.The Direct numerical simulation (DNS) datasets used for training and testing the ST model are described in section 3.In section 4, the results from testing the ST model are discussed, and section 5 provides a summary of the conclusions drawn from this study.II.METHODOLOGY Transformer 52 was originally proposed for NLP problems, but the ViT 58 adapted it for computer vision by splitting input images into patches, similar to NLP tokens.Therefore, the correlation between patches can be captured through the self-attention operation in Transformer, addressing the limitation of CNN kernels in capturing only local information.Swin Transformer 60 improves upon the ViT model and incorporates shifted windows to avoid window artifact issues.The proposed ST model is based on Swin Transformer, which divides the input flow field data into multiple patches, groups them into several windows, and employs shifted windows to overcome the lack of window boundary information.The architecture of the ST model is shown in Figure 1 (a).The model consists of an encoder and a

FIG. 1 .
FIG. 1.The architecture of (a) the ST model and (b) the SwinT-block.

(FIG. 2 .
FIG. 2. The window partitioning method for (a) W-MSA and (b) SW-MSA.Here, each red block means one window to calculate the local self-attention.
are query, key and value matrices, respectively; d is the dimension of query; B ∈ R M 2 ×M 2 is the learnable relative positional encoding.The attention function mentioned above is typically calculated multiple times, with the number of calculations equal to the number of attention heads used (referred to as h).The output of each attention calculation is then spliced together to form the final multi-head attention output.The proposed ST model in this study incorporates physical principles to guide its learning process, facilitating the capture of the underlying physical behavior of turbulent flow and achieving better fitting to the training data.The first physical loss employed in the proposed ST model is the gradient error loss L gradient , which is computed from the gradient of the flow.This loss term can assist the model in accurately reconstructing the turbulent flow with non-uniform grid distribution, particularly in the wall-normal direction of turbulent channel flow in this study.Reynolds stress error L Reynolds stress and the spectrum error L spectrum quantify the variance in the Reynolds stress tensor of velocity fields and the difference in the spectral content of the flow parameters, respectively.By incorporating these loss terms, the model's ability to reconstruct the Reynolds stress components and the energy spectra of the flow is enhanced.In addition, the reconstructed velocity field error L velocity also be considered as the basic loss in this model.The loss functions for the proposed ST model are defined as follows:

•∥ 1
and ∥•∥ 2 are the L 1 and L 2 norms; T expresses the Reynolds stress tensor; E(k) is the energy spectrum, k is the wavenumber; S is batch size.The balance coefficients of the loss terms, denoted as λ 1 , λ 2 , λ 3 and λ 4 , have been empirically determined as 0.01, 80, 10 −5 , and 300 for isotropic turbulent flow, respectively.For turbulent channel flow, they are set as 5, 100, 10 −5 , and 200, respectively.III.DATA DESCRIPTION AND PRE-PROCESSINGIn this study, we investigate two different types of flows: the forced isotropic turbulence flow obtained from the Johns Hopkins turbulence databases (JHTDB), which serves as a demonstration case, and the turbulent channel flow at Re τ = 180 and 550 generated by performing DNS, which is used as systematic model capability test case.In both cases, the ST model is trained using an adaptive moment estimation (Adam) optimization algorithm62 with a batch size S = 8 and an initial learning rate η = 0.0001.To implement the model, the open-source library TensorFlow 2.2.3 is utilized.Additionally, an early stopping regulation technique is employed to terminate the training.
2π 1024 × 1024 × 1024 0.000185 0.0002 TABLE I.The detailed parameters for the forced isotropic turbulence.Here, L is the domain dimension and N is the number of grid points.ν and ∆t represent kinematic viscosity and simulation time-step, respectively.The velocity dataset is applied as input to the ST model, which contains 200 snapshots of the x − y plane (where z = 0).The dataset spans approximately two large-eddy turnover times.The training dataset consists of 100 snapshots, and the test dataset is another 100 snapshots that are completely separate from the training dataset.The time interval between each snapshot in the training and testing dataset is 0.02.In order to reduce computational costs, the entire domain is divided into 64 parts, resulting in a change in data size from the original N x × N y = 1024 × 1024 in the x − y plane to 128×128.Consequently, the training dataset comprises 6400 sub-snapshots, which are randomly shuffled before being fed into the model.

8 FIG. 3 .
FIG. 3. Instantaneous spanwise (a) vorticity field and (b) velocity field for the case of forced isotropic turbulence.

FIG. 4 .FIG. 5 .
FIG. 4. Probability density function plot of the velocity gradient field for the case of isotropic turbulent flow.
ified using turbulent channel flow at Re τ = 180 and Re τ = 550.To establish a baseline for comparison, the channel flow snapshots were compressed and reconstructed using a CNNbased autoencoder (CNN-AE) with an architecture similar to the ST model.Here, convolutional layers, downsampling, and upsampling are used instead of SwinT-blocks, patchmerging, and patch-splitting.Both the ST model and the CNN-AE have the same CR of 64 and the same hyperparameters.In addition, this section evaluates the performance of the ST model at different CR, verifying the robustness of the model.

Figures 6 and 7
Figures 6 and 7 display the instantaneous streamwise velocity field (u + ) and vorticity field (ω + x ) of the DNS and ST-decompressed results for three different time steps at Re τ = 180 and Re τ = 550, respectively.It can be observed that the ST model successfully compresses and decompresses the flow data at Re τ = 180, yielding results that are consistent with the DNS data.Nonetheless, there are some visual disparities in the decompressed turbulent channel flow at Re τ = 550, particularly in the representation of small-scale structures, while the dominant flow features and flow patterns have been well-preserved.

FIG. 8 .
FIG. 8. Turbulent statistics for the turbulent channel flow at (a) Re τ = 180 and (b) Re τ = 550.Mean streamwise velocity profile (left), r.m.s.profiles for the three velocity components (middle), and Reynolds shear stress profile (right).

FIG. 9 .
FIG. 9. Probability density function plots of the three velocity components (streamwise velocity on the left, wall-normal velocity in the middle, and spanwise velocity on the right) as a function of the wall-normal distance for the turbulent channel flow at (a) Re τ = 180 and (b) Re τ = 550.Shaded contours indicate the p.d.f.from the DNS data, while black contours and grey contours represent reconstruction results of the ST model and the CNN-AE, respectively.The contours levels are 20%, 40%, 60% and 80% of the maximum p.d.f.

Figure 11 presents the L 2 -
norm relative error for the reconstructed flow at (a) Re τ = 180 and (b) Re τ = 550.As shown, the ST model achieves lower errors than the CNN-AE for the two Reynolds numbers with the same CR, indicating the superior performance of the ST model.These results further confirm that the ST model outperforms the CNN-AE.This can be attributed

FIG. 10 .
FIG. 10.Premultiplied spanwise wavenumber energy spectra of the three velocity components (streamwise velocity on the left, wall-normal velocity in the middle, and spanwise velocity on the right) as a function of the wall-normal distance and the spanwise wavelength for the turbulent channel flow at (a) Re τ = 180 and (b) Re τ = 550.Shaded contours show DNS data, while black and grey contours represent reconstruction results of the ST model and the CNN-AE, respectively.The contour levels are set at 10% increments, ranging from 10% to 90% of the maximum premultiplied spanwise wavenumber energy spectra.

FIG. 11 .
FIG. 11.Relative L 2 -norm error of the decompressed velocity fields at (a) Re τ = 180 and (b) Re τ = 550.Cases 1, 2, and 3 correspond to the decompressed results from the ST with CR = 16, 64, and 256, while Case 4 represents the decompressed results from the CNN-AE with CR = 64.

TABLE II .
Simulation parameters of turbulent channel flow at Re τ = 180 and 550.Here, L is the domain dimension and N is the number of grid points.The superscript "+" denotes that the quantity is made dimensionless by using u τ and ν. ∆y + w refers to the distance near the wall and ∆y + c refers to the spacing in the center of the channel.
t, ρ, p, and ν are time, density, pressure, and kinematic viscosity, respectively.The open-source computational fluid dynamics (CFD) finite-volume code OpenFOAM-5.0x is used to perform the simulations.The simulation parameters of each friction Reynolds number are shown in Table II.The streamwise and spanwise directions are subject to periodic boundary conditions.Meanwhile,