Developing data-driven subgrid-scale (SGS) models for large eddy simulations (LESs) has received substantial attention recently. Despite some success, particularly in *a priori* (offline) tests, challenges have been identified that include numerical instabilities in *a posteriori* (online) tests and generalization (i.e., extrapolation) of trained data-driven SGS models, for example, to higher Reynolds numbers. Here, using the stochastically forced Burgers turbulence as the test-bed, we show that deep neural networks trained using properly pre-conditioned (augmented) data yield stable and accurate *a posteriori* LES models. Furthermore, we show that transfer learning enables accurate/stable generalization to a flow with $10\xd7$ higher Reynolds number.

Due to their high computational cost, the direct numerical simulation (DNS) of turbulent flows will remain out of reach for many real-world applications in the foreseeable future. As a result, the need for parameterization of subgrid-scale (SGS) processes in coarse-resolution models such as large eddy simulation (LES) continues in various areas of science and engineering.^{1,2} In recent years, there has been substantial interest in applications of deep learning for data-driven modeling of turbulent flows,^{3–13} including for developing data-driven SGS parameterization (DDP) models.^{14–27} In many of these studies, the goal is to learn the relationship between the filtered variables and SGS terms in high-fidelity data (e.g., DNS data), and use this DDP model in LES. *A priori* tests in some of these studies^{17–19,25} have shown that such a non-parametric approach can yield DDP models that capture important physical processes (e.g., energy backscatter^{28,29}) beyond the simple diffusion process that is represented in canonical physics-based SGS models such as Smagorinsky and dynamic Smagorinsky (DSMAG).^{30–32} However, these studies have also reported that *a posteriori* (i.e., online) LES tests, in which the DDP model is coupled to a coarse-resolution Navier-Stokes solver, show numerical instabilities or lead to physically unrealistic flows.^{17–19,25,26} As a remedy, often *ad hoc* post-processing steps of the DDP models' outputs are introduced, e.g., to remove backscattering or to attenuate the SGS feedback into the numerical solver. Usually, such post-processing steps substantially take away the advantages gained from using deep learning. As a result, numerical instabilities remain a major obstacle to broadening the applications of LES with DDP models.

Another major concern with DDP models is their (in)ability to accurately generalize beyond the flow they are trained for, particularly to flows that have higher Reynolds numbers (*Re*). However, such extrapolations are known to be challenging for neural networks.^{23,33} Some degree of generalization is essential for building robust and trustworthy LES models with DDP. Furthermore, given that high-fidelity data from often-expensive simulations (e.g., DNS) are needed to train DDP models, some capability to extrapolate to higher *Re* makes such DDP models much more practically useful.

In this paper, with a particular focus on the issues of stability and generalization, we use a deep artificial neural network (ANN) to develop a DDP model for stochastically forced Burgers turbulence. The forced Burgers equation is^{34}

where *u* is velocity, $\nu =1/Re$, and *F* is a stochastic forcing (defined later). The domain is periodic with length *L*. Despite being one-dimensional, the presence of strongly nonlinear local regions in the form of shocks, often multiple shocks [Fig. 1(a)], makes Burgers turbulence a complex and challenging system, which has been used as the test-bed in various SGS and reduced-order modeling studies.^{34–40}^{,}*F*(*x*, *t*) is defined as^{34}

where *k*, $\Delta t$, and *A* are the wavenumber, time step, and forcing amplitude, respectively. $\Phi k$ and *α _{k}* are real, random numbers. To develop the LES model, we spatially filter Eq. (1) to obtain

with SGS term

Here, we use a box filter;^{1} explorations with Gaussian and sharp spectral filters yield the same findings and conclusions. Overbars indicate filtered (and coarse-grained to LES resolution) variables. Note that the difference between *F* and $F\xaf$ is negligible. Our aim is to train an ANN to learn Π as a function of $u\xaf$ in the DNS data, and then use this DDP model as a closure in Eq. (3).

We define a setup, referred to as “control” and indicated with subscripts “*c*,” with the following parameters (identical to those used by Dolaptchiev *et al.*^{34}) *L* = 100, $\nu =0.02$, and $A=2/100$. $\Phi k$ and *α _{k}* are drawn randomly from $N(0,1)$ every $20\Delta t$ to update

*F*. To obtain the DNS data, which are treated as the “truth,” Eq. (1) is integrated using a pseudo-spectral solver with 1024 Fourier modes and time step $\Delta t=0.01$. (The second-order Adams-Bashforth and Crank-Nicholson methods are used for time integration.) Figure 1 shows a sample profile of

*u*(

*x*), and the kinetic energy (KE) spectrum and power spectral density (PSD) of the flow. To perform LES, Eq. (3) with the DDP model of $\Pi (u\xaf)$ is integrated using the same pseudo-spectral solver but with 128 Fourier modes and time step $20\Delta t$. At this spatial resolution, the filtered velocity field accounts for $\u223c81%$ of the total KE of the flow (from DNS), conforming with the commonly used ratio.

^{1}Also, note that the spatial and temporal resolutions of the LES solver are, respectively, $8\xd7$ and $20\xd7$ lower than those of the DNS solver, substantially reducing the computational cost.

The schematic of LES with DDP is shown in Fig. 2(a). Next we present the details of the ANN and the training data/procedure. We use a multilayer perceptron ANN^{41} to develop the DDP model. This ANN is unidirectional (information only passes in one direction from input to output) and is fully connected between the layers. The ANN is trained, i.e., all learnable parameters of the network (weights and biases, collectively represented by *θ*) are computed, by minimizing the mean square-error $MSE=\u2211i=1M||ANN(u\xaf\u0303i;\theta )\u2212\Pi \u0303i||22/M$. Here, *M* is the number of training samples, $||\xb7||2$ is the *L*_{2} norm, $u\xaf$ and Π are calculated from DNS data, and $\xb7\u0303$ indicates pre-conditioned (augmented) training data (discussed shortly). The best network architecture, found based on extensive trial and error using *MSE*, consists of an input layer, six hidden layers with 250 nodes each, and a linear output layer. On all but the final layer, the swish activation function^{42} is used. Overall, the ANN has 394 640 trainable parameters.

Our first attempts to train the DDP model with $M=O(105)$ resulted in inaccurate Π terms in *a priori* tests and unstable LES with DDP in *a posteriori* tests. Further analysis showed that the problem is due to the fact that the SGS dynamics and thus the Π terms in Burgers turbulence are highly localized around the shocks,^{37} which as explained below, leads to overfitting, i.e., poor generalization of ANN (at the same *Re*) beyond the training set. Shocks are persistent and can remain fairly stationary for many time steps, which can lead to small or near-zero Π terms in some regions of the domain that do not experience shocks throughout the training set. The ANN trained on such a dataset will predict $\Pi \u22480$ in those regions no matter what the inputted $u\xaf$ is during (*a priori* or *a posteriori*) tests. Note that by design, the flow during training could be very different, in terms of the location of shocks and their evolution, from the flow during testing. (Though the training and testing sets have the same *Re*, the latter is chosen from an independent DNS run or from a time window far from the time window of the training set.) Of course, this overfitting problem can be resolved by using a much larger training set that contains a sufficient number of samples of shocks waves occurring in all regions; however, such large training sets are often unavailable. Here, we propose a simple strategy, based on pre-conditioning the training samples, to overcome this problem without the need for a larger dataset.

As shown in Fig. 2(b), a random shift *η*, drawn from the uniform distribution $U(0,L)$, is added to *x* for each input-output pair $(u\xaf,\Pi )$

The periodicity in *x* is used when $x\u2212\eta <0$. It should be noted that this type of artificially enhancing the richness of information inside the training set is commonly used in the machine learning community and is called data augmentation.^{43} For example, in processing of natural images, data augmentation generally involves artificially enhancing the training set by rotating, mirroring, or cropping images. Here, we have exploited the periodicity of *x* to introduce a physically meaningful augmentation, which allows us to enrich the information of the localized flow and SGS terms around shock waves in the training set without the need for a longer DNS dataset. Finally, as is common practice in machine learning, the input $u\xaf\u0303$ and output $\Pi \u0303$ samples are separately normalized (through removing the mean and dividing by the standard deviation).

The pre-conditioned input-output pairs $(u\xaf\u0303,\Pi \u0303)$ are used to train the ANN. As shown next, the DDP model with an ANN trained using augmented data leads to accurate Π terms in *a priori* tests and stable and accurate LES models in *a posteriori* tests without the need for any post-processing of the trained ANN or its output [with the exception of de-normalizing the predicted Π; see Fig. 2(a)]. We have used $M=5\xd7105$ samples for training and another (independent) $5\xd7104$ samples for validation from a DNS run at *Re* = *Re _{c}*. For testing, we have used data from the same run but $5\xd7104\Delta t$ separated from the training/validation sets as well as data from two other independent DNS runs at

*Re*=

*Re*.

_{c}We examine the performance of the LES with DDP in *a posteriori* (online) tests to assess both accuracy (of the SGS modeling) and stability of the hybrid model. Given that the numerical solution of Eq. (3) blows up without any SGS modeling (i.e., with Π = 0), we use a conventional SGS scheme, DSMAG,^{44} as the baseline. Figures 3(a) and 3(b) show the spectrum and the probability density function (PDF) of the Π terms predicted by DDP and DSMAG compared against those of the filtered DNS (FDNS), which is treated as the truth. Both panels show that the statistics of Π predicted by DDP closely follow those of the truth at any *k* and even at the tails of the PDF. Furthermore, both panels show that DDP outperforms DSMAG in modeling the statistics of the SGS term (Π). The better performance of DDP is clearly seen at high and low *k* in (a) and beyond ±1 standard deviation in (b). Note that the difference between the Π's PDFs from FDNS and DSMAG (DDP) is (is not) statistically significant at 95% confidence level based on both Kolmogorov-Smirnov, KS, and Kullback-Leibler divergence, KL, tests.^{45}

To examine the statistics of the resolved flow, Figs. 3(c) and 3(d) show the spectrum of KE and the PDF of $u\xaf$. Both LES with DDP and LES with DSMAG capture the KE spectrum up to near the maximum resolved *k* ($=64$) although DDP does slightly better and agrees with the FDNS' KE spectrum up to $k\u224860$ while DSMAG does so up to $k\u224850$. Furthermore, as shown in panel (d), LES with DDP outperforms LES with DSMAG in capturing the PDF's tails, which correspond to shocks. Note that the differences between the PDFs of DDP, FDNS, and DSMAG are not statistically significant (at 95% confidence level) based on the KS or KL test, but that is because such tests mainly assess similarities in the bulk rather than the tails of the PDFs. A closer visual inspection shows that the difference between the tails of the PDFs from FDNS and DDP (DSMAG) is within (outside) the uncertainty range, indicating that DDP (DSMAG) accurately captures (does not capture) the statistics of the rare events.

So far we have discussed the results with LES resolution of 128 Fourier modes, which as mentioned before, conforms with the commonly used criterion for LES resolution based on the KE of the filtered flow. With the lower resolution of 96 modes, the DDP model still leads to a stable LES that outperforms LES with DSMAG. Further lowering the resolution to 64 modes leads to an unstable DDP. While LES with DSMAG is stable (due to the purely diffusive nature of this SGS model), the accuracy is poor. At this resolution, the $u\xaf$ field only accounts for $\u223c40%$ of the total KE, and LES (with any SGS model) is not expected to be used at such low resolutions.

In summary, the DDP model that uses an ANN trained with augmented data (from *Re* = *Re _{c}*) leads to a stable LES model (with reasonably high enough resolution) in

*a posteriori*tests (at

*Re*=

*Re*) that is more accurate than LES with DSMAG. Next, we examine whether a DDP model trained with augmented data from a given

_{c}*Re*can be used for LES of a flow that has higher

*Re*.

Figure 4 shows the statistics of the resolved flow and of Π calculated using results from *a posteriori* tests at *Re* = *Re _{c}* but with a DDP model that uses an ANN trained on data from $Re=Rec/10$ (see the dashed blue lines). It is clear that this DDP model

*does not*generalize as the spectrum and PDF of Π and the spectrum of KE all deviate from those of the FDNS. The results are not surprising as it is known that data-driven models often have difficulty with generalization to a different (especially more complex) system. For example, using a multi-scale Lorenz 96 system, we

^{23}showed that ANN- and recurrent neural network-based data-driven SGS models do not accurately generalize when the system is forced to become more chaotic. However, we also showed that transfer learning (TL)

^{47}provides an effective way for addressing this challenge, at least for a simple chaotic toy model. Below, we show the effectiveness of TL in making DDP generalizable to higher

*Re*in a turbulent flow.

Figure 5 shows the schematic of TL applied to the ANN of a DDP model. In general, the weights of an ANN are randomly initialized and then they are updated through training on *M* samples from a given data distribution (here, data from a flow with $Re=Rec/10$). The test in Fig. 4 showed that this ANN does not accurately work for *Re* = *Re _{c}*. The idea of TL is that we re-train this ANN (starting with its current weights rather than random initializations) and update the weights only in the deeper layers using a smaller number of samples (e.g., $MTL=M/10$) from the new data distribution (i.e., the flow with

*Re*=

*Re*). The underlying idea of TL is that in deep networks, the initial layers learn high-level features, and only the deeper layers learn low-level features that are specific to a particular data distribution.

_{c}^{47}Thus, for generalization, we only need to re-train the deeper layers, which can be done using a small amount of data from the new distribution.

Figure 4 shows that the DDP model with TL (solid blue lines) accurately generalizes to the flow with *Re _{c}* as the spectrum and PDF of Π and spectrum of KE closely match those of FDNS. In fact, the accuracy of the DDP model with TL in Fig. 4 (which only uses $MTL=5\xd7104$ training samples from

*Re*) is comparable with the accuracy of the DDP model in Fig. 3 (which uses $M=5\xd710 5$ training samples from

_{c}*Re*). Furthermore, Fig. 6 shows how gradually increasing

_{c}*M*improves the generalization capability of the DDP model. Finally, a figure in the supplementary material further demonstrates the effect of the number of re-trained layers (as well as

_{TL}*M*), showing that at large enough

_{TL}*M*, re-training more than one layer yields accurate generalization.

_{TL}In conclusion, we have investigated ANN-based data-driven SGS modeling of Burgers turbulence, with a particular focus on the stability of *a posteriori* LES models and generalization to higher *Re*. We show that developing a DDP model for Burgers turbulence is particularly challenging due to the presence of shocks, which localize the SGS term (Π), resulting in ANNs that overfit in the absence of a large training set. The overfitting ANNs lead to inaccurate/unstable DDP models. To overcome this challenge, we introduce a pre-conditioning step in which, exploiting periodicity, training samples are randomly shifted, thus enriching and augmenting the training set. The DDP model trained on this augmented dataset leads to stable and accurate *a posteriori* LES models. These results suggest that similar data augmentation strategies that exploit symmetries, invariances, and other physical properties (see Xie *et al.,*^{48} Pan and Duraisamy,^{49} and Formentin *et al.*^{50} for more examples) should be considered in developing DDP models for more complex flows when large training sets are unavailable, not only to improve accuracy but also to improve the stability of *a posteriori* LES runs.

We have also found the DDP model not to generalize (i.e., extrapolate) to a flow with $10\xd7$ higher *Re*. However, we show, for the first time to the best of our knowledge, the application of TL to making a DDP model generalizable in a turbulent flow. Transfer learning enables the development of DDP models for high-*Re* flows with most of the training data provided by high-fidelity simulations at lower *Re*, which is highly appealing for practical purposes because the computational cost of simulating turbulent flows rapidly increases with *Re*.

In future work, the application of TL and data augmentation to develop accurate, stable, generalizable DDP models for more complex turbulent flows that are 2D and 3D will be investigated.

See the supplementary material for a figure showing the effects of the number of re-trained layers and the number of samples used for re-training on the generalization accuracy.

We thank Romit Maulik, Rambod Mojgani, and Ebrahim Nabizadeh for insightful discussions. We are grateful to two anonymous reviewers for helpful comments and suggestions that improved the quality of the manuscript. This work was supported by an award from the ONR Young Investigator Program, N00014-20-1-2722, and by NSF Grant No. OAC-2005123 (to P.H.). A.C. thanks the Rice University Ken Kennedy Institute for Information Technology for a BP HPC Graduate Fellowship. Computational resources were provided by NSF XSEDE (Allocation No. ATM170020) and by the Rice University Center for Research Computing.

## DATA AVAILABILITY

The data that support the findings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.4316338 and in GitHub at https://github.com/envfluids/Burgers_DDP_and_TL.