We study the 2D Orszag–Tang vortex magnetohydrodynamics (MHD) problem through the use of physics-constrained convolutional neural networks (PCNNs) for forecasting the density, *ρ*, and the magnetic field, B, as well as the prediction of B given the velocity field v of the fluid. In addition to translation equivariance from the convolutional architecture, other physics constraints were embedded: absence of magnetic monopoles, non-negativity of *ρ*, use of only relevant variables, and the periodic boundary conditions of the problem. The use of only relevant variables and the hard constraint of non-negative *ρ* were found to facilitate learning greatly. The divergenceless condition $ \u2207 \xb7 B = 0$ was implemented as a hard constraint up to machine precision through the use of a magnetic potential to define $ B = \u2207 \xd7 A$. Residual networks and data augmentation were also used to improve performance. This allowed for some of the residual models to function as surrogate models and provide reasonably accurate simulations. For the prediction task, the PCNNs were evaluated against a physics-informed neural network, which had the ideal MHD induction equation as a soft constraint. Several models were able to generate highly accurate fields, which are visually almost indistinguishable and have low mean squared error. Only methods with built-in hard constraints produced physical fields with $ \u2207 \xb7 B = 0$. The use of PCNNs for MHD has the potential to produce physically consistent real-time simulations to serve as virtual diagnostics in cases where inferences must be made with limited observables.

## I. INTRODUCTION

Magnetohydrodynamics (MHD) describes the low-frequency dynamics of a plasma, which is modeled as an electrically conducting fluid influenced by electromagnetic fields.^{1} MHD is used to describe a wide range of systems including turbulence in plasmas,^{2,3} quantum plasmas,^{4} Tokamaks,^{5–9} plasma detachment in magnetic nozzles,^{10} liquid metals,^{11,12} and intense ion beams in particle accelerators.^{13} MHD simulations are complex, describing systems with many evolving nonlinear fields and can be computationally expensive.

In this work, we utilize physics constrained neural networks (PCNNs) via a novel neural operator approach for building hard physics constraints into convolutional neural networks (CNNs),^{14} for the study of MHD. We focus on the Orszag–Tang vortex,^{15} a widely used test case for 2D MHD simulations.^{16–18} We generate synthetic MHD data using physics simulations and test various neural network-based methods for creating surrogate models of the MHD simulation, which are able to quickly predict the evolution of the data, such as using magnetic field and charge density values at one time step to predict those fields at subsequent time steps. We also test incorporating neural network-based methods as solvers within the physics model to speed up computationally expensive parts of the MHD simulation.

The use of PCNNs has the potential to provide physically consistent and accurate surrogates of MHD simulations, which run orders of magnitude faster than traditional approaches. Complex plasma processes could benefit from real-time data processing^{19} and real-time data analysis.^{20} The PCNN-based methods studied in this work, although not as accurate as physics models, are incredibly fast and therefore may be useful for real-time applications such as finding approximate solutions to inverse problems for operational systems with limited direct observations by providing a good initial guess, which is refined by computationally expensive physics models. Such ML-based methods, combined with robust model-independent control algorithms,^{21} as was demonstrated for intense electron beams in the LCLS free electron laser,^{22} have the potential to provide real-time feedback control for complex plasma physics systems.

### A. Ideal MHD and the Orszag–Tang vortex

^{1}Mocz

*et al.*,

^{17}Jackson,

^{23}and Mocz

*et al.*

^{24}

^{,}

*p*. The variables

*ρ*,

*p*,

**v**, and

**B**are the mass density, total pressure, velocity, and magnetic fields, respectively, while

^{25,26}it is common in 2D MHD to take it as a pseudo-vector in order to have a better analogy to the 3D MHD equations and we do so here (see the Appendix A for details of generalizing the ideal 3D MHD equations to arbitrary dimensions).

For what follows, we consider a monatomic, perfect gas.^{27} In this case, the equation of state becomes $ p fluid = ( \gamma \u2212 1 ) \rho u$, where *u* is the gas's thermal energy density per mass and *γ* is the gas adiabatic index. We follow the example of previous works^{16,17} and let $ \gamma = 5 3$ to have better analogy to the 3D case.

*ρ*and

*p*are initially constant. We use the initial conditions at

*t*= 0 of

For this work, Orszag–Tang vortex data were generated using a physics simulation that utilizes a finite volume method together with constrained transport on a static mesh as described in Refs. 17 and 28. The physics code is used to generate large data sets, which include the temporal evolution of the fields $ ( \rho , v , B , p )$ in Eqs. (1)–(4). We then study various neural network-based approaches for reproducing these fields and for capturing their dynamics, such as using field values at one time $ B ( x , y , t )$ to predict their values at a subsequent time step $ B ( x , y , t + \Delta t )$.

For most of this work, we utilize convolutional neural networks (CNN), which take 128 × 128 pixel images as inputs. CNNs are used for this image-based ML because they have been demonstrated to have the best performance for image processing and image generation. When generating a 2D image, such as the mass density $ \rho ( t , x , y )$ at time *t* for a 128 × 128 grid of values of (*x*, *y*), the CNN generates the entire image in one shot, and therefore, when referring to CNN-generated values such as $ \rho ( t , x , y )$, we simply write $ \rho ( t )$. In the one case where we use a dense neural network to map out individual values at individual (*x*, *y*) positions, we write out $ \rho ( t , x , y )$.

## II. SUMMARY OF RESULTS

For accurate predictions, one would expect low mean squared error (MSE) and high structural similarity (SSIM) (SSIM values are between −1 and 1 with a value of 1 for images that are exactly the same) between true and predicted images. In all of the results in this work, it was found that the MSE and SSIM were consistent; that is, as MSE increased, SSIM decreased and models with lower MSE also had higher SSIM values relative to each other. However, the sensitivity of MSE and SSIM was, as expected, very different. For example, Fig. 3 shows the true and predicted mass density *ρ* as predicted by several of the ML models at various time steps. Visually, the blue *ρ* predictions of all of the models look similar, but they clearly have very different levels of error as shown in the red sub-plots. In Fig. 4, this is numerically summarized by the MSE and SSIM. While all SSIM values remain within roughly 10% of each other, the MSE values differ by up to 1000%.

For the relatively simple problem of predicting mass density *ρ*, the networks having a simple positivity constraint of $ \rho \u2265 0$ enforced with a ReLU output function had the best performance. For the more challenging problem of predicting the magnetic field **B,** some trade-offs were found between how easy it was to train a network in terms of MSE or SSIM accuracy of the **B** field prediction vs constraint violation. Networks with no constraints or only soft constraints were able to minimize a given cost function faster (in fewer training steps) than those with hard physics constraints, but produced nonphysical fields. This result is not very surprising because unconstrained networks (including PINNs, networks where the constraint is soft as part of the cost function) are free to generate completely arbitrary functions over the largest space of possibilities. On the other hand, the networks with the hard physics constraint of $ \u2207 \xb7 B = 0$, which was strictly enforced by generating $ B = \u2207 \xd7 A$, were the most restricted in terms of the families of functions that they could generate. Of all of the approaches considered, only the networks with built-in hard constraints generated physical fields (within numerical precision).

## III. SIMULATED DATA

The simulated Orszag–Tang vortex data were generated using the finite volume method and constrained transport on a static mesh.^{17,28} The fluid is simulated over a unit square subject to spatial periodic boundary conditions. The procedure makes use of conserved quantities while also maintaining a discretization of the divergence-free condition, $ \u2207 \xb7 B = 0$. Furthermore, details of the simulation can be found in Appendix B.

Some models that appear later will mimic the constrained transport of the simulation; thus, we discuss briefly how it functions.

### A. Constrained transport

^{28}Each grid point represents the center of a cell, but now we utilize a staggered grid representation to also represent quantities that exist on the face centers and corners of cells. The constrained transport setup is shown in Fig. 1. The magnetic field is computed via the cell-face averaged magnetic field,

**b**, which is related to the cell volume-averaged magnetic field,

**B**, by

**b**are updated using the electric field at the corners of each cell in a way that preserves the following discretization of $ \u2207 \xb7 B$:

### B. Adding physics to neural networks

It is well known that deep neural networks can represent both functions and operators with arbitrary accuracy.^{29–31} Deep neural networks are now being utilized for a wide range of scientific research due to the availability of incredibly powerful high-performance computing,^{32} combined with their ability to represent complex functions over compact sets with the representation being learned directly from high dimensional data, such as large images.^{33} For example, in the field of plasma physics, deep neural networks have been applied to a wide range of problems,^{34,35} including the use of neural networks as surrogates for equations of state.^{36} Neural networks can take advantage of transfer learning for the design optimization of internal confinement fusion,^{37} and deep surrogate models can take advantage of intelligent iterative sampling of expensive simulations for faster training.^{38}

Several methods have been developed in an effort to incorporate physics within neural networks. The most general and most popular approach by far is that of physics informed neural networks (PINNs). In a PINNs approach, additional terms are added to the cost function used to train a neural network, typically in the form of a PDE, which describes the underlying physics such as electrodynamics^{39} and fluid mechanics.^{40}

Although very general, the main limitations of PINNs are a lack of any guarantee of respecting the physics as they use only soft constraints. A soft constraint is a condition the network attempts to fulfill. A hard constraint is one in which the NN automatically satisfies. Furthermore, in a PINNs approach, there is usually a trade-off between satisfying the physics constraint and in accurately generating the desired fields. For example, we consider the well-known physics constraint $ \u2207 \xb7 B = 0$. In a PINNs approach, a cost function term such as $ | | \u2207 \xb7 B | |$ is trivially satisfied by $ B = C$ for any arbitrary constant vector **C**, which has absolutely nothing to do with the true magnetic field.

Recently, work has been done on incorporating hard physics constraints within PCNNs, such as the approach applied here to satisfy Maxwell's equations for electrodynamics,^{14} methods have been developed for physics-constrained Gaussian processes that enforce thermodynamic consistency by limiting the space of functions that are sampled,^{41} and PCNNs have also been implemented for fluid mechanics.^{42,43} Although PCNNs have now been applied for electrodynamics and fluid mechanics separately, comparatively little has been done to study their use for MHD.^{44} We carry out a proof-of-principal study of applying PCNNs to the 2D Orszag–Tang Vortex for predicting both the fluid density and the associated magnetic fields.

## IV. PHYSICS-CONSTRAINED NEURAL NETWORKS

**B**, typical un-constrained approaches utilize supervised learning in which the neural network's predictions are compared to the correct answer via a cost function such as

*w*

_{2}penalizes the network for generating fields with non-zero divergence. With soft constraints, there is no guarantee that the physics will be respected, and as described above, there is an unfortunate trade-off between

*w*

_{1}and

*w*

_{2}as increasing

*w*

_{2}usually leads the network to predict incorrect constant field values, which trivially have zero divergence.

Another benefit of using CNNs is that the translational symmetry of the physics, formally spatial translational equivariance, can be approximated through the discrete translational symmetry of the CNN architecture.^{45,46} We can also impose relatively simple hard physics constraints, such as $ \rho \u2265 0$ by taking the activation function in the final layer of our network to be a ReLU [i.e., ReLU $ ( x ) = max ( 0 , x ) \u2265 0$]. For periodic boundary conditions, padding to enforce spatial periodicity can be used. A summary of physics constraints and how they are enforced in this paper can be found in Table I.

Constraint . | Implementation . | Hard/soft . |
---|---|---|

Divergence-free | $ B ( r , t ) = \u2207 \u22a5 A ( r , t )$ | Hard |

Translation equivariance | CNN architecture | Hard |

Non-negativity (e.g., $ \rho \u2265 0$) | Final layer ReLU | Hard |

Periodic BC's | Padding | Soft |

Partial differential equation | Term in cost function | Soft |

Constraint . | Implementation . | Hard/soft . |
---|---|---|

Divergence-free | $ B ( r , t ) = \u2207 \u22a5 A ( r , t )$ | Hard |

Translation equivariance | CNN architecture | Hard |

Non-negativity (e.g., $ \rho \u2265 0$) | Final layer ReLU | Hard |

Periodic BC's | Padding | Soft |

Partial differential equation | Term in cost function | Soft |

*D*at time

*t*to new values over that domain at $ t + \Delta t$,

*D*. When using NNs to map individual points to values of interest at those points, the NN is acting as a function that is trained over the entire domain

*D*,

The major advantages of using CNNs as operators that map entire images to other images are faster inference and greater generalizability. A trained CNN can map incredibly large images (or volumes) to their subsequent states within microseconds, whereas a function-approximating CNN would have to work its way through many individual points one batch at a time, which greatly slows down both inference and training. Because the CNN sees the entire global image, it can interpolate between various examples, whereas the function-approximating NN memorizes values at individual points and has to re-train for even small variations. The major advantages of using NNs for point-wise function approximation is auto-differentiation and smaller memory requirements because they can be relatively small dense networks that do not need to be able to handle large image or volume tensors.

### A. NN architecture and training

Most of our NNs have a convolutional encoder–decoder architecture. The baseline architecture along with a description of the layers and parameters used can be seen in Fig. 2. This setup is based on the well-known VGG-style approach, which has been established for deep networks, in which several sets of two convolutional layers with 3 × 3 filters are followed by a max pool layer that is used for the encoder network.^{47} For some models, modifications to the initial and final layers were made and will be discussed later.

*L*

_{2}regularization with a cost function of the form

*y*is the

_{i}*i*th data point for the variable being predicted, $ y \u0302 i$ the model prediction for the variable (we will use the hat $ \u0302$ over a variable to denote it is a prediction),

*w*is the

_{j}*j*th weight or bias, and

*λ*is a hyperparameter.

_{r}*N*and

_{d}*N*are the number of data points and weights, respectively. The first term in Eq. (15) measures how accurate the model predictions are, while the second term is acting as a

_{w}*L*

_{2}regularizer for the network's weights. In the Bayesian framework, it is possible to interpret the second term as coming from having a Gaussian prior on the weights and

*λ*being related to its variance. All the CNNs were created, trained, and tested using Tensorflow.

_{r}^{48}For all training sessions, the Adam optimizer was used.

^{49}Detailed model parameters, such as numbers of filters per layer, were chosen by quickly testing various setups on a small subset of the training data.

### B. Cases examined

We had the models perform both forecasting and predicting tasks. For forecasting, the task was given certain input fields at *t* forecast a certain output field at $ t + \Delta t$. First, this was done for *ρ* and then **B**. For predicting, we looked at given the **v** field predicts the **B** field at the same time step, after which the models' predictions were compared to that of a PINN.

A summary of networks used, constraints, inputs, outputs, and training data is given in Table III.

## V. RESULTS

For this study, several different datasets and ML approaches were used, which are summarized in Table II. Throughout the paper, to quantify the accuracy of our ML-based predictions, we utilized mean squared error (MSE) and structural similarity index (SSIM), as each has its own strengths and limitations. Our formulations of MSE and SSIM are discussed in more detail below. MSE is a popular metric for comparing numerical data, and MSE-based optimization problems are popular because of the many analytical results (such as least squares fits). A limitation of MSE is its sensitivity to small perturbations and to outliers, which becomes especially important for images. Regarding outliers, if even a single pixel of a generated high-resolution image is corrupted, while all other pixels are identical to the true values, a large MSE value will result. Regarding small perturbations, we consider a generated image that is identical to the true image, but the entire image is shifted by just a single pixel. While this difference may be unnoticed by the human eye, it will result in an incredibly high MSE.

Name . | Constraints . | Training data . | Inputs . | Outputs . |
---|---|---|---|---|

ρ-CNN-rel/all | none | 1 MHD | $ { \rho , p , B , v} /$ $ { \rho , v}$ images (t) | ρ images $ ( t + \Delta t )$ |

ρ-PCNN-rel/all | $ \rho = ReLU ( \u2009 y ) \u2265 0$ | 1 MHD | $ { \rho , p , B , v} /$ $ { \rho , v}$ images (t) | ρ images $ ( t + \Delta t )$ |

B-CNN-all | none | 1 MHD | $ { B , v}$ images | B images $ ( t + \Delta t )$ |

B-PCNN-all | $ \u2207 \xb7 B = \u2207 \xb7 ( \u2207 \xd7 A ) = 0$ | 1 MHD | $ { B , v}$ images (t) | A images $ ( t + \Delta t )$ |

B-CNN-res/res-rand | none | 1/50 MHD | $ { B , v}$ images (t) | $ \Delta B$ images $ ( t + \Delta t )$ |

B-PCNN-res/res-rand | $ \u2207 \xb7 B = \u2207 \xb7 ( \u2207 \xd7 A ) = 0$ | 1/50 MHD | $ { B , v}$ images (t) | $ \Delta A$ images $ ( t + \Delta t )$ |

$ B ( v )$-CNN | none | 1 MHD | $ { v}$ images (t) | B images (t) |

$ B ( v )$-PCNN | $ \u2207 \xb7 B = \u2207 \xb7 ( \u2207 \xd7 A ) = 0$ | 1 MHD | $ { v}$ images (t) | A images (t) |

$ B ( v )$-PINN | soft | 1 MHD | $ { x , y , v ( x , y )}$ points (t) | $ B ( x , y )$ points (t) |

Name . | Constraints . | Training data . | Inputs . | Outputs . |
---|---|---|---|---|

ρ-CNN-rel/all | none | 1 MHD | $ { \rho , p , B , v} /$ $ { \rho , v}$ images (t) | ρ images $ ( t + \Delta t )$ |

ρ-PCNN-rel/all | $ \rho = ReLU ( \u2009 y ) \u2265 0$ | 1 MHD | $ { \rho , p , B , v} /$ $ { \rho , v}$ images (t) | ρ images $ ( t + \Delta t )$ |

B-CNN-all | none | 1 MHD | $ { B , v}$ images | B images $ ( t + \Delta t )$ |

B-PCNN-all | $ \u2207 \xb7 B = \u2207 \xb7 ( \u2207 \xd7 A ) = 0$ | 1 MHD | $ { B , v}$ images (t) | A images $ ( t + \Delta t )$ |

B-CNN-res/res-rand | none | 1/50 MHD | $ { B , v}$ images (t) | $ \Delta B$ images $ ( t + \Delta t )$ |

B-PCNN-res/res-rand | $ \u2207 \xb7 B = \u2207 \xb7 ( \u2207 \xd7 A ) = 0$ | 1/50 MHD | $ { B , v}$ images (t) | $ \Delta A$ images $ ( t + \Delta t )$ |

$ B ( v )$-CNN | none | 1 MHD | $ { v}$ images (t) | B images (t) |

$ B ( v )$-PCNN | $ \u2207 \xb7 B = \u2207 \xb7 ( \u2207 \xd7 A ) = 0$ | 1 MHD | $ { v}$ images (t) | A images (t) |

$ B ( v )$-PINN | soft | 1 MHD | $ { x , y , v ( x , y )}$ points (t) | $ B ( x , y )$ points (t) |

*N*is the number of components in

_{c}*x*and

_{i}*y*. The second metric is the structural similarity index (SSIM) defined as

_{i}*μ*and

_{x}*μ*are the means of

_{y}*x*and

*y*, respectively; $ \sigma x 2$ and $ \sigma y 2$ are the variances of

*x*and

*y*, respectively;

*σ*is the covariance of

_{xy}*x*and

*y*; and $ c 1 = c 2 = 10 \u2212 4$ (their presence is meant to avoid issues that occur if $ u x 2 + \mu y 2$ = 0 or $ \sigma x 2 + \sigma y 2 = 0$). The SSIM produces values in the range of $ [ \u2212 1 , 1 ]$. An SSIM of 1 means that

*x*and

*y*match exactly, while lower values of the SSIM correspond to tensors that are less structurally similar.

^{50}

Compared to MSE, SSIM captures a higher level more wholistic view of images, similarly to what we do with our eyes. In fact, interesting examples have been generated in which the MSE between two images is kept constant, while the SSIM is either maximized or minimized, resulting in images that are completely different or visually almost identical.^{51} A much more thorough comparison of MSE and SSIM, as well as an interesting discussion and set of examples, has been published.^{52} We present MSE and SSIM for all of the generated quantities to simultaneously summarize both the pixel-by-pixel and the neighborhood-dependent features of the predictions.

*a*periodic padding would produce

*a*,

_{p}### A. Forecasting density

*t*to predict the mass density at a subsequent time step $ \rho ( t + \Delta )$ via CNNs. We compared the use of a linear and ReLU output activation functions with ReLU output providing a hard physics constraint of $ \rho \u2265 0$. The two models operated according to

To create training data, the physics-based MHD simulation ran from *t* = 0 to *t* = 1 with $ \Delta t = 8 \xd7 10 \u2212 4$ producing 1250 snapshots. The first 936 snapshot pairs or $ t \u2208 [ 0 , 0.7488 ]$ were used for training of the neural networks and the last 313 snapshot pairs or $ t \u2208 [ 0.7488 , 1 ]$ were used for testing. All ML models used early stopping and were trained using a learning rate of $ 5 \xd7 10 \u2212 4$. Models were trained for 240 epochs with a batch size starting at 2 and doubling every 60 epochs, as recommended in Ref. 53.

It should be noted that we are training on a turbulent part of the Orszag–Tang vortex and then trying to predict the evolution of a smoother, more uniform part. Predictions of the models and their absolute errors at various times are shown in Fig. 3. The SSIMs and the MSEs of the model predictions on the test set are shown in Fig. 4.

### B. Forecasting the magnetic field

The next problem was the use of the **B** and **v** field images at a time steps *t* to forecast the magnetic field at the next time step $ B ( t + \Delta )$. The generated magnetic fields should satisfy physics constraints. It is known that not preserving $ \u2207 \xb7 B = 0$ in numerical simulations leads to magnetic monopole numerical artifacts and the Lorentz force no longer being orthogonal to the magnetic field.^{17}

**B**-CNN, is very similar to

*ρ*-CNN-rel with two major differences: one,

**B**-CNN takes in

**v**and

**B**at one time step and output

**B**at the next time step. Two,

**B**-CNN predicts

**b**, the face-averaged magnetic field, at the next time step and then takes the average of

**b**to produce

**B**,

**b**as an intermediate step because the discretization of the divergence in Eq. (11) is with respect to

**b**, so it is easiest to compare the divergence of the models' predictions to the simulation's divergence if we predict

**b**in our models. As an added benefit, it performed more accurately than when it was not included.

**B**-PCNN, has a significantly different structure. The

**B**-PCNN predicts the vector potential

*A*, where $ b = \u2207 \u22a5 A$, and then numerically takes the curl of

*A*with a custom made Tensorflow layer (also known as a lambda layer) and averages the result to produce

**B,**

*ρ*in the previous section. Comparisons of both models' predictions for

*B*and

_{x}*B*are shown in Figs. 5 and 6, respectively.

_{y}In Fig. 7, the models' adherence to the divergence-free condition is seen in two ways. First, we look at $ \u2207 \xb7 B$, as computed as in Eq. (11), for the models at certain time steps. In terms of $ \u2207 \xb7 B$, only the **B**-PCNN would generate physical fields (by construction), while the **B**-CNN created wildly nonphysical fields (by many orders of magnitude). Therefore, we also used a modified less stringent measure of divergence, referred to as relative divergence.^{17,54}

The denominator in Eq. (21) comes from the idea that the effective magnetic field of interest is $ B eff = 2 p$ (recall the magnetic pressure is given by $ B 2 / 2$). In both cases, the divergence of the PCNN was found to be many orders of magnitude lower than that of the CNN, with the lower bound on the PCNN's divergence given by the numerical precision used, which in our case was $ \u223c 10 \u2212 9$ for a float32, as shown in Fig. 8. Note that for both cases, we are using Eq. (10) to estimate the divergences in order for them to be directly comparable to the constraint transport simulation's notion of the divergence-free condition. Also, we let $ \u27e8 \u03f5 r d \u27e9$ denote the mean of *ϵ _{rd}* taken over the whole grid at fixed time. A comparison of the MSE and SSIM is shown in Fig. 9.

#### 1. Residual networks

**B**-CNN-res, is as follows:

**B**and $ \Delta E z$ are related by Faraday's law.

**B**-PCNN-res, by again predicting a potential,

**B**in the simulation computed as in Eq. (21) with the exception that now we compute $ \u2207 \xb7 B$ as

**b**as an intermediate step in our model architecture.

The hyperparameters for both models were different from previous sections. The models were trained with a learning rate of $ 5 \xd7 10 \u2212 3$. The loss function was changed to be the squared error (SE), which is just the MSE without any normalizing constant. Training was run for 90 epochs. The batch size started at 2 and doubled every 30 epochs for a final batch size of 8. Early stopping was not utilized.

Since the change in **B** from one time step to the next tends to be small, we also made sure that the networks were doing significantly better than predicting zero for $ F ( t + \Delta t )$ [i.e., predicting $ B \u0302 ( t + \Delta t ) = B ( t )$]. This is indeed the case. Predictions of the models and their absolute errors at various times are shown in Figs. 10 and 11. The SSIMs and the MSEs of the model predictions over time on the test set are shown in Fig. 12. The absolute error of the relative divergences of these models is displayed in Fig. 13.

#### 2. Data augmentation

NNs often require much data to learn properly; thus, often data augmentation is used in their training. Thus, to improve learning and see how well the NNs could generalize, we trained the ResNets on 50 ideal MHD simulations that started with random initial fluid velocity fields and magnetic vector potentials, but kept the same initial pressure and density as in the Orszag–Tang vortex. The NNs were then tested on the Orszag–Tang vortex, which they had never encountered before. Here, we take advantage of the fact that the neural operators work for a wide range of conditions, unlike PINNs which just approximate a particular solution of a PDE.

The 50 physics-based MHD simulations started with random initial **v** and *A* fields, where $ b = \u2207 \u22a5 A$, and **B** is computed from **b** via the same averaging operation as in Eq. (10), created by starting with a Gaussian noise field at *t* = 0 and then applying a Gaussian filter (i.e., a convolution with a normal distribution) to smooth things out, similar to what has been done in other works.^{55} Wrapping was used in the Gaussian filter to maintain spatial periodicity of the initial conditions. The parameters for the random field generation were chosen so that the values of fields in the initial conditions were of the same order of magnitude as that of the Orszag–Tang vortex's (on average). The simulation was run from *t* = 0 to 1 and the Orszag–Tang vortex test ran on the same interval. All of the simulations were combined into one training dataset.

Two neural network models were trained on these data: **B**-CNN-res-rand, which has the same architecture as **B**-CNN-res, and **B**-PCNN-res-rand, which has the same architecture as **B**-PCNN-res. These models were trained for 100 epochs and used a learning rate of $ 1 \xd7 10 \u2212 4$. The SE was used again as the loss function. The batch size started at 16 and doubled every 25 epochs, with a final batch size of 128. Early stopping was not used.

#### 3. Surrogate modeling

Our last task for forecasting **B** was to embed our neural network models inside of the physics-based MHD Orszag–Tang simulation. Specifically, we ran the simulation using constrained transport and finite volume methods until we reached the test set. Then, we switched over to using one of our forecasting models to advance **B** instead of using constrained transport.

*f*is meant to denote these quantities are being taken at the face of a cell. This extrapolation uses finite differences, and this can lead to errors where Eq. (30) becomes negative (which should not be physically possible). This does not happen with the normal simulation, but always happens when we try to introduce surrogate models into the simulation. The main issue here being that the calculation of a stabilizing diffusive term uses the square root of Eq. (30), which can lead to “not a number” (NaN) errors. Our fix for this was to take the absolute value of Eq. (30).

For this setup, we allowed the simulation to produce data up until where the test set starts. Then, we deactivated the part of the simulation that uses constrained transport to advance **B** and instead used **B**-CNN-res-rand or **B**-PCNN-res-rand. These models would take in $ v ( t )$ and $ B ( t )$ and then produce $ B ( t + \Delta t )$. The simulation would then be used to advance all other variables, incorporating $ B ( t + \Delta t )$ where relevant.

The **B**-CNN-res and **B**-PCNN-res models failed rapidly, quickly making predictions that resulted in the simulation becoming plagued with NaN errors even with the fix discussed above. One of the difficulties faced by this preliminary study of embedding the CNNs within the physics model was that we were using both $ v ( t )$ and $ B ( t )$ as NN inputs, but only predicting $ B ( t + \Delta t )$, while $ v ( t + \Delta t )$ was generated by the physics model itself. This disconnect resulted in the build-up of nonphysical numerical errors, but was greatly improved by the use of residual networks, which only predict small perturbations. Both the **B**-CNN-res-rand and **B**-PCNN-res-rand models were able to carry out the whole simulation without producing NaN errors. These models still ended up predicting some negative values for the cell-centered version of Eq. (30), but this value is not necessary to advance the simulation, so it did not lead to additional problems. The result of this in terms of the MSE and SSIM for all primitive variables is shown in Fig. 14. How the relative divergences differ is shown in Fig. 15. A qualitative comparison of the primitive variables *ρ*, *p*, *v _{x}*, and

*B*is shown in Fig. 16.

_{y}### C. Predicting the magnetic field

*ρ*. They were tested against

**B**-PINN, a feedforward NN with a leaky ReLU activation ( $ \alpha = 0.1$) and 12 hidden layers with 256 neurons per layer. The model takes time and spatial co-ordinates as inputs and outputs the magnetic field

**B**-PINN, the following term is added to the cost function at Eq. (15)

*λ*is a hyperparameter. Note that in addition to the PDE-based term in the cost, this

_{PDE}**B**-PINN network is also relying on $ B ( t )$ field values produced by the physics-based MHD simulation for its training, as they are part of the cost function defined by Eq. (15). The network was trained with 120 epochs, starting with a learning rate of $ 2 \xd7 10 \u2212 3$ and halving every 40 epochs. During training, the network attempts to obey the ideal MHD induction equation while also fitting the data and gets punished for large weights. Note that Eq. (3) implies

**B**-PINN is weakly driven to be divergenceless from both fitting the data and satisfying the PDE.

One advantage of PINNs is that by automatic differentiation, you can compute time/spatial derivatives efficiently and accurately, hence their use in looking for PDE solutions. They are also independent of meshes. On the other hand, unlike the CNNs, there is no translation equivariance built into **B**-PINN. Also, by approximating a particular solution of a PDE, it is less general than the neural operator, which can, in principle, approximate the solution operator of the PDE, thus solve for any given initial conditions.

The MSE of the models on the test data and $ \u27e8 \u03f5 r d \u27e9$ is shown in Figs. 17 and 18, respectively. Both the $ B ( v )$-CNN and $ B ( v )$-PCNN have low $ \u27e8 \u03f5 r d \u27e9$ on the training data, with $ B ( v )$-PCNN matching the simulation nearly exactly from about *t* = 0.15 to before end of the training data at *t* = 0.75, then rising for the test data. In contrast, **B**-PINN had a steady relative divergence error of slightly less than 1% (Fig. 18).

### D. Other configurations

We investigated ways to improve both the convolutional models and the PINNs. For the CNN and PCNN, we used data augmentation, with the models training on 50 simulations like earlier. Also, **v** was added as an input to the PINN, with the idea that access to it on the test data may improve predictions. These approaches only increased the error on the test set.

## VI. DISCUSSION

We briefly summarize the above results with a list of all of the various neural networks that were used and the names with which we refer to them, as shown in Table I. A summary of quantitative MSE and SSIM values for the various architectures is given in Table III.

Forecasting ρ
. | |||
---|---|---|---|

Model . | Total MSE ( $ \xd7 10 \u2212 2 )$ . | . | . |

ρ-CNN-all | 8.97 | ||

ρ-CNN-rel | 9.87 | ||

ρ-PCNN-rel | $ 1.47$ | ||

ρ-PCNN-all | 3.08 |

Forecasting ρ
. | |||
---|---|---|---|

Model . | Total MSE ( $ \xd7 10 \u2212 2 )$ . | . | . |

ρ-CNN-all | 8.97 | ||

ρ-CNN-rel | 9.87 | ||

ρ-PCNN-rel | $ 1.47$ | ||

ρ-PCNN-all | 3.08 |

Forecasting B . | |||
---|---|---|---|

Model . | Total MSE ( $ \xd7 10 \u2212 5$) . | Total $ ( 1 / 128 2 ) | | \u2207 \xb7 B | | 1$ . | Total $ \u27e8 \u03f5 r , d \u27e9$ ( $ \xd7 10 \u2212 4 )$ . |

B-CNN | 35, 600 | $ 2.67 \xd7 10 3$ | ⋯ |

B-PCNN | 73, 700 | $ 1.84 \xd7 1 0 \u2212 5$ | ⋯ |

B-CNN-res | 4.06 | ⋯ | 7.706 |

B-PCNN-res | 5.21 | ⋯ | 7.712 |

B-CNN-res-rand | $ 1.58$ | ⋯ | 6.93 |

B-PCNN-res-rand | 1.66 | ⋯ | 5.387 |

Forecasting B . | |||
---|---|---|---|

Model . | Total MSE ( $ \xd7 10 \u2212 5$) . | Total $ ( 1 / 128 2 ) | | \u2207 \xb7 B | | 1$ . | Total $ \u27e8 \u03f5 r , d \u27e9$ ( $ \xd7 10 \u2212 4 )$ . |

B-CNN | 35, 600 | $ 2.67 \xd7 10 3$ | ⋯ |

B-PCNN | 73, 700 | $ 1.84 \xd7 1 0 \u2212 5$ | ⋯ |

B-CNN-res | 4.06 | ⋯ | 7.706 |

B-PCNN-res | 5.21 | ⋯ | 7.712 |

B-CNN-res-rand | $ 1.58$ | ⋯ | 6.93 |

B-PCNN-res-rand | 1.66 | ⋯ | 5.387 |

Predicting B . | |||
---|---|---|---|

Model . | Total MSE . | Total $ \u27e8 \u03f5 r d \u27e9$ . | . |

$ B ( v )$-CNN | 5.64 | 3.55 | |

$ B ( v )$-PCNN | 4.76 | $ 1.28$ | |

B-PINN | $ 1.99$ | 2.58 |

Predicting B . | |||
---|---|---|---|

Model . | Total MSE . | Total $ \u27e8 \u03f5 r d \u27e9$ . | . |

$ B ( v )$-CNN | 5.64 | 3.55 | |

$ B ( v )$-PCNN | 4.76 | $ 1.28$ | |

B-PINN | $ 1.99$ | 2.58 |

### A. Forecasting density

We first studied of the use of neural networks to forecast the mass density *ρ* of a subsequent time step $ ( t + \Delta t )$ based on field measurements from the previous time step (*t*). In this case, we compared using a CNN without any physics constraints, which takes all fields as input images at time *t* to predict $ \rho ( t + \Delta t )$. We also incorporate a hard physics constraint in enforcing $ \rho \u2265 0$ by utilizing a ReLu activation function on the network's output, referred to as *ρ*-PCNN-all. For density forecasting, we also take into account Eq. (1) and train each of the above networks with only the relevant fields $ \rho ( t )$ and $ v ( t )$. These networks using only relevant data are referred to as *ρ*-CNN-rel and *ρ*-PCNN-rel.

As can be seen from Figs. 3 and 4, the models that performed best according to both the MSE and SSIM were the ones that used ReLU activation functions in their final layer. This is likely due to the fact that the ReLU function is enforcing the hard constraint $ \rho \u2265 0$; thus, the model does not have to learn this, which facilitates the model learning quicker. We also tested enforcing non-negativity by using a linear output layer and then squaring its output. The MSE on the test set was nearly the same as with using a ReLU final layer. In terms of both MSE and SSIM, it is found that utilizing only the relevant data only slightly improves performance, while adding the hard physics constraint results in a dramatic improvement.

### B. Forecasting the magnetic field

Figures 5 and 6 show the *B _{x}* and

*B*fields generated by the

_{y}**B**-CNN and

**B**-PCNN relative to the true field values at several time steps. While the predictions of both models are very accurate and almost indistinguishable by eye, only the

**B**-PCNN with hard physics constraints based on the approach in Ref. 14 generated physical fields with zero divergence, while the fields generated by the

**B**-CNN had divergence values higher by a factor of 10

^{8}, as seen in Figs. 7 and 8.

Zooming in on the absolute differences between true values and predictions in Figs. 5 and 6, a slightly higher error can be seen for the **B**-PCNN approach, which is quantified in terms of both MSE and SSIM in Fig. 9, which shows that the SSIM of both methods is very high $ \u2265 0.8$, while the MSE of both methods is very low $ < 5 \xd7 10 \u2212 3$. The slightly higher SSIM and slightly smaller MSE of the **B**-CNN predictions relative to those of the **B**-PCNN, given the same number of training steps, is due to two major factors.

First, a well-known issue with the **B**-PCNN approach on a mesh is that boundaries have to be handled carefully and even masked out, which was not done here, as can be seen by the visual artifacts that appear near the boundary of the simulation. This is due to taking finite-difference derivatives at a boundary which sample beyond the boundary (requiring periodic or zero padding). Using periodic padding in the encoder part of the **B**-PCNN architecture helped somewhat alleviate this, but would have to be carefully applied on the output of the PCNN so that when **B** is computed as the curl of *A* artificial numerical errors are not introduced. Second, while the **B**-CNN has no built in constraints and can generate arbitrary outputs, the **B**-PCNN is very highly constrained in that in can only generate fields that by construction satisfy the hard physics constraint of $ \u2207 \xb7 B = 0$.

While the accuracy of both model's predictions is very high in terms of both MSE and SSIM, only the **B**-PCNN generates physical fields (within numerical precision), as can be seen in Figs. 7 and 8. In Fig. 7, the pixel-wise $ | \u2207 \xb7 B |$ values of the predicted fields are shown for several times steps on a fixed color scale. On that scale, while large $ | \u2207 \xb7 B |$ values are clearly visible for **B**-CNN predictions, they are almost impossible to see for the **B**-PCNN predictions. This difference is quantified in Fig. 8, which shows the image-averaged $ | \u2207 \xb7 B |$ values for all of the predicted test value time steps. While the **B**-PCNN $ | \u2207 \xb7 B |$ values are $ \u223c 10 \u2212 8$, with a lower bound dictated by the numerical precision of the Tensorflow floating point 32 data type, the **B**-CNN physics constraint violation is catastrophic at $ \u223c 1$, eight orders of magnitude higher.

As discussed above, one way to make this problem much easier for the CNN is to utilize a residual CNN, which takes into account our knowledge that this is a continuous dynamic process. Therefore, the magnetic field at a subsequent time step, $ B ( t + \Delta )$, should be close to its previous value $ B ( t )$ for small values of Δ. The new field value should be a small perturbation of the form $ B ( t + \Delta ) = B ( t ) + \Delta B ( t )$. It was found that by giving the network this strong hint and having it generate $ \Delta B ( t )$ directly, the accuracy of both the **B**-CNN and **B**-PCNN was greatly improved as seen in terms of MSE in Fig. 12, which decreased by four orders of magnitude.

**B**-PCNN could generate physical fields. However, because now only small perturbations are being generated by the network and are being added to model-based fields with zero divergence, the overall divergence of the subsequent fields became smaller for both networks as seen by considering the divergence of the ResNet's predicted magnetic field

*ϵ*is useful in that the relative divergence is much smaller and is a unit-less quantity, which is less sensitive to the magnitude of the field being generated. Taking into account that the

_{rd}**B**-PCNN is the only network that was able to generate physical fields, in the remainder of the paper, we focus on this relative divergence measure, which is of interest for numerical simulations in which the hard physics constraint $ \u2207 \xb7 B = 0$ is not necessarily maintained, but for which some measure of divergence is still important.

In terms of *ϵ _{rd}*, the residual

**B**-CNN-res and

**B**-PCNN-res were found to have very similar values (besides a few random outliers) as each network easily found a correct small perturbation of the divergence-less

**B**field from the previous time step provided by the simulation, as shown in Fig. 13. The ResNet setup is only predicting small perturbations, which are added to the correct simulation-based values whose divergence is zero, from previous states. Therefore, the overall divergence of the predicted fields will have contributions from only the small perturbations generated by the ResNets. That is why in this case, the “relative divergence” of the PCNN and CNN becomes very similar.

Next, it was shown that training on 50 random simulations, each of which was a slight perturbation of the original Orszag–Tang vortex, and then predicting on data from the Orszag–Tang vortex, yielded models that preformed even better than just training on data from the Orszag–Tang vortex alone, as also seen in Fig. 12. In this case, the networks had a much more rich set of data to learn from. In this residual approach, again with most of the output dominated by the known previous step's divergence-less fields, the relative divergences of both approaches were relatively low.

These results imply that this approach can be useful for a wide range of unseen initial conditions as long as the network is trained over a sufficiently large dataset with a large enough variety of field distributions. Furthermore, such a trained network should be useful for a completely different set of initial conditions if it is first quickly re-trained on just a few of the initial steps.

### C. Surrogate modeling

The only two models we were able to successfully apply to surrogate modeling were **B**-CNN-res-rand and **B**-PCNN-res-rand. The use of surrogate models had the greatest effect on **B** and **v** with errors increasing rapidly with time as can be seen in Fig. 14. The same trend is reflected in the SSIM. While *ρ* and *p* do start to diverge from the simulation as time goes on, this divergence is on a scale two orders of magnitude less than that of **v** and **B**.

In terms of relative divergences, **B**-PCNN-res-rand preformed the best overall.

For this more challenging problem of recursively applying the ML on its own outputs, the **B**-CNN-res-rand and the **B**-PCNN models both had high SSIM and low MSE, but in terms of relative divergences, the **B**-PCNN performance was much better, although after many steps both models diverged. Note that in this case, the use of relative rather than actual divergence makes artificially makes the **B**-PCNN look bad because, as seen in the bottom three rows of Fig. 16, as the MHD simulation becomes more diffuse over time, the magnitude of the magnetic field decreases significantly, which decreases the denominator or Eq. (37), while the numerator is relatively fixed in the **B**-PCNN case limited by numerical precision.

### D. Predicting the magnetic field

The final test of these approaches was to predict the magnetic field at a given time step based only on velocity field measurements from the same time step. Although it would have been possible to do this with a residual network approach using the previous time step's magnetic field, $ B ( t \u2212 \Delta )$, this was not done in order to fully test the network's ability to match both field accuracy and the divergence-less condition, which as mentioned above is otherwise dominated by the correct **B** field values provided by the simulation. In this case, in addition to the CNNs, a point-wise PINN-based network was also tested.

All three networks performed well with low MSE as shown in Fig. 17, and visually indistinguishable field maps. As before, only the **B**-PCNN was able to produce divergence-less fields that matched the simulation, as shown in Fig. 18. For the convolutional models, $ B ( v )$-PCNN outperformed $ B ( v )$-CNN in both accuracy and upholding the divergence-less condition. The **B**-PINN performed better than the convolutional models with respect to MSE on the test data, but generated un-physical data that resulted in the worst violation of the physics constraints as can be seen in Fig. 18, despite having the explicit PDE constraint. Again the only approach able to generate physical field values was the PCNN. Furthermore, the PINN approach is slower to train and slower to generate images as it works on a point-by-point basis rather than seeing and generating entire images at once.

## VII. CONCLUSIONS

We showed that convolutional neural networks can be used for MHD, how certain physics constraints can be embedded and which ones led to improvements in performance. It was also seen that using ResNets and data augmentation could make a big difference. The forecasting and surrogate modeling tasks show that PCNNs could potentially be helpful to speed up simulations. The prediction task demonstrated that they may be of use in analyzing MHD phenomena when access to certain observables is not possible. Something similar has already been done with PINNs to study the solar coronal magnetic field.^{56} Deep generative encoder–decoder convolutional neural networks are also being combined with adaptive model-independent feedback to make robust surrogate models that generate all of the 15 2D projections of the 6D phase space of charged particle beams based only on a limited view of a single 2D measurement.^{57} Adaptive feedback methods have been combined with simulations to provide virtual views of complex dynamics based on limited data,^{58} and there is potential to develop such an approach for plasma systems by using fast MHD surrogates based on CNNs.

It is worth pointing out that for a fixed neural network architecture, a slightly different result can occur every time that the network is trained using a new random seed and new random initial weights and biases, which are sampled from probability distributions. In this work, we did re-train networks many times and none of the results were significantly changed. Furthermore, it is also true that various architecture choices such as the number of layers, the width of each layer, and the activation functions being used can all slightly change a network's performance. Despite this possibility of small variation, when our networks were trained with various activation functions or levels of regularization, the overall results still held. The overall main point is that the different approaches, independent of the details of the architecture used, have certain limitations, and the **B**-PCNN approach with hard physics constraints built into the architecture, based on the methods developed in Ref. 14, was the only method that generated physical fields that respected constraints.

Future work could go beyond ideal MHD to incorporating magnetic diffusion and fluid viscosity into the simulation. Also, the spatial resolution could be increased beyond 128 × 128. It was observed that the Orszag–Tang vortex picks up interesting detailed structures at higher resolutions. A systematic investigation into how robust the MHD CNNs/PCNNs and PINN models are to noise would be worth undertaking and inform their use in real world applications. Relatedly, the lack of interpretability of NNs makes uncertainty quantification crucial. The challenge is that many standard techniques would be prohibitively slow given the number of parameters the NNs have, and thus, other methods have been developed.^{59} Given their different strengths, finding effective ways of combining PCNNs and PINNs for MHD problems may prove quite fruitful. One such approach already investigated was the application of Physics Informed Neural Operators in the form of Fourier neural operators to MHD.^{44}

^{60}CD schemes preserve the following form of the divergence free condition:

**b**as an intermediary step or having to use relative divergence to quantify error. Finally, the forecasting tasks could likely be improved by using by using a recurrent structure, like a LSTM or GRU, at the beginning of the network architectures.

Employing PCNNs to study real-world MHD situations can help speed up simulations, as well as be an effective approach in cases with incomplete information or with problems that are not well-posed. Taking into account the trade-off between ease of training and generating physical fields, the choice of ML with or without hard physics constraints likely depends on the application of the user. If the goal is to simply create a fast-executing surrogate model that provides a fast qualitative view of a system's state, such as a display in a control room that provides intuition for an experimenter, then the use of ML without hard physics constraints may be sufficient with the benefit of faster training. If on the other hand the goal is to develop a physically consistent surrogate model with the goal of speeding up computationally expensive calculations in a physics model while generating physical fields that numerically satisfy all of the constraints, then ML with hard physics constraints should be used.

## ACKNOWLEDGMENTS

This work was supported by the Los Alamos National Laboratory Directed Research and Development (LDRD) Program Directed Research (DR) project “Charged Particle Beam Control and Diagnostics using Adaptive Machine Learning,” Project No. 20220074DR.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Ari Bormanis:** Conceptualization (equal); Data curation (lead); Formal analysis (equal); Investigation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Christopher Leon:** Data curation (equal); Formal analysis (equal); Investigation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Alexander Scheinker:** Conceptualization (lead); Data curation (equal); Formal analysis (equal); Funding acquisition (lead); Investigation (equal); Methodology (lead); Supervision (equal); Writing – original draft (supporting); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX A: MHD EQUATIONS IN ARBITRARY DIMENSIONS

### APPENDIX B: CONSTRAINED TRANSPORT SIMULATIONS

##### 1. Finite volume method

**U**is a vector of conservative variables and

**F**a vector that is some function of them. The conservative variables in this case are

*ρ*, $ \rho v$,

*ρe*, and

**B,**where

*e*is the total specific energy given by

*ρ*, $ \rho v$, and

*ρe*. The cells update their cell-averaged values by exchanging conservative quantities via fluxes through their faces. This scheme ensures that these conservative quantities are maintained to machine error. However, special care must be taken when advancing

**B**in time to ensure that Gauss's law for magnetism, $ \u2207 \xb7 B = 0$, is also maintained. This necessitates the need of constrained transport (see Sec. III A).

##### 2. Understanding the simulation data

To ensure the simulation is accurate, it must obey the Courant–Friedrichs–Lewy (CFL) condition, which enforces a maximum time step at each update in the simulation. Heuristically, this is the restriction that changes cannot propagate faster than the fastest possible wave speed, here the fast magnetosonic speed. We have modified the simulation to use a constant time step of $ \Delta t = 0.0008$ and the reason this time step is so small is to ensure that it never violates the restriction placed on it by the CFL condition for the whole simulation duration.

### APPENDIX C: RESNETS AND CONSTRAINED TRANSPORT

**b**:

*E*is the electric field at the corners of a cell. Note that we use a slightly more compact notation than in the rest of the paper. The time step is now the second superscript. This allows us to rigorously asses the update equation for

^{z}**B**. First for

*B*,

_{x}*i*,

*j*) in the mesh:

**F**is not divergence free in the discretized sense when taken as a forward difference

## REFERENCES

*β*plasmas