A large number of magnetohydrodynamic (MHD) equilibrium calculations are often required for uncertainty quantification, optimization, and real-time diagnostic information, making MHD equilibrium codes vital to the field of plasma physics. In this paper, we explore a method for solving the Grad–Shafranov equation by using physics-informed neural networks (PINNs). For PINNs, we optimize neural networks by directly minimizing the residual of the partial differential equation as a loss function. We show that PINNs can accurately and effectively solve the Grad–Shafranov equation with several different boundary conditions, making it more flexible than traditional solvers. This method is flexible as it does not require any mesh and basis choice, thereby streamlining the computational process. We also explore the parameter space by varying the size of the model, the learning rate, and boundary conditions to map various tradeoffs such as between reconstruction error and computational speed. Additionally, we introduce a parameterized PINN framework, expanding the input space to include variables such as pressure, aspect ratio, elongation, and triangularity in order to handle a broader range of plasma scenarios within a single network. Parameterized PINNs could be used in future work to solve inverse problems such as shape optimization.

## I. INTRODUCTION

The goal of the magnetic confinement fusion research is to establish a hot, dense, steady-state plasma with enough fusion occurring to be economically viable. For fusion parameter regimes, the plasma is often well-described by ideal magnetohydrodynamics (MHD), and 2D equilibria are described by the Grad–Shafranov equation (GSE).^{1,2} For diagnostic and real-time control purpose on current quasi-steady-state fusion devices, accurate plasma equilibria must be computed in real-time from a sparse set of measurement data. These types of calculations can be improved and sped up with neural network surrogate models. Neural network surrogates provide flexibility since the solution is differentiable and meshless. Once these surrogate models are trained, the inference times are very fast and additional improvements in GPUs and neural networks can further improve the performance. Moreover, they can be useful for uncertainty quantification (UQ), as the evaluation of these surrogate models post-training is very inexpensive.

In a physics-informed neural network (PINN), the weights and biases of the network are chosen by directly minimizing the residual of a partial differential equation (PDE) as a loss function.^{3} Accordingly, PINNs do not need training data obtained through potentially expensive physical experiments or computer simulations.^{3} PINNs can be used to solve many types of PDEs, including high-dimensional partial differential equations^{4} and fractional partial differential equations.^{5} PINNs can also be used to solve various inverse problems, such as shape optimization,^{6,7} that derive from PDEs.^{8} In this paper, we demonstrate the effectiveness of Physics-Informed Neural Networks (PINNs) in solving the Grad–Shafranov equation, employing both hard boundary conditions for exact enforcement and soft boundary conditions for approximate enforcement through penalty methods, achieving accurate solutions with each variation.

More than two decades ago, neural networks were proposed and briefly studied for solving the ideal MHD force balance equation for tokamaks and stellarators.^{9} However, recently, there has been a resurgence of interest in neural network surrogates for MHD equilibria.^{10–13} Most of this work has not used physics-informed neural networks (i.e., not directly minimizing the residual of the ideal MHD force balance) or has used substantial experimental data to build a fast surrogate for a specific device. There are also cases where the force balance residual is used as physics regularization^{14} or the MHD force balance is solved with a pseudospectral method via a fully spectral polynomial-Fourier basis instead of a typical neural network basis.^{15}

### A. Contributions of this work

In the present work, we generate fully data-free and mesh-free surrogate models of the Grad–Shafranov equation via physics-informed neural networks (PINNs). The intent of this work is not to replace optimized GS solvers but to develop a solver optimized for speed in scenarios where lower accuracy is sufficient, leveraging automatic differentiation (AD) to compute analytic derivatives. This approach eliminates the need for a mesh while requiring the equations to be in a symbolic form, as facilitated by neural networks. Additionally, we introduce a parameterized framework to further broaden the method's applicability, allowing for a diverse range of plasma scenarios to be addressed within a single model. We illustrate the strength of this approach by showing solutions of a class of fixed-boundary tokamak configurations from Cerfon *et al.*, including single and double null configurations.^{16} Configurations are implemented with both soft and hard constraints to compare their performance. We also perform hyperparameter scans and other systematic investigations and conclude that the network produces robust results for a very broad range of network parameters. We conclude by demonstrating preliminary work on *parametric* PINNs that are parameterized by the magnitude of the pressure and three shape parameters associated with the “elongated-D” cross section used for tokamak shapes. Unlike standard PINNs in which the models need to be retrained for each boundary shape and choice of profiles, parametric-PINNs only need to be trained once for various configurations. After a modest training period, a parametric-PINN can infer equilibria in tens of milliseconds for different geometries, from negative triangularity tokamaks to field-reversed configurations. For the cases explored here, the solution accuracy is likely sufficient to use a parametric PINN as a surrogate in inverse problems and optimization, although we have not yet explored these applications. Implementing PINNs in this work was done through a modified version of DeepXDE, a Python library for physics-informed learning.^{17}

## II. MHD EQUILIBRIA

^{18,19}In short, for an axisymmetric configuration, a poloidal magnetic flux function

*ψ*is introduced, where

*r*is the major radial coordinate in cylindrical coordinates, $ e \varphi $ is the unit vector in the toroidal direction, and $ F ( \psi ) = r B \varphi $.

*p*and poloidal current profiles $ F = r B \varphi $ are free functions that must be prescribed to solve the PDE. In other words, given the pressure and

*F*profiles, solving (5) provides the MHD equilibrium.

### A. Solov'ev equilibria

^{20}Although the general solution for the non-linear partial differential equation in Eq. (5) must be computed numerically, certain choices of $ p ( \psi )$ and $ F ( \psi )$ allow for an analytical solution: Solov'ev profiles, in which $ p ( \psi )$ and $ F 2 ( \psi )$ are both linear functions of

*ψ*

*A*and

*C*are constants. After rescaling with the major radius of the plasma,

*R*

_{0}, and an arbitrary constant,

*ψ*

_{0}, via the definitions $ R = r / R 0 , \u2009 Z = z / R 0 , \u2009 \psi = \psi 0 \Psi $, and $ \psi 0 = R 0 2 ( A + C R 0 2 )$ (5) can be written

^{16}

### B. Fixed boundary condition

*R*

_{0}, as in Fig. 1. The boundary can be defined by the following parametric equations:

^{21}

*κ*is the elongation,

*δ*is the triangularity, and $ \tau \u2208 [ 0 , 2 \pi )$.

### C. Analytic solutions for Solov'ev equilibria

*et al.*

^{16}With the prescribed boundary conditions and Solov'ev profiles, the Grad–Shafranov equation in Eq. (5) can be solved analytically using a sum of the homogeneous and particular solutions, where

*R*,

*Z*), and can be found in Cerfon and Friedberg

^{16}or Zheng

*et al.*

^{22}The coefficients

*c*are determined from seven point-wise boundary conditions specified in Cerfon

_{i}*et al.*With the seven geometric constraints, we have seven requirements that uniquely determine $ c 1 , \u2026 , c 7$. In principle, for the $ \Psi = 0$ contour to coincide with a specific boundary curve such as Eq. (8) exactly, an infinite number of basis functions would need to be included, rather than seven. However, in practice, this truncated analytic solution is very accurate for many choices of the boundary conditions. To be more precise, the relative error between the boundary shape prescribed by Eq. (8) and the boundary shape of the Cerfon analytic solutions is on the order of $ 10 \u2212 5$ for the examples that will be considered here. Given $ \u2207 \Psi \u223c 0.1$ in typical solutions, the difference of $ \u223c 10 \u2212 5$ in the boundary location translates to a difference of $ \u223c 10 \u2212 5 * 0.1 \u2248 10 \u2212 6 = 10 \u2212 4 %$ in $\Psi $ locally. We conclude that the boundary of the Cerfon analytic solution is close enough to Eq. (8) for Eq. (10) to be used as analytic solutions, and the difference between the boundary shapes is subdominant to other solution errors that will be discussed in the paper.

## III. PHYSICS-INFORMED NEURAL NETWORKS

PINNs use a neural network as a universal function approximator^{23} to calculate the solution of a PDE given independent variables of the PDE. Using automatic differentiation, residuals at each collocation points are calculated from the governing equation, boundary conditions, and (possibly) data. The residuals are consolidated into a loss function with weights.^{3} Fig. 3 describes the full PINN architecture used for this work.

*L*> 1 the network is considered “deep.” For the standard PINNs employed in this work, $ z 0 = ( R , Z ) , \u2009 n 0 = 2$,

*n*= 1 and

_{L}*z*is the neural network approximation of the poloidal flux function, $ \Psi \xaf$. The layers between the input and output layers are denoted hidden layers $ z i , \u2009 i = 1 , \u2026 , L \u2212 1$, and

_{L}*i*denotes the index of the layer, and $ \sigma i ( )$ is an activation function acting element-wise. The activation function can be any of the common varieties, e.g., sigmoids, rectified linear units (ReLU), and tanh functions. Different activation functions are explored in this work, and we found that ReLU does not work for solving the Grad–Shafranov equation since the solution needs to be twice differentiable. During training, both PDE and boundary condition residuals in a loss function are calculated by applying automatic differentiation to our differentiable surrogate model $ \Psi \xaf$. After the training, the weights, bias, and activation function at each layer are determined, and the output prediction $ \Psi \xaf = z L$ can be efficiently calculated from any given input vector $ z 0$ (i.e., coordinates).

*λ*

_{1}is a hyperparameter weighing the trade-off between matching the PDE in the volume and matching the boundary conditions.

^{24,25}Therefore, both soft constraints and hard constraints for boundary conditions are explored. The hard constraint can be effectively imposed by using a function

*G*(

*R*,

*Z*) that adheres strictly to the boundary condition. The function

*G*(

*R*,

*Z*) can be any smooth function that goes to zero at the boundary. It can be found with two different methods. We can find an analytical form from a given geometry or train a separate neural network that vanishes at the boundary and multiply the two network outputs. Either way, the final neural network with hard constraints, $ \Psi \xaf ( R , Z ) h c$, becomes

The drawbacks of employing a hard constraint approach include potentially lower accuracy for certain problems and the limitation that it can only be implemented when there is a functional representation of the boundary.

*R*and

_{data}*Z*; in the case of an experiment, $ \Psi \u0303$ should be understood to be an approximation of the GSE corrupted by measurement noise and any non-equilibrium dynamics. Although we solve a number of fixed-boundary problems in Sec. V, we omit any data points to illustrate that these methods can reproduce approximate MHD equilibria from the minimization of the PDE residual and boundary conditions alone. This demonstrates that while PINNs are known to enhance training outcomes when data points are included, for the scenarios examined in this study, utilizing collocation points alone sufficed to achieve a training accuracy of 1 × 10

_{data}^{−4}. Unlike data points that are from either a simulation or experiment, collocation points are points in input space where the residuals are calculated. Collocation points are different from data points in that there is no need for experiment or simulation to collect and label the data. An example of collocation points used in this work is shown in Fig. 2.

## IV. IMPLEMENTATION AND TRAINING

The general architecture used for this work is shown in Fig. 3. We will first use a standard FNN with three hidden layers and 20 nodes in each hidden layer. The inputs are the coordinates (*R*, *Z*), and the output is $ \Psi \xaf$. To try to avoid local optima, the Adam optimizer^{26} with learning rate $ \alpha l = 10 \u2212 3$ was used for the first 100 iterations and then L-BFGS-B^{27} is used. We used 100 and 500 boundary and domain collocation points, respectively. In Appendix, the parameter space is explored by varying the size of the model, activation functions, and optimizers, in order to map various tradeoffs such as between reconstruction error and computational speed. We demonstrate that approximating the solution to the two-dimensional Grad–Shafranov equation with linear profiles can be done with absolute errors of no more than one percent. We observe no significant effect on our results when using larger networks and adjusting the hyperparameters. In the subsections that follow, we will explain how we use PINN in various scenarios.

### A. Simple equilibria

### B. Equilibria with a divertor/X-point

Equilibria with an X-point can be calculated similar to the simple equilibria examples. Given the absence of a convenient parametric equation for the boundary, akin to Eq. (8), we derive the boundary through an analytical solution as described in Sec. II C. We can then generate collocation points both in the domain and at the boundary.

### C. Hard constraints

In the method shown in Fig. 3, we apply soft constraints of the boundary conditions through the loss *L _{bc}* shown in Eq. (21). Soft constraints on the boundary condition were used for simple- and divertor-equilibria cases. The hard constraints on the boundary conditions can be advantageous for various reasons. For some problems, it is shown that the hard constraints produce faster and more accurate solutions to a given PDE. It might be important in various diagnostic scenarios or in shape optimization to match the boundary condition exactly.

*G*(

*R*,

*Z*) from Eq. (8):

### D. Parametric-PINN

So far, each trained PINN model had fixed shape parameters, so the PINN can only produce solutions for one specific boundary configuration. However, it can be helpful to be able to interpolate and extrapolate for various boundary configurations with one pre-trained model. This can allow us to solve inverse problems, such as shape optimization for reactor design. When expanding to the parametric-PINN framework, the input space includes additional variables: strength of the pressure profile *P*, inverse aspect ratio *ε*, elongation *κ*, and triangularity *δ*. The expanded input space allows PINNs to vary the boundary condition loss term from Eq. (13). This extended input space requires modifications to the training and implementation procedures, as depicted in Fig. 4. Instead of only the spatial (*R*,*Z*) coordinates, the input to the PINN is now a six-dimensional vector [*R*, *Z*, *P*, *ε*, *κ*, *δ*] and the neural network output is $ \Psi \xaf ( R , Z , P , \epsilon , \kappa , \delta )$. In principle, one could include other parameters that affect the equilibria as additional variables in the input layer.

For the PDE residuals, we use the Grad–Shafranov equation with varying *P*. Soft boundary conditions are used, in the same way as in the non-parametric case. The implementation of the trained parametric-PINN model can be realized by feeding it with the required six-dimensional inputs, thereby facilitating accurate and efficient solutions to the Grad–Shafranov equation under different plasma scenarios characterized by varied pressure profiles, aspect ratios, elongations, and triangularities.

## V. RESULTS

We first demonstrate the accuracy and effectiveness of PINNs for solving the GSE under two different Solov'ev profiles for four different devices: International Thermonuclear Experimental Reactor (ITER), National Spherical Torus Experiment (NSTX), a spheromak, and an FRC (field-reversed configuration). In addition to these configurations, we also investigated variations with divertors, further expanding the applicability of our method. We benchmark the PINN-based solutions against the analytical solutions in Eq. (10), which serve as a reliable reference for our study.

In addition, we examine the impact of using hard constraints instead of soft constraints in the PINN model. Our findings show that hard constraints decrease computational complexity and relative error, as they satisfy an analytical representation for the boundary by definition, eliminating the need to evaluate collocation points for each boundary.

Finally, we present preliminary results exploring parametric PINN, where a single neural network can be trained for various devices with different geometries. Our model suffered from the curse of dimensionality, as the model's larger input space was associated with an exponentially increased number of boundary collocation points. Nevertheless, we were able to train a reasonably effective surrogate for MHD equilibria parameterized by the shape parameters, with promise for inverse problems.

### A. Baseline equilibria with different configurations

The seven different configurations considered in this study are displayed in Table I in Appendix. By assessing the performance of PINNs in these diverse scenarios, our objective is to demonstrate the robustness and versatility of our approach to solving GSE in various magnetic confinement devices. As one example, the NSTX geometry with finite beta is shown in Fig. 5.

Configuration . | ε
. | κ
. | δ
. | |
---|---|---|---|---|

Devices . | P
. | |||

ITER | 0.0 | 0.32 | 1.7 | 0.33 |

NSTX | 0.0 | 0.78 | 2.0 | 0.35 |

Spheromak | 0.0 | 0.95 | 1.0 | 0.2 |

ITER | 1.0 | 0.32 | 1.7 | 0.33 |

NSTX | 1.0 | 0.78 | 2.0 | 0.35 |

Spheromak | 1.0 | 0.95 | 1.0 | 0.2 |

FRC | 1.0 | 0.99 | 10.0 | 0.7 |

Configuration . | ε
. | κ
. | δ
. | |
---|---|---|---|---|

Devices . | P
. | |||

ITER | 0.0 | 0.32 | 1.7 | 0.33 |

NSTX | 0.0 | 0.78 | 2.0 | 0.35 |

Spheromak | 0.0 | 0.95 | 1.0 | 0.2 |

ITER | 1.0 | 0.32 | 1.7 | 0.33 |

NSTX | 1.0 | 0.78 | 2.0 | 0.35 |

Spheromak | 1.0 | 0.95 | 1.0 | 0.2 |

FRC | 1.0 | 0.99 | 10.0 | 0.7 |

^{29}so PINNs should be used for problems where computational speed or inverse solves are more important than obtaining equilibria to high numerical precision.

Configuration . | Average error (%) . | Max error (%) . | Training time (s) . | Inference time (ms) . | |
---|---|---|---|---|---|

Devices . | P
. | ||||

ITER | 0 | 0.015 | 0.162 | 20 | 10 |

NSTX | 0 | 0.032 | 0.142 | 22 | 15 |

Spheromak | 0 | 0.027 | 0.204 | 36 | 12 |

ITER | 1 | 0.021 | 0.136 | 17 | 11 |

NSTX | 1 | 0.026 | 0.186 | 32 | 13 |

Spheromak | 1 | 0.075 | 0.325 | 37 | 10 |

FRC | 1 | 0.057 | 0.214 | 27 | 17 |

Configuration . | Average error (%) . | Max error (%) . | Training time (s) . | Inference time (ms) . | |
---|---|---|---|---|---|

Devices . | P
. | ||||

ITER | 0 | 0.015 | 0.162 | 20 | 10 |

NSTX | 0 | 0.032 | 0.142 | 22 | 15 |

Spheromak | 0 | 0.027 | 0.204 | 36 | 12 |

ITER | 1 | 0.021 | 0.136 | 17 | 11 |

NSTX | 1 | 0.026 | 0.186 | 32 | 13 |

Spheromak | 1 | 0.075 | 0.325 | 37 | 10 |

FRC | 1 | 0.057 | 0.214 | 27 | 17 |

### B. Configurations with a divertor

For a diverted configuration with X-point(s) in the separatrix, the same neural network architecture is used, with some modification in the boundary conditions. Similar to the simple equilibria case, we implemented these configurations with a fixed boundary condition. Since we do not have an explicit analytic expression for the boundary shape, we numerically located points on the boundary of the analytic solution for $\Psi $ and used the resulting points for the boundary loss term. As shown in Fig. 6, the accuracy and training time were comparable to the baseline equilibria configurations without an X-point.

### C. Hard vs soft constraints

PINNs for configurations without an X-point were trained using both hard and soft constraints, and the figures of merit mentioned earlier were compared: equilibrium reconstruction error (both average and maximum) and computational time (both training and inference time). The results with hard constraints show 2–10 times more accurate equilibria reconstructions with faster training time. Qualitatively, we can see that the error at the boundary is zero by Eq. (17) and can be visually confirmed in Fig. 7. As mentioned before, even though the Cerfon analytic solution does not align exactly with Eq. (8), using the Cerfon analytic solution as a comparison is justified as the relative error between the two is subdominant, i.e., $ \u223c 10 \u2212 4 %$.

### D. Parametric-PINNs

We present a preliminary work on parametric-PINNs in which a single neural network is trained for various devices with different geometries. A single trained model can interpolate and even extrapolate for various boundary configurations. For training, we varied each parameter (e.g., *P*, *ε*, *κ*, and *δ*) as follows: $ P = [ 0.00 , 1.00 ] , \u2009 \epsilon = [ 0.12 , 0.52 ] , \u2009 \kappa = [ 1.25 , 2.75 ]$, and $ \delta = [ \u2212 0.5 , 0.5 ]$. For each parameter, we sampled nine points in the given range. The model had four layers with 30 nodes in each layer.

The training for the parametric-PINNs took longer than the regular PINNs (i.e., four hours vs 20 seconds) and the reconstruction error increased by up to an order of magnitude as shown in Fig. 8. The inference times are comparable to regular PINNs at around 10 ms. With 6D input instead of 2D input space for the model, the number of collocation points exponentially increased. The number of collocation points scales as *N ^{D}* where

*D*is the number of dimensions and

*N*is the number of points sampled in each dimension (assuming for simplicity one does uniform sampling). For example, we have four dimensions (i.e.,

*P*,

*ε*,

*κ*,

*δ*) with nine points sampled in each dimension (i.e., for triangularity: −0.5, −0.375, −0.25, −0.125, 0.0, 0.125, 0.25, 0.375, and 0.5). So, if we have a 100 boundary collocation points (R, Z) for each configuration (i.e., at a fixed

*P*,

*ε*,

*κ*,

*δ*), the total number of boundary collocation points for parametric-PINNs will be $ 100 \xd7 9 4 = 656 \u2009 100$. For the domain collocation points, we also used 656 100 pseudo-randomly generated points.

A preliminary result on the trained parametric-PINNs is shown in Fig. 8, which shows the variation in *δ* from a single trained model instead of three different models. The three other dimensions (i.e., *P*, *ε*, and *δ*) can also be varied in the same model without re-training, but results are not shown here as the trend is similar. The error gradually gets worse as we extrapolate further from the trained region.

Parametric PINNs have the potential to be as accurate as regular PINNs with further hyperparameter tuning and other techniques such as hard constraints and adaptive sampling methods.^{30,31} Exploring hard constraints for boundary conditions in parametric PINNs is beyond the scope of this work, but hard constraints could be advantageous by providing a workaround for the exponentially increasing number of collocation points. Furthermore, given the high-dimensional input space, strategies for efficient sampling of collocation points (both interior and boundary points) are also imperative in future work.

## VI. SUMMARY AND CONCLUSIONS

In this study, we have illustrated a comparison between the analytical solution and PINN in various fusion devices, such as ITER, NSTX, a spheromak, and an FRC. Although the PINNs approach may not be the optimal choice for achieving the highest precision compared to modern numerical MHD equilibrium codes, it offers several advantages. Specifically, the PINN approach (1) eliminates the need for complicated PDE discretization and solution schemes, requiring only the coding of the PDE in symbolic language and (2) provides flexibility through automatic differentiation, enabling its use on various geometries and plasma devices without mesh generation.

The potential of PINNs to incorporate physical constraints into the loss function, as well as its flexibility, implies that it could be a valuable tool for MHD equilibrium reconstruction and optimization in a variety of devices with further improvements. PINNs are also known to excel in addressing inverse problems, a topic that should be explored in future work.

Our work makes initial progress in employing PINNs for plasma physics applications, but we acknowledge that there is still room for improvement and further exploration. Future research should concentrate on refining the model by improving sampling methods for collocation points and broadening its applicability to a wider range of magnetic confinement devices. Although our preliminary parametric PINN approach encounters obstacles such as the curse of dimensionality, future work can investigate potential solutions, including separable PINNs,^{32} hard constraints, and adaptive sampling techniques.^{30,31} This work can also expand to introduce general pressure and current profiles instead of the linear profiles shown in this work. Then, this work can expand to free-boundary equilibria, shape optimization for tokamaks, and 3D equilibria for stellarators, which have more complicated features such as magnetic islands.

## ACKNOWLEDGMENTS

We would like to acknowledge valuable discussions with Chris Hansen, Andrea Merlo, Timo Thun, and Thomas Sunn Pederson. This work was supported by the U.S. Department of Energy under Contract No. DE-FG02-93ER54197. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at the Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231 using NERSC award FES-ERCAP-mp217-2023.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Byoungchan Jang:** Conceptualization (equal); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Software (lead); Validation (lead); Visualization (lead); Writing—original draft (lead). **Alan A. Kaptanoglu:** Conceptualization (equal); Data curation (supporting); Investigation (supporting); Methodology (supporting); Project administration (equal); Software (equal); Supervision (equal); Writing—review & editing (equal). **Rahul Gaur:** Investigation (supporting); Methodology (supporting); Writing—review & editing (supporting). **Shaowu Pan:** Conceptualization (supporting); Writing—review & editing (supporting). **Matt Landreman:** Conceptualization (equal); Funding acquisition (equal); Project administration (equal); Supervision (equal); Writing—review & editing (equal). **William Dorland:** Funding acquisition (equal); Supervision (equal); Writing—review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX: HYPERPARAMETER ANALYSIS

The parameter space of the PINN architecture is explored by varying the activation functions, learning rate, number of iterations of Adam optimizer before using L-BFGS-B, and the size of the model, in order to map various tradeoffs such as between reconstruction error and computational speed. For each set of hyperparameters, we repeated the training ten times with different random seeds for initialization.

^{33,34}

The hyperbolic tangent (tanh) function returns values that range from −1 to 1. The tanh function is similar to the sigmoid function but the mean of its output is closer to zero, which helps with converging the model during gradient descent.^{33}

The rectified linear unit (ReLU) returns the input directly if it is positive; otherwise, it evaluates to zero.^{35,36} ReLU has become a default choice for many problems due to its simplicity and performance across multiple datasets and tasks. Its major benefit is that it solves the vanishing gradient problem, allowing models to learn faster and perform better. The issue with ReLU for this work is that the function is not differentiable at the origin. It is clearly shown in Fig. 9 that models with ReLU do not converge.

Swish attempts to combine the best properties of ReLU and sigmoid functions. It is non-monotonic, and this can help the model learn more complex patterns. It is smooth and differentiable, and unlike ReLU, it does not nullify negative input values.^{37}

PINNs were trained with these four activation functions with a learning rate of 2 × 10^{−2}, with 100 Adam iterations followed by L-BFGS-B, network size of three hidden layers with 20 nodes each, 100 boundary points, and 1000 domain points. As shown in Fig. 9, swish and tanh perform significantly better than the sigmoid function, and it is inconclusive whether tanh or swish is optimal. Models had a hard time converging with ReLU. This is probably because for the Grad–Shafranov equation, the model needs to be two times differentiable, where the ReLU function is not differentiable at the origin.

Next, the learning rate was explored by varying it from 7 × 10^{−1} to 2 × 10^{−4}. The learning rate of the Adam optimizer is considered an important hyperparameter for the effective neural network training.^{38} Depending on each model's loss landscape, an optimized learning rate can provide an effective training. If the learning rate is too high, the model may diverge, whereas if the learning rate is too low, it may take too long to converge. As shown in Fig. 10, for 1000 Adam steps, the optimal learning rate is around 2 × 10^{−2} and we see that the model starts to diverge at 7 × 10^{−2}. Note that the choice of these learning rates is arbitrary.

Next, we vary the number of training steps from the Adam optimizer before L-BFGS-B is applied. PINNs are typically trained with two optimizers in succession: Adam and then L-BFGS-B.^{29,39} Since L-BFGS-B is a second-order optimization method, it converges to minima more accurately. However, L-BFGS-B can rapidly converge and get stuck to local minima. This is why we use Adam before L-BFGS-B to avoid local minima. In Fig. 11, our runs show that Adam is helpful for robust training, since a low number of Adam training steps results in high variation depending on the initialization.

Next, we vary the size of the network, first the depth of the network (i.e., the number of hidden layers) and then the width of the network (i.e., the number of nodes for each hidden layer). We find that anything more than three hidden layers does not show a significant increase in the performance of the model as shown in Fig. 12. For the width, anything more than 20 nodes does not show a significant increase in the performance as shown in Fig. 13. This is probably due to the smooth and relatively well-behaved nature of the Grad–Shafranov solutions with linear profiles.

## REFERENCES

*Neural Networks: Tricks of the Trade*