Two existing particle-in-cell gyrokinetic codes, GEM for the core region and XGC for the edge region, have been successfully coupled with a spatial coupling scheme at the interface in a toroidal geometry. A mapping technique is developed for transferring data between GEM's structured and XGC's unstructured meshes. Two examples of coupled simulations are presented to demonstrate the coupling scheme. The optimization of GEM for graphics processing unit is also presented.

## I. INTRODUCTION

The study of burning plasmas will be the next frontier for fusion research with the upcoming ITER in the next decade. Based on several fusion experiments as well as theory and simulation, ITER is expected to reach plasma regions beyond present and previous experiments. It is, thus, important to make theory and simulation reliable to predict the plasma behavior of ITER so as to help ITER fulfill its potential through well-confined, disruption-free, and burning operation scenarios. Since the behavior of a fusion plasma in tokamaks involves multiple spatial and time scales, a Whole Device Model (WDM) is critical for understanding and predicting the operation of the existing fusion devices and future facilities such as ITER.

Recently, an ambitious simulation project on high-fidelity whole device modeling of magnetically confined fusion plasma was launched by the Department of Energy (DOE) Exascale Computing Program (ECP). It aims at delivering a first-principles-based computational tool that simulates both the neoclassical transport and anomalous turbulent transport from the core to the edge of a tokamak. To enable such simulations, it has been proposed that the existing gyrokinetic codes^{1} be coupled together in order to improve computational efficiency. This strategy takes advantage of the complementary nature of different applications to build a more advanced and efficient whole device kinetic transport kernel. A core gyrokinetic code, such as GEM^{2,3} or GENE,^{4,5} is well optimized to model small amplitude, weak-gradient driven turbulence for plasmas within the closed-flux surface, while an edge gyrokinetic code such as XGC^{6–9} is well designed to model large amplitude fluctuations, strong gradient driven turbulence, and neoclassical physics for plasmas crossing the magnetic separatrix and in the open magnetic field region up to the divertor plates. Prior to the core-edge coupling with different codes, some prerequisite steps have already been made, such as the cross-verification^{10} of the global gyrokinetic codes GENE, ORB5,^{11} and XGC, as well as developing the spatial coupling scheme^{1} by the XGC(core)–XGC(edge) coupling. Cross-verification of GEM and XGC is also a prerequisite for the target coupling physics problem.

In this paper, we present a numerical scheme for coupling GEM and XGC, where GEM is used for the core and XGC is used for the edge. GEM is optimized for the simulation of core micro-turbulence and transport in two main aspects. First, the *δf*-particle-in-cell (PIC) algorithm is used to take advantage of the low fluctuation amplitude in the core. Second, searching particles is easier due to the structured grids, which are constructed based on the field line following the feature of the closed flux surface in a tokamak. On the other hand, XGC is originally designed for the simulation of edge turbulence and transport. Due to the complexity of the magnetic field configuration near the X-point, the limiter, and the divertor, the cylindrical coordinates $(R,Z,\zeta )$ and unstructured triangular mesh are used to describe the particle motion. Due to the high fluctuation amplitude in the edge, the required number of particles in the edge is also much larger than in a core simulation.

In principle, XGC alone can be used to simulate the whole volume. In such a simulation, the unstructured grids in the core region need to obey similar complex rules as the edge region requires, and the core simulation needs to adopt the same time-consuming algorithm as in the edge to deal with the complexity of the grids. Furthermore, it is difficult to reduce the particle number in the core region, as the $\delta f$-PIC algorithm requires that the core and edge regions in a XGC-alone simulation to have an equivalent number of marker particles for the same volume in the spatial space. Because a large number of particles are required to resolve the large amplitude in the edge, the total number of particles in the core region would far exceed what is necessary for resolving the low-amplitude fluctuations in the core. This will greatly enhance the required computing resources. A scheme based on coupling of two different codes should be pursued. Such a coupling scheme would fully utilize code features that are specialized to the core and edge plasma conditions, and this is the motivation of the present work.

Here, we present an implementation of the spatial coupling scheme,^{1} which has been previously demonstrated with the XGC–XGC coupling. In the XGC–XGC coupling, two XGC simulations are coupled as if they are independent codes. The simulations are coupled only through an overlap region, where charge density and potential are exchanged. The same mesh is used in both simulations for the purpose of developing the coupling scheme. No data mapping is needed. In the coupling scheme for the GEM–XGC coupling, the data mapping becomes essential. The poloidal cross section is divided into three regions, the core region of GEM, the edge region of XGC, and an overlapping region included in both codes. Time-stepping of the distribution functions is achieved by pushing the composite distribution function independently in each code, while the electric potential is solved solely in XGC and then passed to GEM. Specifically, in the gyrokinetic electrostatic GEM–XGC coupling simulations, the complete process is split into four steps: (1) GEM and XGC push particles independently; (2) GEM's charge density is sent to XGC; (3) then, the XGC's field solver will solve the gyrokinetic Poisson equation for the potential field using the composite charge density; and (4) GEM will receive the potential field from XGC, and so, GEM and XGC can push particles with the updated field.

Since GEM uses the field-line-following magnetic-flux coordinates and XGC adopts field-line-following cylindrical coordinates $(R,Z,\zeta )$, a parallelized mapping technique between GEM and XGC is developed to convert between these different coordinates. Within the coupling framework,^{12} the software ADIOS^{13} is used to exchange data through files or memory, which enables the cost of data exchange being only a very small fraction of the total cost of simulations. As a part of the ECP project, the GPU optimization for GEM in Summit is also achieved, which shows a good scaling for the particle evolution.

We present two examples to demonstrate the coupling scheme. In the first example with the Cyclone Base Case (CBC),^{14} the simulation domains of both codes are within the separatrix. A reference simulation is first carried out with XGC alone, to be used as a gauge for the coupled simulation. Good agreement is obtained between the coupling and reference case for both linear and nonlinear simulations. In the second example, the edge region outside the separatrix is also included in a DIII-D-like plasma, with the edge region simulated by XGC and the core region by GEM. Good agreement between the reference simulation and the coupled simulation is achieved.

This paper is organized as follows. A brief description of the established coupling scheme^{1} is given in Sec. II. Data mapping between GEM and XGC is presented in Sec. III. Two examples of coupled simulation are presented in Sec. IV. The GPU optimization of GEM on Summit is presented in Sec. V, and conclusion is given Sec. VI.

## II. SPATIAL COUPLING SCHEME

In this section, we briefly describe the coupling scheme. We refer readers to Ref. 1 for more details. The radial domain is divided into three regions, the core, the edge, and an overlapping region. Let $\delta fC$ denotes the perturbed particle distribution function in the core code (GEM), which is also defined in the overlapping region. The radial size of the overlap region should be larger than the radial turbulence correlation length. Once initial values are assigned to $\delta fC$, it is evolved in GEM with the $\delta f$-PIC method. Similarly, $\delta fE$ denotes the distribution in the edge region and is evolved in XGC. The two codes share the same gyrokinetic set of equations. In this study, XGC is operated in the $\delta f$ mode. The two distribution functions are combined into a composite one,

where $\varpi =\varpi (\psi gc)$ is a continuous connection function varying between 0 and 1 with respect to the gyrocenter's radial position *ψ _{gc}*, with $\varpi =1$ in the core, $0<\varpi <1$ in the overlapping region, and $\varpi =0$ in the edge. The electrostatic potential $\varphi \u02c7$ is obtained by solving the global Poisson equation over the whole simulation domain, with the ion charge density obtained from the composite distribution $\delta f\u02c7$,

where the ion polarization density and the adiabatic electron response are denoted by the operator $\xa3$. The ion charge density is given by

where ** X** is the gyrocenter position, $\rho $ is the Larmor vector,

**is the particle position vector in the configuration space, $v||$ is the particle speed scalar parallel to the magnetic field,**

*x**μ*is the magnetic moment,

*α*is the gyroangle,

*B*is the magnetic field,

*m*is the particle mass, and $\delta [\u2026]$ is the Dirac operator. For consistency, it is crucial to have ϖ defined in gyrocenter space, $\varpi =\varpi (X)$, before applying the guiding-center pull back operation, $\delta [X+\rho \u2212x]$. Equation (3) has important consequences (see more details in Ref. 1). In addition, GEM uses the usual Dirichlet boundary condition for $\varphi $. However, in the coupled simulations, the potential is provided by XGC. No boundary condition is applied at the boundary of the overlap region. At the boundary of the buffer region,

^{1}calculation of the electric field from the provided potential is limited to grid points inside the buffer. For grid points on the buffer boundary, the fields are set to zero, and particles crossing this boundary are put back inside the buffer. Our first step is to ensure that the real-time data exchange between the two codes is accurate, thereby verifying the particle pushing routines in both codes, and sufficiently fast so that the coupling scheme is practical. Even though the two codes are based on the same set of gyrokinetic equations, they use different discretization methods. Not only simulations from individual codes should be verified, but also the coupled simulation and the reference simulation should be cross-verified.

We use the $\delta f$ method to evolve the perturbed distribution, $\delta f=f\u2212f0$, with *f*_{0} being the Maxwellian distribution with radius-dependent density and temperature. It is to be noted that, in the present work, sources and collisions are not included. The marker particle distribution *g* is usually chosen to be different from *f*_{0}. In this paper, two methods are used to evolve the weight distribution: the conventional *δf* scheme^{15} and the direct *δf* scheme.^{7} In the conventional *δf* scheme, the weight is obtained by advancing the ordinary differential equation for the weight, which is derived from the gyrokinetic equation for *δf*. In the direct *δf* method, use is made of the fact that the total distribution is constant along the particle trajectory between collisions; hence, *δf* is simply the difference between the total distribution at the initial particle phase-space location and the equilibrium distribution evaluated at the current location.

We define the weights $wg\u2261\delta f/g$ and $w\u2261\delta f/f$ at the particle phase space coordinate. Defining $w0\u2261f/g$, the weight for the conventional *δf* scheme evolves according to

Here, $w0\u2261f0(Xt=0,t=0)/g(Xt=0,t=0)$, where $Xt=0$ is the initial position. In Eq. (4), the invariance of *f* and *g* has been used. This is the consequence of the collisionless assumption. In a collisionless plasma, the total distribution is constant along the particle trajectory.

In the direct weight evolution method for the *δf* scheme, the weight is given by

where Δ represents the finite difference along the particle trajectory. This scheme is used for both GEM and XGC when the coupling simulations include the open field line region.

## III. MAPPING BETWEEN GEM AND XGC

In this section, we describe the numerical procedure for data exchange. A general equilibrium magnetic field in a tokamak is given by

where *ψ* is the poloidal magnetic flux, *ζ* is the toroidal angle, $\zeta \u0302$ is the basic vector along the *ζ* direction, *R* is the major radius, and $f(\psi )=RB\zeta $, with $B\zeta $ the toroidal component of the magnetic field. GEM uses the field-line-following magnetic-flux coordinates (*x*, *y*, *z*),

where $r=(Rmax\u2212Rmin)/2$ is a flux surface label, *R _{max}* and

*R*are the major radii of the two points where the flux surface intersects the horizontal plane containing the magnetic axis,

_{min}*r*

_{0}is a reference radius,

*R*

_{0}is the major radius of the magnetic axis,

*q*is the safety factor and $q0=q(r0)$,

*θ*is the usual poloidal angle, and $tan\u2009\theta =Z/(R\u2212R0)$ such that the cylindrical coordinates $(R,Z,\zeta )$ are right-handed,

*θ*is the straight-field-line poloidal angle,

_{f}$q\u0302(r,\theta )=B\xb7\u2207\zeta B\xb7\u2207\theta $, and $q(r)=B\xb7\u2207\zeta B\xb7\u2207\theta f$. The simulation domain in GEM is $r\u2208[rmin,rmax],\u2009\theta \u2208[\u2212\pi ,\pi )$. The outer boundary *r _{max}* lies inside the separatrix. The inner boundary

*r*cannot be the magnetic axis, as the Jacobian of the field-line-following coordinates becomes singular at

_{min}*r*= 0. The simulation domain in the toroidal direction is often chosen to be a toroidal wedge, $\zeta \u2208[0,2\pi /nw)$, and correspondingly, the exact

*y*-coordinate becomes

Then, the simulation domain is given by $(0,Lx)\xd7(0,Ly)\xd7(0,Lz)$ with $Lx=rmax\u2212rmin,\u2009Ly=r0q02\pi nw,\u2009Lz=2\pi q0R0$, which is divided into $nx\xd7ny\xd7nz$ equally sized cells in (*x*, *y*, *z*) coordinates.

Given a numerical equilibrium such as the files from the the tokamak equilibrium reconstruction code EFIT (Equilibrium Fitting),^{16} GEM first constructs two numerical arrays to represent the arbitrary flux-surface shapes. These are the major radius $R(r,\theta )$ and the cylindrical coordinate $Z(r,\theta )$, on a set of structured $(r,\theta )$ grids. A typical grid number for this purpose is a few hundred grids in each dimension. Quantities such as the equilibrium magnetic field magnitude $B(r,\theta )$, the straight-field-line poloidal angle *θ _{f}*, and geometric quantities such as $\zeta \u0302\xb7\u2207r\xd7\u2207\theta $ are, then, computed and represented with the same $(r,\theta )$ grids. This approach of representing an axisymmetric tokamak equilibrium within the separatrix is completely general and is used even if the equilibrium allows for an analytic representation, as in the case of a model equilibrium with circular concentric flux-surfaces. Whenever the value of a quantity at an arbitrary point (e.g., the value of

*θ*corresponding to an XGC grid) is needed, it is obtained with a 2D linear interpolation in the $(r,\theta )$-plane.

_{f}XGC uses the field-line following cylindrical coordinates $(R,Z,\zeta )$ with regular grids in *ζ* and unstructured triangular mesh in the (*R*, *Z*) plane that is twisted along the magnetic field lines to be identical at each *ζ* grid. The node points of the poloidal mesh reside in a set of magnetic flux surfaces and follow the magnetic field lines. Within one closed flux surface, each node point for the (*R*, *Z*) grid corresponds to a specific *θ* or *θ _{f}*. The mapping from (

*R*,

*Z*) to GEM's $(r,\theta )$ coordinates is obtained by first finding

*θ*from $tan\u2009\theta =Z/(R\u2212R0)$ and then finding

*r*with a searching algorithm. Hence, $(R,Z,\zeta )$ can be written as $(x,\theta ,\zeta )$ or $(x,\theta f,\zeta )$. The same toroidal wedge is used in both XGC and GEM.

To simplify data exchange, the radial grids in GEM are chosen to coincide with those magnetic surfaces used in XGC's mesh, which reside within the GEM's radial domain (radial overlap domain). When, for instance, the potential $\varphi $ of XGC is sent to GEM and re-defined on the GEM grids, interpolation is needed only within the flux-surface, not in the radial direction. Data exchange, then, becomes a two-dimensional interpolation problem for each flux-surface. We can use the coordinates either $(\theta f,\zeta )$ or $(y,\theta f)$ for the 2D interpolation. Although GEM's *z*-grids are more directly related to angle *θ*, we find *θ _{f}* to be more convenient for mapping purposes, as it simplifies the relation between

*y*and

*ζ*.

In the remainder of this section, the subscript “$GEM$” is used to indicate the value of a coordinate at a $GEM$ grid, and the subscript “$XGC$” is used to indicate an $XGC$ grid. Thus, a grid in GEM is written as $(x,yGEM,\theta f_GEM)$ or, equivalently, $(x,\theta f_GEM,\zeta GEM)$. A grid in XGC is written as $(x,\theta f_XGC,\zeta XGC)$ or, equivalently, $(x,yXGC,\theta f_XGC)$. The subscript for *x* is suppressed, as the *x*-grids of the two codes coincide in the GEM domain.

The 2D interpolation is accomplished as two consecutive 1D interpolations. In the case of the ion charge density, which is to be transferred from GEM to XGC, one starts with $n\xaf(x,yGEM,\theta f_GEM)$ and first performs a 1D cubic interpolation in the *θ _{f}*-dimension to obtain $n\xaf(x,yGEM,\theta f_XGC)$ and then performs a linear interpolation in the

*y*-dimension to obtain $n\xaf(x,yXGC,\theta f_XGC)$. We have verified that the cubic interpolation in the first step can be replaced with linear interpolation without changing the result. The order of the two interpolations is mandated by the requirement that, in constructing the intermediate array, no information is lost. The grids in both GEM and XGC take advantage of the field-aligned feature of micro-turbulence but differ in the choice of the field-aligned coordinate. In GEM,

*θ*(or

*θ*) is used as the field-line-following direction such that the number of

_{f}*y*-grids in GEM is far greater than the number of

*θ*-grids, $Ny_GEM\u226bN\theta _GEM$. In XGC, the toroidal angle is chosen to be the field-line-following direction such that $N\theta _XGC\u226bN\zeta _XGC$. By performing the first interpolation in the

*θ*-direction, i.e., from GEM's

*θ*-grids to the more dense XGC's

*θ*-grids while holding the y-coordinate fixed, one simply assigns values to more points along the field line. The size of the intermediate array is much larger than the original array, by a factor of $N\theta _XGC/N\theta _GEM\u226b1$. No information is lost. If we perform the first interpolation in the

*y*-direction while holding

*θ*fixed, the intermediate array will be $n\xaf(x,yXGC,\theta f_GEM)$, where $yXGC$ are the values of y corresponding to XGC's

_{f}*ζ*-grids, much fewer than the

*y*-grids in GEM. Data information in the

*y*-direction is lost, and the second interpolation in the

*θ*-direction will be meaningless.

The interpolation procedure is slightly more complicated in the case of $\varphi $, which is transferred from XGC to GEM. Starting from $\varphi (x,\theta f_XGC,\zeta XGC)$, the first step is a 1D interpolation in the field-line-following direction (to be explained below), by which we obtain the potential defined on the same $\theta f_XGC$ but a set of new *ζ*-grids, $\varphi (x,\theta f_XGC,\zeta (yGEM,\theta f_XGC))$, with the new *ζ* grids given by

Now, we regard $\varphi (x,\theta f_XGC,\zeta (yGEM,\theta f_XGC))$ as $\varphi (x,yGEM,\theta f_XGC)$, as both refer to $\varphi $ at the same spatial location. The second step is a cubic interpolation in the *θ _{f}*-direction to obtain $\varphi (x,yGEM,\theta f_GEM)$, the desired representation for GEM.

The first step of interpolation along the field line can be understood as follows. Starting from $\varphi (x,\theta f_XGC,\zeta XGC)$, a naive 1D interpolation in the *ζ*-direction while holding *θ _{f}* fixed cannot be used to obtain $\varphi (x,\theta f_XGC,\zeta (yGEM,\theta f_XGC))$, as the number of the

*ζ*-grids in XGC is not sufficient for resolving the mode structure along a line of constant

*θ*. The mode structure at a fixed poloidal angle can vary strongly between two neighboring toroidal grids, especially for high-n modes. The mode is adequately resolved by the

_{f}*ζ*-grids of XGC only if viewed along the field line. Consequently, interpolation along the field line must be used to obtain $\varphi $ at the arbitrary toroidal angle. The procedure is schematically shown in Fig. 1. For brevity, we write the

*θ*grids in XGC as

_{f}*θ*

_{0},

*θ*

_{1}, and

*θ*

_{2}in Fig. 1. The field line passing through the point $(x,yGEM,\theta 1)$ crosses the two nearby poloidal planes in the XGC mesh, at $\theta left,\u2009\theta right$, respectively, with

*θ*between

_{left}*θ*

_{0}and

*θ*

_{1}and

*θ*between

_{right}*θ*

_{1}and

*θ*

_{2}. The potential at $(x,yGEM,\theta 1)$ will be obtained with linear interpolation, using the potential values at $(\theta left,\zeta 0)$ and $(\theta right,\zeta 1)$, which are in turn obtained with linear interpolation using values on the XGC grids. For instance, $\varphi (\theta left,\zeta 0)$ is obtained from $\varphi (\theta 0,\zeta 0)$ and $\varphi (\theta 1,\zeta 0)$. The combined linear field-line-following interpolation is given by

where $w\zeta 0=(\zeta 1\u2212\zeta y)/(\zeta 1\u2212\zeta 0),\u2009w\zeta 1=1\u2212w\zeta 0,\u2009w00=(\theta left\u2212\theta 1)/(\theta 1\u2212\theta 0),\u2009w01=1\u2212w00$, and $w10=(\theta right\u2212\theta 2)/(\theta 2\u2212\theta 1)$, respectively.

In the mapping operation, the meshes of both GEM and XGC are generated based on the EFIT files,^{16} where the Grad–Shafranov equilibrium equations are solved for both the cyclone-based case and the more realistic DIII-D equilibrium. In addition, the coefficients for linear field-line-following interpolation $(x,\theta f,\zeta )\u2192(x,y,\theta f)$ and for the linear interpolation $(x,y,\theta f)\u2192(x,\theta f,\zeta )$ are calculated once at the beginning of the coupled simulation and saved for subsequent time steps. The calculation for all the flux surfaces proceeds independently and is fully parallelized.

It is worth noting that conventional PIC simulations require the same grid interpolation scheme for both the particle deposition and the field calculation for good conservation properties, e.g., no particle self-force. So, some caution is in order in picking and verifying the best interpolation schemes in the two codes and in the mapping algorithm.

## IV. GEM–XGC COUPLED SIMULATIONS

In this section, we present two examples of coupled GEM–XGC simulation that demonstrate the feasibility of the coupling scheme. In each example, a reference XGC-alone simulation is performed for comparison. Collisions and particle/energy source terms are not included in these simulations. Electrons are assumed to be adiabatic. Since the time histories of the heat flux and the growth rate are compared, the same initial condition must be implemented in both codes. Expressed in the GEM coordinates, the initial condition for the ion weights is chosen to be

where A is the amplitude for each toroidal mode and $nky$ is the number of toroidal modes for initialization. The initial condition for the perturbed distribution is $\delta f=wf0(Xt=0,t=0)$. In XGC, the weight of a marker is initialized accordingly, taking into account XGC's marker distribution.

### A. Cyclone base case

For simplicity, we first adopt a simple equilibrium without the open field region. This equilibrium, defined as case V in Ref. 17, is based on the experimental DIII-D discharge underlying the Cyclone Base Case (CBC).^{14} In this case, the major radius $R0=1.68\u2009m$, the minor radius $a=0.59\u2009m$, and the magnetic field on axis $B0=2.09\u2009T$. A deuterium plasma is considered with the same temperature and density profile (Fig. 2) for both ions and electrons.

We first perform a benchmark of GEM and XGC with a series of single-mode linear simulations. We use the plasma equilibrium previously used to benchmark GENE and XGC.^{10} The results for mode frequency and growth rate from all three codes are plotted in Fig. 3, with data from GEM, GENE, and XGC shown in blue, red, and green, respectively. Good agreement is obtained among the three codes. Data from GENE and XGC in Fig. 3 are obtained from the benchmark of Merlo *et al.*^{10} For the linear growth rate, the three codes agree to within 8% for all modes considered. The frequencies agree to within 3.5%. The remaining difference is likely due to different ways of discretizing the quasi-neutrality equation. GEM starts with a spectral form of the ion polarization density in the *x*–*y* plane, which is Fourier transformed in *y* and then discretized with a pseudospectral method,^{3} resulting in a 1D equation in *x* for each $(ky,z)$. XGC, on the other hand, uses a Padé approximation for the ion polarization density, which is discretized directly in the 2D poloidal cross section for each *ζ*-grid. As long as the gyrokinetic ordering is valid, both methods are good approximations, but they are not identical. Better agreement can be expected if the same Poisson solver is used in both GEM and XGC. We have tested this hypothesis for the *n* = 24 mode, for which the difference in the growth rates between GEM and XGC is 1.1%. If the XGC Poisson solver is used in the GEM simulation, the difference in the growth rate decreases to 0.2%. In order to eliminate the difference due to the Poisson solver, in the following coupled simulations, only the XGC Poisson solver is used.

In the coupled simulations, the “core” region occupied by GEM is $\psi /\psi x=[0.0191,0.897]$ $\u2009r/a=[0.1,0.9]$; the “edge” region occupied by XGC is $\psi /\psi x=[0.0191,1]$ or $r/a=[0.1,1]$; the overlap region is $\psi /\psi x=[0.333,0.462]$ or $r/a=[0.45,0.55]$, where the density and temperature gradients are the steepest and the ion-temperature-gradient (ITG) modes are most unstable; the buffer regions are $r/a=[0.55,0.9]$ in GEM and $r/a=[0.1,0.45]$ in XGC. The purpose of the buffer region^{18} is to reduce the effect of an artificial boundary condition for the particles at the edge of the overlap region. For convenience, the sizes of the buffer region in this paper are chosen to be large. In a realistic coupled simulation, they should be smaller, e.g., on the order of particle orbit width. We choose the instability growth region to reside in the overlap region to demonstrate the robustness of our coupling scheme. The continuous connection function in the coupling region is chosen as a linear function $\varpi (r/a)=(0.55\u2212r/a)/0.1$.

The GEM simulations described in this section use a grid resolution of $nx\xd7ny\xd7nz=240\xd7256\xd764$ and 80 marker particles per cell; the four-point nearest-grid-point averaging technique^{19} is employed.

For the XGC simulation presented here, the spatial grid is composed of $nRZ\xd7n\zeta =355\u2009890\xd732$, where *n _{RZ}* represents the number of grid points in a poloidal plane and $n\zeta $ is the number of the poloidal plane in the toroidal direction. There are 50 maker particles per mesh, and a four-point adaptive gyroaveraging matrices

^{18}in the $\mu $ grid is used.

We select toroidal number *n* = 24 for the coupling exercises of the linear simulations. For initial conditions, $A=10\u22126,\u2009nky=1$. The frequency is identical comparing the XGC reference and the GEM–XGC coupling case. The growth rate shows only 1% difference (Fig. 4). The mode structure of electrostatic potential also shows a good agreement (Fig. 5).

In the nonlinear simulations, a toroidal wedge *n _{w}* = 3 is adopted, so the corresponding toroidal numbers $n=3,\u2009\u20096,\u2009\u20099,\u2009\u200912\u2026$ are selected. For initial conditions, $A=10\u22124,\u2009nky=16$. The poloidal structures agree well in the linear stage (Fig. 6). The difference is apparent in the nonlinear stage (Fig. 7). Such differences are expected to arise due to the accumulation of small differences in the numerical methods on the grid scale, such as the mesh size and shape and the interpolation methods in the two codes. Despite the differences in the details of the mode structures, the statistical macroscopic quantities, such as the heat flux, should show better agreement and can be used for verification. This is shown in Fig. 8. The peak magnitude of the radially averaged heat flux (Fig. 9) has only 4% difference between the reference simulation and the coupled simulation.

### B. DIII-D like geometry

The second example is a deuterium plasma in DIII-D like geometry. Artificially made-up temperature and density profiles are shown in Fig. 10 with the steep edge gradient. The direct *δf* scheme^{7} is used in both GEM and XGC. In the direct *δf* scheme, the deviation of the distribution from the local Maxwellian from ion banana excursion is included. This effect is important in the edge. In principle, neoclassical transport can be simulated, provided that collisions are also included in simulations. Here, for demonstrating the coupling scheme, collisions are again omitted.

In this case, the core region of GEM is $\psi /\psi x=[0.0104,0.945]$, the edge region of XGC is $\psi /\psi x=[0,1.11]$, and the overlap regions are $\psi /\psi x=[0.652,0.803]$ ($r/a=[0.76,0.86]$), which is around the pedestal top and $\psi /\psi x=[0.042,0.096]$ ($r/a=[0.2,0.3]$) in the axis region; the buffer regions are $\psi /\psi x=[0.803,0.945]$ and $\psi /\psi x=[0.0104,0.042]$ in GEM and $\psi /\psi x=[0.096,0.652]$ in XGC. The connection function in the coupling region is $\varpi (r/a)=(0.86\u2212r/a)/0.1$ in the pedestal region and $\varpi (r/a)=(r/a\u22120.2)/0.1$ on the axis region. GEM uses a grid of $nx\xd7ny\xd7nz=258\xd7512\xd7256$ and 512 marker particles per cell. An eight-point regular gyroaveraging technique^{15} is employed. XGC uses a grid of $nRZ\xd7n\zeta =503\u2009016\xd732$, 1024 maker particles per mesh vertex, and an eight-point adaptive gyroaveraging matrices in the $\mu $ grid. Here, more points are used for gyroaveraging in both codes, as the flux-surfaces are strongly shaped.

Toroidal modes with $n=2,4,6,8\u2026$ are selected for nonlinear simulations with the toroidal wedge number *n _{w}* = 2. For initial conditions, $A=10\u221212,\u2009nky=1$. The radially averaged heat flux (Fig. 11, left) of XGC reference and GEM–XGC coupling shows similar evolution shapes with a decay toward the end due to the lack of the heat source. The time-accumulated turbulent heat flux, $\u222bdt\u27e8Qi\u27e9$, is also shown (Fig. 11, right). At the end of the simulation, there is less than 4% difference in the time-accumulated heat flux. We consider the agreement to be satisfactory between two gyrokinetic simulations with different discretization methods. The time evolution of heat flux (Fig. 12) through the radial range of XGC reference simulation and the coupled simulation indicates that the heat flux starts from the steep gradient region where the ion-temperature-gradient (ITG) driven turbulence originates and propagates across the overlap region.

## V. OPTIMIZATION OF GEM FOR GPUS

As a PIC code, GEM mainly consists of three parts: particle push and shift, deposition, and field solving. The primary domain decomposition in GEM is along the field line, the z-direction in the field-line-following coordinates. The box length in *z* is divided into *n _{z}* equally spaced grids, and each $nx\xd7ny$ grid is assigned one or multiple message passing interface (MPI) tasks. An MPI task holds only those particles with the z-coordinate within its domain. After each particle pushing step, particles are sorted into the corresponding domains; this is the shift operation. Deposition is the operation whereby the density and current are obtained by depositing the corresponding particle quantities onto the grids. Particle push and deposition consist of loops over all the particles. At present, GEM optimization focuses on these operations. The starting point is the original pure MPI version.

### A. Porting strategy

Optimization on the GPU hardware architecture is achieved using the OpenACC directive-based heterogeneous parallel programming model, with few modifications to the original source code. In PIC codes, a significant amount of time is spent on indexing operations (integer arithmetic) and data movement (shift). Consequently, the main strategy of GPU-optimization using OpenACC is to offload the calculation processes from the central processing unit (CPU) to GPU and minimize data transfer. Particle arrays (for spatial coordinates, parallel velocity, weight, etc.) are moved from CPU memory to GPU memory, and all particle loops are performed on GPUs. Particle pushing and deposition all proceed continuously on GPU without data movement. This requires that the field arrays also reside in GPU memories.

After the push step, some particles will move across the domain boundaries. The shift operation transfers such particles to the new domains. At present, such data movement must be accomplished on CPUs and entails particle data movement between the GPU and CPU. Such data movement must be minimized. The shift comprises of two steps: the initialization step and the actual data movement. Initialization constructs sorted pointers to particle holes, as well as buffers for the sending and receiving processes. It is called only once per shift operation. The actual movement performs nonblocking MPI communication, removes holes, and restructures the particle arrays. Particle arrays need to be updated between the host and device during this procedure and are called for each particle array. In optimizing the particle shift, it is found unnecessary to construct the pointers to particle holes in a strictly increasing order. Only the search for nonhole indices from the end of the array needed to be in sequential order. Consequently, the initialization step has been modified to allow more loops to run on GPU, and only the data of the actually transferred particles need to be updated between the CPU and GPU.

### B. Performance results

The performance of the optimized code has been tested on Summit's multi-GPU nodes.^{20} Each Summit compute node has six GPUs and 42 physical CPU cores, and GEM uses 42 MPI ranks per node, where the Multi-Process Service (MPS) is enabled. In reporting the performance data, we exclude the execution time on code initialization (memory allocation, particle loading, etc.) and focus on the time-stepping loop that evolves the particle coordinates and the electrostatic fields.

Since the main focus of the present GEM code optimization is particle push and the density and current deposition, our starting point for the performance tests is the case with fixed grid size $(128\xd764\xd742)$ and a number of particles, which increase proportionally to the number of MPI tasks (in the case of the CPU version). The execution time is given in Table I. The speedup of the MPI + OpenACC version over the original MPI-only version is about 40 times. The scalability of different computational modules in both MPI + OpenACC and MPI-only codes is shown in Fig. 13. One can infer from Fig. 13(a) that a satisfactory scalability of the MPI + OpenACC code has been obtained in these cases.

Number of nodes . | Number of particles . | Original MPI-only . | OpenACC + MPI . | Speedup . |
---|---|---|---|---|

128 | 34 406 400 × 128 | 70.228 | 1.731 | 40.571 |

256 | 34 406 400 × 256 | 77.423 | 1.764 | 43.891 |

512 | 34 406 400 × 512 | 75.418 | 1.752 | 43.047 |

1024 | 34 406 400 × 1024 | 78.194 | 1.842 | 42.451 |

Number of nodes . | Number of particles . | Original MPI-only . | OpenACC + MPI . | Speedup . |
---|---|---|---|---|

128 | 34 406 400 × 128 | 70.228 | 1.731 | 40.571 |

256 | 34 406 400 × 256 | 77.423 | 1.764 | 43.891 |

512 | 34 406 400 × 512 | 75.418 | 1.752 | 43.047 |

1024 | 34 406 400 × 1024 | 78.194 | 1.842 | 42.451 |

On the other hand, when both the number of grids and the total particle number are changed, we find that scalability of the MPI + OpenACC code degrades in the large node number regime (see Table II). In Fig. 14, a detailed analysis of the scalability of different modules is shown. The results suggest that the poor scalability comes from the field solver. Nevertheless, the figure clearly shows the excellent scaling of particle push and deposition. The poor scaling of the field solvers can be easily understood. The gyrokinetic Poisson equation is first Fourier transformed in the toroidal direction and then discretized in radius using a pseudospectral method,^{3} leading to $\u223cO(nynz)$ dense linear systems. With domain decomposition in the *z*-direction, each MPI task solves $\u223cO(ny)$ linear systems, with the size of each matrix being about $(nx\xd7nx)$. Thus, the execution time on the field solvers increases rapidly with the number of grids. To solve this bottleneck, the cuFFT^{21} and cuBLAS^{22} library routines will be used in the future. Notice that for the present XGC–GEM coupling scheme, the field is solved only in the XGC code.

Number of nodes . | Number of particles . | Grid size . | OpenACC + MPI . |
---|---|---|---|

1 | 17 203 200 × 1 | 4 × 16 × 42 | 0.781 |

128 | 17 203 200 × 128 | 256 × 32 × 42 | 1.156 |

256 | 17 203 200 × 256 | 256 × 64 × 42 | 1.433 |

512 | 17 203 200 × 512 | 256 × 128 × 42 | 1.911 |

1024 | 17 203 200 × 1024 | 256 × 256 × 42 | 3.319 |

Number of nodes . | Number of particles . | Grid size . | OpenACC + MPI . |
---|---|---|---|

1 | 17 203 200 × 1 | 4 × 16 × 42 | 0.781 |

128 | 17 203 200 × 128 | 256 × 32 × 42 | 1.156 |

256 | 17 203 200 × 256 | 256 × 64 × 42 | 1.433 |

512 | 17 203 200 × 512 | 256 × 128 × 42 | 1.911 |

1024 | 17 203 200 × 1024 | 256 × 256 × 42 | 3.319 |

## VI. SUMMARY

We have presented a numerical scheme to couple the two existing particle-in-cell codes GEM and XGC through a configuration space interface region. In the coupling scheme, time stepping of the distribution functions is achieved by pushing the composite distribution function independently in each code, while the electrostatic potential is solved over the whole volume using the gyrokinetic Poisson solver in XGC. The two codes share the 3D density and field data across a spatial interface. A grid mapping technique is developed for data exchange. The GEM–XGC coupling scheme is demonstrated with two examples, which verify the accuracy of the mapping and the applicability of the spatial coupling scheme. However, in all the simulations presented here, the heat flux tends to zero because no heat source is included in these simulations. Future work will involve including sources and sinks to maintain a steady state. Optimization of GEM for architectures with GPUs is also presented. Similarly, the spatial coupling of GENE and XGC with this spatial coupling scheme will be reported in a separate publication.^{23}

In the present work, the coupling scheme requires only the exchange of ion density in the overlapping region, and no kinetic particle distribution-function (PDF) information is exchanged. The scheme works well for ITG turbulence in which there is no radial particle flux across the coupling interface. A more complete coupling scheme with kinetic electrons may require the exchange of kinetic PDF information between the coupled codes. Although the particle distribution function contains all the kinetic information in simulation, it is usually not explicitly constructed in PIC simulation. Rather, moments of the distribution function, such as the density and current, are needed for solving the field equations. They are obtained from the discrete particles directly via deposition to the mesh nodes. Exchange of kinetic information between XGC and GEM can take the form of exchange of marker particles. Depending on the marker density in each code, re-sampling^{24} of the exchanged particles might be needed. Based on the re-sampling, a new coupling scheme,^{25} where the marker particle information is exchanged by 5D grids, has been developed and will be reported for publishing. Meanwhile, XGC simulates the whole domain in current coupled code implementation, which is for convenience. In future implementations, the XGC simulations can still include a core buffer region but then have no particles in the core, which is demonstrated in the new coupling scheme.^{25} Thus, the GEM–XGC coupling will earn gains in computation.

## ACKNOWLEDGMENTS

This research was supported by the Exascale Computing Project (No. 17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory (ORNL) and resources of the National Energy Research Scientific Computing Center (NERSC), which are supported by the Office of Science of the U.S. Department of Energy under Contract Nos. DE-AC05–00OR22725 and DE-AC02–05CH11231, respectively.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.