We present a scheme that spatially couples two gyrokinetic codes using first-principles. Coupled equations are presented and a necessary and sufficient condition for ensuring accuracy is derived. This new scheme couples both the field and the particle distribution function. The coupling of the distribution function is only performed once every few time-steps, using a five-dimensional (5D) grid to communicate the distribution function between the two codes. This 5D grid interface enables the coupling of different types of codes and models, such as particle and continuum codes, or delta-f and total-f models. Transferring information from the 5D grid to the marker particle weights is achieved using a new resampling technique. Demonstration of the coupling scheme is shown using two XGC gyrokinetic simulations for both the core and edge. We also apply the coupling scheme to two continuum simulations for a one-dimensional advection–diffusion problem.

## I. INTRODUCTION

The Exascale High-Fidelity Whole-Device-Modeling (WDM) project aims at delivering an application, WDMApp, composed of multiple physics components coupled using first-principles. The long-term goal of the WDM project is to combine the necessary physics needed to understand and predict the performance of ITER and future fusion reactor experiments. This paper describes a first step toward developing high-fidelity WDMApp, by coupling two first-principle gyrokinetic codes simulating the physics in the core and edge of a tokamak plasma. To guarantee that the system of coupled codes provides a solution based on first principles, the coupling between these components must itself be based on first principles. This paper extends our previous study in Ref. 1 to kinetic coupling and illustrates our approach to achieve a first-principles coupling of a core and edge gyrokinetic code.

The first step addressed in this project is the spatial coupling of two gyrokinetic codes, one for the core (such as GENE^{4,5} or GEM^{6}) and one for the edge (XGC^{7}), see Fig. 1, computing on separate spatial domains. The domain of coupled simulations is decomposed in three main regions, which are the core, the edge, and the overlap region, where both core and edge simulations co-exist, see Fig. 1(b). Beyond the overlap region, each simulation has a buffer region of variable width, which ends at an artificial boundary. This overlap region mitigates the error caused by numerical differences between core and edge simulations in the coupling region.

We present a generalized scheme that enable the coupling of core and edge gyrokinetic simulations. This new scheme combines the coupling of the field (Poisson equation^{1}) with the coupling of the particle distribution function in the buffer regions. The tight coupling of the Poisson equation ensures that core and edge simulations time advance the same gyrokinetic Vlasov equation. The coupling of the distribution function is necessary for correctly evolving the distribution function near the artificial boundary at the end of the buffer region. This coupling of the distribution function is the main contribution of this new work, going beyond the field coupling in Ref. 1.

This study provides basic insights into the problem of first-principle spatial code coupling. The principles developed here can be applied to other code coupling problems. For instance, we illustrate the scheme in the simpler context of coupling of the one-dimensional (1D) advection–diffusion equation. This scheme can also be applied and adapted to the coupling of 5D and even 6D codes.

To enable the coupling of different codes, we have implemented an abstract interface that communicates the particle distribution function information on a 5D grid. For particle-in-cell (PIC) codes, this requires the transfer of information between a grid and a set of marker particles. This transfer from the 5D grid to the marker particles is based on a moment preserving constrained resampling (MPCR) algorithm.^{8} Verification is performed with the core–edge coupled simulation using the XGC code in both domains.

In comparison with Ref. 1 where we only coupled the Poisson equation, the addition of the coupling of the distribution function in the buffers enables more complex simulations. For example, including a heat source in the core, neutral recycling in the edge, or simply kinetic electrons and associated turbulent fluxes. The new kinetic coupling scheme allows for *f* on one side (core or edge) to be transferred to the other side via the coupling of *f* in the buffer.^{9}

Finally, as we will discuss, this new core–edge coupling scheme enables one to use a much less computationally intensive model in the core (e.g., delta-f, or with fewer impurities) than in the edge. Such an optimization would lead to the speed-up of a coupled core–edge gyrokinetic simulation over one global code calculation. This comes at the price of addressing the computational load balancing problem between the core and edge simulations, since the core–edge coupled simulation requires the use of fewer compute nodes for the simpler delta-f core code, as compared to the total-f edge code.

The remainder of this paper is organized as follows. In Sec. II, we illustrate the problem of coupling *f* in a simple 1D advection–diffusion problem. In Sec. III, we present the full system of core–edge coupled gyrokinetic equations, i.e., the coupling of both the field and the particle distribution function. In Sec. IV, we introduce how we use a resampling technique to transfer the particle distribution function information between core and edge simulations. In Sec. V, we present a core–edge coupled simulation where we use this new coupling algorithm that employs the resampling of the particle distribution function. In Sec. VI, we discuss two optimization scenarios that exercise the core–edge coupling algorithm to achieve a computational speed-up by $2\xd7$ to $10\xd7$. Finally, we close with conclusions and future work in Sec. VII.

## II. SPATIAL COUPLING OF 1D ADVECTION–DIFFUSION EQUATIONS AND CONTINUUM SIMULATIONS

In this section, we present the spatial coupling of a 1D advection–diffusion equation

The reader only interested in the spatial coupling of gyrokinetic codes can directly go to Sec. III. However, this section provides valuable insight into the underlying principles of the scheme, which are more easily elucidated in a simplified context.

The simulations are performed with a continuum code on a 1D radial grid. The spatial derivatives are computed with centered stencils, the discrete advection–diffusion equation reads

with *y*^{0} being the initial condition, *t* the time index, and *i* the spatial index. We have verified that the mass of the 1D system, $M=\u222bdxf(x):=\u2211nfn$, is conserved to round-off error.

### A. Coupling of core and edge advection–diffusion equations

The coupling of core and edge advection–diffusion equations through the overlap region is based on the composition of *f* (see Ref. 1), by defining

where $fC$ and $fE$ are the core and edge “distribution functions,” respectively.

The function ϖ is a monotonic function of the radial coordinate, see Fig. 1(b). It is typically 1 in the core, 0 in the edge, and it varies continuously from 1 to 0 in the overlap. The overlap is the region where both $fC$ and $fE$ co-exist. Beyond this overlap are regions called buffers, which are used to provide suitable boundary conditions to the overlap. We discuss the subtleties of the ϖ function for gyrokinetic simulations in Sec. III B.

### B. Necessary and sufficient condition

Given that $fC$ and $fE$ are two independent solutions of the advection–diffusion equation (1), it is simple to show by direct substitution that their combination $f\u02c7=\varpi fC+(1\u2212\varpi )fE$ is also a solution of Eq. (1) if and only if the following necessary and sufficient condition^{10} is fulfilled:

Note that ϖ is a function of the radial direction only, so $\u2202\varpi \u2202t=0$. Relaxing this constraint is left for future studies, when, for example, the position of the overlap will move in time.

There are two trivial solutions to Eq. (4). The first trivial solution consists in taking $fC=fE$ everywhere. The second solution consists in taking ϖ as a fixed constant.

An alternate solution consists in using either one of these two conditions at each given position *x*. In our case and as shown in Fig. 1(b), $\varpi =1$ in the core and $\varpi =0$ in the edge, such that $\u2202\varpi \u2202x=0$ in these regions. Therefore, the condition $fC=fE$ is only necessary in the overlap region where ϖ is varying and $\u2202\varpi \u2202x\u22600$, see Sec. III D for gyrokinetic simulations.

Note that the third term in parenthesis in Eq. (4) requires

which means that we do not need $fC\u2212fE=0$ in the overlap but only $\u2202(fC\u2212fE)\u2202x=0$ which is less restrictive, as it is true for any constant number including 0.

Obviously, the coupling of different codes will lead to interpolation error or marker particle noise effects, such that we will only be able to ensure $fC\u2212fE=\u03f5$ with $\u03f5\u22430$. In Ref. 1, we discussed (in the case of gyrokinetic simulations) that a larger overlap width, *L*, mitigates the effect of particle noise because it decreases the nonphysical term

As ϖ is linear, one has

In the case of the advection–diffusion equations, using ϖ linear implies $\u22022\varpi /\u2202x2=0$, which leads to a similar conclusion regarding the width of the overlap. Overall, *ϵ* acts like a random error with similar amplitude at all positions of the overlap. Hence, we take $\u2202\varpi /\u2202x$ constant. Intuitively one can understand that if ϖ is not linear, it should at least be a monotonic function. A detailed study of the effect of *ϵ* on the correctness of the solution, i.e., on Eq. (4), is left for a future study.

### C. Synchronizing *f* in buffer regions to provide correct boundary conditions near the overlap

To keep the coupled solution $f\u02c7$ correct in the overlap, we need to determine accurately its evolution. For this, we need to correctly push $fC$ and $fE$ with the advection–diffusion equation. The difficulty of correctly pushing in the buffers is related to the spatial derivatives that connect *f _{i}* to $fi+1$ and $fi\u22121$ in the discrete advection–diffusion equation (2).

During the first push, the value of *f* at the artificial boundary of the core simulation, which we note $fnC$ with *n* the position of the artificial boundary, cannot be evolved correctly because the core simulation does not have the knowledge of $f\u02c7n+1$. The quantity $fnC$ thus becomes corrupted after the first push, i.e., $(fC)nt+1$ is incorrect. During the second push, the update of $fn\u22121C$ is incorrect, because $(fC)nt+1$ is corrupted. The error propagates by one interval per time step, because of our choice of stencil.

In this example, the error reaches the overlap boundary after *N* + 1 time steps, given that the buffer extends over *N* radial intervals. Therefore, if we synchronize the buffer at every *N* time steps, the error will never reach the overlap and the coupled solution will remain correct.^{11}

The width of the buffer is flexible and should be chosen wisely. In standard parallel domain decomposition within a single code, one would typically choose *N* = 1. However, inter-code communication is subject to latency considerations that may become a bottleneck for the entire simulation. To minimize this latency, it thus makes sense to consider sending larger messages less frequently, i.e., *N* >1, to minimize this latency. The larger the buffer, the less frequent is the synchronization. Moreover, the total volume of information exchanged is independent of *N*, because transferring *N* grid points every *N* time steps corresponds to transferring one grid point per time step, on average. More details about the buffer synchronization will be discussed in Sec. III E.

### D. Numerical 1D simulation and tests

In the following tests, we chose $V=\u22121/2$, the system size is *L* = 1, a grid with 100 intervals is used in *x*, and a Dirac function is initialized in the edge $f(x,t=0)=\delta (x\u22121)$. A timeline of the simulations is shown in Fig. 2, where the reference (black) and coupled (red) simulations are compared in the left column. The Dirac function rapidly diffuses in a “blob” like structure which moves inwardly from the edge to the core and crosses the coupling region. The overlap and buffer regions are identified in Fig. 2 top right subplot. The buffers are periodically synchronized with the period $Tsync$.

We test the influence of the synchronization period by comparing simulations where we use different synchronization periods, $Tsync$. The accuracy of the coupling is estimated in Fig. 3 where we plot, in (a), the time evolution of the system mass

and, in (c) and (d), the time evolution of the error

with *f* being the solution. Conservation of mass is a necessary but not a sufficient condition. Therefore, we also look at the error which is a rigorous way of measuring correctness of the coupled solution.

Let us analyze the results (in blue in Fig. 3) of the coupled simulation carried out with a small enough period of synchronization, $Tsync=N\Delta t=0.2$. The mass is conserved as shown in (a), and the difference between the reference and coupled simulations is zero or round-off error as shown in (c). Note that the round-off error is the largest when the blob is in the overlap region—this is sensible as the overlap is the only place in which the computational methods differ.

Then, to show that too large a period leads to bigger error, we gradually increase the synchronization period in three simulations plotted with yellow, red, and purple curves. The mass of the system (a) is not conserved when using too large a synchronization period $Tsync$, and the error (d) is big. Note that the error rapidly increases and then slowly decreases in time (to a nonzero value), because the sharp structure due to an inappropriate buffer synchronization are diffusing in the system.

For illustration, we plot with a black dashed line in (a) the expected result of a purely advective problem (*D* = 0) during which we do not synchronize the buffer ($Tsync=\u221e$), and which is also initialized with a Dirac at the edge $f\u02c7(t=0)=\delta (1\u2212x)$. The solution of this incorrectly coupled system reads $f\u02c7=(1\u2212\varpi )\delta (x\u22121\u2212Vt)$. The rate of mass loss of *f* inside the overlap region is

The mass moving at constant velocity and without diffusion is linearly decreasing when crossing the overlap, as we chose ϖ linear in *x* in the overlap. This incorrect coupling acts as if a perfect sink was present.

In comparison, the purple curve is the result of the 1D advection–diffusion simulation with $Tsync=\u221e$. The agreement with the black dashed line is very good. The slight difference between the black dashed line and the purple curve is due to the effect of diffusion. The initial Dirac perturbation tends to converge toward a symmetric Gaussian with a front tail that is faster than *V* and a back tail that is slower than *V*.

The results in Fig. 4 estimate the error on the system mass conservation (1 means 100% loss) at the final simulation time $Tend=2$ (blue), 8 (red), and 20 (black). The blob is passing many times through the overlap because the system is made periodic with $f(x=0)=f(x=1)$ for the purpose of our tests.

In this figure, the error, “Loss of mass,” is plotted with respect to the period of synchronization $Tsync$. The shorter the period, the better the system conserves mass. Three regions can be identified in this figure:

$Tsync<N\Delta t$ (exact conservation).

$N\Delta t<Tsync<0.4$.

$Tsync>0.4$.

We recall that *N* is the radial width of the buffer (in number of grid intervals).

When $Tsync>0.4$, region (3), the synchronization period is so large that the blob can pass almost entirely through the buffer without being synchronized. The error is big and the system rapidly loses most of the blob mass, as shown in (b) after 4 or 10 “turns.”

In between, when $N\Delta t<Tsync<0.4$, the accuracy depends on the diffusion. Increasing the diffusion coefficient *D* increases the loss of mass caused by the too slow synchronization of the buffer, as shown in Fig. 5. When the diffusion is low, we can even use a period larger than $Tsync=N\Delta t$ without losing mass (see the yellow curve).

## III. SPATIAL COUPLING OF GYROKINETIC MODELS

In this section, we report on the spatial coupling of gyrokinetic models. We consider, for simplicity, a single gyrokinetic species represented with the gyrocenter distribution function *f*. The generalization to multi-species is trivial.

### A. Set of gyrokinetic equations

The gyrocenter distribution function *f* is evolved with the 5D gyrokinetic equation

where ** X** is the gyrocenter, $v\u2225$ is the parallel velocity, $C$ is the collision operator, and $S$ is a source term. The electrostatic potential $\varphi $ is solved with the gyrokinetic Poisson equation

where the right-hand side is computed with

The gyrokinetic distribution function $f(X,v\u2225,\mu )$ is defined in gyrocenter space, while the fields $\varphi (x)$ and $n\xaf(x)$ are defined in particle position space; recall $x=X+\rho $. This discrepancy needs to be taken care of carefully when spatially coupling two gyrokinetic models. A review of gyrokinetic theory can be found in Ref. 12.

Note that, here, we do not explicitly describe the implementation for the Poisson equation, but we assume that $L$ is fixed in time and does not depend on *f*. This is usually the case in the various implementations of the gyrokinetic Poisson equation. They could have differential or integral forms.^{13–15} Generalization to more elaborated field equations (where the left-hand side is updated) is left for future work, but should be relatively straightforward as any required coupled field information on the left-hand side should be constructed in the same manner that we do for the right-hand side.

### B. Composite gyrocenter distribution function

Just as in Sec. II, the coupling of core and edge gyrokinetic simulations is based on the use of a composite distribution function^{1}

with $fC$ and $fE$ being the core and edge distribution functions, respectively. The function $\varpi =\varpi (X)$, shown in Fig. 7, is a monotonic function of ** X** the gyrocenter position, because we couple two gyrocenter distribution functions and not two particle distribution functions. Obviously, $\varpi =1$ in the core region and 0 in the edge region. In between these two regions, there is an overlap where $0<\varpi <1$. In addition to the core, overlap, and edge regions, there are two buffer regions. As we will discuss in Subsection IV, these buffer regions are necessary for the consistency of the particle distribution function coupling.

From the definition of the blended distribution function in Eq. (8), we can obtain the coupled equations. For example, the gyroaveraging of *f* in Eq. (7) is computed with

where we define $v\u2192=(v\u2225,\mu )$ to lighten notation.

The set of coupled equations is summarized in Table I. We will detail those equations in subsections III C and III D. We use the “check” symbol, $\u2009\u02c7$, to identify coupled quantities. For instance $f\u02c7$ is the coupled distribution function and $\varphi \u02c7$ the coupled field.

Distribution function . | $f\u02c7=\varpi fC+(1\u2212\varpi )fE$ with $\varpi =\varpi (X)$ monotonic . |
---|---|

Collision (or source $S$) | $C[f\u02c7]$ is a linear local operation on $f\u02c7$ |

Charge density (linear and nonlocal) | $n\xaf[f\u02c7]=\u222bdZ\u2009\delta (X+\rho \u2212x)[\varpi fC+(1\u2212\varpi )fE]$ |

GK Poisson (depends on the charge density) | $L\varphi \u02c7=n\xaf[f\u02c7]$ is linear in $f\u02c7$ |

GK Vlasov operator (depends on $\varphi \u02c7$) | $DDt=\u2202\u2202t+X\u0307[\varphi \u02c7]\xb7\u2202\u2202X+v\u0307\u2225[\varphi \u02c7]\u2202\u2202v\u2225$ is nonlinear ($\varphi .f$) and is nonlocal (calculation of $X\u0307\xb7\u2202f/\u2202X$). |

Distribution function . | $f\u02c7=\varpi fC+(1\u2212\varpi )fE$ with $\varpi =\varpi (X)$ monotonic . |
---|---|

Collision (or source $S$) | $C[f\u02c7]$ is a linear local operation on $f\u02c7$ |

Charge density (linear and nonlocal) | $n\xaf[f\u02c7]=\u222bdZ\u2009\delta (X+\rho \u2212x)[\varpi fC+(1\u2212\varpi )fE]$ |

GK Poisson (depends on the charge density) | $L\varphi \u02c7=n\xaf[f\u02c7]$ is linear in $f\u02c7$ |

GK Vlasov operator (depends on $\varphi \u02c7$) | $DDt=\u2202\u2202t+X\u0307[\varphi \u02c7]\xb7\u2202\u2202X+v\u0307\u2225[\varphi \u02c7]\u2202\u2202v\u2225$ is nonlinear ($\varphi .f$) and is nonlocal (calculation of $X\u0307\xb7\u2202f/\u2202X$). |

### C. Coupling of Poisson equation, a unique field

The same electrostatic potential $\varphi \u02c7$ is used in both core and edge simulations. This unique electrostatic potential $\varphi \u02c7$ is solved by using the composite distribution function equation (8) in the right-hand side of Poisson equation.^{1} This coupled gyrokinetic Poisson equation reads, in our case

with

A unique field guarantees that core and edge simulations push the same nonlinear terms in the Vlasov equation, $DfDt=0$, with

Solving for a unique coupled field avoids turbulence suppression by nonlinear phase-mixing, see Ref. 1.

The field solve is performed at each Runge–Kutta stage, but it only requires the exchange of 3D information, $nC$ and $\varphi $, between the core and edge, see Ref. 1. Therefore, even if these communications are frequent, their computational cost is mostly negligible. Here are the main steps of this coupling algorithm

The core and edge compute their charge density.

The core simulation sends its density to the edge.

The edge simulation solves the field and sends it to the core simulation.

Both sides push the Vlasov equation.

All communications are performed using ADIOS technologies.^{16,17}

### D. Ensuring that the composite distribution function is the solution of the gyrokinetic equation

Now that we are guaranteed to use the same gyrokinetic Vlasov operator $DDt$ in the core and edge simulations, we want to be sure that if $fC$ and $fE$ are solutions of the gyrokinetic equation, then their composition $f\u02c7$ is also a solution of the gyrokinetic equation, such that

To prove this, we inject the definition $f\u02c7=\varpi fC+(1\u2212\varpi )fE$ in the gyrokinetic equation, leading to

which can be rewritten

Having $DfCDt=C[fC]+S,\u2009DfEDt=C[fE]+S$, then $f\u02c7$ is a solution of Eq. (10), if and only if the following necessary and sufficient condition is fulfilled:

where we used the definition $DDt=\u2202\u2202t+X\u0307\xb7\u2202\u2202X+v\u0307\u2225\u2202\u2202v\u2225$.

This necessary and sufficient condition has two trivial solutions: $fC=fE$ or ϖ a fixed constant. In our case, we wind up using these two conditions, just using either one or the other in different parts of the domain, as we will see now.

For coupling core and edge simulations, the monotonic function $\varpi =\varpi (X)$ is a function of the radial coordinate, *X*, only. Therefore, $\u2202\varpi /\u2202t=0$ and $\u2202\varpi /\u2202v\u2225=0$, which leads to the simplified condition

This condition was mentioned in Ref. 1 without further details. We now study the implication of this condition in the context of the coupling of core and edge distribution functions.

By design, we have $\varpi =1$ in the core and $\varpi =0$ in the edge, such that $\u2202\varpi \u2202X=0$ in these regions. Therefore, the condition $fC=fE$ is only necessary in the overlap region where ϖ is varying with respect to the radial direction and $\u2202\varpi \u2202X\u22600$.

In the overlap, one has actually $fC\u2212fE\u2243\u03f5$ and $\u2202\varpi /\u2202X|overlap=1/L$, such that one needs to minimize the quantity

by minimizing either *ϵ* with more particles or by using a large overlap width *L*, see Sec. II B and see Ref. 1.

This design enables the distribution function to be incorrect in the buffer, where $\u2202\varpi /\u2202X=0$, as long as it does not contaminate the overlap region. Indeed, the buffer must provide information which is accurate enough for correctly evolving the distribution function in the overlap region. Mathematically, it means that the derivative of *f* must be defined on both sides of the interface between the overlap and the buffer. Numerically, this obviously translates into different constraints given that *f* is represented on a grid or with marker particles, as we will discuss now.

### E. The artificial boundary of the buffer region

The problem due to an artificial boundary can be easily pictured with the presence of a heat source in the region outside of the edge simulation domain, as proposed in Fig. 7. Without a proper boundary condition at the artificial boundary, the heat will not enter the edge simulation. A similar issue would happen if a strong flux of particles is generated in the core [during an edge localized mode (ELM) crash for example]. Note that particle flux requires kinetic electrons.

In our study, both core and edge simulations have a buffer near the overlap. So, let us focus on the core simulation and its artificial boundary at the end of the buffer. It is trivial to reach the same conclusions for the edge simulation.

To understand how we need to handle the boundary condition, we study the evolution of the distribution function, *f*, at the position of the artificial boundary, see Fig. 8. More particularly we look at how to handle the term $X\u0307\xb7\u2207f$ when pushing the coupled gyrokinetic Vlasov equation. In the following, we first treat the case of a continuum code, and then the case of a PIC code.

#### 1. Continuum code

In a continuum code, pushing *f* with the gyrokinetic Vlasov equation requires a difference approximation for estimating the derivative of *f*. By using, for example, the centered stencil

where *n* is the position of the artificial boundary, *f _{n}* is the value of the distribution at this radial position, and

*h*is the width of a radial interval, see Fig. 8.

In our case, the core code does not have the information $fn+1C$. Therefore, it cannot estimate $\u2202fnC\u2202X$ correctly and it will incorrectly update $fnC$ during the first push operation. The information $fnC$ in the buffer is then corrupted. During the next push operation, the core code cannot estimate $\u2202fn\u22121C\u2202X$ correctly because $fnC$ is corrupted. It will then incorrectly update $fn\u22121C$ which becomes corrupted after two push operations. The error originally located at $fn+1C$ is propagating inwardly by one radial interval per push operation. If the buffer width is *N* intervals, it takes *N* + 1 push operations to corrupt the overlap region. In this case, the solution consists in updating the buffer distribution function with the one of the edge simulation, every *N* time steps. See Sec. II for an example of spatial coupling of continuum codes (1D advection–diffusion equation), and more particularly Sec. II C where we discuss the synchronization of the buffer of continuum codes.

#### 2. PIC code

In a PIC code, things are more complicated. A PIC code evolves marker particles defined by their position *Z _{p}* in the 5D gyrocenter space and their weight

*w*. The evolution of the distribution function near the artificial boundary condition is computed through the motion of the marker particles.

_{p}^{18}The evolution of the distribution in a cell of the artificial boundary is the result of the evolution of incoming and outgoing marker particles, as well as the evolution of marker particles remaining inside. The problem at the artificial interface is that it misses the particles outside the simulation that move inside the cell; see Fig. 8. Their contribution to the evolution of

*f*is missing. Therefore,

*f*becomes corrupted in this cell.

The propagation of this error is also more delicate in a PIC code. There are multiple ways for this error to propagate radially:

The collision operations radially couple the system because markers have neoclassical orbits spanning multiple radial surfaces with an orbit width

*δr*. Therefore, by local collisions along their neoclassical orbits, they propagate the error radially. The corrupted markers, in turn, also propagate this error radially along their neoclassical orbits.We typically compute the collisions every 10 ion time steps in XGC. So synchronizing f every 10 time steps should be enough, as long as the buffer width is large enough compared to the largest neoclassical orbit widths δr.

The turbulent $E\xd7B$ drift,

*v*, which is the same for all species, moves the marker particles radially. Assuming this field is mostly constant between two updates of the buffer, i.e., over a few time steps_{E}*N*, the radial width of the buffer needs to be larger than $l=(\u2207\psi |\u2207\psi |\xb7vE\u2009N\Delta t)$ with $\Delta t$ the simulation time step,*ψ*the poloidal magnetic flux that labels the magnetic surfaces, and $\u2207\psi $ its gradient. The fact that thefield is gyroaveraged with a Bessel factor $J0<1$, does not change this conclusion.*E*Even in a configuration of zero (or small) particle flux, the error due to the artificial boundary could propagate radially, because hot and cold particles can have large fluxes of opposite directions.

^{14}The parallel nonlinearity can change the orbit of a marker and thus its radial excursion. This can enhance the radial propagation of the error.

In the future, one could imagine tracking a few marker particles, initiated near the boundary or outside the domain, in order to detect when they reach the overlap region (or a region just before). This could permit to automatically trigger the buffer synchronization before the overlap is corrupted. This is left for a future study.

### F. Coupling of the distribution functions in the buffer regions

The coupling of the core and edge distribution functions requires their synchronization in the buffer once every multiple time steps. Indeed, as we have shown in Secs. III D and III E, the distribution function does not need to be correct in the buffer, as long as the error it contains does not reach the overlap. The artificial boundary corrupts the buffer distribution function but it takes time for this error to propagate through the buffer and reach the overlap. A solution we described consists in synchronizing the distribution function in the buffer region once every *N* time steps before the error penetrates the overlap region. The period of synchronization, $T=N\Delta t$, needs to be small enough so that the error coming from the artificial boundary does not reach the overlap. Obviously, a correct period of synchronization depends on the buffer radial width, as it would take more time for the error to crossover a large buffer. A cartoon of the buffers synchronization is shown in Fig. 9. Note that the core and edge buffers can have different widths and periods of synchronization.

We will now discuss in Sec. IV how one can couple the distribution function while using a resampling technique. Such technique uses an intermediate 5D grid for representing the information which is transferred between the core and edge simulations. Thanks to using an intermediate grid, one can couple different models or even different codes like PIC and continuum together.

## IV. COUPLING THE PARTICLE DISTRIBUTION FUNCTIONS WITH A RESAMPLING TECHNIQUE

We now describe how to use the resampling technique for coupling core and edge particle distribution functions in the buffer regions. This resampling technique allows information to be transferred between core and edge simulations, More interestingly, it enables the coupling of codes using different models (delta-f and total-f) or different numerical schemes (PIC and continuum), because an abstract interface is used for communicating the 5D data. Using an intermediate 5D grid is possible because the resampling technique transfers the information from a 5D grid to marker particle weights, and vice versa. The 5D grid data are communicated between the two codes using ADIOS communication.^{16,17} Note that if we couple two continuum codes, the resampling of marker particles is unnecessary and it suffices to interpolate the information between the grids (if they are different).

### A. Resampling technique

The original use of the resampling algorithm is to reduce the particle noise by reducing the variance of the marker weights while preserving the correct physics. Indeed, the distribution function represented by marker particles contains a noise component, $f\u0303$, such that these marker particles supposed to represent *f*, actually represent $F=f+f\u0303$. Many marker particles are used to maximize the signal to noise ratio $r=f/f\u0303$. Nonetheless, the noise component $f\u0303$ increases through time until the simulation is dominated by noise. The resampling scheme addresses this issue by decreasing the noise component $f\u0303$ while preserving the physics component *f*.

Preserving the physics contained in *f* while minimizing the weights variance is ensured by solving a quadratic programming (QP) optimization problem under constraints (moment preserving constrained resampling^{8,19}). The quadratic optimization decreases the variance and the constraints preserve the physics. Note that in Ref. 8 the resampling was implemented for a full-f scheme where the marker weights are fixed. In our work, we use an implementation for the total-f scheme of XGC where the marker weights vary (like in a delta-f scheme). This was introduced in Ref. 19 and will be presented in detail in a forthcoming publication. We will now describe the resampling algorithm.

The resampling algorithm regroups the marker particles per velocity bins $(v\u2225i,\mu j)$ inside each Voronoi cells. A Voronoi cell is defined around each vertex of the XGC 3D position grid. Once the marker particles are sorted in this 5D grid, the algorithm can perform three actions:

If there are not enough markers per bin (excessive noise), new markers are created prior to resampling their weights. This is called “up-sampling.” The QP optimization problem also needs a high enough number of marker particles to be successful.

If there are too many markers per bin, many of them are destroyed and the weights of the remaining markers are resampled. This is called “down-sampling.”

Finally, if the level of noise measured with the variance of the weights inside the bin is higher than a given parameter, the weights are resampled. This permits to maintain a “good” signal to noise ratio.

When creating or destroying marker particles, one has to be careful and ensure conservation of the phase space volume by renormalizing the weights, because each marker represents a subvolume of this phase space and blindly changing their number would be an issue for the total-f scheme (as well as for the classic delta-f scheme).

When resampling marker particles, the algorithm minimizes the variance (noise reduction) of the marker particle weights inside a velocity bin. For this purpose, the algorithm solves a constrained quadratic programming optimization problem (Goldfarb-Idnani see Ref. 8) with the optimization target being the minimization of the marker variance. Currently, the constraint consists in preserving five moments of the distribution function, which are the mass $M[1]$, the parallel linear momentum $M[v\u2225]$, the magnetic moment $M[\mu ]$, the perpendicular energy $M[\mu B]$, and the parallel kinetic energy $M[v\u22252]$, with

$P(i,j)$ is the projection on velocity bins $(v\u2225i,\mu j)$. These bins are used to discretize the velocity space. A minimum number of marker particles per bin is necessary to guarantee the success of the optimization problem. For this purpose, the algorithm up-samples the bins if necessary. More details will be given in a future publication dedicated to the resampling technique for the total-f scheme.

### B. Using the resampling technique for coupling and extra difficulty

When coupling the core and edge distribution functions, the algorithm may up-sample, down-sample, or resample the bins. By default, the algorithm resamples all bins without testing their variance, because we want to import the entire distribution function information. Indeed, the goal of resampling for coupling is to transfer the particle distribution function information. We can deactivate the down-sampling option, but the up-sampling option is necessary for increasing the rate success of the resampling operation.

As we will discuss now, applying the resampling technique for coupling the core and edge distribution functions brings extra difficulties compared to its original use for reducing the particle noise.

First of all, the algorithm requires communications between the core and edge simulations. For instance, the resampling of the core distribution function $fC$ in the core buffer region can be described in three steps: (1) we compute the five moments $M$ by projecting the edge marker particle weights on a 5D grid; (2) we transfer these 5D moments $M$ from the edge to the core; (3) we use these moments to resample the core marker particles in the core buffer. This communication algorithm is illustrated in the cartoon in Fig. 10. A simple verification test is shown in subplots (a) and (b). This test illustrates the transfer of 5D information and the synchronization of the buffers. In this test, we purposely use different initial conditions in the two simulations shown in (a) and (b), in order to more clearly visualize the buffers after their synchronization. The two buffer rings clearly stand out in (a) and (b).

There is a second extra difficulty when resampling the marker particles of a PIC code while using the 5D information from another PIC code, because the marker particles have different positions and thus a different noise in the two simulations. Indeed, the distribution function represented by particles contains a noise component, $f\u0303$, such that these particles supposed to represent *f*, actually represent $F=f+f\u0303$. If we consider two different sets of particles labeled 1 and 2, they will have different noise components: $f\u03031$ and $f\u03032$, respectively. Then when resampling the buffer of the set of marker particles 1 with the information of the set of marker particles 2, we will resample the set 1 with the information $F2=f+f\u03032$. In the end, if successful, the set 1 will represent $F1=(f+f\u03032)+f\u03031\u2032$, where $f\u03031\u2032$ is the noise of the set of marker 2 on the information $F2=f+f\u03032$. This noise is different from the noise $f\u03032$, which is the marker particle noise over the information *f*. Two problems arise: (1) this extra noise makes the optimization problem more difficult, because of nonphysical grid scale information and (2) the noise of the two sets of marker particles somehow accumulates in the buffer. To avoid these two problems, we use a large number of particles per cell for resampling in the buffer and a large number of particle per cell to reduce the noise on the 5D grid. Studying the use of noise filtering techniques to mitigate the error due to particle noise is left for a future work. Filtering can be based on a field-aligned Fourier filter^{20} or on a smoothing function.

### C. Advantages of coupling with the resampling technique

With our abstract interface, the coupling of different models, such as delta-f and total-f is possible. In a delta-f model like GEM (or the delta-f version of XGC), the distribution is represented with $f=fa+fp$, where *f _{a}* is an analytical function and

*f*is the particle-in-cell representation. In a total-f code like XGC, the distribution function is represented with $f=fa+fg+fp$ where

_{p}*f*is a grid component that contains the slow relaxation of the distribution function. Coupling these different representations is not a problem, because the moments $M$ can be computed with

_{g}*f*. Therefore, the component

*f*, which is the one we want to represent with particles when resampling, is recovered with $fp=f\u2212fa$ or $fp=f\u2212fa\u2212fg$ depending on the model we use. The coupling of codes using a different number of weights or different weight evolution equation is, in principle, also possible with this coupling scheme.

_{p}This scheme also enables the coupling of PIC (XGC) and continuum (GENE) codes because it exchanges distribution function information (the five moments $M$) on a 5D grid. A continuum code like GENE can take the 5D grid data from XGC and update its gyrokinetic grid data. Then it can send its gyrokinetic information on the interface 5D grid to XGC, which in turn resamples its marker particle weights. The bins used for resampling can be represented on any velocity grid with any coordinate system if it facilitates the coupling. The only constraint is that there needs to be enough marker particles in the bins for reducing noise or enabling the success of the optimization while resampling. Note that different velocity grids can be used for the core buffer and for the edge buffer. Such a study is left for future work.

### D. Computational cost of the coupling scheme

There are three aspects of the coupling algorithm to consider when estimating the cost of coupling core and edge simulations. They are the:

Cost of the Poisson coupling and communications.

Cost of the

*f*-coupling and resampling operation.Ratio between the simulation volume of a core–edge coupled simulation and a reference simulation.

The cost of the Poisson coupling is mostly negligible. The main constraint is that the coupling scheme needs the communication of the density and potential between the core and edge simulations. Such information is already communicated within a single code using MPI communications between different tasks. The difference in a coupled simulation is that we use ADIOS communications to transfer this information with files or in-memory communications between two different executables.

The cost of the *f*-coupling with resampling is also almost negligible, as long as it is well configured. Indeed, the cost of resampling all the velocity bins of all Voronoi cells in the buffer takes a significant fraction of one time step, i.e., more or less $20%\u221240%$, but we resample only once every *N* = 10 time steps which decreases the total relative time of the resampling operation down to $2%\u22124%$. The computational time taken for resampling a large buffer is the same as the one for resampling a narrow buffer, because the XGC code is efficiently parallelized in the radial direction.

In core–edge coupled simulations, part of the plasma volume is simulated twice, because the scheme uses overlap and buffer regions at the interface between the two codes. At first sight, the coupling scheme makes a coupled simulation more expensive, but there are at least two ways for speeding-up a simulation thanks to this new scheme. One way is to use different models in the core and edge, with a cheaper model in the core. Another way is to take advantage of the fact that core and edge turbulence transport timescale are different. We will discuss how to take advantage of these aspects in Sec. VI.

## V. SPATIALLY COUPLED SIMULATIONS WITH THE XGC GYROKINETIC CODE

We now test the new spatial coupling scheme described in Sec. III, by carrying out coupled core–edge gyrokinetic simulations of a simple circular geometry. We couple a core simulation with an edge simulation, both carried out with XGC. The core domain covers the radial domain going from $\psi n=0$ to $\psi n=0.65$. The edge domain covers the radial region going from $\psi n=0.3$ to $\psi n=0.9$. The synchronization of the distribution function in the buffer uses the resampling technique described in Sec. IV. The communication of the 5D grids are performed using the ADIOS^{16,17} framework, which enables both file and memory based communication. Both the core and edge simulations are run with XGC, and we use the same delta-f representation of the perturbation, i.e., $f=fa+fp$. Nonetheless, we use a different density of marker particles in these two simulations, and two different radial domains as already mentioned when commenting Fig. 7.

A successful coupled simulation is shown in Fig. 11, where we compare a reference simulation (a) with a coupled simulation (b), and we plot their difference (c). We show the $E\xd7B$ flux of (gyrocenter) particles, Γ, which is a combination of the field $\varphi $ and the distribution function *f*, and is defined by

where $\u2207\u27e8\varphi \u27e9\alpha =\u2207\u27e8\varphi \u27e9\alpha (X)$.

An excellent agreement is found between reference and coupled simulations, with an error of a few percent. Let us emphasize that during this simulation, the buffers have been resampled 80 times (every 10 time steps). This is a difficult problem, because we use significantly narrower buffer compared to our previous study.^{1} This choice makes the coupling of distribution function in the buffer more critical, and it demonstrates the efficiency of the coupling technique. The buffer regions are shown in Fig. 12.

We will now look, in more detail, in Fig. 12, at the role of the dynamics boundary conditions, i.e., the synchronization of the buffer distribution function. We will show that it prevents the corrupted data from penetrating inside the overlap region, where the solution has to be correct.

In Fig. 12, the core–edge coupled simulation (c) is decomposed in its core simulation (a) and its edge simulation (b). Each simulation is split in four radial regions:

the core (respectively, the edge),

the overlap,

the buffer, and

the outside (which is mostly empty).

In this simulation, the outside is not exactly empty, because we let the marker particles follow their neoclassical orbits beyond the simulation boundary. So even if they are created inside the artificial boundary, they may follow an orbit crossing the artificial boundary or they may simply drift in this region. Research about how to handle this boundary in a different manner is left for a future study. This region is nonphysical and, as we can see, there is a huge nonphysical flux growing there.

The role of the buffer is to actively apply correct boundary condition to the overlap, so that the nonphysical dynamics in the empty region does not penetrate the overlap. We discussed the coupling of these equations with detail in Sec. III.

In the bottom left subplot, we zoom on the buffer of the core simulation. As we can see, the buffer data are reset every 10 time steps. Indeed, in this simulation we resample the buffer every 10 time steps. The blue structure in the outside region does not have the time to contaminate the overlap region.

To visualize the effect of the resampling, in subplot (e), we zoom on the flux of the surface 118, which is inside the core buffer. In blue, we plot the flux from the reference edge simulation, and with red markers, we plot the flux in the core buffer. Overall, the two curves agree pretty well, thus, showing that the synchronization of the distribution function keeps the buffer close to the correct solution. It is also clear, in the zoom of subplot (e), that the core's buffer diverges from the edge simulation but that the resampling is bringing it back to the correct solution every 10 time steps.

Finally, in Fig. 13, we show the fluxes of coupled simulations in the middle of the overlap region. If we do not synchronize the buffer, as in (b), the core and edge simulations significantly deviate from each other, whereas with the synchronization of the buffers, as in (a), the two simulations are in very good agreement in the overlap region. Moreover, the fact that the gyrocenter $E\xd7B$ flux of the core and edge simulations deviate significantly from each other in (b), despite the fact that both simulations use the exact same potential and thus $E\xd7B$ drift, means that $fC$ and $fE$ are significantly different in the core and edge simulations when the narrow buffers are not synchronized.

## VI. DISCUSSION ABOUT SIMULATION SPEED-UP ENABLED BY THE COUPLING SCHEME

We are now interested in discussing how one can use this new core–edge coupling scheme to speed-up a simulation. This is an important discussion as we have shown in Sec. IV D that the XGC–XGC coupled simulation is more expensive if nothing is done. We propose two simple solutions illustrated in the cartoon in Fig. 14.

The first simple optimization (b) consists in using different models for the core and edge simulations, with a much cheaper gyrokinetic model in the core. This could be achieved for simulations that include impurities, because impurity physics is computationally more expensive in the edge than in the core. For instance, tungsten impurities are being modeled with several bundles in XGC because the ionization state of tungsten ions strongly varies in the edge pedestal. These many bundles, which are necessary in the edge, increase the computational cost of a simulation because of inter-species collisions with complexity proportional to the square of the number of kinetic species. When using seven tungsten bundles with deuterium and electrons, the simulation is almost 10 times more expensive than when simulating one tungsten bundle with deuterium and electrons ($92/32=9$). Therefore, taking advantage of the fact that only one tungsten bundle is necessary in the core, one can speed-up a whole device simulation by using different numbers of species in the core and edge simulations. The cartoon in Fig. 14(b) assume a $\u223c2\xd7$ factor, as we split the simulation volume in two.

A second optimization consists in taking advantage of the different timescales of the turbulence saturation in the core and edge regions. This optimization is illustrated in the cartoon shown in Fig. 14(c). If the most expensive part of the simulation is the edge which saturates faster than the core, then we can saturate the core simulation first with the optimized core model and switch-on the expensive edge simulation later, by coupling the edge simulation to an already saturated core simulation. For instance, we could run a core simulation for $\u223c9$ ms and then start a coupled simulation for $\u223c1$ ms. This would make the edge simulation $10\xd7$ cheaper. It is nonetheless not certain that a core–edge simulation will saturate in only 1 ms, as the core and edge physics may interact on a longer timescale. A dedicated study is necessary. More complex algorithms may be necessary with multiple iterations between core and core–edge simulations.

By applying these two optimizations enabled by the coupling scheme, we could reach a $10\xd7$ speed-up when running a coupled core–edge simulation instead of a classical single simulation. These scenarii will be studied in future studies.

## VII. SUMMARY

We have presented a new scheme for spatial kinetic coupling of two gyrokinetic simulations. The details of the coupled equations was given. Additionally, we presented a necessary and sufficient condition for ensuring accurate coupling. The scheme presented here extends our previous work^{1} to include the sharing of kinetic information between codes. The distribution function is shared in the buffer regions^{21} by using a recently developed resampling technique.^{8,19}

The new coupling scheme is more generally applicable to kinetic problems because we have now addressed the coupling of both the fields and the particle distribution function. The transfer of the distribution function information between the core and edge simulations is done on an intermediate 5D grid using resampling. This way, the scheme and its implementation can be used to couple different models together, such as delta-f and total-f, and also to couple codes based on different representations of the distribution function, such as or PIC (XGC) and continuum (GENE) codes.

The flexibility of the buffer width presents an opportunity for code optimization because there is a trade-off between the frequency of synchronization and the size of the buffer region where the distribution function is resampled. The resampling of the distribution function can be computationally expensive so it is advantageous to decrease the frequency of resampling. More work is needed to pick optimal parameters for coupled core and edge gyrokinetic simulations.

Because of XGC parallelization, the wall clock time taken for synchronizing does not vary too much with the buffer width in our preliminary tests. Therefore, using a larger buffer, for decreasing the frequency of synchronization, decreases the overall computational cost. For the simulation we performed, we synchronized the buffer every 10 time steps and the overall cost of resampling took only a few percent of the total computing time. This optimization problem will be carefully studied in the frame of the CODAR co-design project.^{22} Future work also includes the coupling of electromagnetic simulations by applying the same approach to Ampère's equation as is used for the gyrokinetic Poisson equation.

Finally, we discussed how the coupling scheme can enable the speed-up of a whole volume simulation. A first optimization consists in using different models for the core and edge simulations, with a much cheaper gyrokinetic model in the core. In such a scenario, there is an important load balancing problem between the core and edge simulations. A second optimization consists in taking advantage of the different saturation times of turbulence in the core and edge regions. If the most expensive part of the simulation is the edge, which physics saturates faster than the core, then we can saturate the core simulation first with the optimized core model and switch-on the expensive edge simulation later, by coupling the edge on an already saturated core simulation.

A next step in the project concerns the coupling of GENE and XGC, as well as GEM and XGC, with this new scheme for coupling the distribution function. GENE and XGC have been carefully cross-verified,^{23} and recent progress on their coupling based on the Poisson equation coupling is reported in Ref. 24. Similarly, progress on the Poisson coupling of GEM and XGC is reported in Ref. 25.

## ACKNOWLEDGMENTS

This work was supported by the U.S. DOE under the Exascale Computing Project (No. 17-SC-20-SC).

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05–00OR22725.

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02–05CH11231.

This manuscript has been authored by Princeton University under U.S. Department of Energy Contract No. DE-AC02–09CH11466. The publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

## DATA AVAILABILITY

Data used in preparing this article will be made available in the PPPL Theory Department ARK, see Ref. 26.