Spherical tokamaks have many desirable features that make them an attractive choice for a future fusion power plant. Powerplant viability is intrinsically related to plasma heat and particle confinement, and this is often determined by the level of microinstabilitydriven turbulence. Accurate calculation of the properties of turbulent microinstabilities is therefore critical for tokamak design; however, the evaluation of these properties is computationally expensive. The considerable number of geometric and thermodynamic parameters and the high resolutions required to accurately resolve these instabilities make repeated use of direct numerical simulations in integrated modeling workflows extremely computationally challenging and create the need for fast, accurate, reducedorder models. This paper outlines the development of a datadriven reducedorder model, often termed a surrogate model for the properties of microtearing modes (MTMs) across a spherical tokamak reactorrelevant parameter space utilizing Gaussian process regression and classification, techniques from machine learning. These two components are used in an active learning loop to maximize the efficiency of data acquisition, thus minimizing computational cost. The highfidelity gyrokinetic code GS2 is used to calculate the linear properties of the MTMs: the mode growth rate, frequency, and normalized electron heat flux, and core components of a quasilinear transport model. Data crossvalidation and direct validation on unseen data are used to ascertain the performance of the resulting surrogate models.
I. INTRODUCTION
Spherical tokamaks possess a number of potential advantages when compared to conventional aspect ratio devices. These include improved vertical stability properties as well as the ability to access high β regimes of operation, allowing a more compact and ultimately cheaper reactor design. However, in these scenarios, electromagnetic instabilities, such as kinetic ballooning modes (KBM)^{1} and microtearing modes (MTM),^{2} are typically excited, both of which can drive radial heat propagation away from the plasma core and potentially degrade plasma performance.
In the case of MTMs, the turbulence distorts and reconnects the equilibrium magnetic field, resulting in microscopic magnetic island structures, which can overlap and cause field stochasticity, which is believed to be responsible for large radial electron heat transport.^{3–6} Consequently, being able to reliably predict the critical gradients at which MTMs are expected to be driven unstable has considerable value for the design of performant reactor scenarios.
The most appropriate framework for describing turbulent modes in the core of a tokamak is the gyrokinetic framework.^{7,8} Running nonlinear gyrokinetic codes routinely within integrated models for design space exploration and turbulence modeling is very costly, particularly for electromagnetic modes like MTMs, which exhibit extremely fine radial structure and consequently require very high numerical resolution. Even local linear gyrokinetic simulations are currently prohibitively expensive in the context of integrated modeling over confinement timescales, being many orders of magnitude slower than currently used reduced turbulence models such as TGLF.
An alternative to repeated direct use of highfidelity models in integrated workflows is to use a training dataset to build a datadriven reducedorder model of the simulation outputs using machine learningbased techniques. This is an especially attractive approach in the case of MTMs due to the lack of sufficiently accurate physicsbased reducedorder models valid for the highly shaped plasmas found in spherical tokamaks (STs); however, progress is being made in the development of valid reduced models.^{9,10}
Machine learning (ML) and uncertainty quantification (UQ) techniques have already been extensively used in transport and turbulence modeling,^{11–16} including with fully nonlinear plasma turbulence simulations in a limited way,^{17,18} to create predictive surrogate models. Furthermore, neural networks trained on experimental databases^{19} and on reduced microtearing mode models^{20,21} have been developed showing speedups of almost two orders of magnitude over conventional models, and active learning pipelines have been shown to be able to create new training data efficiently.^{22,23} However, the physical simplifications and restricted parameter spaces used in these works, which help reduce the number of simulations required and the cost of those simulations, make them unsuitable for use in the predictive design of a spherical tokamak power plant. Consequently, new reducedorder models are required that are suitable for STs, both in terms of the parameter space explored and the fidelity of model used.
Local linear gyrokinetic codes^{24–26} permit evaluation of the frequency, growth rates, normalized radial fluxes, and the eigenfunctions of unstable modes for a given set of input parameters characterizing the geometry and profiles at a particular flux surface. These are the building blocks of a reduced transport model.^{27} In this paper, we aim to emulate these quantities using a database of gyrokinetic simulations of spherical tokamak powerplant relevant microtearing modes. To do this, Gaussian process regression (GPR) is performed on an initially unexplored sevendimensional parameter space. A Gaussian processbased classifier is constructed, which assists in the expansion of the training set by learning the stability manifold of the parameter space, allowing active learning techniques^{22,28–30} to be employed, which focus sample points on regions with high probability of instability to MTMs. The combined model is designed to investigate the parameter space as efficiently as possible while building an accurate regressionbased model harnessing the advantage of GPs that a prediction comes with an easily understandable confidence interval (CI) derived from the process variance.
The remainder of the paper is outlined as follows: Sec. II briefly introduces the gyrokinetic model and the simulations used to generate training data; Sec. III defines the parameter space over which models will be built and Sec. IV describes the modes of interest and the selection criteria used to identify them. An overview of Gaussian processes is given in Sec. V prior to a description of how they are used to build efficient models for MTMs in Sec. VI. Validation of the resulting models is performed in Sec. VII, and finally, conclusions and ideas for future development are provided in Sec. VIII.
II. GYROKINETICS
The gyrokinetic Vlasov–Maxwell system of equations is widely used to describe the turbulent component of the electromagnetic fields and particle distribution functions of the species present in a plasma. Here, we will provide a short summary for understanding, but for a more comprehensive description of the model, see Refs. 7, 31, and 32. To reduce the number of dimensions and the resolutions required to solve the equations, the following ordering assumptions are imposed, which have been observed to be valid for turbulence in in the core of tokamak plasmas:

The Larmor radius $ \rho a = v \u22a5 / \Omega a = v \u22a5 m a / q a B$ for species a is small compared to the length scale over which the background kinetic profiles and magnetic equilibrium vary. The Larmor radius normalized to the system size (characterized by the minor radius a) $ \rho * = \rho a a \u226a 1$ is thus considered a small parameter.

The turbulent fluctuations in a confined plasma are highly anisotropic due to the presence of a strong magnetic field, i.e., the perpendicular wavelengths of interest are significantly smaller than those parallel to the magnetic field line, $ k \u22a5 \u226b k  $.

The fluctuations of the field perturbations against the background profiles are small. The electrostatic potential $ \varphi \u2248 T q a \rho *$ and the electromagnetic potential $ A   \u2248 T q a v t h \rho *$.

The frequencies of interest of the turbulence and transport (ω) are small compared to the ion cyclotron frequency, $ \omega \u226a \Omega a = q a B / m a$.
Solution of the gyrokinetic Vlasov–Maxwell system enables an understanding of the stability of tokamak plasmas, and in the nonlinear case, the saturated turbulent state and the resulting heat, particle, and momentum fluxes. Nonlinear gyrokinetics represent one of the highest fidelity plasma turbulence modeling tools accessible on current supercomputers. It has been extensively validated against experiment^{33–36} in a variety of regimes. Its large computational expense, however, makes it impractical for use in integrated modeling simulations over confinement timescales, and as such, quasilinear models such as TGLF^{37,38} and QuaLiKiz^{39} among others are used for this purpose. These models try to infer the nonlinear regime from the properties of the underlying linear modes. These are then used, along with a set of estimates of the saturation amplitudes of the mode based off of the underlying saturation mechanism, to estimate the nonlinear fluxes. However, the parameter regime of interest in this paper is outside the range of validity of such models and as such does not encapsulate the MTMdriven transport, which is thought to be of primary importance in highβ spherical tokamaks.
A. GS2
GS2 is an implicit gyrokinetic code, which enables solution of the gyrokinetic Vlasov–Maxwell system of equations using HPC resources and is used in the generation of training data for this study.^{26} It was specifically developed to study lowfrequency turbulence in magnetized plasmas. The code incorporates many physics effects in the local limit including kinetic electrons, electromagnetic effects (both $ A  $ and $ B  $ effects), collisions,^{40,41} full general geometry, and rotation.
It is typically used to assess the microstability of plasmas produced in the laboratory and to calculate key properties of the turbulence, which results from instabilities. Linear microinstability growth rates, frequencies, and normalized fluxes can be calculated on a wavenumberbywavenumber basis using a flux tube geometry in the ballooning limit.^{42}
GS2 has the option of being run as an initial value solver where the fastest growing mode for a given parameter set is found, or as an eigenvalue calculation where all the eigenvalue pairs for a parameter set can be found. The latter method has the advantage that all unstable modes can be found, including those that are subdominant. This is particularly useful for MTMs as they are in many cases subdominant to other microinstabilities, while still having a significant effect on heat transport. However, for ease of computation, the former option is chosen, and usage of the eigenvalue solver is left for a future expansion of the model. The usage of an eigenvalue solver adds significant complexity to the classifier component of the model as will be described further in Sec. V B.
III. INPUT PARAMETER SPACE DEFINITION
In this study, a sevendimensional parameter space is modeled comprising the variables considered to be those to which the MTMs are most sensitive. These parameters are as follows:

The safety factor q: The ratio of toroidal to poloidal turns a field line of interest will wind around a flux surface.

The magnetic shear $ s \u0302 = r q d q d r$: Parameterizes the radial variation of the safety factor, where r is the radial coordinate.

The normalized radial gradient of the electron density $ a L n e = \u2212 a n e \u2202 n e \u2202 r$, where a is the minor radius of the last closed flux surface.

The radial gradient of the logarithmic electron temperature $ a L T e = \u2212 a T e \u2202 T e \u2202 r$.

The ratio of the plasma pressure to the magnetic pressure $ \beta = 2 \mu 0 n k B T B 2$. Here, the magnetic field is the toroidal magnetic field at the center of the flux surface at the height of the magnetic axis.

The electron–ion collision frequency ν_{ei}: The characteristic time scale at which electrons will scatter off ions.

The normalized binormal wavevector k_{y} (units of ρ_{s}).
The ranges of the seven parameters are described in Table I and represent plausible limits within which a future spherical tokamak power plant may be expected to operate. Other parameters, for instance those defining the flux surface shaping, are fixed at representative values based on STEP scenario SPR008^{43} except for the ion temperature gradient, which is set to 0 to suppress ion temperature gradient (ITG) and trapped electron (TEM) modes, which are not in the scope of this study, and it is known to have a negligible effect on the MTM.^{44} The values of the fixed variables are given in Table II. The binormal mode wavevector is restricted to the range $ 0 \u2264 k y \rho s \u2264 1$ for this initial model as this is the range of modes, which dominate the heat flux spectra. An analysis of the mode stability for some of the flux surfaces that are considered here can be found in Refs. 45 and 46.
Variable .  Name .  Min. .  Max. . 

q  Safety factor  2  9 
$ s \u0302$  Magnetic shear  0  5 
$ a / L n e$  Electron density gradient  0  10 
$ a / L T e$  Electron temperature gradient  0.5  6 
β  Ratio plasma/magnetic pressures  0  0.3 
ν_{ei}  Electron–ion collision frequency ( $ c s / a$)  0  0.1 
k_{y}  Binormal mode wavelength (units of $ 1 / \rho s$)  0  1 
Variable .  Name .  Min. .  Max. . 

q  Safety factor  2  9 
$ s \u0302$  Magnetic shear  0  5 
$ a / L n e$  Electron density gradient  0  10 
$ a / L T e$  Electron temperature gradient  0.5  6 
β  Ratio plasma/magnetic pressures  0  0.3 
ν_{ei}  Electron–ion collision frequency ( $ c s / a$)  0  0.1 
k_{y}  Binormal mode wavelength (units of $ 1 / \rho s$)  0  1 
Variable .  Name .  Value . 

R  Normalized major radius (m)  1.84 
a  Minor radius (m)  1.47 
κ  Elongation  2.66 
$ d \kappa d \rho $  Radial elongation gradient  −0.25 
τ  Triangularity  0.34 
$ d \tau d \rho $  Radial triangularity gradient  0.25 
Δ  Shafranov shift  −0.44 
$ T e T i$  Electron/ion temperature ratio  0.944 
$ \beta \u2032$  Radial gradient of β  −0.64 
θ_{0}  Ballooning angle  0.0 
Variable .  Name .  Value . 

R  Normalized major radius (m)  1.84 
a  Minor radius (m)  1.47 
κ  Elongation  2.66 
$ d \kappa d \rho $  Radial elongation gradient  −0.25 
τ  Triangularity  0.34 
$ d \tau d \rho $  Radial triangularity gradient  0.25 
Δ  Shafranov shift  −0.44 
$ T e T i$  Electron/ion temperature ratio  0.944 
$ \beta \u2032$  Radial gradient of β  −0.64 
θ_{0}  Ballooning angle  0.0 
IV. MICROTEARING MODES AND SELECTION RULES
One of the characteristics of MTMs is their parity with respect to the coordinate along the magnetic field line. Specifically, the parallel magnetic vector potential ( $ A  $) has an even parity around the outboard midplane (θ = 0), while the electrostatic potential ( $\varphi $) has an odd parity. Examples of this are seen in Fig. 1.
In order to ensure that GS2 predominantly finds microtearing modes, which in many cases will be subdominant to ITGs, TEMs, and KBMs, the desired parity can be enforced on the solution during the simulation. This is done by setting the solution at negative θ to be minus the solution at positive θ. Even parity solutions can be forced in a similar way by setting the solution at negative θ to be identical to that at positive θ. This filters out modes with the parity seen in the lower panel of Fig. 2 and allows subdominant microtearing modes to be found even when running the code as an initial value solver. In general, for electromagnetic modes such as these, the magnitude of the magnetic perturbation $ A  $ is the same amplitude as or larger than its electrostatic counterpart, $\varphi $ ( $  A    \u2265  \varphi $), as seen in Fig. 1.
Figure 3 shows the distribution of a subset of the converged simulations in normalized mode frequency and tearing parameter space. The horizontal lines represent the lower and upper bounds of the normalized frequency within which an MTM is expected to sit, and the vertical dashed line represents the lower cutoff to the tearing parameter. A final visual inspection of the structure of the eigenfunctions removes points, which are unconverged or lack sufficient length along the field line to fully resolve the mode, and this rejects the nonMTM (blue) points within the acceptance limits in Fig. 3. These physicsbased selection rules are used to set a logical parameter indicating whether a converged instability is considered to be a MTM or not. It is upon this parameter that the classifier component of the model is trained.
A. GS2 simulation setup
The MTMs will be examined using local linear gyrokinetics on a single core flux surface with $ \rho = r / a = 0.67$, where a is the radius of the last closed flux surface. A Miller parameterization^{48} was used to model the equilibrium, with the parameters outlined in Table II. Three kinetic species are modeled: deuterium, tritium, and electrons ( $ n e = n D + n T$, deuterium and tritium in a 50/50 mix). The simulations are run for a single binormal mode, k_{y}. We neglect compressional magnetic field ( $ B  $) effects in order to reduce the computational complexity of the model and suppress some types of kinetic ballooning mode, which are destabilized by compressional magnetic field perturbations. Microtearing modes are essentially unaffected by their inclusion.
MTMs require significant numerical resolution due to being particularly elongated along the magnetic field line in ballooning space, equivalent to having a very fine radial structure. A single set of numerical inputs is used in this work for toolchain simplicity. These are summarized in Table III. n_{period} defines the domain of the simulation by setting the number of times the flux tube winds around the device poloidally. The number of $ 2 \pi $ segments is given by 2 $ n period \u2212 1$. GS2 uses the pitch angle and particle energy as velocity space coordinates. $ n  $ determines the number of untrapped pitch angle grid points. The number of trapped pitch angle points is defined by the resolution of the θ grid as $ n \theta / 2 + 1$.
Variable .  Description .  Value . 

$ n \u03f5$  Number of energy grid points.  16 
$ n  $  Number of untrapped pitch angle grid points in each direction along the field line.  16 
n_{trapped}  Number of grid points in the trapped region of velocity space.  65 
$ n \theta $  Sets the number of parallel grid points along the magnetic field line per poloidal turn.  128 
n_{period}  Defines the domain of the simulation by setting the number of times the flux tube winds around the device poloidally.  9 
Variable .  Description .  Value . 

$ n \u03f5$  Number of energy grid points.  16 
$ n  $  Number of untrapped pitch angle grid points in each direction along the field line.  16 
n_{trapped}  Number of grid points in the trapped region of velocity space.  65 
$ n \theta $  Sets the number of parallel grid points along the magnetic field line per poloidal turn.  128 
n_{period}  Defines the domain of the simulation by setting the number of times the flux tube winds around the device poloidally.  9 
The run time of the simulation is approximately linear in the value of n_{period}, so a value of 9 was chosen for this work as a compromise between accuracy and required computational resources; however, there will be a systematic uncertainty resulting from the choice of this parameter, especially in parts of parameter space where the mode is particularly elongated in the ballooning angle. The simulations where this effect is particularly acute are removed from the training set to prevent such systematic errors from entering the regression model. The resolutions used here are influenced by, and consistent with, the resolutions used in previously published studies of the linear structure of unstable modes in spherical tokamaks.^{49} It is noted here that this resolution is approximately an order of magnitude higher than for the KBMs or trapped electron modes (TEM) that would also be found in this parameter space where 2/3 poloidal turns is usually sufficient, and the modes tend to be more strongly growing.
B. Model outputs
The outputs of the model are the properties of the microtearing modes that are important for building a quasilinear heat transport model. Namely, the growth rates, frequencies, and mode contribution to the radial electron heat flux caused by the interaction of the turbulent field and the turbulent pressure fluctuations. These are outlined in Table IV along with their units.
Variable .  Name .  Units . 

γ  Mode growth rate  $ c s / a$ 
ω  Mode frequency  $ c s / a$ 
$ Q e m , e$  Radial electromagnetic electron heat flux  $ \rho * 2 n ref c s T e$ 
Variable .  Name .  Units . 

γ  Mode growth rate  $ c s / a$ 
ω  Mode frequency  $ c s / a$ 
$ Q e m , e$  Radial electromagnetic electron heat flux  $ \rho * 2 n ref c s T e$ 
The heat flux is defined as $ Q e m , e = \u27e8 \u222b d 3 v m v 2 2 v \delta B \u0303 \xb7 \u2207 r f \u27e9$, normalized to $ \u27e8  A    2 \u27e9$, where $ v \u0303 \delta A   = b \u0302 \xd7 \u2207 ( v   A   \u0303 ) / B$ is the flutter velocity, the parallel velocity along the perturbed field lines, and the curly brackets denote a flux surface average (where for quantity F, $ \u27e8 F \u27e9 = \u222b F d l$). $ b \u0302$ denotes a unit vector along the equilibrium magnetic field and B is the magnetic field strength. This is the predominant radial heat transport driven by MTMs.
In addition to GS2, our toolchain incorporates Pyrokinetics,^{50} which is a python library that aims to standardize gyrokinetic analysis by providing a single interface for reading and writing input and output files from different gyrokinetic codes, normalizing to a common standard. This enables future interoperability with other gyrokinetic codes. All variables in this paper utilize the standard normalizations used by Pyrokinetics.^{50}
V. GAUSSIAN PROCESSES
Our datadriven surrogate model of the linear properties of MTMs is built on Gaussian process (GP) regression. An introduction can be found here.^{51} GP regression is a Bayesian approach that has received significant interest in recent years and has been used in the modeling of a diverse range of physical systems. A Gaussian process defines a prior over a function space, which can be conditioned on observations of the function to be emulated. Once trained, it provides a probability distribution over the function value at any unobserved location, allowing prediction with a quantified uncertainty.
Gaussian process regression is used as opposed to neural networks for the model as they are readily interpretable and, importantly when doing uncertainty quantification, they handle uncertainty and noise in a rigorous fashion, two properties that are not always applicable to neural networks. Due to the expense of running a gyrokinetic code, even linearly, the dataset is expected to be small, and as such, Gaussian processes have an advantage due to their data efficiency.
A. Gaussian process regression
A Gaussian process is a stochastic process (a collection of random variables) such that every finite subset has a joint probability distribution, which is Gaussian. In the context of function emulation, one can consider the value of the function f at each location in the input space to be given by a random variable, such that the function as a whole is comprised of an infinite collection of random variables which obey the above property. This effectively yields a probability distribution for the function, which can then be conditioned on observations.
If there is prior knowledge of the overall structure of the function to be emulated, then this can be expressed via the choice of parameterization of the mean function, but in many cases, this is simply set to zero. In this work, we use a constant mean function, unless otherwise stated.
In relation to our model, the mean μ corresponds to the growth rate, mode frequency, and fluxes for a specific input parameter set $ x \u22c6$ (the seven input parameters as outlined in Table I). The variance Σ represents the uncertainty in those predicted mean values.
1. Kernels
The squared exponential in the RBF kernel is infinitely differentiable, which can lead to unrealistically smooth function realizations. The Matérn kernel is only $ \u230a \nu \u2212 1 \u230b$ times differentiable and thus allows more flexible modeling of the degree of smoothness of the underlying function. Note that the Matérn kernel can be simplified for $ \nu = p + 1 2 ; p \u2208 \mathbb{N} 0$, yielding particularly “nice” solutions for $ \nu = [ 1 / 2 , 3 / 2 , 5 / 2 ]$. When $ \nu \u2265 7 / 2$, the kernel is so smooth that it is hardly distinguishable from the RBF kernel.^{51} The Matérn 5/2 kernel is a common choice for emulation and the default kernel used in the experiments below.
2. Fitting to the data
B. Gaussian process classification
After conditioning on training data, the $\Phi $transformed posterior mean of the latent function can be interpreted as the probability of a test point $ x \u22c6$ being an MTM, i.e., $ p ( MTM = 1  x \u22c6 ) = \Phi ( \mu ( x \u22c6 ) )$, and conversely, $ p ( MTM = 0  x \u22c6 ) = 1 \u2212 \Phi ( \mu ( x \u22c6 ) )$. This classifier will enable efficient expansion of the training set by predicting the regions of parameter space where an MTM may be found and thus allowing a targeted use of compute resources. The classifiers were evaluated using standard metrics, namely, accuracy, precision, recall, and Fscore (see, e.g., Ref. 54).
VI. ACTIVE LEARNING WORKFLOW
A. Parameter space exploration
The microtearing mode is stabilized by some of the parameters that are investigated here, and as such, there are large volumes of parameter space where GS2 may not return a positive result as is evident in the locations of the simulation points seen in Fig. 4. To demonstrate this, Fig. 5 shows a contour plot of the calculated growth rate across a twodimensional parameter space of magnetic shear $ s \u0302$ and electron temperature gradient $ a / L T e$ calculated using a Latin hypercube with 100 samples. Stability boundaries where the growth rate of the MTMs tends to zero are evident in the bottom righthand corner and additionally at low electron temperature gradient. This submanifold characterizing marginal stability is of great importance for determining the parameter values beyond which turbulence can form.
As GS2 is being used as an initial value code, in cases where the simulation is run in areas of stability, it will run to a predetermined maximum time without finding an instability, thus using up significant computational resources without producing useful data upon which to train the regression model. To mitigate this, a classifier is trained to predict the probability that a given point in parameter space will be unstable to MTMs, effectively learning the marginal stability manifold and allowing targeted exploration of the parameter space. To build such a classifier an initial dataset is required.
1. Initial dataset generation
An initial dataset was produced by generating a 300 point maximin Latin hypercube design over the parameter space. A Latin hypercube is a spacefilling quasiMonte Carlo sampling scheme whereby one partitions each parameter range into N divisions and samples N points such that each such partition only contains one sampled point for every parameter. A maximin design attempts to do this in a way that maximizes the minimum separation between points in order to generate a point set with good coverage and minimal clustering.
Out of the 300 points sampled, 293 converged, and of these, 99 passed the selection criteria for MTMs discussed in Sec. IV and in total cost approximately 9000 CPU hours. The classifier was trained on both the samples identified as MTMs and those identified as stable (no MTM), while the regression models were trained solely on the MTMs. The training of the classifier takes approximately 1 h on a single CPU. The low fraction ( $ \u223c 30 %$) of MTMs produced by the Latin hypercube, a uniform sampling approach, and the high associated computational overhead, highlights the potential benefit of using an active learning approach.
B. Additional data points for GP regression
To improve the predictions of the GP regression, more samples were drawn iteratively from areas of the parameter space with high posterior probability of finding an MTM as determined by the classifier. The algorithm is summarized as follows:

Draw $ n = 2 k$ low discrepancy candidate data points from the input space using Sobol sequences^{55} to yield an initial sample $ X = { x i} i = 1 n$.

Evaluate the posterior predictive probability of being an MTM for each candidate data point using the Bernoulli GP classifier $ p ( MTM = 1  x i ) = \Phi ( \mu ( x i ) )$, where $ \Phi ( \xb7 )$ is the standard normal CDF, and $ \mu ( x i )$ is the predictive mean of the latent function for data point x_{i}.

Accept/reject each candidate data point using a rejection sampler on the posterior predictive probability of being an MTM, i.e.,

Draw a random number from a uniform distribution $ \rho i \u223c U ( 0 , 1 )$.

If $ \rho i < p ( MTM = 1  x i )$, accept the candidate data point x_{i}.

Otherwise, reject the candidate data point x_{i}.

This algorithm yields a final sample $ X \u22c6$ of accepted samples, where the samples are now distributed according to the posterior predictive distribution of MTMs, $ p ( MTM = 1  x i )$.
Retrain the classifier and regressors using the expanded data set.
Iterate until the desired performance is achieved or the available compute resources are expended.
Two batches of new data were generated using this procedure. The first batch consisted of 1362 accepted samples, of which 971 were subsequently identified as MTMs, corresponding to a hit rate of 71%. The second batch consisted of 1317 accepted samples, of which 983 were subsequently identified as MTMs, corresponding to a hit rate of 75%. This is a significant increase compared to the initial parameter space sampling, representing a substantial saving in computational time. A summary of the batch statistics can be found in Table V, and the locations of the simulation points in parameter space are showing in Fig. 6.
N^{o} .  Name .  Description .  Total num. simulations .  Number of MTMs .  Hit rate (%) . 

0  Initial LHD  Original LHD distributed parameter scan  293  99  33 
1  Original  First expansion of dataset using rudimentary classification  1000  513  51.3 
2  Targeted1  General classifier directed database expansion  1361  971  71.2 
3  Targeted2  General classifier directed database expansion  1316  983  74.6 
4  HighFlux1  Dataset targeted to highflux region ( $ Q e m , e > 0.2$)  300  237  79 
5  HighFlux2  Dataset targeted to highflux region ( $ Q e m , e > 0.5$)  400  349  87 
N^{o} .  Name .  Description .  Total num. simulations .  Number of MTMs .  Hit rate (%) . 

0  Initial LHD  Original LHD distributed parameter scan  293  99  33 
1  Original  First expansion of dataset using rudimentary classification  1000  513  51.3 
2  Targeted1  General classifier directed database expansion  1361  971  71.2 
3  Targeted2  General classifier directed database expansion  1316  983  74.6 
4  HighFlux1  Dataset targeted to highflux region ( $ Q e m , e > 0.2$)  300  237  79 
5  HighFlux2  Dataset targeted to highflux region ( $ Q e m , e > 0.5$)  400  349  87 
The model trained on this expanded dataset was observed to lack accuracy at high heat fluxes, and so, further iterations targeting highflux points were performed. To do this, the existing data were labeled according to whether they were below or above a specified heat flux threshold and a Bernoulli GP classifier was trained on these labels. Sample points with a high predicted heat flux value were then proposed using the rejection sampling algorithm outlined above, highlighting the flexibility of the classifierdriven sampling approach. Using this approach, two further iterations targeting $ Q e m , e > 0.2$ and $ Q e m , e > 0.5$ were performed.
VII. VALIDATION
A. Diagnostics
B. Classification
The Bernoulli GP (BGP) classifier was benchmarked against standard reference classifiers, namely, logistic regression (LR) (see, e.g., Refs. 59 and 60) and a gradientboosting classifier (GB).^{60–62} The BGP was tested using three different covariance kernels, namely, Matérn 1/2, 3/2, and 5/2 (M12, M23, and M52). Table VI shows the mean and standard deviation of the accuracy, precision, recall, and F1 score for each classifier. For definitions of these terms, the reader is directed to A. Here, the mean and standard deviation were computed from a fivefold crossvalidation. Note that the Bernoulli GP classifiers have slightly lower accuracy and precision, and slightly higher recall than the reference classifiers.
Classifier .  Accuracy .  Precision .  Recall .  F1 . 

LR  0.827 ± 0.027  0.813 ± 0.019  0.907 ± 0.036  0.857 ± 0.024 
GB  0.879 ± 0.036  0.868 ± 0.046  0.935 ± 0.024  0.900 ± 0.026 
BGP (M12)  0.875 ± 0.016  0.847 ± 0.024  0.954 ± 0.008  0.898 ± 0.014 
BGP (M32)  0.888 ± 0.026  0.865 ± 0.029  0.955 ± 0.018  0.907 ± 0.020 
BGP (M52)  0.890 ± 0.017  0.864 ± 0.032  0.961 ± 0.013  0.909 ± 0.015 
Classifier .  Accuracy .  Precision .  Recall .  F1 . 

LR  0.827 ± 0.027  0.813 ± 0.019  0.907 ± 0.036  0.857 ± 0.024 
GB  0.879 ± 0.036  0.868 ± 0.046  0.935 ± 0.024  0.900 ± 0.026 
BGP (M12)  0.875 ± 0.016  0.847 ± 0.024  0.954 ± 0.008  0.898 ± 0.014 
BGP (M32)  0.888 ± 0.026  0.865 ± 0.029  0.955 ± 0.018  0.907 ± 0.020 
BGP (M52)  0.890 ± 0.017  0.864 ± 0.032  0.961 ± 0.013  0.909 ± 0.015 
Figure 7 shows the predictions of the Bernoulli GP classifier with the Matérn 5/2 covariance kernel, which was the bestperforming BGP classifier, across each of the twodimensional subspaces of the sevendimensional space. For each marginal parameter scan, all the remaining parameters were fixed at the center point of their respective domain. These indicate where in the parameter space MTMs are likely to be unstable.
C. Regression
GP regression was employed to predict the mode frequencies, growth rates, and normalized electron heat fluxes across the unstable regions of the parameter space using the simulations, which were identified as MTMs as training data. Independent GPs are trained for each output. Initial experiments in this included multitask GP models (where correlations between outputs can yield more accuracy) with an intrinsic coregionalization model (ICM) kernel on the outputs. However, it was found that they did not yield any significant improvement for our data compared to the independent predictors and they are more expensive to train.
Four different covariance functions [Matérn 1/2 (M12), Matérn 3/2 (M32), Matérn 5/2 (M52), and radial basis function (RBF)] were investigated for each output variable. Each model used a constant mean function. Initial experiments additionally included corresponding models with a linear mean function, but this provided no predictive improvement. The results of a fivefold crossvalidation across the entire dataset of MTMpositive datapoints are shown in Table VII. The MSLL metrics show that every GP model performs better than the trivial model $ y \u22c6 \u223c \mathbb{N} ( \mu D , \sigma D 2 )$. The coverage shows that there is good agreement with the theoretical 95% credible interval for every model. There is no appreciable difference between the kernels with respect to the chosen metrics. However, there is a tradeoff between the predictive variance and the inferred observation noise with respect to the kernel smoothness (not shown). As the kernels become smoother, the uncertainty is pushed from the predictive variance to the observation noise. This is reflected in the calibration error, which decreases slightly but monotonically with the smoothness of the kernel (RBF being the smoothest and M12 being the roughest).
Model .  MSE .  SMSE .  MSLL .  CE .  $ C 0.95$ . 

Growth rate  
M12  0.069 ± 0.008  0.068 ± 0.016  −1.325 ± 0.106  0.092 ± 0.011  0.952 ± 0.015 
M32  0.070 ± 0.008  0.062 ± 0.014  −1.369 ± 0.116  0.083 ± 0.011  0.943 ± 0.017 
M52  0.070 ± 0.008  0.064 ± 0.015  −1.363 ± 0.117  0.080 ± 0.011  0.939 ± 0.019 
RBF  0.071 ± 0.008  0.066 ± 0.014  −1.350 ± 0.107  0.081 ± 0.011  0.934 ± 0.020 
Mode frequency  
M12  3.338 ± 0.518  0.030 ± 0.011  −1.692 ± 0.185  0.148 ± 0.011  0.957 ± 0.015 
M32  3.344 ± 0.517  0.029 ± 0.011  −1.742 ± 0.187  0.138 ± 0.012  0.953 ± 0.019 
M52  3.336 ± 0.516  0.031 ± 0.012  −1.722 ± 0.176  0.133 ± 0.012  0.948 ± 0.019 
RBF  3.320 ± 0.504  0.038 ± 0.012  −1.651 ± 0.154  0.122 ± 0.010  0.940 ± 0.024 
Quasilinear flux  
M12  0.088 ± 0.014  0.039 ± 0.013  −1.485 ± 0.130  0.165 ± 0.008  0.963 ± 0.018 
M32  0.088 ± 0.014  0.041 ± 0.013  −1.493 ± 0.140  0.159 ± 0.009  0.960 ± 0.022 
M52  0.088 ± 0.014  0.043 ± 0.013  −1.482 ± 0.138  0.156 ± 0.009  0.956 ± 0.023 
RBF  0.088 ± 0.014  0.047 ± 0.013  −1.451 ± 0.128  0.147 ± 0.008  0.952 ± 0.025 
Model .  MSE .  SMSE .  MSLL .  CE .  $ C 0.95$ . 

Growth rate  
M12  0.069 ± 0.008  0.068 ± 0.016  −1.325 ± 0.106  0.092 ± 0.011  0.952 ± 0.015 
M32  0.070 ± 0.008  0.062 ± 0.014  −1.369 ± 0.116  0.083 ± 0.011  0.943 ± 0.017 
M52  0.070 ± 0.008  0.064 ± 0.015  −1.363 ± 0.117  0.080 ± 0.011  0.939 ± 0.019 
RBF  0.071 ± 0.008  0.066 ± 0.014  −1.350 ± 0.107  0.081 ± 0.011  0.934 ± 0.020 
Mode frequency  
M12  3.338 ± 0.518  0.030 ± 0.011  −1.692 ± 0.185  0.148 ± 0.011  0.957 ± 0.015 
M32  3.344 ± 0.517  0.029 ± 0.011  −1.742 ± 0.187  0.138 ± 0.012  0.953 ± 0.019 
M52  3.336 ± 0.516  0.031 ± 0.012  −1.722 ± 0.176  0.133 ± 0.012  0.948 ± 0.019 
RBF  3.320 ± 0.504  0.038 ± 0.012  −1.651 ± 0.154  0.122 ± 0.010  0.940 ± 0.024 
Quasilinear flux  
M12  0.088 ± 0.014  0.039 ± 0.013  −1.485 ± 0.130  0.165 ± 0.008  0.963 ± 0.018 
M32  0.088 ± 0.014  0.041 ± 0.013  −1.493 ± 0.140  0.159 ± 0.009  0.960 ± 0.022 
M52  0.088 ± 0.014  0.043 ± 0.013  −1.482 ± 0.138  0.156 ± 0.009  0.956 ± 0.023 
RBF  0.088 ± 0.014  0.047 ± 0.013  −1.451 ± 0.128  0.147 ± 0.008  0.952 ± 0.025 
Figure 8 shows the MSLL and SMSE of the GP model using the Matérn 1/2 covariance kernel, as more data points from the additional batches described in Sec. VI B are cumulatively added. The classifier and regression are retrained after each iteration. Here, kfold crossvalidation with k = 10 is performed, with the testing data for each fold kept fixed with N_{test} = 200 data points from the first batch. Broadly, each new batch improves the performance of the growth rate and quasilinear heat flux predictors with respect to both metrics. However, for the mode frequency, the variance of the MSLL across the k = 10 folds first increases and then decreases again. Notably, as the variance increases, some folds are better than any of the previous and some are worse. This shows there are some testing data that are highly sensitive to the new data points from MTMtargeted batches (batches 2 and 3, see Table V). As the highflux targeted batches (batches 4 and 5, see Table V) are added, the variance of the MSLL contracts again, indicating that the sensitive test points are resolved. There is some correlation between the mode frequency and electron flux, which may explain why the highflux batches assist in resolving the sensitive mode frequency data points. It is also worth noting that the MSLL of the mode frequency is consistently low.
D. Individual parameter scans
The validation plots shown in Fig. 9 showing the results of the kfold validation of the whole dataset give an indication of the global performance of the model but do not show how well it can reproduce specific local parametric dependencies, which are important in integrated modeling. Figures 10 and 11 depict scans in the k_{y} parameter comparing the predictions of the model and the values given by GS2. Two scans are shown, one far away from marginal stability (Fig. 10) characterized by large growth rates, and one close to marginal stability (Fig. 11), with lower growth rates and a characteristic peak in the growth rate spectrum. In both cases, the model outputs are in good agreement with the GS2 predictions within the predictive uncertainties of the model.
The growth rate predictions (left, Fig. 11) show two points of interest: First, the growth rate, and correspondingly the heat flux prediction, diverges from the actual values at the upper boundary of our k_{y} range, indicating that there is a lack of data in this region requiring model improvement; second, the confidence intervals are significantly larger than the deviation of the model from the GS2 values, and more generally larger than the mean across this k_{y} range indicating the complexity of capturing the zero growth rate manifold. The error bars here represent the predictive standard deviation, i.e., $ V ( f ( x \u22c6 ) ) + \sigma \u03f5 2$, where $ V ( f ( x \u22c6 ) )$ is the variance of f and $ \sigma \u03f5 2$ is the noise variance. The latter is practically zero, and so, the variance of f dominates.
Figure 12 shows a comparison of the model with GS2 output values over a scan in the safety factor, q. Here, it is evident that there are two unstable branches in this part of parameter space, at low and high q. These two unstable branches are captured by the model, with the growth rate predicted to be negative at the values of q, where GS2 does not predict an unstable MTM ( $ 3 < q < 6$). This highlights the complex nature of the outputs, particularly the growth rate. The model does not reproduce the frequency in between the two unstable branches well because this is a stable area where there are no data on which the GP model can train; thus, the model reverts toward its mean. Both branches maintain a dominant electromagnetic component ( $  A    \u2265  \varphi $) and a large value of the tearing parameter, thereby confirming that both branches are microtearing modes; however, the mode structures in the higher branch are generally narrower in ballooning space, indicating a difference in which type of MTM the two branches contain. An example of a lower q and a higher q can be seen in Fig. 1.
A further crucial parameter that shows criticality is β. Figure 13 shows that the model captures the stability boundary with high precision, again showing an agreement that is significantly better than the error bars suggest. In both Figs. 12 and 13, the classification of the data points in the scan and a comparison with the MTM probability as predicted by the GPbased classifier show the expected dip in probability, below the 0.5 threshold at low β and a corresponding dip between the branches within the confidence interval in the qscan, which in both cases corresponds to the drop in the growth rate below 0 indicating a stable area.
VIII. CONCLUSIONS AND FUTURE DEVELOPMENT
This paper described the development of Gaussian processbased models of the linear parameters of microtearing modes trained on data generated by a highfidelity gyrokinetic code. Gaussian processes have advantages over neural networkbased models for surrogate modeling via their ability to be more accurate at low data volume and their transparent and understandable confidence intervals.
The model consists of two components: a classifier, which predicts the probability of a given point in parameter space being unstable to MTMs, and a regressor, which provides the quantity of interest in the unstable regions predicted by the classifier. Combining these in an active learning loop, it was possible to significantly increase the generation efficiency of training data on a previously unexplored parameter space. An accurate model was created using approximately 5000 data points in a sevendimensional parameter space requiring approximately 1M CPU hours.
A wider expansion of the model is planned, including:

Expansion of the parameter set to include all the geometric parameters listed in Table II. Increasing the dimensionality of the problem will require significantly expanding the data set.

Harnessing the confidence intervals intrinsic in Gaussian processes to inform the targeting of new simulations in parts of parameter space where the most improvement can be made.

Generalization of the model to account for other types of mode, which may be dominant, e.g., KBM, hybrid KBM, trapped electron modes (TEM), etc. This would benefit from the use of a gyrokinetic eigenvalue solver as this would allow the tracking of subdominant stable and unstable modes.

Expanding the current parameters to cover larger ranges. For example, it was apparent from Fig. 10 that the current k_{y} range is insufficient to resolve the turbulent spectrum peak in all cases. It is thought that an expansion of this parameter to $ k y \rho s \u2248 7$ would be sufficient to resolve the peak.

Furthermore, expanding the number of outputs to include the flux surfaceaveraged $ \u27e8 k \u22a5 2 \u27e9$. A further component required for quasilinear models. This could be further refined to predict the full eigenfunctions of the perturbed fields.

Integration of the model into the JINTRAC^{63} integrated modeling toolchain. This will include the development of bespoke saturation rules required to predict the total electromagnetic heat flux via full nonlinear turbulence simulations.
ACKNOWLEDGMENTS
The authors would like to thank David Dickinson and the GS2 team for many discussions about the intricacies of the GS2 code.
This work has been (part) funded by the EPSRC Energy Programme (Grant No. EP/W006839/1). To obtain further information on the data and models underlying this paper, please contact PublicationsManager@ukaea.uk.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
William Anestis Hornsby: Conceptualization (equal); Data curation (lead); Investigation (lead); Methodology (lead); Software (equal); Validation (equal); Writing – original draft (lead). Nikolaos Papadimas: Investigation (supporting); Methodology (supporting); Software (supporting). Ben Fourcin: Investigation (supporting); Methodology (supporting); Software (supporting). Jordan Hart: Investigation (supporting); Methodology (supporting); Software (supporting); Validation (supporting); Visualization (supporting). Ander Gray: Investigation (equal); Methodology (equal); Software (equal); Validation (equal). James Christopher Buchanan: Conceptualization (lead); Investigation (equal); Methodology (equal); Project administration (lead); Software (equal); Validation (equal); Writing – review & editing (equal). Bhavin S. Patel: Conceptualization (equal); Investigation (equal); Methodology (equal); Project administration (equal); Software (equal); Writing – review & editing (equal). Daniel Kennedy: Conceptualization (supporting); Investigation (equal); Methodology (equal). Francis Casson: Conceptualization (equal); Investigation (equal); Methodology (equal); Writing – review & editing (equal). Colin M. Roach: Conceptualization (equal); Investigation (equal); Methodology (equal); Writing – review & editing (equal). Mikkel B. Lykkegaard: Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (lead); Writing – original draft (equal). Huy Nguyen: Investigation (supporting); Methodology (supporting); Software (supporting); Validation (supporting); Visualization (supporting).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
APPENDIX: GLOSSARY OF TERMS
When determining the accuracy of the GP classifier the following terms are used, they are listed here for the reader:

TP—True positive—The fraction of points correctly classified as growing MTMs.

FP—False positive—The fraction of points incorrectly classified as a growing MTM.

TN—True negative—The fraction of points correctly classified as a stable point.

FN—False negative—The fraction of points incorrectly classified as a stable point

Accuracy—The fraction of true positive and true negative points— $ ( TP + TN ) / ( TP + TN + FP + FN )$.

Precision—The fraction of true positive over true positive and false positive— $ TP / ( TP + FP )$.

Recall—The fraction of true positive over true positive and false negatives— $ TP / ( TP + FN )$.

F1 score—A metric that accounts for both the precision and the recall, defined as the harmonic mean of the two, $ F 1 = 2 \xd7 Precision \xd7 Recall / ( Precision + Recall )$
An ideal classifier would classify points as true positive or true negative only. Multiple metrics are used to analyze a classifier in unbalanced systems (when number of MTM points are significantly more or less than nonMTM points). An ideal classifier would give values of one for its accuracy, precision, recall, and F1.

MSE—Mean squared error: $ MSE = 1 N \u2211 i = 1 N ( y i \u2212 f ( x i ) \u2009 ) 2$

SMSE—Standardized MSE: $ SMSE = MSE \sigma y 2$. The MSE standardized by the variance of the test data.
 Coverage—The coverage C measures the coverage of the nth predictive credible interval with respect to the test data, with $ n \u2208 [ 0 , 1 ]$, i.e.$ C n = 1 N \u2211 i = 1 N I ( y i \u2208 CI n ( x i ) ),$
where $ I ( \xb7 )$ is the indicator function and $ CI n ( x )$ is the n credible interval of testing datapoint x. For example, for the 0.95 credible interval, the coverage should be 0.95 for a wellcalibrated model. Conversely, if $ CI n > n$, the model is underconfident with respect to that particular credible interval and vice versa.