Bayesian statistics offers a powerful technique for plasma physicists to infer knowledge from the heterogeneous data types encountered. To explain this power, a simple example, Gaussian Process Regression, and the application of Bayesian statistics to inverse problems are explained. The likelihood is the key distribution because it contains the data model, or theoretic predictions, of the desired quantities. By using prior knowledge, the distribution of the inferred quantities of interest based on the data given can be inferred. Because it is a distribution of inferred quantities given the data and not a single prediction, uncertainty quantification is a natural consequence of Bayesian statistics. The benefits of machine learning in developing surrogate models for solving inverse problems are discussed, as well as progress in quantitatively understanding the errors that such a model introduces.
I. INTRODUCTION
Inferring knowledge from data is a critical issue across all scientific disciplines. In plasma physics, this issue is particularly challenging because of the complexities in directly measuring fundamental properties like density and temperature. Such measurements are indirect and rely heavily on theoretical development for even a basic level of understanding. For example, understanding Langmuir probe physics necessitates two pages of explanation in the most commonly used plasma physics textbook.^{1} Experts in Langmuir probes might spend their entire careers delving into the intricacies of this area.^{2}
Bayesian statistics offers a powerful framework for data interpretation and is especially valuable in challenging areas like plasma physics. This approach necessitates the creation of a data model that accounts for the unknowns, enabling the application of complex data models, such as those found in plasma physics. It also facilitates natural quantification of uncertainty. The explicit nature of the data model in Bayesian statistics also ensures that assumptions are clearly stated and integrated into the analysis.
Despite its advantages, Bayesian statistics has only recently begun to gain wider appreciation, partly for historical reasons. Bayes's theorem was first published in 1763^{3} and further developed by Laplace.^{4} However, it initially fell into disrepute due to errors in its application.^{5} A prominent early 20thcentury statistician, Ronald Fisher, even dismissed it, stating that “(Bayes Theorem) is founded upon error, and must be wholly rejected.”^{6} These errors were rectified in the 1950s and 1960s, securing a solid theoretical foundation for the theorem. Harold Jeffries likened Bayes's theorem to the Pythagorean theorem in geometry, underscoring its fundamental importance.^{7} Today, while modern statisticians do not necessarily view Bayesian methods as distinct from general statistics, there remains a lack of awareness in the broader scientific community, fostering a growing number of Bayesian advocates. Two notable textbooks that emerged in the early 2000s, authored by Jaynes^{8,9} and Sivia and Skilling,^{10} have been instrumental in introducing these concepts to scientists. These books, as well as a textbook by Tarantola,^{11} are highly recommended.
The conditional probabilities for each of the factors in Bayes's equations have names as shown, but we defer the discussion of those names because the rest of the tutorial is to provide intuition on what each of these factors represent. We start with a simple example that is easy to solve both analytically and visualize graphically. We use the mathematics and concepts from the simple example to explain Gaussian process regression, a Bayesian profile fitting technique that is increasing in usage within the plasma physics community. Next, we discuss integrated data analysis and present a stateoftheart usage of Bayesian statistics. Finally, we briefly touch upon the interplay of Bayesian statistics and machine learning.
II. SIMPLE EXAMPLE: TWO PEOPLE, ONE SCALE
We begin with using Bayesian statistics to solve a simple problem: guess the weight of two people named Ann and Bob, standing behind a curtain on a single scale reading 320 lbs.^{12} (see Fig. 1). This is clearly an illposed problem, but because either underfitting or overfitting the unknowns from data is common in science, it serves as a good example for illustrating the concepts.
Rather than providing a single guess, the Bayesian approach emphasizes the uncertainty and the need to formulate a probability of answers given the limited data. To be concrete, let x_{1} be the weight of Ann, x_{2} be the weight of Bob, and d = 320 lbs. be the measurement. Our goal is to find the posterior distribution, $ P ( X = x 1 , x 2  D = d$,I), i.e., the probability of Ann and Bob's weights given the measurement. To do so, we will use the Bayes theorem, Eq. (1), and calculate it using the likelihood, $ P ( D  X , I )$, and prior, $ P ( X  I )$ distributions. The evidence, $ P ( D  I )$, is not used because it acts as a normalization constant for the posterior distribution. As stated earlier, the background information I is explicitly included to demonstrate the use of knowledge not contained within X or D. In this case, one can write $ P ( x i < 0 ) = 0$ as the first and obvious piece of information that we will use.
The prior distribution offers considerably more freedom in choosing and making use of the background information. There are two common choices used for the prior:
Laplace's choice: This is a uniform prior and is simple because in this case the posterior becomes the likelihood. It is known as the principle of insufficient reason because one conservatively assumes you have no reason to use background information to make further progress.
Jayne's choice: This assumes normal distributions as the basis of some type of exploiting the background knowledge, which in this context satisfies the principle of maximum entropy because we can use mean and deviation as constraints. See Chapter 5 of Sivia^{10} for the derivation of other priors satisfying the maximum entropy principle and Chapter 7 of Jaynes^{8} for the use of normal distribution.
Historically, most of the confusion around Bayesian statistics lies in the subtleties of setting the prior, especially for cases of timevarying data where one wants to update the posterior distribution. Other choices of priors can be made of course. For example, Jeffreys prior^{13} is an important prior that we could choose in this case.
In this tutorial, we will use Jayne's choice as the most straightforward and make the following assumptions:

Ann and Bob are American, as befitting a tutorial presented at an American Physical Society meeting.

Ann and Bob have gendernormative names.

Ann and Bob's weights are independent of each other.
The next limit to consider is the unreliable scale, $ \sigma \u03f5 \u2192 \u221e$. In this case, the Kalman gain filter values all go to zero, therefore the posterior mean is the same as the mean of the prior. That is, when we do not trust the data, our best guess is to say that Ann and Bob's weights are each the weight of the national average for their gender. The difference in predictions between $ \sigma \u03f5 = 0$ and $ \sigma \u03f5 \u2192 \u221e$ for Ann's weight is to go from 157.4 to 170.8 lbs. and for Bob's weight to go from 162.6 to 199.8 lbs. Although our uncertainties were arbitrarily chosen, the results are perhaps less sensitive to large variations than one would expect. This will also be seen in Sec. III.
Bayesian statistics is not about predicting the maximum a posteriori value but about getting the distributions of the predictions. Although the total distribution is obtained, what is often desired is the distribution of our inference for Ann and Bob's weights individually. To do so, we integrate out the other value in a process called marginalization. This may be viewed as a projection of the correlated posterior distribution onto the x_{1} axis.
Bayes's theorem can be visually understood through the example shown in Fig. 2. The global likelihood distribution, shown in blue, is multiplied by the localized prior distribution, shown in red, yielding a more concentrated posterior distribution, shown in purple. The prior distribution is uncorrelated, as evidenced by its ellipsoid's axes being parallel to the coordinate axes. In contrast, the posterior distribution becomes highly correlated due to the stringent constraints imposed by the data through the likelihood distribution. The MAP estimate, marked at the center of the posterior ellipsoids, is situated above the most probable likelihood line because the statistical American weights imply that the scale is most likely underweighing Ann and Bob, as discussed earlier when discussing the special limits. Marginalization is then applied to project the posterior distribution onto each axis, revealing the inferred individual weight distributions for Ann and Bob. In this simple example, the posterior distribution remains symmetric such that the MAP value is also the mean. This is discussed further in Sec. III.
In conclusion, the extensive mathematical framework of Bayes's theorem ultimately leads to an intuitive outcome: utilizing known data about average weights to make an educated guess in a poorly defined problem. This aligns with Laplace's assertion in 1819 that “probability theory is nothing but common sense reduced to calculation.” Jaynes's textbook^{8} expands on this idea, portraying the derivation of Bayes's theorem as the simplest mathematical model for human inference. While this basic example highlights the theorem's alignment with common sense, Sec. III will apply it to curve fitting, showcasing how these foundational concepts are extended to more complex applications.
III. INTRODUCTION TO GAUSSIAN PROCESS
Gaussian process regression has emerged as a common, and growing, method for overcoming these limitations within the plasma physics community. It benefits from having an excellent textbook by Rasmussen and Williams.^{17} The first plasma physics use that we are aware of is an EFDA report by Svensson,^{18} where Gaussian processes were used for inferring density profiles from interferometers and current distribution from magnetic diagnostics in the JET tokamak. A notable and influential paper is Ref. 19. A recent paper notable for discussing some of the pitfalls of applying GPR is Ref. 20. A comprehensive review of the uses of this method is not given, but rather a highlight of the key features. To do so, Ref. 21 is followed because it uses analytic profiles with synthetic data to obtain quantitative measurements of the accuracy. More details may be found in that paper.
Our goal with the Bayesian approach to curve fitting is to find the posterior distribution of curves; i.e., the goal is not to find a single curve that fits the data but a distribution that describes the probability of the given curve. This is discussed more below. The basic procedure proceeds the same as the simple example in Sec. II. The likelihood is determined using a data model, and the prior is developed using our prior knowledge to determine the posterior distribution using Bayes's theorem.
As an illustration, consider a minimal dataset of six points spaced equally in x. The underlying curve is a smooth function, but unlike parameterized fitting, performing a fit with GPR does not require any knowledge of this function. Figure 3 shows the MAP estimate and 95% confidence intervals for the example dataset, assuming zero errors in the data. The distribution of possible functions is wide between data points, leading to a larger error in the fit in those regions. This example demonstrates the power of Gaussian process regression: even with few data points and no hyperparameter optimization, the resulting MAP estimate is smooth and reasonable. Additionally, this shows the role that measurements play in the GPR process, constraining the fit. Error estimations on these data points can also be provided naturally to the GPR algorithm, which loosens the constraint these points provide. This will be shown next.
A. GPR for Thomson scattering
In tokamaks, the Thomson scattering (TS) diagnostic provides measurements for temperature and density at a series of physical locations along laser beamlines within the plasma. These measurements are often noisy as they come from scattered photon spectra whose intensity is related to the electron density and whose width is related to the electron temperature. These beamlines are designed such that the measurements span a large range of flux surfaces, so density/temperature/pressure profiles can be obtained. These profiles are important inputs into equilibrium reconstruction codes, such as EFIT;^{22} however, it is important to fit these data for smooth profiles. Many parameterized techniques are popular for this purpose, but recent work^{19,21,23} has utilized GPR for this purpose due to its many benefits: error estimation, nonparametric fitting, and more.
Lmode profiles are easily fit by GPR, but Hmode profiles require a more careful choice of the kernel since the correlation length scale varies as a function of ψ (i.e., the profile changes slowly in the core, with a sharp gradient in the edge). The standard squaredexponential kernel shown in Eq. (24) is not sufficient for such a case, and instead one must choose a nonstationary kernel—that is, a kernel that is not a function of $ r \u2212 r \u2032$, like in Eq. (24), but also can depend solely on “r.” One such kernel is called the Gibbs kernel,^{24} which is a slightly modified version of the squaredexponential kernel. Another, which is what will be shown below, is the changepoint kernel,^{25} which is a piecewise combination of multiple kernels (can be SEK or other) with smooth exponential transitions between them. For the Hmode, a threeregion changepoint kernel is sufficient with a separate SEK for each, and thus there is a separate length scale each for the core, pedestal, and edge. The locations of the transition regions are hyperparameters that can be optimized concurrently with the length scales.
Figure 5 demonstrates this by showing the same dataset, containing outliers, fit with a Gaussian likelihood and then using the student's tdistribution likelihood. The fit is pulled away from the true profile by these outliers in the Gaussian likelihood case, while the student's tdistribution likelihood allows for these outliers to automatically be accounted for. In this case, the fit remains close to the true profile and the error estimation of the fit is also left unaffected. The plots in the figure show the mean, or expectation value, in bold, as well as confidence intervals of the curves. Again, uncertainty quantification is natural with a Bayesian framework.
It is interesting to contrast this with a traditional spline plus hyperbolic tangent fit, which is traditionally used to fit Hmode profiles and is a form of generalized least squares, as discussed above. For the figure with the Gaussian likelihood [Fig. 5(a)], the outliers are kept in the fit and, as seen in the green curve, affect the quality of the fit. For these fits, six knots were used with the positioning chosen to give the best fit and the width chosen manually to fit the data. As seen in Fig. 5(a), manually throwing away the outliers greatly improves the quality of the fit but requires additional manual intervention. In both cases, however, GPR gives a superior fit without any of the tweaking of parameters required in these fits.
However, the use of a nonGaussian likelihood comes at a computational cost. An analytic MAP estimate is no longer possible to obtain, and so a more expensive numerical technique must be used. This is a transition from an empirical Bayesian to a full Bayesian approach. The empirical Bayesian approach involves estimating the hyperparameters from the data, and if both the kernel and likelihood are squared exponentials this is analytically tractable. The full Bayesian approach fixes the prior distribution before data are observed and then explores the full hyperparameter space numerically, integrating over the hyperparameters. There are multiple common tools for the exploration of multidimensional parameter space, including nested sampling, which is good for smallersized problems, and Markov chain Monte Carlo (MCMC) sampling. In the latter, which we use, the hyperparameter space is sampled in such a way that the density of the resulting samples represents the distribution of the hyperparameters. This technique can take more than an order of magnitude longer to compute the fit than the MAP estimate, so one can weigh the pros and cons of the nonGaussian likelihood. MCMC can also be used in the case of a Gaussian likelihood, though one might ask why this would be done when the significantly less expensive MAP estimate is possible.
Figure 6 shows the interesting results that even in this case, the MCMC provides a significantly better (lower error) estimate than the MAP estimate. This is due to the underlying assumptions in the MAP estimate—that the hyperparameter distributions are Gaussian, and thus the expectation value is the maximum value. However, if the hyperparameter distributions are nonsymmetric, then the MAP estimate and the expectation value diverge. MCMC provides not only a better fit but also an insight into the qualitative features of the distribution. For example, in a statistical analysis of a large number of profiles, it was found that the length scale distributions of the pedestal region often contain two peaks, and thus the Gaussian assumption is poor. From this, it is evident that care should be taken in using MAP estimates. A good approach is to use MCMC initially to ensure accuracy and then analyze these results to determine whether a faster MAP estimate is appropriate.
It is beyond the scope of this tutorial to fully explain MCMC in detail, but there are many excellent tutorials online. In particular, PyMC^{26} is a widely used Python package with excellent documentation for learning Bayesian modeling in general and MCMC in particular. Inference tools^{27} is a package developed for meeting the needs of fusionspecific problems and has Python notebooks and tutorials. The work presented here is based on unbaffeld,^{28} which uses inference tools.
IV BAYESIAN STATISTICS IN PLASMA PHYSICS
A. Inverse problems in plasma physics
Until now, the focus of this review has not been specifically on plasma physics but rather on issues general to all sciences. Why should plasma physicists especially learn Bayesian statistics? The primary reason is that plasma physics encompasses a wide variety of data types:
Local measurements: Data measurement is highly localized; e.g., Langmuir probes and Thomson scattering.
Line average measurements: Data measurement comes from spectroscopic measurements and results in line averaging of quantities; e.g., soft xray, interferometry.
Global measurements: Data measurements give the results of an area or volumetric averaging; e.g., magnetic probes and diamagnetic loop.
For each of these measurements, theoretical understanding is essential to interpret the relationship between the observed data and the key quantities of interest, particularly electron density and temperature. Moreover, plasma physics stands out due to the complexity of its data analysis.
Using theory to deduce quantities of interest from data has traditionally been cast as an inverse problem. This concept is illustrated with two examples in Fig. 7. In the case of Langmuir probes, the voltage is adjusted, and the resultant current is measured. The theoretical model, known as the forward problem, provides the relationship between voltage and current as functions of density and temperature. Although not the most efficient approach due to the forward problem being a single equation, one method to solve this involves making initial guesses for density and temperature, computing these values, comparing them with experimental data, and iteratively refining them until the results converge.
Inverse problems often rely on complex theories and sophisticated numerical methods. A notable example is equilibrium reconstructions in tokamaks, as described in Ref. 22. In this case, the poloidal flux, pressure, and toroidal flux function throughout the tokamak's volume are deduced from magnetic measurements, although other types of data can also be incorporated, as we will discuss later. An iterative process is typically employed to minimize the discrepancy between the synthetic (or modelgenerated) data and the actual observations. This iterative process requires the solution of the Grad–Shafranov equation and the calculation of the synthetic diagnostic at each iteration.
Apart from tokamak research, other domains also utilize inverse problemsolving approaches, such as the TRANSP^{29} code for tokamak data analysis and the LASNEX^{30} code for inertial confinement studies. The application of inverse methods extends beyond plasma physics, with seismic analysis and medical imaging techniques like magnetic resonance imaging (MRI) representing prominent examples in broader scientific fields. These applications underscore the widespread relevance and critical importance of inverse problemsolving across various disciplines.
Generalizing inverse problems into a Bayesian framework is natural. For instance, in the context of a Langmuir probe, the objective is to determine the posterior probability of electron density and temperature based on the measured current and voltage, $ p ( n e , T e  I m , V m )$. Similarly, for equilibrium reconstruction in tokamaks using only magnetic data, the aim is to ascertain the posterior distribution of poloidal flux, pressure, and toroidal flux function given the magnetic measurements, $ p ( \psi , p , F  D mag )$.
Within the plasma physics community, the clear leader in developing Bayesian techniques has been the European magnetic fusion community. Two prescient references are a 2003 paper by Fischer,^{31} which discusses integrating multiple diagnostics, and Svensson,^{32} which discusses multiple diagnostics with consistency of equilibrium reconstruction. These papers offer two key insights:

Inverse problems are ubiquitous in plasma physics and best done with a Bayesian approach.

Instead of solving inverse problems individually, they should be solved simultaneously in what Fischer terms integrated data analysis.^{33}
Although both the LASNEX and TRANSP codes have done a type of integrated data analysis long before this term was developed, IDA generally refers to the idea of treating plasma data analysis as one big inverse problem: from raw diagnostic signals to quantities of interest. Two software frameworks, Minerva^{34} and IDA,^{35} have been developed by the European community to address the practical aspects of performing integrated data analysis.
At this stage of the tutorial, the benefits of the integrated data analysis (IDA) method should be clear to the reader. However, the question arises: if IDA is so beneficial, why has it not seen wider adoption? The main obstacles lie in its complexity and the high computational costs involved. To better understand the advantages and drawbacks of the IDA approach, two stateoftheart recent studies from the literature are discussed next.
B. Example of Bayesian analysis for multiple diagnostics
In this section, an example of integrated data analysis is shown based on two papers by Kwak et al.^{23,36} In Ref. 23, the inverse problems associated with far infrared interferometry (FIR) and Thomson scattering (TS) signals are solved simultaneously.
Thomson scattering is a popular diagnostic because it gives localized electron density and temperature measurements. While the temperature can be determined from the physics of Thomson scattering itself, the density requires calibration. This is because the number of detected photons is proportional to the number of electrons that scatter them, but there is nothing to automatically determine that proportionality constant. Far infrared interferometry can also measure electron density. However, it is a line average measurement and thus requires a tomographic inversion to calculate a profile, as mentioned in Sec. III.
Combining the two methods then involves two steps: (1) determining the calibration constant for the density as measured by TS and then (2) inferring the electron density and temperature profiles. This then gives a more accurate approach than manually calibrating the density periodically because it is calibrating at precisely the same time as the measurement is taken. Often, the number of photons can vary because of deposits on the lenses of the materials or in the variations of the detectors themselves. Combining two diagnostics gives a measurement more accurate than traditional methods, with the additional benefit of being more selfconsistent and robust and less labor intensive since manual TS calibration is not needed.
In this case study, the equilibrium configuration was assumed to be static, leading to potential discrepancies between the pressure gradients inferred from the diagnostics and those derived from equilibrium reconstructions. This issue is addressed in Ref. 36, where a more comprehensive Bayesian framework is introduced. This framework incorporates data from a variety of diagnostics, including magnetic probes, polarimeters, interferometers, Thomson scattering, and lithium beam spectroscopy. The approach facilitates the generation of equilibria that are consistent across all diagnostic inputs and also allows for the determination of the full range of possible equilibria.
Quantifying the full uncertainty in simulations, such as those run by magnetohydrodynamic and gyrokinetic codes, presents a challenge. The primary source of uncertainty often stems from variations in the equilibrium, which must be aligned with experimental diagnostics. An example of such uncertainty quantification is provided by Ref. 37, where the peeling–ballooning stability boundary is assessed by varying the equilibrium through two parameters. However, developing more generalized methods for this kind of analysis has been difficult. This work represents a pioneering effort in deriving such a distribution from fundamental principles.
Instead of directly solving the Grad–Shafranov equation, the equilibrium code in Ref. 36, employs a currentcarrying beam model. This model, while simplified, offers the advantage of computational speed, crucial for Bayesian analysis, which requires numerous iterations to produce a full distribution. As computational demands increase, machine learning has emerged as a potent tool for creating these reducedorder, or surrogate, models, which is explored further in the subsequent discussion.
C. Machine learning for inverse problems
Machine learning has many deep connections to Bayesian statistics. When viewed as a method of nonlinear curve fitting, the machine learning community was a pioneer^{17} in the use of Gaussian processes, as discussed in Sec. III. Statistics are key when judging the quality of a given model, and any modern statistical method has Bayesian statistics as its basis. Here, the issues of machine learning in inverse problems are discussed, including its applications to equilibrium reconstruction as a continuation of Sec. IV A. A recent review of the role of Bayesian statistics and machine learning in nuclear fusion may be found in Ref. 38.
There are two methods for using machine learning for inverse problems. The first is to use a surrogate for the likelihood in a way similar to the currentcarrying beam model in Ref. 36. However, if there is already a method for solving the inverse problem, then another approach is to use that method to train on the posterior. The disadvantage of this approach is that one does not obtain the selfconsistency inherent in the integrated data analysis approach; however, this method is generally faster and easier to implement while still providing an approximation of the posterior distribution. In addition, it can provide insight into the strengths and weaknesses of using machine learning as a surrogate model for the Grad–Shafranov equation.
EFIT solves the inverse approach via a least squares minimization. Without going into too much detail and referring to the discussion in Sec. III, least squares fitting can be viewed as linear curve fit by calculating the maximum a posteriori using a uniform prior. Thus, EFIT's solution of the inverse reconstruction process can be viewed as the calculation: $ max [ p ( \psi , p , F  D mag ) ]$. The max converts the distribution into a function that is then suitable for finding a neural network surrogate. That is, here we consider $ y = f ( d )$, where $ y = ( \psi , J tor )$ and d is the magnetics data as the function to be fit with a neural net model.
Most machinelearningbased surrogate models are developed using standard neural architectures or ad hoc network choices. This includes similar work on EFITbased neural nets.^{39–41} While versatile, they may not be specifically tuned to a problem's unique aspects. They often miss out on quantifying prediction uncertainties, which is crucial for reliable decisionmaking and risk assessment. Here, a technique known as neural architecture search (NAS) is used to give models with improved prediction accuracy. Incorporating uncertainty quantification provides a more complete understanding of predictions, combining accuracy with confidence measures. NAS is a technique that optimizes both architecture (e.g., recurrent neural network, concurrent neural network, multilayer perceptron) and model hyperparameters (e.g., number of layers, weights) effectively. This involves two levels of optimization: the outer level for architecture parameters and the inner level for model parameters, based on the chosen architecture. This approach ensures a thorough exploration of both parameters.
Unique to our NAS approach is the method by which each model is optimized.
In the future, a true Bayesian framework will ideally include the selfconsistent uncertainty quantification of the diagnostic errors as in Ref. 36 as well as the prediction errors of a fast surrogate. Such a path forward would enable complete uncertainty quantification for the complexity inherent in plasma physics.
V. CONCLUSIONS
Bayesian statistics is powerful and broadly applicable. It is not a prescription, but rather a framework for, or way of thinking about, approaching problems. Its main advantages are that (1) everything is a probability, (2) uncertainty quantification is natural, and (3) it encourages being explicit in background information and assumptions. Given the complexity of plasma physics in both models and data acquisition, it is a natural approach.
For inferring quantities from data, the key pieces of Bayes's theorem are as follows:
Likelihood distribution, $ P ( D  X )$: This distribution has the data mismatch. It is the distribution where the data tell us how likely a given unknown value is. It has embedded within it a data model that in inverse problems comes from a potentially complex forward model. Because the data model is a type of prediction, this is also known as the predictive distribution.
Prior distribution, P(X): This distribution allows us to use our prior knowledge and background information to weight the posterior distribution toward a particular answer. Common priors are Laplace's choice (uniform distribution, which gives no weight), or Jaynes's choice (normal distribution that can be viewed as weighting the posterior distribution toward maximum entropy). In GPR, it is what gives the continuity of the fit for example.
Posterior distribution, $ P ( X  D )$: This is the desired distribution: the probability of our inferred quantity given the data.
One of the increasing uses of Bayesian statistics is the use of Gaussian process regression (GPR) for curve fitting. Within the context of curve fitting, it has the advantage of robustness by more easily being able to prevent under and overfitting. It also gives uncertainty quantification by having the variance of the curves built into its formulation.
Integrated data analysis is a Bayesian approach for accurately inferring data from the multiple, heterogeneous diagnostics required for understanding plasma experiments. Each diagnostic is an inverse problem, and treating it all as a single inverse problem gives many advantages, including consistency, accuracy, and uncertainty quantification. This approach can also be generalized to include more advanced analysis, such as equilibrium reconstruction in tokamaks. While this example was chosen because it is arguably the most advanced example of integrated data analysis, it should be realized that this problem is a general problem for all of plasma physics. A major disadvantage of the integrated data analysis approach is the increased computation time required to sample the entire distribution space. However, modern computer hardware and mathematical techniques have made this more attractive, especially the development of machine learning.
The use of Bayesian statistics and machine learning has a long history. If Bayesian is the simplest mathematical model of human inference, and the goal of artificial intelligence is to mimic the human mind, the relationship is obvious. More specifically, if machine learning is viewed as a type of nonlinear curve fitting whereby a set of inputs provides a set of outputs, then Gaussian processes can also be used to provide robust fitting. More recently, the use of ensemble averages has been shown to give a robust inference of the posterior distribution directly. In the future, a completely Bayesian technique that includes uncertainty of the measurements as well as uncertainties in the surrogate model offers an attractive method for uncertainty quantification of complex data analysis like that found in plasma physics.
ACKNOWLEDGMENTS
The authors thank Professor James Hanson for his strong advocacy of Bayesian statistics. The authors also thank Dr. Severin Denk for his educational conversations on integrated data analysis. Dr. Fenton Glass explained the many issues with Thomson scattering, and Dr. Ted Strait explained the subtleties of magnetic diagnostics. Both were helpful in helping us understand many of the issues in inferring knowledge from diagnostics. The simple example used in this tutorial is based on work by Professor R. P. Dwight of the Delft University of Technology. This material is based upon work supported by the U.S. Department of Energy, Office of Fusion Energy Science under Award Nos. DESC0021203, DESC0021380, DEFC0204ER54698, and DEFG0295ER54309. The data used in the machine learning are based upon work supported by the U.S. Department of Energy, Office of Science, Office of Fusion Energy Sciences, using the DIIID National Fusion Facility, a DOE Office of Science user facility, under Award(s) DEFC0204ER54698. Disclaimer: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
S. E. Kruger: Conceptualization (lead); Data curation (supporting); Methodology (supporting); Project administration (equal); Software (supporting); Supervision (equal); Validation (supporting); Writing—original draft (lead); Writing—review & editing (lead). S. P. Smith: Data curation (equal); Project administration (supporting); Supervision (supporting). X. Sun: Methodology (supporting); Software (supporting); Writing—review & editing (supporting). A. Samaddar: Methodology (supporting); Writing—review & editing (equal). A.Y. Pankin: Data curation (supporting); Software (supporting). J. Leddy: Conceptualization (equal); Methodology (equal); Software (equal); Writing—original draft (supporting). E. C. Howell: Conceptualization (equal); Methodology (equal); Writing—review & editing (supporting). S. Madireddy: Conceptualization (equal); Methodology (equal); Software (equal); Writing—review & editing (supporting). C. Akcay: Software (equal); Writing—original draft (supporting); Writing—review & editing (supporting). T. Bechtel Amara: Data curation (lead); Software (equal); Writing—review & editing (supporting). J. McClenaghan: Data curation (supporting); Software (supporting); Writing—review & editing (supporting). L. L. Lao: Funding acquisition (lead); Project administration (lead); Writing—review & editing (equal). D. Orozco: Data curation (supporting); Validation (supporting).
DATA AVAILABILITY
The data used in the training of the machine learning model will be made available in the future under FAIR practices after an article on the data are published.