Bayesian statistics offers a powerful technique for plasma physicists to infer knowledge from the heterogeneous data types encountered. To explain this power, a simple example, Gaussian Process Regression, and the application of Bayesian statistics to inverse problems are explained. The likelihood is the key distribution because it contains the data model, or theoretic predictions, of the desired quantities. By using prior knowledge, the distribution of the inferred quantities of interest based on the data given can be inferred. Because it is a distribution of inferred quantities given the data and not a single prediction, uncertainty quantification is a natural consequence of Bayesian statistics. The benefits of machine learning in developing surrogate models for solving inverse problems are discussed, as well as progress in quantitatively understanding the errors that such a model introduces.

Inferring knowledge from data is a critical issue across all scientific disciplines. In plasma physics, this issue is particularly challenging because of the complexities in directly measuring fundamental properties like density and temperature. Such measurements are indirect and rely heavily on theoretical development for even a basic level of understanding. For example, understanding Langmuir probe physics necessitates two pages of explanation in the most commonly used plasma physics textbook.1 Experts in Langmuir probes might spend their entire careers delving into the intricacies of this area.2 

Bayesian statistics offers a powerful framework for data interpretation and is especially valuable in challenging areas like plasma physics. This approach necessitates the creation of a data model that accounts for the unknowns, enabling the application of complex data models, such as those found in plasma physics. It also facilitates natural quantification of uncertainty. The explicit nature of the data model in Bayesian statistics also ensures that assumptions are clearly stated and integrated into the analysis.

Despite its advantages, Bayesian statistics has only recently begun to gain wider appreciation, partly for historical reasons. Bayes's theorem was first published in 17633 and further developed by Laplace.4 However, it initially fell into disrepute due to errors in its application.5 A prominent early 20th-century statistician, Ronald Fisher, even dismissed it, stating that “(Bayes Theorem) is founded upon error, and must be wholly rejected.”6 These errors were rectified in the 1950s and 1960s, securing a solid theoretical foundation for the theorem. Harold Jeffries likened Bayes's theorem to the Pythagorean theorem in geometry, underscoring its fundamental importance.7 Today, while modern statisticians do not necessarily view Bayesian methods as distinct from general statistics, there remains a lack of awareness in the broader scientific community, fostering a growing number of Bayesian advocates. Two notable textbooks that emerged in the early 2000s, authored by Jaynes8,9 and Sivia and Skilling,10 have been instrumental in introducing these concepts to scientists. These books, as well as a textbook by Tarantola,11 are highly recommended.

This tutorial continues in this tradition. Although there are many tutorials available on Bayes's theorem, many of them are applicable to things like games of chance or spam filters. While attesting to the power of Bayesian statistics, they are not necessarily helpful for a plasma physicist looking to understand the essence of the Bayesian approach for their own needs. In this tutorial, we only present the specific problem of inferring knowledge from data. In this case, we write Bayes's theorem as follows:
p ( X | D , I ) posterior = p ( D | X , I ) likelihood p ( D | I ) evidence p ( X | I ) prior .
(1)
Here, p ( X | D , I ) is the conditional probability, i.e., the probability that X is true given D. Our goal is to infer the probability of our unknowns X given the data D. The background information I is included to be clear that we will explicitly include other knowledge not contained within X or D.

The conditional probabilities for each of the factors in Bayes's equations have names as shown, but we defer the discussion of those names because the rest of the tutorial is to provide intuition on what each of these factors represent. We start with a simple example that is easy to solve both analytically and visualize graphically. We use the mathematics and concepts from the simple example to explain Gaussian process regression, a Bayesian profile fitting technique that is increasing in usage within the plasma physics community. Next, we discuss integrated data analysis and present a state-of-the-art usage of Bayesian statistics. Finally, we briefly touch upon the interplay of Bayesian statistics and machine learning.

We begin with using Bayesian statistics to solve a simple problem: guess the weight of two people named Ann and Bob, standing behind a curtain on a single scale reading 320 lbs.12 (see Fig. 1). This is clearly an ill-posed problem, but because either under-fitting or over-fitting the unknowns from data is common in science, it serves as a good example for illustrating the concepts.

FIG. 1.

The likelihood distribution (a) of Eq. (4) is global demonstrating the ill-posed nature of the simple example. Here, d = 320 lbs. and the standard deviation is taken to be 5% of d; i.e., σ ϵ = 16 lbs. The prior distribution (b) of Eq. (7) shows that Ann and Bob's weights are independent because the axes of the resultant ellipsoid are parallel to the figure axes.

FIG. 1.

The likelihood distribution (a) of Eq. (4) is global demonstrating the ill-posed nature of the simple example. Here, d = 320 lbs. and the standard deviation is taken to be 5% of d; i.e., σ ϵ = 16 lbs. The prior distribution (b) of Eq. (7) shows that Ann and Bob's weights are independent because the axes of the resultant ellipsoid are parallel to the figure axes.

Close modal

Rather than providing a single guess, the Bayesian approach emphasizes the uncertainty and the need to formulate a probability of answers given the limited data. To be concrete, let x1 be the weight of Ann, x2 be the weight of Bob, and d = 320 lbs. be the measurement. Our goal is to find the posterior distribution, P ( X = x 1 , x 2 | D = d,I), i.e., the probability of Ann and Bob's weights given the measurement. To do so, we will use the Bayes theorem, Eq. (1), and calculate it using the likelihood, P ( D | X , I ), and prior, P ( X | I ) distributions. The evidence, P ( D | I ), is not used because it acts as a normalization constant for the posterior distribution. As stated earlier, the background information I is explicitly included to demonstrate the use of knowledge not contained within X or D. In this case, one can write P ( x i < 0 ) = 0 as the first and obvious piece of information that we will use.

The likelihood is the distribution that contains the data model. For this simple example, the data model is simply
d = x 1 + x 2 + ϵ ,
(2)
where ϵ is the error of the measurement. We assume that this error is unbiased and has a normal distribution:
ϵ N ( 0 , σ ϵ ) ,
(3)
where N represents the normal distribution as a function of the mean (zero in this case because it is unbiased) and variance. The key step for developing the likelihood is the realization that it represents the probability of the error:
P ( D = d | X 1 = x 1 , X 2 = x 2 ) = P ( ϵ ) = exp [ 1 2 ϵ 2 σ ϵ 2 ] = exp [ 1 2 ( d ( x 1 + x 2 ) ) 2 σ ϵ 2 ] .
(4)
The term d ( x 1 + x 2 ) is known as the data mismatch, or residual, and it is the main feature of the likelihood distribution, even for more complex models. In words, we are saying that the values that are most likely occur when the error is minimized, and the error is minimized when the data mismatch is zero. The likelihood is the key place where one sees that uncertainty quantification is natural to a Bayesian view because including error in the measurement is key to the development of the statistical data model.
It is useful to express the likelihood in generalized matrix form to understand later discussions. In matrix form, this can be written as
P ( D | X ) = exp [ 1 2 ( d A x ) T R 1 ( d A x ) ] ,
(5)
where x represents a 2 × 1 vector with components x1 and x2, A is the 1 × 2 matrix [ 1 1 ], and R is a 1 × 1 matrix with component σ ϵ. Graphically, this distribution can be seen in Fig. 1. In this graph, Ann's weight is on the x axis and Bob's weight is on the y axis. The peak of the distribution occurs when their weight adds to 320 lbs., giving a straight line. There is no localization in the values for Ann and Bob's weight illustrating the ill-posed nature of this problem. In this figure, we have used σ ϵ = 16 lbs., which is 5% of d.

The prior distribution offers considerably more freedom in choosing and making use of the background information. There are two common choices used for the prior:

  • Laplace's choice: This is a uniform prior and is simple because in this case the posterior becomes the likelihood. It is known as the principle of insufficient reason because one conservatively assumes you have no reason to use background information to make further progress.

  • Jayne's choice: This assumes normal distributions as the basis of some type of exploiting the background knowledge, which in this context satisfies the principle of maximum entropy because we can use mean and deviation as constraints. See Chapter 5 of Sivia10 for the derivation of other priors satisfying the maximum entropy principle and Chapter 7 of Jaynes8 for the use of normal distribution.

Historically, most of the confusion around Bayesian statistics lies in the subtleties of setting the prior, especially for cases of time-varying data where one wants to update the posterior distribution. Other choices of priors can be made of course. For example, Jeffreys prior13 is an important prior that we could choose in this case.

In this tutorial, we will use Jayne's choice as the most straightforward and make the following assumptions:

  • Ann and Bob are American, as befitting a tutorial presented at an American Physical Society meeting.

  • Ann and Bob have gender-normative names.

  • Ann and Bob's weights are independent of each other.

As stated previously, the Bayesian approach encourages being explicit in assumptions and use of background information. Each of these assumptions may be suspect; for example, Ann and Bob could live together, in which case their weights would be correlated. However, by being explicit in the assumptions, one is more easily able to understand the effects on the posterior distribution; i.e., the effect of assumptions on our best guess. With our choice of principle of maximum entropy and the assumptions of independent weights, we can write
x 1 N ( μ 1 , σ 1 ) ; x 2 N ( μ 2 , σ 2 ) .
(6)
With assumptions 1 and 2, we can use data from the Center for Disease Control14 and find μ 1 = 170.8 lbs. and μ 2 = 199.8 lbs. The standard deviations are not provided, so for convenience we will assume σ 1 = 15 lbs. and σ 2 = 25 lbs. This choice will be discussed later. Our prior distribution then is
P ( X 1 = x 1 , X 2 = x 2 ) = exp [ 1 2 ( ( x 1 μ 1 ) 2 σ 1 2 + ( x 2 μ 2 ) 2 σ 2 2 ) ]
(7)
= exp [ 1 2 ( x μ ) V 1 ( x μ ) ] ,
(8)
where on the second line the more abstract form is represented. The variance matrix V is a 2 × 2 diagonal matrix because the weights are independent with the variances σ 1 2 and σ 2 2 on the diagonals. This distribution is shown graphically in Fig. 1 along with the likelihood distribution.
With our likelihood and prior distributions, the posterior distribution can now be calculated. Because both the likelihood and prior are normal distributions, it is easiest to see the algebra by considering the log of the posterior. Multiplying two exponentials gives the sum of the exponents such that the log posterior is
log P ( X | d ) + constant = 1 2 ( d A x ) T R 1 ( d A x ) 1 2 ( x μ ) T V 1 ( x μ ) .
(9)
This abstract form is the sum of two quadratic forms. Recall that by completing the square, a simplified quadratic form may be written. After a fair amount of algebra, one may write
P ( X | d ) exp [ 1 2 ( x μ ̂ ) Σ 1 ( x μ ̂ ) ] ,
(10)
where the following are defined as follows:
K = V A T ( R + A V A T ) 1 ,
(11)
μ ̂ = μ + K ( d A μ ) ,
(12)
Σ = ( I K A ) V ,
(13)
and I is the identity matrix. In this form, it is clear that the maximum of the posterior occurs when x = μ ̂; that is, μ ̂ is the maximum a posteriori or MAP estimate. The new variance matrix, Σ, has off-diagonals indicating that the posterior distribution predicts that Ann and Bob's weights are correlated. This is expected since most likely their weights add to something close to 320 lbs. The matrix K is known as the Kalman gain matrix and is used in the widely used Kalman filter.15,16 It plays a key role in producing the MAP estimate and new variance matrix. Specifically, the new mean is given by the mean of the prior but shifted by the data mismatch. The amount of that shift is determined by the Kalman gain matrix. Likewise, the new variance matrix is the old variance with a modification proportional to the Kalman gain matrix.
Using the definitions above, these values are for this simple example as follows:
K = [ σ 1 2 σ ϵ 2 + σ 1 2 + σ 2 2 σ 2 2 σ ϵ 2 + σ 1 2 + σ 2 2 ] ,
(14)
μ ̂ = [ μ 1 + σ 1 2 σ ϵ 2 + σ 1 2 + σ 2 2 ( d μ 1 ) μ 2 + σ 2 2 σ ϵ 2 + σ 1 2 + σ 2 2 ( d μ 2 ) ] ,
(15)
Σ = [ σ 1 2 ( 1 σ 1 2 σ ϵ 2 + σ 1 2 + σ 2 2 ) σ 1 2 σ 2 2 σ ϵ 2 + σ 1 2 + σ 2 2 σ 1 2 σ 2 2 σ ϵ 2 + σ 1 2 + σ 2 2 σ 2 2 ( 1 σ 2 2 σ ϵ 2 + σ 1 2 + σ 2 2 ) ] .
(16)
Using the values above ( σ ϵ = 16 lbs., σ 1 = 15 lbs., σ 2 = 25 lbs.), the values μ 1 ̂ = 160.5 lbs . and μ 2 ̂ = 170.5 lbs. are obtained.
Before discussing the results in further detail, it is helpful to take limits to gain intuition on these forms. The first limit considered is the perfect scale; i.e., σ ϵ = 0. In this case, the expected result that the means add up to d is recovered:
μ 1 ̂ + μ 2 ̂ = μ 1 + σ 1 2 σ 1 2 + σ 2 2 ( d μ 1 μ 2 ) + μ 2 + σ 2 2 σ 1 2 + σ 2 2 ( d μ 1 μ 2 )
(17)
= μ 1 + μ 2 + σ 1 2 + σ 2 2 σ 1 2 + σ 2 2 ( d μ 1 μ 2 )
(18)
= d .
(19)
In this case, the likelihood distribution is a straight line and the prior serves to localize it along that line. The exact values calculated in this limit are μ 1 ̂ = 157.4 lbs. and μ 1 ̂ = 162.6 lbs. The effect of the uncertainty in measurement is to naturally allow it to be closer to the national average weights.

The next limit to consider is the unreliable scale, σ ϵ . In this case, the Kalman gain filter values all go to zero, therefore the posterior mean is the same as the mean of the prior. That is, when we do not trust the data, our best guess is to say that Ann and Bob's weights are each the weight of the national average for their gender. The difference in predictions between σ ϵ = 0 and σ ϵ for Ann's weight is to go from 157.4 to 170.8 lbs. and for Bob's weight to go from 162.6 to 199.8 lbs. Although our uncertainties were arbitrarily chosen, the results are perhaps less sensitive to large variations than one would expect. This will also be seen in Sec. III.

Bayesian statistics is not about predicting the maximum a posteriori value but about getting the distributions of the predictions. Although the total distribution is obtained, what is often desired is the distribution of our inference for Ann and Bob's weights individually. To do so, we integrate out the other value in a process called marginalization. This may be viewed as a projection of the correlated posterior distribution onto the x1 axis.

Bayes's theorem can be visually understood through the example shown in Fig. 2. The global likelihood distribution, shown in blue, is multiplied by the localized prior distribution, shown in red, yielding a more concentrated posterior distribution, shown in purple. The prior distribution is uncorrelated, as evidenced by its ellipsoid's axes being parallel to the coordinate axes. In contrast, the posterior distribution becomes highly correlated due to the stringent constraints imposed by the data through the likelihood distribution. The MAP estimate, marked at the center of the posterior ellipsoids, is situated above the most probable likelihood line because the statistical American weights imply that the scale is most likely under-weighing Ann and Bob, as discussed earlier when discussing the special limits. Marginalization is then applied to project the posterior distribution onto each axis, revealing the inferred individual weight distributions for Ann and Bob. In this simple example, the posterior distribution remains symmetric such that the MAP value is also the mean. This is discussed further in Sec. III.

FIG. 2.

From Bayes's theorem, the likelihood distribution (blue) is multiplied by the prior distribution (red) to give the posterior distribution (purple). Marginalization is used to project the posterior distribution onto each axis to show the inferred weight distribution for Ann and Bob.

FIG. 2.

From Bayes's theorem, the likelihood distribution (blue) is multiplied by the prior distribution (red) to give the posterior distribution (purple). Marginalization is used to project the posterior distribution onto each axis to show the inferred weight distribution for Ann and Bob.

Close modal

In conclusion, the extensive mathematical framework of Bayes's theorem ultimately leads to an intuitive outcome: utilizing known data about average weights to make an educated guess in a poorly defined problem. This aligns with Laplace's assertion in 1819 that “probability theory is nothing but common sense reduced to calculation.” Jaynes's textbook8 expands on this idea, portraying the derivation of Bayes's theorem as the simplest mathematical model for human inference. While this basic example highlights the theorem's alignment with common sense, Sec. III will apply it to curve fitting, showcasing how these foundational concepts are extended to more complex applications.

Curve fitting, determining the “best curve” through a set of data, is ubiquitous throughout the sciences. The most common method for doing this is the generalized linear regression method. In this method, the curve y is determined by the type of basis function:
y ( r ; w ) = j = 0 m 1 w j ϕ j ( r ) ,
(20)
where r refers to the dependent axis (r used instead of x to avoid confusing with the notation used in Sec. II), w represents the weights that are determined by minimizing the least squares error between the data and the function values at those data points. If ϕ j = r j, then the curve fitting is linear regression. Polynomial fits, spline fits, and Fourier fitting are all contained within this general framework. While commonly used, these methods all suffer from common problems. Under-fitting and over-fitting are common in practice because it is unclear how many basis functions need to be used, which basis functions to use without looking at the data, and some arbitrariness in picking parameters (e.g., spline knot locations). In addition, it does not naturally include uncertainty quantification or measurement error.

Gaussian process regression has emerged as a common, and growing, method for overcoming these limitations within the plasma physics community. It benefits from having an excellent textbook by Rasmussen and Williams.17 The first plasma physics use that we are aware of is an EFDA report by Svensson,18 where Gaussian processes were used for inferring density profiles from interferometers and current distribution from magnetic diagnostics in the JET tokamak. A notable and influential paper is Ref. 19. A recent paper notable for discussing some of the pitfalls of applying GPR is Ref. 20. A comprehensive review of the uses of this method is not given, but rather a highlight of the key features. To do so, Ref. 21 is followed because it uses analytic profiles with synthetic data to obtain quantitative measurements of the accuracy. More details may be found in that paper.

Our goal with the Bayesian approach to curve fitting is to find the posterior distribution of curves; i.e., the goal is not to find a single curve that fits the data but a distribution that describes the probability of the given curve. This is discussed more below. The basic procedure proceeds the same as the simple example in Sec. II. The likelihood is determined using a data model, and the prior is developed using our prior knowledge to determine the posterior distribution using Bayes's theorem.

Recall that the likelihood, p ( D | X ), is the distribution of the error between the data and the inferred quantities. Let d represent the data points that are being fit, y be the curve being fit, and A represent the operator that evaluates, or interpolates, the curve y at the data locations. That is, if di is located at the point ri, ( A y ) i is the value of y at ri. Similar to before, the likelihood then is
p ( D = d | X = y ) = exp [ 1 2 ( d A y ) T R 1 ( d A y ) ] ,
(21)
where, as before, R is a diagonal matrix containing the uncertainty of the data. The term in the exponent is the error and if one were to assume a uniform prior (principle of insufficient reason), this could be used to derive least squares regression from the Bayesian perspective.
Instead, the Gaussian process regression method utilizes the assumption of continuity to develop its prior. Recall from the simple example that used the principle of maximum entropy, x 1 N ( μ 1 , σ 1 ). The N is the normal distribution for a single number as befitting x 1 . Assuming a continuous function requires a generalization of this distribution. The generalization of the normal distribution to curves is known as a Gaussian process, denoted as y G P, such that y is in the family of continuous functions ( y C). Similar to the simple example, the prior is written
p ( X = y ) = G P ( μ ( r ) , V ( r , r ) )
(22)
= exp [ 1 2 ( y μ ) V 1 ( y μ ) ] ,
(23)
where the covariance matrix V is also known as the kernel. In Ref. 21 and most plasma physics reference, this is denoted by K for kernel, but V is used here to avoid confusion with the notation in Sec. II. V gives the correlation between two points, including self-correlations or uncertainties. Multiple kernel functions can be used, but in the plasma physics community, the most common kernel is the square-exponential kernel (SEK), also known as the radial basis function:
V sek ( r , r ) = θ v exp [ ( r r ) 2 2 θ l 2 ] .
(24)
The parameters controlling the uncertainty and correlations, θ v , θ l, are the hyperparameters.
The likelihood, Eq. (21), and the prior, Eq. (22), have the same functional forms as the likelihood and prior of the simple example [Eqs. (4) and (7), respectively]. The posterior will thus have a similar form:
P ( X = y | D = d ) exp [ 1 2 ( y μ ̂ ) Σ 1 ( y μ ̂ ) ] ,
(25)
with the definitions in Eq. (11) being the same form if the factors are updated with the definitions above. As before, y = μ ̂ gives the maximum a posteriori or MAP estimate.

As an illustration, consider a minimal dataset of six points spaced equally in x. The underlying curve is a smooth function, but unlike parameterized fitting, performing a fit with GPR does not require any knowledge of this function. Figure 3 shows the MAP estimate and 95% confidence intervals for the example dataset, assuming zero errors in the data. The distribution of possible functions is wide between data points, leading to a larger error in the fit in those regions. This example demonstrates the power of Gaussian process regression: even with few data points and no hyperparameter optimization, the resulting MAP estimate is smooth and reasonable. Additionally, this shows the role that measurements play in the GPR process, constraining the fit. Error estimations on these data points can also be provided naturally to the GPR algorithm, which loosens the constraint these points provide. This will be shown next.

FIG. 3.

Data provided as training points to GPR provide a constraint in their location, and the fit is allowed to vary between these points.

FIG. 3.

Data provided as training points to GPR provide a constraint in their location, and the fit is allowed to vary between these points.

Close modal

In tokamaks, the Thomson scattering (TS) diagnostic provides measurements for temperature and density at a series of physical locations along laser beamlines within the plasma. These measurements are often noisy as they come from scattered photon spectra whose intensity is related to the electron density and whose width is related to the electron temperature. These beamlines are designed such that the measurements span a large range of flux surfaces, so density/temperature/pressure profiles can be obtained. These profiles are important inputs into equilibrium reconstruction codes, such as EFIT;22 however, it is important to fit these data for smooth profiles. Many parameterized techniques are popular for this purpose, but recent work19,21,23 has utilized GPR for this purpose due to its many benefits: error estimation, non-parametric fitting, and more.

Various kernels and likelihoods can be chosen, and to test the quality of these it is useful to generate synthetic TS data for which the true underlying profile is known. These data can be generated by
y ( ψ ) = C 1 ( 1 ψ α ) β + C 2 tanh ( ( ψ ψ ped ) / w ped ) ,
(26)
where C1 is the core profile height, ψ is the flux coordinate, α determines the core gradient, β determines the edge gradient, C2 is the pedestal height, ψ ped is the pedestal location, and w ped is the pedestal width. This function can be used to generate L-mode or H-mode profiles by setting C 2 = 0 or C 2 > 0, respectively. After this, either homoskedastic (uniform) or heteroskedastic (varying with each data point) noise can be added to the generated data points and a number of outliers can also be added.

L-mode profiles are easily fit by GPR, but H-mode profiles require a more careful choice of the kernel since the correlation length scale varies as a function of ψ (i.e., the profile changes slowly in the core, with a sharp gradient in the edge). The standard squared-exponential kernel shown in Eq. (24) is not sufficient for such a case, and instead one must choose a non-stationary kernel—that is, a kernel that is not a function of r r , like in Eq. (24), but also can depend solely on “r.” One such kernel is called the Gibbs kernel,24 which is a slightly modified version of the squared-exponential kernel. Another, which is what will be shown below, is the change-point kernel,25 which is a piecewise combination of multiple kernels (can be SEK or other) with smooth exponential transitions between them. For the H-mode, a three-region change-point kernel is sufficient with a separate SEK for each, and thus there is a separate length scale each for the core, pedestal, and edge. The locations of the transition regions are hyperparameters that can be optimized concurrently with the length scales.

In addition to the multiple length scales, one must also consider the effect of outliers on the fitting process since the TS diagnostic is noisy and can contain failed spectral fits (resulting in outliers). With standard, basic fitting techniques, these outliers must be identified and removed from the dataset to prevent them from skewing the fit. With GPR, however, it is possible to use other methods that will automatically detect outliers during the hyperparameter optimization. One such method is the use of a student's t-distribution for the likelihood, rather than the standard Gaussian. The student's t-distribution is given by
P ( x ) = Γ ( ν + 1 2 ) π ν Γ ( ν 2 ) ( 1 + x 2 ν ) ( ν + 1 ) / 2 ,
(27)
where ν is the degrees-of-freedom and Γ is the Gamma function. This distribution has the interesting and useful property of becoming Gaussian as ν , as shown in Fig. 4. A Gaussian likelihood assumes normally distributed errors, which is often correct, and the standard deviation of this Gaussian can be provided as input from the experimental measurement error estimation. However, if this error estimation is poor, then the resulting fit will suffer from potential over-fitting. For example, if error estimation is set to zero, the fit will be constrained to exactly go through every data point. By instead using a student's t-distribution in the likelihood, another hyperparameter is added, the degrees-of-freedom, which allows the error distribution to become more or less heavy-tailed depending on the distribution of the errors on the data points. If no outliers are present, the degrees-of-freedom hyperparameter will naturally be optimized to a large value, thus the likelihood approaches a Gaussian. However, when outliers are present (and their errors are poorly estimated), the heavy tails will allow for the widening of the distribution without requiring an increase in the error estimation, and therefore the fit will remain unaffected.
FIG. 4.

The student's t-distribution has heavy tails when the degrees-of-freedom, ν, is small. It approaches a normal distribution as ν .

FIG. 4.

The student's t-distribution has heavy tails when the degrees-of-freedom, ν, is small. It approaches a normal distribution as ν .

Close modal

Figure 5 demonstrates this by showing the same dataset, containing outliers, fit with a Gaussian likelihood and then using the student's t-distribution likelihood. The fit is pulled away from the true profile by these outliers in the Gaussian likelihood case, while the student's t-distribution likelihood allows for these outliers to automatically be accounted for. In this case, the fit remains close to the true profile and the error estimation of the fit is also left unaffected. The plots in the figure show the mean, or expectation value, in bold, as well as confidence intervals of the curves. Again, uncertainty quantification is natural with a Bayesian framework.

FIG. 5.

The GPR fit when standard Gaussian likelihood (a) is negatively affected by outliers. When using a student's t-distribution likelihood (b), the fit is better and the error smaller.

FIG. 5.

The GPR fit when standard Gaussian likelihood (a) is negatively affected by outliers. When using a student's t-distribution likelihood (b), the fit is better and the error smaller.

Close modal

It is interesting to contrast this with a traditional spline plus hyperbolic tangent fit, which is traditionally used to fit H-mode profiles and is a form of generalized least squares, as discussed above. For the figure with the Gaussian likelihood [Fig. 5(a)], the outliers are kept in the fit and, as seen in the green curve, affect the quality of the fit. For these fits, six knots were used with the positioning chosen to give the best fit and the width chosen manually to fit the data. As seen in Fig. 5(a), manually throwing away the outliers greatly improves the quality of the fit but requires additional manual intervention. In both cases, however, GPR gives a superior fit without any of the tweaking of parameters required in these fits.

However, the use of a non-Gaussian likelihood comes at a computational cost. An analytic MAP estimate is no longer possible to obtain, and so a more expensive numerical technique must be used. This is a transition from an empirical Bayesian to a full Bayesian approach. The empirical Bayesian approach involves estimating the hyperparameters from the data, and if both the kernel and likelihood are squared exponentials this is analytically tractable. The full Bayesian approach fixes the prior distribution before data are observed and then explores the full hyperparameter space numerically, integrating over the hyperparameters. There are multiple common tools for the exploration of multi-dimensional parameter space, including nested sampling, which is good for smaller-sized problems, and Markov chain Monte Carlo (MCMC) sampling. In the latter, which we use, the hyperparameter space is sampled in such a way that the density of the resulting samples represents the distribution of the hyperparameters. This technique can take more than an order of magnitude longer to compute the fit than the MAP estimate, so one can weigh the pros and cons of the non-Gaussian likelihood. MCMC can also be used in the case of a Gaussian likelihood, though one might ask why this would be done when the significantly less expensive MAP estimate is possible.

Figure 6 shows the interesting results that even in this case, the MCMC provides a significantly better (lower error) estimate than the MAP estimate. This is due to the underlying assumptions in the MAP estimate—that the hyperparameter distributions are Gaussian, and thus the expectation value is the maximum value. However, if the hyperparameter distributions are non-symmetric, then the MAP estimate and the expectation value diverge. MCMC provides not only a better fit but also an insight into the qualitative features of the distribution. For example, in a statistical analysis of a large number of profiles, it was found that the length scale distributions of the pedestal region often contain two peaks, and thus the Gaussian assumption is poor. From this, it is evident that care should be taken in using MAP estimates. A good approach is to use MCMC initially to ensure accuracy and then analyze these results to determine whether a faster MAP estimate is appropriate.

FIG. 6.

The average root mean square error (RMSE) using three methods: empirical Bayesian with Gaussian likelihood, full Bayesian with Gaussian likelihood, and full Bayesian with student's t-distribution likelihood. Full Bayesian gives lower error in both cases when there are no outliers, and the student's t-distribution likelihood has a constant low error for up to ten outliers.

FIG. 6.

The average root mean square error (RMSE) using three methods: empirical Bayesian with Gaussian likelihood, full Bayesian with Gaussian likelihood, and full Bayesian with student's t-distribution likelihood. Full Bayesian gives lower error in both cases when there are no outliers, and the student's t-distribution likelihood has a constant low error for up to ten outliers.

Close modal

It is beyond the scope of this tutorial to fully explain MCMC in detail, but there are many excellent tutorials online. In particular, PyMC26 is a widely used Python package with excellent documentation for learning Bayesian modeling in general and MCMC in particular. Inference tools27 is a package developed for meeting the needs of fusion-specific problems and has Python notebooks and tutorials. The work presented here is based on unbaffeld,28 which uses inference tools.

Until now, the focus of this review has not been specifically on plasma physics but rather on issues general to all sciences. Why should plasma physicists especially learn Bayesian statistics? The primary reason is that plasma physics encompasses a wide variety of data types:

  • Local measurements: Data measurement is highly localized; e.g., Langmuir probes and Thomson scattering.

  • Line average measurements: Data measurement comes from spectroscopic measurements and results in line averaging of quantities; e.g., soft x-ray, interferometry.

  • Global measurements: Data measurements give the results of an area or volumetric averaging; e.g., magnetic probes and diamagnetic loop.

For each of these measurements, theoretical understanding is essential to interpret the relationship between the observed data and the key quantities of interest, particularly electron density and temperature. Moreover, plasma physics stands out due to the complexity of its data analysis.

Using theory to deduce quantities of interest from data has traditionally been cast as an inverse problem. This concept is illustrated with two examples in Fig. 7. In the case of Langmuir probes, the voltage is adjusted, and the resultant current is measured. The theoretical model, known as the forward problem, provides the relationship between voltage and current as functions of density and temperature. Although not the most efficient approach due to the forward problem being a single equation, one method to solve this involves making initial guesses for density and temperature, computing these values, comparing them with experimental data, and iteratively refining them until the results converge.

FIG. 7.

Inferring knowledge from data where there is a physics equation constraining the data is traditionally referred to as an inverse problem. (a) A simple example in plasma physics is determining the density and temperature from a Langmuir probe based on the measured current and voltage. (b) A more complex example is that of equilibrium reconstruction in tokamaks.

FIG. 7.

Inferring knowledge from data where there is a physics equation constraining the data is traditionally referred to as an inverse problem. (a) A simple example in plasma physics is determining the density and temperature from a Langmuir probe based on the measured current and voltage. (b) A more complex example is that of equilibrium reconstruction in tokamaks.

Close modal

Inverse problems often rely on complex theories and sophisticated numerical methods. A notable example is equilibrium reconstructions in tokamaks, as described in Ref. 22. In this case, the poloidal flux, pressure, and toroidal flux function throughout the tokamak's volume are deduced from magnetic measurements, although other types of data can also be incorporated, as we will discuss later. An iterative process is typically employed to minimize the discrepancy between the synthetic (or model-generated) data and the actual observations. This iterative process requires the solution of the Grad–Shafranov equation and the calculation of the synthetic diagnostic at each iteration.

Apart from tokamak research, other domains also utilize inverse problem-solving approaches, such as the TRANSP29 code for tokamak data analysis and the LASNEX30 code for inertial confinement studies. The application of inverse methods extends beyond plasma physics, with seismic analysis and medical imaging techniques like magnetic resonance imaging (MRI) representing prominent examples in broader scientific fields. These applications underscore the widespread relevance and critical importance of inverse problem-solving across various disciplines.

Generalizing inverse problems into a Bayesian framework is natural. For instance, in the context of a Langmuir probe, the objective is to determine the posterior probability of electron density and temperature based on the measured current and voltage, p ( n e , T e | I m , V m ). Similarly, for equilibrium reconstruction in tokamaks using only magnetic data, the aim is to ascertain the posterior distribution of poloidal flux, pressure, and toroidal flux function given the magnetic measurements, p ( ψ , p , F | D mag ).

As discussed in Secs. II and III, the likelihood probability distribution is the distribution of the error arising from the data mismatch. The data mismatch for the inverse problems is the same as the criterion used for convergence in solving the inverse problem. Therefore, the likelihood distribution for inverse problems typically takes the following general form:
p ( d | X ) = exp [ 1 2 ( d d syn ) T R 1 ( d d syn ) ] ,
(28)
where dsyn refers to the synthetic diagnostic based on the forward problem. In this context, the likelihood distribution is also known as the predictive distribution because the data model comes from a prediction from the forward problem.

Within the plasma physics community, the clear leader in developing Bayesian techniques has been the European magnetic fusion community. Two prescient references are a 2003 paper by Fischer,31 which discusses integrating multiple diagnostics, and Svensson,32 which discusses multiple diagnostics with consistency of equilibrium reconstruction. These papers offer two key insights:

  • Inverse problems are ubiquitous in plasma physics and best done with a Bayesian approach.

  • Instead of solving inverse problems individually, they should be solved simultaneously in what Fischer terms integrated data analysis.33 

Although both the LASNEX and TRANSP codes have done a type of integrated data analysis long before this term was developed, IDA generally refers to the idea of treating plasma data analysis as one big inverse problem: from raw diagnostic signals to quantities of interest. Two software frameworks, Minerva34 and IDA,35 have been developed by the European community to address the practical aspects of performing integrated data analysis.

At this stage of the tutorial, the benefits of the integrated data analysis (IDA) method should be clear to the reader. However, the question arises: if IDA is so beneficial, why has it not seen wider adoption? The main obstacles lie in its complexity and the high computational costs involved. To better understand the advantages and drawbacks of the IDA approach, two state-of-the-art recent studies from the literature are discussed next.

In this section, an example of integrated data analysis is shown based on two papers by Kwak et al.23,36 In Ref. 23, the inverse problems associated with far infrared interferometry (FIR) and Thomson scattering (TS) signals are solved simultaneously.

Thomson scattering is a popular diagnostic because it gives localized electron density and temperature measurements. While the temperature can be determined from the physics of Thomson scattering itself, the density requires calibration. This is because the number of detected photons is proportional to the number of electrons that scatter them, but there is nothing to automatically determine that proportionality constant. Far infrared interferometry can also measure electron density. However, it is a line average measurement and thus requires a tomographic inversion to calculate a profile, as mentioned in Sec. III.

Combining the two methods then involves two steps: (1) determining the calibration constant for the density as measured by TS and then (2) inferring the electron density and temperature profiles. This then gives a more accurate approach than manually calibrating the density periodically because it is calibrating at precisely the same time as the measurement is taken. Often, the number of photons can vary because of deposits on the lenses of the materials or in the variations of the detectors themselves. Combining two diagnostics gives a measurement more accurate than traditional methods, with the additional benefit of being more self-consistent and robust and less labor intensive since manual TS calibration is not needed.

In this case study, the equilibrium configuration was assumed to be static, leading to potential discrepancies between the pressure gradients inferred from the diagnostics and those derived from equilibrium reconstructions. This issue is addressed in Ref. 36, where a more comprehensive Bayesian framework is introduced. This framework incorporates data from a variety of diagnostics, including magnetic probes, polarimeters, interferometers, Thomson scattering, and lithium beam spectroscopy. The approach facilitates the generation of equilibria that are consistent across all diagnostic inputs and also allows for the determination of the full range of possible equilibria.

Quantifying the full uncertainty in simulations, such as those run by magnetohydrodynamic and gyrokinetic codes, presents a challenge. The primary source of uncertainty often stems from variations in the equilibrium, which must be aligned with experimental diagnostics. An example of such uncertainty quantification is provided by Ref. 37, where the peeling–ballooning stability boundary is assessed by varying the equilibrium through two parameters. However, developing more generalized methods for this kind of analysis has been difficult. This work represents a pioneering effort in deriving such a distribution from fundamental principles.

Instead of directly solving the Grad–Shafranov equation, the equilibrium code in Ref. 36, employs a current-carrying beam model. This model, while simplified, offers the advantage of computational speed, crucial for Bayesian analysis, which requires numerous iterations to produce a full distribution. As computational demands increase, machine learning has emerged as a potent tool for creating these reduced-order, or surrogate, models, which is explored further in the subsequent discussion.

Machine learning has many deep connections to Bayesian statistics. When viewed as a method of nonlinear curve fitting, the machine learning community was a pioneer17 in the use of Gaussian processes, as discussed in Sec. III. Statistics are key when judging the quality of a given model, and any modern statistical method has Bayesian statistics as its basis. Here, the issues of machine learning in inverse problems are discussed, including its applications to equilibrium reconstruction as a continuation of Sec. IV A. A recent review of the role of Bayesian statistics and machine learning in nuclear fusion may be found in Ref. 38.

There are two methods for using machine learning for inverse problems. The first is to use a surrogate for the likelihood in a way similar to the current-carrying beam model in Ref. 36. However, if there is already a method for solving the inverse problem, then another approach is to use that method to train on the posterior. The disadvantage of this approach is that one does not obtain the self-consistency inherent in the integrated data analysis approach; however, this method is generally faster and easier to implement while still providing an approximation of the posterior distribution. In addition, it can provide insight into the strengths and weaknesses of using machine learning as a surrogate model for the Grad–Shafranov equation.

EFIT solves the inverse approach via a least squares minimization. Without going into too much detail and referring to the discussion in Sec. III, least squares fitting can be viewed as linear curve fit by calculating the maximum a posteriori using a uniform prior. Thus, EFIT's solution of the inverse reconstruction process can be viewed as the calculation: max [ p ( ψ , p , F | D mag ) ]. The max converts the distribution into a function that is then suitable for finding a neural network surrogate. That is, here we consider y = f ( d ), where y = ( ψ , J tor ) and d is the magnetics data as the function to be fit with a neural net model.

Most machine-learning-based surrogate models are developed using standard neural architectures or ad hoc network choices. This includes similar work on EFIT-based neural nets.39–41 While versatile, they may not be specifically tuned to a problem's unique aspects. They often miss out on quantifying prediction uncertainties, which is crucial for reliable decision-making and risk assessment. Here, a technique known as neural architecture search (NAS) is used to give models with improved prediction accuracy. Incorporating uncertainty quantification provides a more complete understanding of predictions, combining accuracy with confidence measures. NAS is a technique that optimizes both architecture (e.g., recurrent neural network, concurrent neural network, multi-layer perceptron) and model hyperparameters (e.g., number of layers, weights) effectively. This involves two levels of optimization: the outer level for architecture parameters and the inner level for model parameters, based on the chosen architecture. This approach ensures a thorough exploration of both parameters.

Unique to our NAS approach is the method by which each model is optimized.

Although this will be discussed in Ref. 42, we present the salient features here. For each model that is trained, the following loss function is used:
l ( x , y ; θ ) = 1 2 log σ θ 2 ( d ) + ( y μ θ ( d ) ) 2 2 σ θ 2 ( d ) ,
(29)
where d are the data inputs, y are the outputs (ψ and Jtor), θ are the hyperparameters, and μ and σ 2 are the mean and variance of the model. That is, as the model is trained, it learns the mean and variance of its prediction.
After the models are found, the models from NAS are ranked and the three best are combined into an ensemble. In the work shown here, only three models are kept. It can be shown that the mean and variance of the ensemble of K models can be written as follows:
μ E = 1 K μ θ ,
(30)
σ E = 1 K σ θ 2 Aleatoric Uncertainty + 1 K 1 ( μ θ μ E ) 2 Epistemic Uncertainty ,
(31)
where the aleatoric uncertainty refers to the inherent noise in the model, and the epistemic uncertainty refers to the uncertainty of the model. Our ensemble thus gives three predictions of ψ and Jtor as given by the mean, the aleatoric uncertainty, and the epistemic uncertainty. The result of this effort is a prediction that is more robust and accurate and also gives insight into its prediction.
The neural net was trained on approximately 150 000 equilibria, validated on approximately 19 000, and then tested on another 19 000 equilibria. The data come from a curated database from the 2019 DIII-D experimental campaign. To quantify the results of neural net, the R2 statistical measure is used. It is defined as follows:
R 2 ( y True , y NN ) = 1 i y True y N N i y True y true ¯ ,
(32)
such that R 2 = 1 indicates that the neural net gives a perfect prediction. The histogram of the R2 for the test cases are shown in Fig. 8. The accuracy of our neural net prediction is excellent as more than 99% of the equilibria have an R 2 > 0.995. A log plot is needed to see the distribution accurately. The median case from the distribution is shown Fig. 9(a). The ψ contours are visually identical, the aleatoric uncertainty is on the order of 10 4, and the epistemic uncertainty is on the order of 10 5. That is, not only is the flux accurately predicted, but the uncertainty indicates that it is accurate. The worst case in the distribution with an R 2 = 0.932 is shown in Fig. 9(b). The contours are visually different, although the overall contour shapes are accurate. The aleatoric uncertainty is on the order of 10 2 and the epistemic uncertainty is on the order of 10 3. This gives an example of what is desired out of a surrogate model: robust and accurate predictions of the desired quantity while also giving a self-prediction of when the accuracy of that quantity is reached.
FIG. 8.

The distribution of R2 values shows high accuracy given the over 10 000 equilibria in the highest bin. The worst case is at R 2 = 0.932 and is an outlier compared to the overall distribution. The accuracy of the predicted flux surfaces is still good.

FIG. 8.

The distribution of R2 values shows high accuracy given the over 10 000 equilibria in the highest bin. The worst case is at R 2 = 0.932 and is an outlier compared to the overall distribution. The accuracy of the predicted flux surfaces is still good.

Close modal
FIG. 9.

(a) The median result case shows flux contours are well predicted by the neural network with low uncertainties. (b) The worst case shows visible difference in the contours, uncertainties roughly two orders of magnitude higher (though still low).

FIG. 9.

(a) The median result case shows flux contours are well predicted by the neural network with low uncertainties. (b) The worst case shows visible difference in the contours, uncertainties roughly two orders of magnitude higher (though still low).

Close modal

In the future, a true Bayesian framework will ideally include the self-consistent uncertainty quantification of the diagnostic errors as in Ref. 36 as well as the prediction errors of a fast surrogate. Such a path forward would enable complete uncertainty quantification for the complexity inherent in plasma physics.

Bayesian statistics is powerful and broadly applicable. It is not a prescription, but rather a framework for, or way of thinking about, approaching problems. Its main advantages are that (1) everything is a probability, (2) uncertainty quantification is natural, and (3) it encourages being explicit in background information and assumptions. Given the complexity of plasma physics in both models and data acquisition, it is a natural approach.

For inferring quantities from data, the key pieces of Bayes's theorem are as follows:

  • Likelihood distribution, P ( D | X ): This distribution has the data mismatch. It is the distribution where the data tell us how likely a given unknown value is. It has embedded within it a data model that in inverse problems comes from a potentially complex forward model. Because the data model is a type of prediction, this is also known as the predictive distribution.

  • Prior distribution, P(X): This distribution allows us to use our prior knowledge and background information to weight the posterior distribution toward a particular answer. Common priors are Laplace's choice (uniform distribution, which gives no weight), or Jaynes's choice (normal distribution that can be viewed as weighting the posterior distribution toward maximum entropy). In GPR, it is what gives the continuity of the fit for example.

  • Posterior distribution, P ( X | D ): This is the desired distribution: the probability of our inferred quantity given the data.

One of the increasing uses of Bayesian statistics is the use of Gaussian process regression (GPR) for curve fitting. Within the context of curve fitting, it has the advantage of robustness by more easily being able to prevent under- and over-fitting. It also gives uncertainty quantification by having the variance of the curves built into its formulation.

Integrated data analysis is a Bayesian approach for accurately inferring data from the multiple, heterogeneous diagnostics required for understanding plasma experiments. Each diagnostic is an inverse problem, and treating it all as a single inverse problem gives many advantages, including consistency, accuracy, and uncertainty quantification. This approach can also be generalized to include more advanced analysis, such as equilibrium reconstruction in tokamaks. While this example was chosen because it is arguably the most advanced example of integrated data analysis, it should be realized that this problem is a general problem for all of plasma physics. A major disadvantage of the integrated data analysis approach is the increased computation time required to sample the entire distribution space. However, modern computer hardware and mathematical techniques have made this more attractive, especially the development of machine learning.

The use of Bayesian statistics and machine learning has a long history. If Bayesian is the simplest mathematical model of human inference, and the goal of artificial intelligence is to mimic the human mind, the relationship is obvious. More specifically, if machine learning is viewed as a type of nonlinear curve fitting whereby a set of inputs provides a set of outputs, then Gaussian processes can also be used to provide robust fitting. More recently, the use of ensemble averages has been shown to give a robust inference of the posterior distribution directly. In the future, a completely Bayesian technique that includes uncertainty of the measurements as well as uncertainties in the surrogate model offers an attractive method for uncertainty quantification of complex data analysis like that found in plasma physics.

The authors thank Professor James Hanson for his strong advocacy of Bayesian statistics. The authors also thank Dr. Severin Denk for his educational conversations on integrated data analysis. Dr. Fenton Glass explained the many issues with Thomson scattering, and Dr. Ted Strait explained the subtleties of magnetic diagnostics. Both were helpful in helping us understand many of the issues in inferring knowledge from diagnostics. The simple example used in this tutorial is based on work by Professor R. P. Dwight of the Delft University of Technology. This material is based upon work supported by the U.S. Department of Energy, Office of Fusion Energy Science under Award Nos. DE-SC0021203, DE-SC0021380, DE-FC02-04ER54698, and DE-FG02-95ER54309. The data used in the machine learning are based upon work supported by the U.S. Department of Energy, Office of Science, Office of Fusion Energy Sciences, using the DIII-D National Fusion Facility, a DOE Office of Science user facility, under Award(s) DE-FC02-04ER54698. Disclaimer: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

The authors have no conflicts to disclose.

S. E. Kruger: Conceptualization (lead); Data curation (supporting); Methodology (supporting); Project administration (equal); Software (supporting); Supervision (equal); Validation (supporting); Writing—original draft (lead); Writing—review & editing (lead). S. P. Smith: Data curation (equal); Project administration (supporting); Supervision (supporting). X. Sun: Methodology (supporting); Software (supporting); Writing—review & editing (supporting). A. Samaddar: Methodology (supporting); Writing—review & editing (equal). A.-Y. Pankin: Data curation (supporting); Software (supporting). J. Leddy: Conceptualization (equal); Methodology (equal); Software (equal); Writing—original draft (supporting). E. C. Howell: Conceptualization (equal); Methodology (equal); Writing—review & editing (supporting). S. Madireddy: Conceptualization (equal); Methodology (equal); Software (equal); Writing—review & editing (supporting). C. Akcay: Software (equal); Writing—original draft (supporting); Writing—review & editing (supporting). T. Bechtel Amara: Data curation (lead); Software (equal); Writing—review & editing (supporting). J. McClenaghan: Data curation (supporting); Software (supporting); Writing—review & editing (supporting). L. L. Lao: Funding acquisition (lead); Project administration (lead); Writing—review & editing (equal). D. Orozco: Data curation (supporting); Validation (supporting).

The data used in the training of the machine learning model will be made available in the future under FAIR practices after an article on the data are published.

1.
F.
Chen
,
Introduction to Plasma Physics and Controlled Fusion
, 2nd ed. (
Plenum Press
,
New York, NY
,
1984
).
2.
E. V.
Shun'ko
,
Langmuir Probe in Theory and Practice
(
Universal-Publishers
,
2009
).
3.
T.
Bayes
, “
LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to John Canton, A. M. F. R. S
,”
Philos. Trans. R. Soc.
53
,
370
418
(
1763
).
4.
P. S.
Marquis de Laplace
,
Théorie analytique des probabilités
(
Courcier
,
1820
), Vol.
7
.
5.
Sections 2.6 and 4.6.1 in Ref. 8 discuss the history briefly and implications on it impacts learning probability theory.
6.
R. A.
Fisher
,
Statistical Methods for Research Workers, 5
(
Oliver and Boyd
,
1928
).
7.
H.
Jeffreys
,
Scientific Inference
(
Cambridge University Press
,
1973
).
8.
E. T.
Jaynes
,
Probability Theory: The Logic of Science
(
Cambridge University Press
,
2003
).
9.
Jaynes, a physicist instrumental in developing Bayesian statistics and well known for linking statistical mechanics to information theory, authored a posthumously published book that is particularly recommended for its derivation of Bayes's theorem.
10.
D.
Sivia
and
J.
Skilling
,
Data Analysis: A Bayesian Tutorial
(
OUP Oxford
,
2006
).
11.
A.
Tarantola
,
Inverse Problem Theory and Methods for Model Parameter Estimation
(
SIAM
,
2005
).
12.
For the non-U.S. readers, this is 145 kg.
13.
H.
Jeffreys
, “
An invariant form for the prior probability in estimation problems
,”
Proc. R. Soc. London A
186
,
453
461
(
1946
).
14.
C. D.
Fryar
,
M. D.
Carroll
,
Q.
Gu
,
J.
Afful
, and
C. L.
Ogden
,
Series: Vital and Health statistics. Series 3, Analytical and Epidemiological Studies
, No. 46 (
U.S. Dept. of Health and Human Services, Public Health Service, National Center for Health Statistics
,
2021
).
15.
R.
Kalman
,
J. Basic Eng.
82
,
35
(
1960
).
16.
S.
Särkkä
and
L.
Svensson
,
Bayesian Filtering and Smoothing
(
Cambridge University Press
,
2023
), Vol.
17
.
17.
C. E.
Rasmussen
and
C.
Williams
,
Gaussian Processes for Machine Learning
(
Springer
,
2006
), Vol.
1
.
18.
J.
Svensson
,
Non-Parametric Tomography Using Gaussian Processes
(
EFDA
,
2011
).
19.
M.
Chilenski
,
M.
Greenwald
,
Y.
Marzouk
,
N.
Howard
,
A.
White
,
J.
Rice
, and
J.
Walk
,
Nucl. Fusion
55
,
023012
(
2015
).
20.
C.
Michoski
,
T. A.
Oliver
,
D. R.
Hatch
,
A.
Diallo
,
M.
Kotschenreuther
,
D.
Eldon
,
M.
Waller
,
R.
Groebner
, and
A. O.
Nelson
,
Nucl. Fusion
64
,
035001
(
2024
).
21.
J.
Leddy
,
S.
Madireddy
,
E.
Howell
, and
S.
Kruger
,
Plasma Phys. Controlled Fusion
64
,
104005
(
2022
).
22.
L.
Lao
,
H. S.
John
,
R.
Stambaugh
,
A.
Kellman
, and
W.
Pfeiffer
,
Nucl. Fusion
25
,
1611
(
1985
).
23.
S.
Kwak
,
J.
Svensson
,
S.
Bozhenkov
,
J.
Flanagan
,
M.
Kempenaars
,
A.
Boboc
,
Y.-C.
Ghim
, and
JET Contributors
,
Nucl. Fusion
60
,
046009
(
2020
).
24.
M. N.
Gibbs
, “
Bayesian Gaussian processes for regression and classification
,” Ph.D. thesis,
Citeseer
,
1998
.
25.
Y.
Saatçi
,
R.
Turner
, and
C. E.
Rasmussen
, in
Proceedings of the 27th International Conference on Machine Learning, ICML'10
(
Omnipress
,
Madison, WI, USA
,
2010
), pp.
927
934
.
26.
O.
Abril-Pla
,
V.
Andreani
,
C.
Carroll
,
L.
Dong
,
C. J.
Fonnesbeck
,
M.
Kochurov
,
R.
Kumar
,
J.
Lao
,
C. C.
Luhmann
,
O. A.
Martin
et al,
PeerJ Comput. Sci.
9
,
e1516
(
2023
).
27.
C.
Bowman
,
P.
Hill
, and
J.
Leddy
, “
C-bowman/inference-tools: 0.13.0 release
,”
2023
.
28.
J.
Leddy
,
S.
Kruger
, and
E.
Howell
, and
The EFIT-AI Team
, “
unbaffeld: Unified bayesian analysis framework for fusion experimental data
,”
2021
.
29.
R.
Hawryluk
, in
Physics of Plasmas Close to Thermonuclear Conditions
(
Elsevier
,
1981
), pp.
19
46
.
30.
J.
Nuckolls
,
L.
Wood
,
A.
Thiessen
, and
G.
Zimmerman
, “
Laser compression of matter to super-high densities: Thermonuclear (CTR) applications
,”
Nature
239
,
139
142
(
1972
).
31.
R.
Fischer
,
A.
Dinklage
, and
E.
Pasch
,
Plasma Phys. Controlled Fusion
45
,
1095
(
2003
).
32.
J.
Svensson
,
A.
Dinklage
,
J.
Geiger
,
A.
Werner
, and
R.
Fischer
,
Rev. Sci. Instrum.
75
,
4219
(
2004
).
33.
This terminology has gained currency within the U.S. fusion community, but we note the confusion of this terminology as a concept with the IDA software framework. The term also does not distinguish its Bayesian approach with traditional integrated data analysis approaches, such as that given, for example, by TRANSP.29 
34.
J.
Svensson
and
A.
Werner
, in
2007 IEEE International Symposium on Intelligent Signal Processing
(
IEEE
,
2007
), pp.
1
6
.
35.
R.
Fischer
,
C.
Fuchs
,
B.
Kurzan
,
W.
Suttrop
,
E.
Wolfrum
, and
A. U.
Team
,
Fusion Sci. Technol.
58
,
675
(
2010
).
36.
S.
Kwak
,
J.
Svensson
,
O.
Ford
,
L.
Appel
,
Y.-c.
Ghim
, and
J.
Contributors
,
Nucl. Fusion
62
,
126069
(
2022
).
37.
P.
Snyder
,
K.
Burrell
,
H.
Wilson
,
M.
Chu
,
M.
Fenstermacher
,
A.
Leonard
,
R.
Moyer
,
T.
Osborne
,
M.
Umansky
,
W.
West
, and
X.
Xu
,
Nucl. Fusion
47
,
961
(
2007
).
38.
A.
Pavone
,
A.
Merlo
,
S.
Kwak
, and
J.
Svensson
,
Plasma Phys. Controlled Fusion
65
,
053001
(
2023
).
39.
B. P.
van Milligen
,
V.
Tribaldos
, and
J. A.
Jiménez
,
Phys. Rev. Lett.
75
,
3594
(
1995
).
40.
S.
Joung
,
J.
Kim
,
S.
Kwak
,
J.
Bak
,
S.
Lee
,
H.
Han
,
H.
Kim
,
G.
Lee
,
D.
Kwon
, and
Y.-C.
Ghim
,
Nucl. Fusion
60
,
016034
(
2019
).
41.
S.
Joung
,
Y.-C.
Ghim
,
J.
Kim
,
S.
Kwak
,
D.
Kwon
,
C.
Sung
,
D.
Kim
,
H.-S.
Kim
,
J. G.
Bak
, and
S. W.
Yoon
,
Sci. Rep.
13
,
15799
(
2023
).
42.
C.
Akcay
and
S. e. a.
Madireddy
, “
Efit-prime: Probabilistic and physics-constrained reduced-order neural network model for equilibrium reconstruction in diii-d
,”
Phys. Plasmas
(submitted
2024
).