Bayesian analysis enables flexible and rigorous definition of statistical model assumptions with well-characterized propagation of uncertainties and resulting inferences for single-shot, repeated, or even cross-platform data. This approach has a strong history of application to a variety of problems in physical sciences ranging from inference of particle mass from multi-source high-energy particle data to analysis of black-hole characteristics from gravitational wave observations. The recent adoption of Bayesian statistics for analysis and design of high-energy density physics (HEDP) and inertial confinement fusion (ICF) experiments has provided invaluable gains in expert understanding and experiment performance. In this Review, we discuss the basic theory and practical application of the Bayesian statistics framework. We highlight a variety of studies from the HEDP and ICF literature, demonstrating the power of this technique. Due to the computational complexity of multi-physics models needed to analyze HEDP and ICF experiments, Bayesian inference is often not computationally tractable. Two sections are devoted to a review of statistical approximations, efficient inference algorithms, and data-driven methods, such as deep-learning and dimensionality reduction, which play a significant role in enabling use of the Bayesian framework. We provide additional discussion of various applications of Bayesian and machine learning methods that appear to be sparse in the HEDP and ICF literature constituting possible next steps for the community. We conclude by highlighting community needs, the resolution of which will improve trust in data-driven methods that have proven critical for accelerating the design and discovery cycle in many application areas.

## I. INTRODUCTION

In recent years, there has been a tremendous increase in the development and application of the so-called *advanced analysis* techniques to problems in Inertial Confinement Fusion (ICF)^{1–3} and High Energy Density Physics (HEDP).^{4} Experiments in this field are conducted using some of the most complex facilities on the planet with an array of diagnostics that stretch the limits of measurement in exceedingly harsh environments to study matter in the most extreme conditions produced on Earth. Facilities such as the National Ignition Facility (NIF)^{5} at Lawrence Livermore National Laboratory (LLNL), the Z Machine^{6,7} at Sandia National Laboratories (SNL), the OMEGA laser^{8} at the Laboratory for Laser Energetics (LLE), the Laser Mega-Joule (LMJ)^{9} at the Commissariat à l’Énergie Atomique (CEA), and the Linac Coherent Light Source (LCLS)^{10} at the National Stanford Linear Accelerator Center (SLAC) provide scientists with the ability to study dense fusion plasmas, planetary physics, astrophysical phenomena, material properties, shock physics, and more. The new generation of ultra-intense lasers such as those found throughout the international Extreme Light Infrastructure (ELI),^{11} planned Zettawatt-Equivalent Ultrashort pulsed laser System (ZEUS)^{12} at the University of Michigan, and others provide access to unprecedented power and repetition rates to study high-field physics, radiation production, and more. Numerous university-scale laser and pulsed power facilities support these large efforts by training students and scientists and enabling additional studies. These facilities provide unique opportunities for studying matter in extreme conditions and unique challenges in collecting, analyzing, and interpreting data.

The reason for the recent growth in advanced analysis methodologies originates from the realization that few, if any, quantities-of-interest can be directly measured in experiments. This comes from two distinct but related realities of the field. First, HEDP and ICF experiments, by their very nature, operate in a transient regime where spatial and temporal gradients are rarely possible to avoid and often dominate the interpretation of our measurements. This means that all measurements we make are best described as weighted measurements, where the weight is provided by the messenger used to make the measurement (neutron emission from a fusion plasma, spatial and temporal character of an x-ray pulse used to perform Thomson scattering, etc.). The interpretation of such weighted quantities then requires that we use a physical model of the system that can account for gradients. The second realization is that many of the most important quantities we desire to understand from the experiment are not directly measurable at all, even in the sense established above. Instead, we must combine multiple measurements obtained from different instruments through a physical model to obtain a constraint on the quantity of interest.

*χ*, which characterizes how close a given fusion experiment is to the ideal ignition threshold,

^{13}

*χ*and the second line shows a simplified version applicable to a clean, 1D, isobaric plasma. The burning fuel, or hotspot, volume is

*V*

_{HS},

*ɛ*

_{α}is the energy of the emitted

*α*particle, and ⟨

*σv*⟩

_{DT}is the DT fusion reactivity. The key quantities of interest are the ion temperature

*T*, the hotspot pressure

*P*

_{HS}, and the energy confinement time

*τ*

_{E}. The ion temperature is directly measurable in the first sense above, where an instrument such as a neutron time of flight (nTOF) detector can measure a signal that is sensitive to a spatiotemporal integration of the ion temperature weighted by the local neutron emissivity. The pressure and confinement time are not directly measurable at all, and so we use proxy quantities, such as the full-width at half-maximum (FWHM) of the nuclear burn history, which are informed by simulation. Alternatively, simulation and theory can be used to recast quantities such as

*χ*in terms of quantities that are measurable, implicitly invoking a model assumption to constrain these relationships.

The picture above further assumes that we are able to extract reliable and useful features directly from our raw data and assign meaningful uncertainties to them. Unfortunately, for the HEDP and ICF experimentalist, our instruments exist in harsh environments where significant noise, background, and other artifacts are present. Perhaps more unique to these experiments is the fact that our instruments are routinely moved, modified, and otherwise adjusted. This makes common tasks such as calibrations and flat-fielding difficult and often impractical. The ability to extract important information directly from noisy data and quantify our confidence in such a process is the crucial first step in our ability to use these data to constrain models and improve our understanding. Furthermore, the process of reducing data is labor intensive and often requires user input, which can compromise reproducibility. The repetitive and laborious nature of data reduction, which often requires defining and removing backgrounds from signals and images, calibrating spectra, etc., often means that significant quantities of data remain unprocessed and, subsequently, unused.

It is with these challenges in mind that we define what we mean by the term *advanced analysis*. We seek methods to analyze experimental data that are robust to complex artifacts in the data, quantify uncertainties in both the data and inferred quantities, and enable the self-consistent integration of multiple sources of information to constrain quantities of interest, some of which may not be observable. Historically, the approach to this in HEDP and ICF has been to measure as many quantities directly as possible and compare these individual quantities to simulations of the experiment, looking for consistency. Alternatively, the measured quantities are often used as input to another equation to estimate a secondary important quantity. Unfortunately, this approach ignores shared information between different instruments and does not admit the inclusion of additional information if a new instrument provides constraining information. It also makes the propagation of uncertainties challenging and generally overestimates the resulting uncertainty.

The preferred method to circumvent the shortcomings of a more traditional approach relies on Bayesian inference. Bayesian methods provide a consistent statistical framework for the extraction of physical insights from experimental data. These methods are able to elegantly handle noisy and even incomplete data, providing a means to regularize ill-posed problems. While Bayesian methods have been popular in many fields, such as neuroscience,^{14–16} biology,^{17,18} astronomy,^{19–22} and others,^{23–25} the ICF and HEDP community has been slow to adopt them. However, the rapid acceleration of computational capabilities and the proliferation of open source, general use software packages for probabilistic programming have started to change this. In the last decade, a small, but growing number of applications have seen the use of Bayesian methods for data reduction, inference of experimental quantities, and calibration of complex physics models, which will be discussed in this Review.

Furthermore, the rise in popularity of Machine Learning (ML) methods in physical sciences has opened a wide array of new tools for use in the processing, reduction, and integration of experimental data. These tools offer the potential for automation of many common tasks, increasing the throughput of data so that vastly more experimental data can be used to inform models and falsify hypotheses than ever before.

We begin in Secs. II A–II D by reviewing the foundations of Bayesian inference. In Sec. II A, we explore the example of inferring the ion temperature from a neutron spectrum that allows different approaches and approximations to be explored. As a companion to illustrate the methodology, we provide a Jupyter notebook with all of the code necessary to run the problem and explore the different methods.^{26} This is followed by practical advice for the application of Bayesian methods to problems of interest. In Sec. II E, we discuss examples from the recent literature that have employed Bayesian inference to improve the quality of analysis over what was possible before or to access fundamentally new information. In Sec. III, we discuss the topic of Bayesian data assimilation and review relevant examples in the literature. This extension of Bayesian inference allows for the use of multiple disparate sources of diagnostic information to jointly constrain model parameters. In Sec. IV, we discuss the application of a variety of methods to accelerate the computationally expensive task of statistical inference. This section will introduce the key concepts of a variety of ML methods for producing efficient surrogate models that can be employed and a variety of statistical approximations that can be used to improve efficiency with examples from the literature. In Sec. V, we discuss methods for quantifying and dealing with noise and background, extracting useful reduced features from raw data, and quantifying uncertainty from these processes. We highlight multiple examples from the literature where these techniques have been successfully applied. In Sec. VI, we discuss ways to extend the demonstrated use of these techniques to improve our ability to integrate experiment and theory, test hypotheses, and intelligently guide the design of experiments. In Sec. VII, we discuss actions that could be taken to accelerate the adoption of these methods across the ICF and HEDP community. Finally, in Sec. VIII, we summarize the current state of advanced data analysis in ICF and HEDP and the outlook for the future.

The methods and examples we chose to highlight are generally more mature and advanced outside of the field of ICF and HEDP. We do not consider this Review to be exhaustive, and as such, the interested reader will need to supplement the material herein with textbooks, reviews, and articles outside of this domain. We make no attempt to be comprehensive in this regard, but do provide some useful references, and the cited articles are likely to hold a trove of valuable resources. We strongly encourage the reader to use the accompanying Jupyter notebook as a resource to aid in the development of intuition and applicability of the methods discussed. We hope that this Review provides the interested reader with a sense of the power and potential of these methods and enough information to start them on their journey.

## II. BAYESIAN INFERENCE OVERVIEW

Bayes’ theorem provides experimental scientists and data analysts with a framework that allows for the solution of non-trivial, and often ill-posed, inverse problems with rigorously defined uncertainties by casting the problem in probabilistic terms. Bayes’ theorem specifies the relationship between the probability of a hypothesis being correct given an observation and the probability of observing a particular outcome given the hypothesis. Therefore, the Bayesian framework provides us with a way to transform an inverse problem to a forward problem, reducing the complexity of the analysis and improving the robustness of the solution. Many texts have been written on this subject,^{27–30} to which we refer the reader for a more comprehensive discussion of Bayesian statistics, their implications, and their use in a range of disciplines.

**, with associated uncertainties**

*D***, and a model describing the system of interest**

*σ**f*(

**), where**

*θ***is a vector of model parameters, we wish to find**

*θ***that best describes the observations. Bayes’ theorem provides us with a self-consistent framework with which to solve this problem. To see this, we first write Bayes’ theorem,**

*θ**likelihood*, which describes the probability of a particular observation

**D**given a specific set of model parameters

**and any background information we have regarding the system of interest**

*θ**I*. This background information can be viewed as physical constraints, assumptions made in the formulation of the model, or even previous observational or theoretical evidence that informs our task. The mathematical form of the likelihood depends on the type of observation being considered. The second term in the numerator is known as the

*prior*probability, which encodes any information or belief we have about the model parameters

**prior to seeing the observations conditioned on our background information**

*θ**I*. Hard constraints or unphysical values can be excluded from the solution using a prior distribution with bounds. Additionally, priors can be used to provide stronger guidance for the solution by for example, imposing correlations on the model parameters. Generally, it is preferred to use maximally uninformative priors so that they do not unduly bias the inference. We will explore the topic of selecting priors in more detail in Sec. II D. The denominator in Eq. (3) is known as the

*evidence*and is a normalization constant for the posterior. As we can see, it does not depend on the model parameters at all and therefore is often omitted when one is only interested in estimating the parameters of a given model. However, this term becomes essential when one wishes to evaluate which model among a library of competing models best matches the available data. Finally, the left-hand side (LHS) is known as the

*posterior*probability, which encodes the probability distribution of model parameters

**, given the observed data**

*θ***D**and our background information

*I*. This is the quantity of interest we seek when performing Bayesian analysis.

In what follows, we will use an example motivated by ICF research to elaborate on these concepts and the practical implementation of Bayesian inference.

### A. An example

An important task in ICF research is the measurement of the ion temperature. This is because in a thermonuclear plasma, the temperature of the ions uniquely determines the fusion reactivity. Several instruments have been devised to inform this parameter; here, we will focus on a simplified process where we have access to the neutron energy spectrum directly. This is akin to a simplified version of the Magnetic Recoil Spectrometer (MRS)^{31,32} fielded on the NIF and OMEGA 60. The MRS works by converting neutrons to knock-on protons or deuterons by interacting the incident neutrons with a CH or CD foil. The knock-on charged particles are dispersed using a magnet system where they are detected by CR-39 foils. The energy of the incident particles is determined by the position on CR-39. Myriad effects contribute to blurring of the spectrum, determining the spectrometer resolution, as well as the absolute calibration. In what follows, we will ignore these effects for simplicity and treat the measurement as a series of counts collected in known energy bins. We simulate the counts as a Poisson process with a model for the neutron spectrum determining the rate parameter (specifying average counts per unit time), including a fixed background rate.

To generate the test data, we make use of the relativistic neutron spectrum model from Ballabio.^{33} We do not include any downscatter, making this spectrum appropriate for application to exploding pushers, which have negligible areal density.^{34} We further assume that the neutrons are produced by a single temperature deuterium/tritium (DT) plasma. This is not necessary, and arbitrary temperature and density distributions can be used to generate data with ease; however, using a single temperature makes comparison of inferred values to the known true values less ambiguous. In order to apply Bayesian methods, we require uncertainties on the data. Because our data generation process is Poissonian, the expected variance in any data point is just the recorded counts, giving the square root of the variance $\sigma i=Di$, where *D*_{i} are the counts measured in each bin. We assume that the recorded counts are the sum of the data and a fixed background process. Figure 1 shows the data generated for this example in blue along with the true neutron spectrum in orange. We note that we have deliberately chosen a count rate that is low enough such that proper treatment of counting statistics is important in order to obtain useful results.

*θ*_{o}and the covariance matrix is

**Σ**=

**Λ**

^{−1}, which defines the width of the distribution, as well as linear correlations between the parameters.

**Λ**is often referred to as the precision matrix. As an example, the covariance matrix describing a multivariate normal distribution with three parameters has the form

*i*th parameter and Σ

_{ij}is the covariance between the

*i*th and

*j*th parameters. The matrix is symmetric such that Σ

_{ij}= Σ

_{ji}. This form is trivially extended to arbitrary dimension.

*θ*_{o}and

**Σ**. According to Bayes’ theorem, we must specify the likelihood and the priors. We define our priors to be unbounded uniform distributions. Accordingly, they show up as multiplicative constants. The likelihood is specified according to the measurement at hand. Our problem deals with counting particles on a series of detectors, and so a Poisson distribution would be most appropriate. However, if we assume a large number of counts, the Poisson distribution converges to a normal distribution, simplifying our task. The RHS of Eq. (3) is then given as

*D*

_{i},

*i*= 1…

*N*, being the measured data points,

*σ*

_{i}being the measurement uncertainty associated with each point, and $C\theta j$ being the constant scale factor for the prior on each parameter. The forward model used to generate the data for comparison is given by

*f*(

*E*

_{i}|

**), where**

*θ**E*

_{i}are the energies at each bin where data are collected. Because we are only concerned with a proportionality, we can drop any scale factors that do not depend explicitly on

**, allowing for a simplified expression. Additionally, the log of this expression is typically used for numerical convenience, giving**

*θ**χ*

^{2}expression often used in data analysis. In fact, a Bayesian approach to data analysis with the assumption of a normal likelihood and unbounded uniform priors is exactly equivalent to the familiar

*χ*

^{2}minimization procedure. This is an important point, as it illustrates that the concepts used in Bayesian inference are not really all that unfamiliar, but a generalization of the special case we are more comfortable with. Since the LHS and RHS of Eq. (3) are equivalent, finding the peak in our posterior amounts to finding the peak in

*L*, also known as the Maximum Likelihood Estimate (MLE) or alternatively the minimum of

*χ*

^{2}. More generally, with arbitrary prior and posterior forms, this is called the Maximum A Posteriori (MAP) solution. We can, however, do more than find the optimum, we can determine the covariance as well, which, under the stated assumptions, determines the uncertainties in our inference. We begin by expanding

*L*in a Taylor series about the optimum point

*θ*_{o},

*L*is the Hessian matrix. In other words, determination of the posterior has been cast as a simple optimization problem to obtain the optimum

*θ*_{o}, followed by the computation of the Hessian. This approximation where the posterior is assumed to be a multivariate normal distribution with mean located at the mode of the true distribution

*θ*_{o}and covariance determined by the Hessian computed at the mode is called the Laplace approximation.

^{35}Intuitively, the Hessian describes how narrow or broad

*L*is about the optimum. The terms along the diagonal determine the width of the distribution for each parameter. If we imagine the equal probability contours of a two-dimensional Gaussian distribution as a series of ellipses, we see that the principal axes of the ellipse need not be aligned with the coordinates of our model. The off-diagonal terms give the rotation, if any, of the distribution. In terms of distributions, this rotation means that two or more parameters are correlated with each other. This approximation is able to capture correlations up to first order, referred to as linear correlations. Higher order correlations imply curvature of the distribution and will violate the assumptions employed in this example, requiring more terms to be retained in the Taylor series expansion of

*L*.

*f*(

*E*

_{i}|

**) to make further progress. We have been able to proceed analytically up to this point; however, taking derivatives of**

*θ**L*analytically requires that

*f*be analytically differentiable. Fortunately, we can make one final assumption that allows this for the neutron spectrum. Physically, the peak location and width of the neutron spectrum are related to the ion temperature; however, the shift in the peak location with temperature is small and can be difficult to resolve, so we will ignore this dependence for the moment. Following the work of Brysk,

^{36}we can specify the neutron spectrum as a Gaussian distribution with variance proportional to the ion temperature,

*θ*

_{1}is a scale factor,

*θ*

_{2}is the ion temperature,

*θ*

_{3}is the central energy, and

*θ*

_{4}is an additive background term. The factor $w=2mnmn+m\alpha \u27e8E\u27e9$, where

*m*

_{n}is the neutron mass and

*m*

_{α}is the

*α*particle mass. Because the Brysk model assumes that the average neutron energy is independent of ion temperature, we fix it to ⟨

*E*⟩ = 14.021 MeV. This will introduce a bias in our estimate of the ion temperature, though we expect it to be small. Using the dataset from Fig. 1, we can use any common optimization package to solve for

*θ*_{o}by maximizing Eq. (7) with Eq. (10) substituted for

*f*(

*E*

_{i}|

**), from which we find that the best fit ion temperature is 4.5 keV and the best central energy is 14.04 MeV. The central energy is found almost exactly, while the ion temperature is not far from the true value of 4.75 keV, but we must determine the covariance and evaluate the posterior to see how much the data constrain our inference.**

*θ*Given the optimum found above, we now compute the Hessian to fully determine the posterior distribution. Because of the approximations we made in this example, this can be computed analytically using a symbolic mathematics software or an automatic differentiation software tool. Once we have the Hessian, we take its inverse to obtain the covariance matrix. The posterior obtained in this way is shown as the red contours in the plot in Fig. 2. This kind of plot is known as a corner plot, where each plot shows the pairwise distribution among two parameters. From this plot, we can visualize the mode of each parameter (marked with a black square and black solid lines), the width of the distribution, and any correlations present. We see that the ion temperature is positively correlated with both the amplitude and the energy shift. The constant offset is negatively correlated with both the ion temperature and amplitude.

Since the ion temperature is the only parameter of real interest in this example, we can use this distribution to determine the uncertainty of our estimate of 4.5 keV from the optimization above. The other parameters, while necessary to produce a fit, are not of physical interest and are referred to as nuisance parameters. Because we have assumed a Gaussian distribution, quoting the standard deviation fully characterizes the distribution, giving a measurement of 4.5 ± 0.75 keV. As we can see, the true value of 4.75 keV is within one standard deviation of the inferred value.

### B. Markov chain Monte Carlo

As stated, it is not often possible to compute the derivatives analytically, and so another means to solve this problem is desired. Additionally, even for this simple example, we had to make significant approximations to enable our analytic solution. The Laplace approximation can still be used without analytic derivatives by applying a finite difference method instead, but the posterior is not always guaranteed to be well approximated as a normal distribution. In order to relax this assumption and allow the posterior to take whatever form the problem dictates, we must have a method by which we can sample an arbitrary probability distribution. For this problem, the RHS of Bayes’ theorem is most generally written as in Eq. (6), where we maintain uniform priors. The technique of choice for sampling this distribution is known as Markov Chain Monte Carlo (MCMC).^{28,29,37–40} Here, we present the basic theory of Markov chains and stationary distributions needed to understand MCMC. We additionally present a specific example of an MCMC algorithm to clarify these concepts. We present the formulation using a continuous state-space notation as this is typically more appropriate for problems in HEDP and ICF. Additionally, many presentations of the theory in discrete formulation are readily available and correspond to intuitive conversions of integrals to summations from our presentation.^{29}

*j*to the next state at

*j*+ 1 only depends on the current state. Mathematically, this is expressed as

*p*(

**x**

_{j+1}|

**x**

_{j},

**x**

_{j−1}, …,

**x**

_{0}) =

*p*(

**x**

_{j+1}|

**x**

_{j}). Simply stated, a Markov process has no memory. To emphasize the sequential nature of transitions from one state

**x**=

**x**

_{j}to another

**y**=

**x**

_{j+1}in a Markov process, we define the notation

*p*(

**x**→

**y**) ≡

*p*(

**y**=

**x**

_{j+1}|

**x**=

**x**

_{j}). Note that

*p*(

**x**→

**y**) must satisfy

*π*(

**x**) satisfying

*π*is referred to as the stationary distribution for the Markov chain with transition probability density

*p*(

**x**→

**y**). Beginning from some state

**x**

_{0}and sequentially drawing samples accepting the transition with probability given by the transition density to build the chain of samples {

**x**

_{0},

**x**

_{1}, …,

**x**

_{n}}, eventually, the sampling converges to drawing from the stationary distribution

*π*, i.e.,

**x**

_{i}∼

*π*. A stronger condition than the Markov property for

*p*(

**x**→

**y**) that will guarantee the existence of the stationary distribution

*π*is that of detailed balance. Detailed balance requires that

For an MCMC algorithm, one wishes to estimate the posterior distribution via sampling. It is sufficient for such an algorithm to choose an appropriate transition probability density so that the stationary distribution corresponds to the posterior distribution in Eq. (3). The Metropolis–Hastings algorithm^{41} achieves this by specifying a proposal distribution *g*(**x**, **y**) and a probability of acceptance *A*(**x**, **y**) such that the transition kernel *p*(**x** → **y**) = *A*(**x**, **y**)*g*(**x**, **y**) satisfies detailed balanced. The proposal transition distribution *g* is not specified *a priori* and must be chosen by the user (see Refs. 28 and 30 for detailed discussion of the properties and requirements for the proposal distribution). Often, *g* is chosen to be symmetric *g*(**x**, **y**) = *g*(**y**, **x**) or independent of the current sample *g*(**x**, **y**) = *g*(**y**), with a uniform or normal distribution often selected for computational convenience.

*g*(state-space must contain the support of

*π*), we see that to satisfy detailed balance, the acceptance probability must satisfy

Propose

**y**by drawing from*g*(**x**,**y**).Compute

*A*(**x**,**y**).Accept

**y**with probability*A*(**x**,**y**); else, set a new step to current step**x**.Return to 1.

A couple of notes are in order. First, note that the transition probability being factored as a product of *A* and *g* allows us to first propose **y** ∼ *g*(**x**, **y**) and then accept or reject with probability *A*(**x**, **y**). We also note that, intuitively, if a symmetric proposal distribution is used *g*(**x**, **y**) = *g*(**y**, **x**), then for *π*(**y**) > *π*(**x**), the new point will always be accepted. That is, the chain will always take a proposed move to a region of higher probability density unless such a move is otherwise penalized by the proposal distribution *g*. Finally, we remind the reader that by satisfying detailed balance, the stationary distribution is guaranteed to be given by *π*.

As previously mentioned, in our problem of interest, *π* is the posterior distribution, and we can generate an individual sample for a given ** θ** by evaluating the prior probability and the likelihood at that point. However, even though we can evaluate the posterior, we resort to MCMC to sample it because any problem with more than a few dimensions will be exceedingly expensive to sample by brute force on a grid, and we do not

*a priori*know where the posterior will have the highest weight. MCMC sampling solves this problem for us by sampling from the posterior distribution and providing us a chain of samples. However, we do have to evaluate the posterior at each point proposed by the sampler, and many thousands of samples are often required to provide a converged chain. This necessitates that the forward models used to emulate the observed data be computationally efficient. We will address a variety of data-driven approaches that can enable computational tractability for this problem in Sec. IV.

To compare the results of the analytic Laplace approximation from Sec. II A, we sample the same model using the Metropolis–Hastings algorithm, allowing us to relax the assumption that the posterior is Gaussian. We can also take advantage of the flexibility of the algorithm to impose more informative priors. First, we observe that the Laplace approximation to the posterior shown in Fig. 2 allows the constant offset and energy shift terms to assume negative values. As negative backgrounds are unphysical in this case, we restrict the uniform prior to be positive. We also impose our knowledge that the mean energy shift in the neutron spectrum is positive by placing a bound on the prior for this parameter. Next, we apply realistic bounds to the ion temperature of [0.1,10] keV. Finally, we restrict the amplitude to be within a sensible range as well. We use the Metropolis–Hastings algorithm to sample the posterior, which is shown as the blue contours overlaid in Fig. 2. We can see very good agreement between the posteriors inferred with both methods, indicating that our assumptions made in the analytic model were not unfounded. We note very small shifts in inferred mean values of ion temperature and amplitude that are well within the uncertainty of the inference. In general, we see that the distribution inferred using MCMC is slightly tighter than the approximate version. This is desirable, as we would not want our approximate posterior to give us a reduced credible range in our inferences. However, we note that this is not a general feature of the Laplace approximation that can be guaranteed, so caution is warranted when using approximate inference methods. Finally, we note the cutoff in the distribution of the constant offset and energy shift parameters (most noticeable in the fully marginalized distributions along the diagonal), which are due to the bounds placed on the prior and are not informed by the data.

Now that we have multiple different inference methodologies at our disposal that can handle physical models of varying complexity with different assumptions made about the statistical model, we can explore how these choices impact our inference. As stated above, the problem of interest is a counting problem. We assumed that counting statistics were sufficient to warrant a normal likelihood distribution, but this will not always be true. Both the Laplace approximation and MCMC sampling can utilize a Poisson likelihood. Additionally, we can relax our assumptions on the physical mode and use one of higher fidelity for *f*(*E*|** θ**), such as the Ballabio model used to generate the data. Because the Ballabio model computes the shift in the mean neutron energy with ion temperature, we are also able to remove a parameter from our model. Our new model requires an ion temperature, a scale factor, and a background level.

Figure 3 shows the posterior distribution of the ion temperature inferred with different physical and statistical models, as well as different inference methodologies. The top panel shows inferences all using a normal likelihood with the Laplace approach and the Brysk model in blue, MCMC and the Brysk model in red, and MCMC with the Ballabio model in green. The true value is shown as a vertical dashed line, and the mean of each inference is shown as a colored dot. We see that the Laplace approach results in a slightly wider posterior with the mean a bit further from the true value relative to the other methods. However, the mean is well within one standard deviation of the true value, so this answer is still a good one. Using MCMC brings the mean a bit closer to the truth with a narrower posterior. Finally, using MCMC with the Ballabio model produces a negligible change in the posterior. This is due to the fact that, given the low count rate, the Brysk model is easily able to explain the data, so the improved physical model does not produce a tighter posterior. The bottom panel shows two cases using the Ballabio model and a Poisson likelihood. The posterior found using the Laplace approximation is shown in orange, and that found using MCMC is shown in cyan. Both approaches produce a noticeably narrower posterior than with the normal likelihood, and the MCMC approach shows a slight improvement relative to the Laplace method. We expect this to be the case since the use of the Poisson likelihood better matches the measurement mechanism and the use of the Ballabio model in our forward physics model better matches the true physics of the observation. We provide the details of these cases as additional examples in the Jupyter notebook that accompanies this Review.

In summary, there are many choices one can make when constructing a statistical inference model. The inference methodology is largely independent of the physical and statistical models implemented in the inference, but the choices will have an impact on the result. Improving the fidelity of the physical and statistical models will generally improve the quality of the inference, and a poor physical model can produce highly biased results with misleading uncertainties. Care must be taken to ensure that a valid model is used. The inference methodology is largely determined by the complexity of the problem and the accuracy required in the results. If done properly, the Laplace approach and other approximate inference methods (e.g., variational inference^{37–40}) can provide excellent results more efficiently than MCMC. However, as these methods rely on traditional optimizations, they can be sensitive to the starting point and the usual warnings apply. If the posterior is not well approximated by a multivariate normal distribution with nearly linear correlations or if more accurate results are required, MCMC approaches should be considered. Fortunately, the software tools available today make it fairly straightforward to test these different approaches.

### C. Summarizing the posterior

Various analytic distributions have known solutions to the integral above. For the multivariate normal distribution obtained via the Laplace method, we can easily obtain a univariate distribution and the necessary summary statistics using the optimal parameter location *θ*_{o} and the covariance matrix **Σ**. This is accomplished simply by using the mean and variance of the parameter of interest $p(\theta i)=N(\theta i,o,\sigma i2)$. As a direct result, the expectation value and variance of the *i*th parameter are *θ*_{i,o} and $\sigma i2$, respectively. Other inference methods that provide an analytic description of the posterior, for example variational inference,^{37} can use similar closed form solutions to marginalize the posterior and obtain summary statistics.

*θ*

_{n},

*n*= 1, 2, …,

*N*, are the MCMC samples from the chain. In fact, the expectation value of any function

*f*can be estimated from samples

*θ*

_{n}as

If the posterior is reasonably Gaussian, the moments above are useful descriptors. However, if there is significant skew in the distribution, the median may be a more appropriate descriptor. Additionally, quantiles are often used instead of the variance to quote credible intervals for an inference. This is because quantiles are able to accurately capture skewness in the distribution where a significant amount of the density is on one side or the other of the mode, therefore encoding more information, but still being amenable to the use of an error bar, albeit potentially asymmetric, when plotting inferred values.

The plot in Fig. 4 shows commonly used quantiles represented on a skewed distribution. The distribution is log normal with a mean of 1 and a standard deviation of 0.5. Quantiles are computed by finding the point in the cumulative distribution function (black line) that exceeds a specified fraction of the total probability. The vertical dashed line marked 50% shows the *p*50 or median point. This is the location where half the weight of the distribution is on either side. Next to this is a small tick mark showing the mean, illustrating that for a skewed distribution, the mean and median are not the same and the median will be closer to the peak. On either side of the *p*50 point are the *p*25 and *p*75 quantiles. These indicate the location where 25% or 75% of the weight of the distribution is to the left of the line, respectively. Between these two locations is the 50% credible interval (CI) or the symmetric region of the distribution about the median that contains 50% of the probability. Going further out, we have the *p*2.5 and *p*97.5 quantiles, between which lies the 95% CI. This interval is equivalent to a 2*σ* interval for a normal distribution. Any combination of quantiles can be used to specify a credible interval, but these are commonly used ones. Above the histogram is a so-called box and whisker plot. The line in the middle of the box shows the *p*50 point, and the edges of the box show the *p*25 and *p*75 points on the left and right, respectively, denoting the 50% CI. Finally, the whiskers extending from the box region indicate the *p*2.5 and *p*97.5 quantiles, giving the 95% CI. This form of plot is a convenient way to visually represent a distribution. Here, we note that there is a subtle difference between Bayesian credible intervals and frequentist confidence intervals. The credible interval conveys the range of parameter values, which are consistent with the observations and the prior. The frequentist confidence interval is a statement on the long-run performance of the method under repeated sampling of data. For further details on the interpretation of credible and confidence intervals, we refer the reader to one of the many textbooks or articles on the subject.^{28,39}

Even though these metrics are useful for describing a distribution function, they do fail to convey important information in certain circumstances. The most common example is a multi-modal distribution, i.e., one that contains more than one peak, or mode. In this situation, the quantiles and, in particular, the median will fail to convey that there are multiple regions of the distribution that have high probability. The other common scenario is when significant non-linear correlations exist between parameters. In this scenario, the pair-wise distribution shown in Fig. 2 will be strongly distorted, which has implications for interpreting various probabilities of interest. In these situations, it is generally best to show the full posterior rather than reduced quantities.

### D. Advice for the practitioner

The example shown in Sec. II A, along with the companion Jupyter notebook,^{26} illustrates how to perform Bayesian inference for an idealized problem using different methods, approximations, and likelihood functions. As we have seen, these approximations have an impact on the quality of the inference. In general, the higher the fidelity of the physics model used in the inference and the better the match of the likelihood function to the true measurement process, the better the quality of the inference. In our example, using a Poisson likelihood in conjunction with the Ballabio model for the neutron spectrum produced the tightest posterior distribution for the ion temperature, which is the parameter of interest. We caution that a better quality inference does not always equate to a tighter posterior. In fact, as previously mentioned, using poor quality models can often give highly biased results with a very tight posterior about the biased values. This illustrates that many choices can be made when formulating a Bayesian inference problem, and these choices will have an effect on the quality of the resulting inference, and so caution must be exercised when interpreting results. We encourage the reader to change various parameters in the example code and study the effect each has on the posterior and how data quality impacts these results. Below, we will discuss some general principles to help the practitioner navigate these choices and develop an understanding to guide them.

The choice of prior is a critical input to any Bayesian inference problem and a very common conceptual hurdle for the novice. In general, we want to use the least informative prior possible. This is often, but not always, a uniform distribution with appropriate bounds placed to avoid non-physical parameters. The reasoning is that we wish to provide enough constraining information to ensure a useful solution without unduly influencing the posterior distribution. There are three major exceptions when a uniform distribution is not the appropriate prior. The first scenario is when important information is known before the experiment about a parameter (or subset of parameters) that warrants a more informative prior. This could be, for example, when target metrology or engineering tolerances are able to constrain certain initial conditions or when constraints such as conservation of mass can be employed. Additionally, correlations known about the model parameters *a priori* can be enforced by choosing a multivariate normal distribution with a covariance matrix that is informed by known physics. It is important to remember that only information known prior to observing the data should be included and not information that can be gleaned from the data. As an example, some diagnostics are sensitive to the product of two parameters, e.g., the areal density *ρR*, and thus, the posterior distribution of these parameters is strongly correlated. This correlation should not be built into the prior, even if it is expected, because this is informed by the data and, thus, is a posterior quantity. However, if, for example, we are inferring the density at two different spatial locations and physics arguments, e.g., a high sound speed, tell us that these two locations should be correlated, we can build this knowledge into the prior.

The second common situation is when a uniform distribution is genuinely not the least informative distribution. This can often occur when a change of variables is employed in the model. We may use a uniform distribution in the original variable system, but variable transformation requires a transformation be applied to the distribution to maintain the desired properties. The classic lighthouse problem illustrates this well, where a transformation is made from angular coordinates to position. We desire a uniform distribution in the angular variable, but this turns out to be non-uniform in position. We encourage the reader to see Ref. 30 for more details. Additionally, there are some common statistical models where a uniform prior on the parameters is not appropriate. Detailed discussion of these cases can be found in Refs. 28–30, though they tend to be uncommon in our application.

The final reason to use a more informative prior is when the algorithm being used requires it. Some approximations that are useful for describing the posterior require that the likelihood and priors be normal,^{42} though some methods can employ distributions with bounds.^{43} Common practice is to use very broad normal distributions with bounds to avoid non-physical or extreme values, thereby approaching a bounded uniform distribution. These approaches are often used for computational convenience and can be avoided if an efficient forward model is used when computing the likelihood. However, methods such as these can be used to find a linearized normal posterior, which can then be used to condition the proposal distribution for a subsequent MCMC sampling step, providing more efficient sampling than the general unconditioned case. Additionally, at least one example of an iterative method exists to remove bias in the posterior introduced by an informative prior.^{44} These methods will be discussed further in Sec. IV D.

Choosing the likelihood is the next significant decision that the practitioner must make. The likelihood distribution should be determined by the measurement method at hand. For the classic problem of determining the fairness of a coin using a series of coin flips, the binomial distribution is the appropriate likelihood. This is because each flip has only two possible outcomes and the probability of each is determined by a binomial distribution. A Poisson likelihood is most appropriate for a counting problem, which can be approximated by a normal likelihood in the limit of high counts. When using features extracted from data in the likelihood, it is imperative to establish the statistics of the feature as derived from the data. If they can be approximated as a normal distribution, then a normal likelihood could be appropriate. If not, a different distribution should be chosen.

^{45}

*σ*is the uncertainty band on the data. This likelihood gives equal weight to any

*f*that matches

**D**within the uncertainty band and 0 weight to any

*f*that is outside providing a uniform distribution centered about

**. One can imagine other likelihood functions that act similarly. For example, a super-Gaussian would provide a similar flat-top response and rapid decay to negligible weight outside of the uncertainty band. As we saw in our example, the choice of likelihood can strongly impact the results of the inference. Because of this, it is important to choose a likelihood that matches the physics of the measurement in question.**

*D*Once our priors and likelihood have been specified, along with our forward model *f*(** θ**), we must choose the inference method that we will use to find the posterior. For some well-behaved cases, we may find an analytic solution; for an estimate of the posterior, we can use the Laplace method, variational inference, or other approximate methods, or if the problem is more complex and we require an accurate posterior, we can use MCMC methods. A wide range of software packages exist to perform this step. Popular options are Dakota,

^{46}Stan,

^{47}emcee,

^{48}

^{,}

*pymc*,

^{49,50}JAGS,

^{51}and TensorFlow probability.

^{52}Our example notebook uses

*pymc*. Each of these packages has built-in MCMC sampling algorithms. The Metropolis–Hastings algorithm is a widely used and reliable workhorse with a small number of hyperparameters to be chosen by the user. Another benefit of this algorithm is that it does not require the calculation of gradients during sampling, so the physics model being used does not need to be differentiable making it amenable to most any scientific computation with little to no modifications.

MCMC samplers require a period of time to establish a stationary distribution. This time period, known as burn-in, is needed so that the samples in the chain do not retain memory of their starting point. MCMC sampling is a kind of diffusion process, so this knowledge can take some time to leave the system. Once the chain has settled to its stationary state, these burn-in samples should be discarded. No hard and fast rules exist to establish the burn-in phase. Generally, $\u223c1000$ samples are required, but this will depend on the dimensionality and properties of the individual problem. Best practice is to monitor the evolution of the chain for the problem of interest and determine the appropriate number of samples. We refer the reader to Refs. 28, 30, and 37 for more information. Additionally, some algorithms produce samples that have some degree of correlation between neighbors, meaning that the Markov chain has some degree of residual memory. In this situation, a practice called thinning is employed. Essentially, only every *x* samples are retained from the chain, where *x* is an integer specified by the user. The remaining samples are discarded. Alternatively, all samples can be retained, but the effective number of samples will be less than the total. This has implications for the Monte Carlo error of an estimate, which is $\u221d1/Neff$, where *N*_{eff} is the effective number of samples, not the total number.

Many of the available packages provide additional samplers that can be exploited to improve efficiency or avoid pathologies that are specific to certain classes of problems. Two popular samplers based on Hamiltonian dynamics are Hamiltonian Monte Carlo (HMC)^{37,53} and No U-turn Sampler (NUTS).^{54} These algorithms improve efficiency by enabling larger jumps in model parameter space while maintaining a high probability of acceptance due to the energy preserving nature of their methods. Details of these algorithms can be found in Refs. 37 and 53.

Additional care must be taken for high-dimensional problems, ≳20. This is because in a high-dimensional space, standard MCMC exploration of the probability density becomes inefficient, referred to as the curse of dimensionality. Regions of high density can be numerous, small in extent, and widely separated. This makes it ever more difficult to map the probability distribution fully. Various ideas from global optimization theory have been adapted to MCMC sampling, such as differential evolution,^{55} graduated homotopy,^{56} and mode-hopping,^{57,58} which seek to allow the chain to explore a broad region of the parameter space, enabling the full distribution to be sampled efficiently. Still, this is an active area of research, with the goal of a general and reliable algorithm for solving strongly non-convex problems remaining unsolved.

### E. Applications in ICF and HEDP

The benefits of Bayesian inference have been accepted and exploited in a variety of fields for several decades now. However, ICF and HEDP have been slow to catch up. This is, in part, due to the complexity of the measurements and physics models involved in inferring important quantities. Recently, computational capabilities and the availability of easy to use inference software have improved to the point where these tools are being applied to a range of problems in the field. An important benefit that Bayesian inference provides is the evaluation of rigorously quantified uncertainties, which can consistently incorporate all the relevant experimental and model uncertainties and constraining information from any measurements that are provided. Rigorously quantified uncertainties are crucial when comparing experimental inferences to models or attempting to discriminate between competing models.

Figure 5 shows four different examples where Bayesian inference was applied in ICF and HEDP research. The left panel from Ref. 59 demonstrates a method where very limited data are used to constrain the energy coupling and flow between different sources and sinks throughout the course of a direct drive exploding pusher implosion^{34} on the OMEGA 60 laser. In this work, a thin SiO_{2} shell is filled with 10 mg/cm^{3} of pure deuterium gas. The shell is directly illuminated with intense UV light, causing the shell to ablate outward, producing a rocket effect that drives the remaining shell material inward and a strong shock into the deuterium fill. These implosions have been shown to be reproducible and fairly symmetric,^{65,66} making them ideal for the investigation of detailed physics processes. Time resolved self-emission images are obtained as the capsule is exploding after the shock has coalesced on axis and struck the shell. These images show a bright ring of emission as the shell material is heated and accelerated outward, from which a radius as a function of time is extracted. The rate of this expansion can be used to constrain the efficiency and prior dynamics of the implosion. With knowledge of the initial radius and laser drive, a simplified model of the ablation, acceleration, and rebound of the shell is constructed that matches the expansion phase of the experiment and simultaneously constrains the unobserved implosion and shock reflection phases. The model consists of an ordinary differential equation (ODE) describing the radius vs time as a function of ten model parameters, five of which have strong priors informed by independent measurements or physical constraints. The other parameters take the form of broad normal distributions with bounds (such as positivity) being enforced where appropriate. The measured radius vs time is compared to the modeled values using a normal likelihood function, and MCMC sampling is used to constrain all of the model parameters. The result is the plot on the bottom of the left panel, showing the posterior distribution of the shell trajectory and pressure of the fuel. The posterior distribution of the model parameters can then be fed back into the ODE to obtain a wealth of information, such as the ablation kinetic energy, fuel energy, and shell kinetic energy, among other quantities, as a function of time with credible intervals. This dataset provides valuable constraints with quantified uncertainties on radiation hydrodynamics simulations, helping to provide consistency checks with the equation of state (EOS) models in untested regimes and improve the code’s predictive capability.

The center panel shows an experiment designed to study the growth of the Rayleigh–Taylor instability (RTI) in a converging geometry. These experiments, conducted on both the NIF and the OMEGA 60 lasers, drive a cylindrical implosion with a pre-imposed perturbation, which grows due to the acceleration driven Rayleigh–Taylor instability; see Refs. 62 and 67 for details. The growth of the perturbation is modified by the convergence of the shell as it implodes, a process that is important for ICF and has seen extensive theoretical study over the years, but little experimental investigation until recently. X-ray radiography is used to measure the amplitude of the perturbation and the trajectory of the interface and the converging shock wave. Initial studies of this platform arrived at relatively large uncertainties in the positions and amplitudes of interest primarily due to instrumental effects. The radiographs are formed by a large area backlighter^{68,69} that illuminates a pinhole array. Each pinhole then forms a quasi-point source that illuminates the target and projects an image onto a time resolved detector. Because of this configuration, each pinhole projects through the target at a slightly different angle creating a position-dependent parallax effect that distorts each image differently. Furthermore, the locations of the backlighter source, pinhole array, target, and detector are not known exactly, and so the parallax cannot be removed directly from the data. To circumvent this, a ray tracing model was developed as the forward model for a Bayesian inference performed using the Bayes Inference Engine (BIE).^{70} The locations of each of the elements in the experiment are able to vary within the known uncertainties of the setup. Rays are passed through a model of the target, which is constructed of primitive objects whose positions vary, creating a mask for the rays in the model, shown in the top left image in the center panel. Backlighter intensity variations and background are also included to form a synthetic radiograph, shown in the upper right. This image is compared to the measured image (bottom left), and a residual is formed (bottom right). The residual is minimized using a normal likelihood, accounting for all the uncertainties in the instrumental setup, producing a posterior distribution of the amplitudes and positions of various features of interest. The plot on the bottom of the center panel shows the comparison of the inferred quantities (×’s) to the simulated trajectories (solid lines). The uncertainties derived from the posterior are invisible on the plot, demonstrating the significant improvement provided by the Bayesian technique. With this analysis, we can clearly see that the simulation fails to reproduce some important aspects of the experiment. Looking forward, this level of confidence in the measurement will easily be able to distinguish between competing hydrodynamic models or different methods of running the same model (choice of EOS, flux limiters, drive multipliers, etc.).

The panel on the far right of Fig. 5 illustrates the inference of the equation of state parameters from velocimetry data in dynamic compression experiments. Illustrated in the top image of the far right panel of Fig. 5, Brown and Hund developed a Bayesian method to calibrate an equation of state model for tantalum at high pressure.^{63} The Z machine^{6,7} is used to drive an aluminum electrode, which supplies a continuously ramped pressure drive to the tantalum sample.^{71} The velocity of the sample is measured on the opposite side where the free surface of the tantalum is mated to a lithium fluoride window. Typically, an inverse Lagrangian analysis is used to turn the free surface velocity information into the equation of state quantities of interest. This method uses the magnetic field drive as a boundary condition and performs a forward model to produce simulated velocities, which are compared with the measurement via a normal likelihood. Priors are placed on the parameters of a Vinet equation of state model^{72} and other experimental parameters, such as the magnetic field drive and the sample thickness, allowing for the incorporation of a wide range of experimental uncertainties. The posterior distribution of the Vinet fit parameters, shown in the bottom plot of Fig. 5, is used to produce an experimentally calibrated equation of state for tantalum. For efficiency, this work uses a Gaussian process^{73}^{,} $(GP)$ surrogate of the forward model. This calibration provides more accurate constraints on the equation of state than the inverse Lagrangian method with more complete accounting of the experimental and model uncertainties. Due to its success, this methodology has been extended and applied to a wide range of equation of state experiments in a range of regimes and numerous drivers.^{74–76}

The final example we will discuss, shown in the bottom panel of Fig. 5, demonstrates a Bayesian technique for inferring magnetization of the fuel at stagnation in Magnetized Liner Inertial Fusion (MagLIF)^{77–79} implosions using secondary DT neutron spectra. The method was developed using a kinetic model that tracks tritons born from DD fusion reactions in a cylindrical fuel configuration and computes the resulting secondary DT neutron spectrum. It was shown that this signature is extremely sensitive to the magnetic field-radius product *BR* during burn, which is an important confinement parameter for magneto-inertial fusion concepts.^{80,81} This analysis was expensive, requiring thousands of evaluations of the kinetic model for every experiment analyzed, and did not provide rigorously defined uncertainties on either the data or the inferred values. The authors of Ref. 64 discuss a fully Bayesian pipeline to extract features from the raw waveforms utilizing a background fit and hand-engineered features that characterize the width and asymmetry of the neutron time of flight signals and infer *BR* from these featurized data in conjunction with the DD and DT neutron yields. To circumvent the expense of running the kinetic model for each MCMC step, a neural network surrogate model was trained to map from model parameter inputs to the extracted data features. The data featurization pipeline is shown on the top of the flow chart in the lower panel of Fig. 5, and the inference pipeline, including the neural network surrogate model, is shown on the bottom. The use of surrogate models and their implications will be discussed in Sec. IV. The Bayesian model implemented in this work is parameterized with five parameters, of which *BR* is of primary interest. The work demonstrated the technique using broad uniform priors on the nuisance parameters, but other measurements could be used to inform these quantities more tightly. This method was applied to an ensemble of four experiments. The results are shown in the plot on the right-hand side of the bottom panel as *BR* vs preheat energy deposited into the fuel. These data clearly show that the inclusion of the so-called Nernst effect is critical for capturing the remaining magnetic flux at stagnation.

These examples illustrate the power and flexibility of using a Bayesian framework for the inference of quantities of interest in ICF and HEDP experiments. Other works have utilized these methods to infer the equation of state parameters from radiographs,^{82} the inference of performance metrics in ICF experiments in the alpha-heating regime,^{83,84} and extracting electron temperature from K-shell spectra.^{85} Additionally, the authors of Ref. 45 used Bayesian inference to demonstrate the advantages and potential pitfalls of these methods using a reanalysis of two datasets from the literature. This work aptly demonstrates the potential for bias and misleading uncertainties when using maximum-likelihood methods that do not fully account for uncertainties and priors.

## III. DATA ASSIMILATION

Going beyond the examples outlined above, where a small number of scalar-valued or a single vector-valued observation is used as the basis for an inference, we quickly enter the realm of *data assimilation*,^{25,86} sometimes called *data fusion*. Data assimilation is the process by which multiple disparate sources of data are used to jointly constrain an inference, typically through a physical model that links the different observables. The same basic principles as above apply, but now have to be extended to allow for the combination of a variety of different likelihood functions, each of which must be constructed prudently such that one does not dominate the solution more than is warranted by the true information content of the data. This necessity arises from the fact that it is often difficult or impossible to infer realistic correlation models between observations both within, and between, instruments.^{87,88} Computational tractability often dictates that we assume independence between measurements, but this can dramatically overstate the true information content of the observations when there are structured or correlated noise effects. Downweighting the likelihoods by rescaling is an effective substitute for the reduced statistical power of the data that would result from using a likelihood with realistic covariances if the latter were easily inferred. This strategy is particularly helpful when very low-dimensional measurements are used in conjunction with data such as images or spectra, which may be represented with a very high sample/voxel count. If treated naively, images can easily outweigh a scalar or vector-valued datum due to the sheer number of data points in the image. Fortunately, the number of pixels does not equate to the true number of data points in an image, though determining the true information content of images can be a challenge. In this section, we will review the construction of a statistical model for data assimilation and highlight a variety of dimensionality reduction techniques used in the literature to help solve the weighting problem.

Despite this challenge, the benefits of this approach are tremendous. Let us suppose that we are making an inference using a model that has a number of nuisance parameters, i.e., parameters that are necessary in order to fit the data but are not of physical interest. Revisiting our example in Sec. II A, fitting the neutron spectrum to extract an ion temperature requires three parameters, a background, an energy shift, and a scale factor, in addition to the parameter of interest *T*. If we have measurements that can inform these nuisance parameters, we can improve our inference by incorporating this knowledge in some way. In our example, a yield measurement could help to constrain the scale factor. Perhaps the simplest way is to use a separate measurement to independently constrain the nuisance parameter and then place an informative prior on that parameter in our inference. A more elegant and powerful way to accomplish this is to use both measurements to jointly constrain the inference. In order to do this, our forward model must account for both measurements simultaneously, and so may require more physics be added to accomplish this. Additionally, we now require two likelihood functions that are combined to perform the inference. These additions are not trivial, but they are extremely powerful. The resulting posterior is now informed by both measurements, which generally improves the quality of inference for all parameters, not just the one of interest. If our nuisance parameter is not really a nuisance, but something we are interested in knowing, then by taking this approach, we have improved our ability to infer multiple important quantities by combining them into a single statistical model.

By way of providing a concrete example, it is useful to revisit Eq. (6), which specifies a normal likelihood and uniform priors. For the moment, we ignore the priors, which need not be uniform, and focus on the normal likelihood. In our example in Sec. II A, the function *f*(*E*_{i}|** θ**) takes as inputs the model parameters

**and returns a neutron spectrum. In the more general case, we can split**

*θ**f*into two parts: one that models the physics of interest and another that models the diagnostic observables. The physics model will generally produce as its output a variety of field quantities (density, temperature, velocity, etc.). These fields are not directly observable to us, but are related to the diagnostic signatures we do observe. The diagnostic model will take these fields and produce any number of observables that we may wish to compare to (x-ray radiographs, emission spectra, neutron spectra, x-ray emission power, etc.), all of which are self-consistent. Here, it is useful to introduce the concept of an inference network,

^{37,89}an example of which is shown in Fig. 6. An inference network is a directed graph that visually depicts the connections between all of the important pieces in our model: the priors, the physics model, the diagnostic models, the observations, and unobserved, or unobservable, quantities of interest.

The graph in Fig. 6 contains three different types of nodes: open circles, which denote stochastic quantities; triangles, which denote deterministic quantities; and filled circles, which denote observed quantities. At the top of our graph, we have open circles representing the prior probability for each of the physics model parameters *θ*_{i}, *i* = 1, 2, …, *N*. These nodes are connected to the triangle *f*(** x**;

**), which represents our physics model. Here,**

*θ***is the independent variable or variables, such as position, time, and frequency. Samples are drawn from**

*x***and fed into this node, producing an output**

*θ**y*. To account for uncertainty due to a surrogate model or other approximation, we can include an additive stochastic parameter

*ɛ*. In the limit of no uncertainty,

*y*becomes a deterministic variable. This output is then directed into each of the diagnostic models

*g*

_{j}(

*y*;

*z*

_{j}),

*j*= 1, 2, …,

*M*, where

*z*

_{j}represent any diagnostic model parameters that have priors placed on them, such as calibration quantities, detector response, and alignment uncertainties. The output of the diagnostic models is compared to the observations

*D*

_{j},

*j*= 1, 2, …,

*M*, through the likelihood distribution specified for each. Finally, through the process of sampling the posterior distribution of the model parameters, we can also accumulate statistics on unobserved quantities. In this example, we have another function

*Y*=

*h*(

*y*) that takes the output of the physics model and returns some quantity of interest. This could be, for example, the quantity

*χ*discussed in the introduction, where the field quantities of pressure and temperature can be used to compute a value for a given sample of model parameters. The posterior samples of this new quantity

*Y*now give us a probability distribution

*p*(

*Y*|

**) of a fundamentally unobservable quantity that is consistent with all of the data.**

*D**f*and

*g*and depend on the model parameters

**and the diagnostic model parameters**

*θ**z*

_{j}. We note that this form assumes independence of the different likelihoods. It is, in principle, possible to relax this assumption by estimating the covariance between the different measurements, but this is rarely ever possible, in practice.

^{87,88}Even though this likelihood is derived from a relatively simple graph, the equation is very difficult to interpret, making the graph itself more useful for understanding the flow of information. Typically, as before, the log of the above equation is used, producing a sum over the individual likelihoods. It is the terms in this summation that must be appropriately weighted for a meaningful composite likelihood to be used. We note that, in general, these graphs can be far more complicated, with multiple models interacting in more complex ways.

Although the application of Bayesian data assimilation in HEDP and ICF is fairly new, a few examples of this approach are available. The authors of Ref. 90 demonstrated an analysis of MagLIF stagnation data using a combination of six different diagnostics (neutron yield, neutron time-of-flight, filtered x-ray power measurements, filtered x-ray pinhole camera, and a monochromatic x-ray imaging system). A simplified isobaric model of the stagnation plasma was used to produce spatially varying neutron and x-ray emissivities. The quasi-cylindrical plasma column was coarsely divided into three axial regions to attempt to capture low-frequency variations in the stagnation parameters. Each axial slice is characterized by a pressure, peak temperature, liner areal density, radius, and mix fraction. A schematic representation of this model parameterization is shown in Figs. 7(a) and 7(b). The computed emissivities were then used to produce synthetic versions of each of the instruments in the analysis. Normal likelihood distributions were used for all observable quantities. The posterior was found using the Laplace approximation discussed above. In this work, the optimum was found using Levenberg–Marquardt minimization, and the Hessian was computed numerically via finite difference to describe the Gaussian posterior. Priors were specified using very broad bounded normal distributions to minimize impact on the posterior. For a subset of cases analyzed, an MCMC sampling was done to confirm that the Gaussianized posterior was an unbiased and conservative estimate of the true posterior. The individual likelihood distributions were scaled by considering two independent factors. First, the data from the imaging instruments were reduced to correspond to only the three axial zones, ensuring repeated identical information was not being included. The images were further reduced by matching the number of samples along the *x*-dimension to the real resolution of the instrument, ensuring the traces were not over-sampled. The neutron time-of-flight signal was also downsampled to reflect the true temporal resolution of the instrument. Finally, an estimate was made of the number of degrees of freedom or truly independent data points in the signal. This effective number of data points was then used to scale the individual likelihoods. Before application to experimental data, this model was extensively validated using a broad ensemble of data from a simplified model and against results of 1D and 3D resistive magneto-hydrodynamics (MHD) calculations.

The validated model was used to infer the model parameters over an ensemble of 35 MagLIF experiments. The posterior distributions of each of the model parameters inferred for each axial zone were then processed to give emissivity weighted values of pressure, temperature, fuel volume, beryllium mix fraction, and liner areal density. Finally, an inferred distribution of *χ* was computed using these values. The results of this ensemble analysis are shown in Fig. 7(c) where the emissivity-weighted pressure is plotted as a function of temperature, and the inferred mean value of *χ* is shown as the color. This kind of plot summarizes the progress made on improving the performance and scalability of the MagLIF concept and provides key quantities for comparison to trends from integrated calculations. Importantly, this example illustrates how to use multiple disparate instruments to jointly constrain a quantity of interest *χ* that is fundamentally unobservable.

The authors of Ref. 91 developed a Bayesian model calibration approach that uses an ensemble of experimental data obtained during the BigFoot campaign on the NIF to constrain potential sources of performance degradation. The authors chose a form of the model calibration problem that transforms the experimental input parameters into a latent space that allows for the inclusion of these degradation mechanisms within the model, rather than the traditional method of including a probabilistic discrepancy term. This approach allows for data taken at a specific experimental configuration ** x** to be used to constrain the values of the latent variables

**that best describe the observations.**

*z*The model is parameterized to include six potential physical sources of degradation (variable hohlraum peak power and total energy multiplier, variable amplitude surface roughness on the capsule, variable width and amplitude of surrogate fill tube perturbation, variable preheat introduced into the DT layer, and a variable carbon mass premixed into the DT gas). The experimental observations to which the model is compared are the DT neutron yield, the ion temperature, the nuclear bangtime and burnwidth, the downscatter ratio (DSR), the mean radius of the hotspot from x-ray imaging, and the radiated x-ray energy at 22 keV. These quantities are compared through a normal likelihood, and the scaling has been accounted for by converting the measurements responsible for each of the observables to a single scalar feature. The forward physics model for this inference is based on the HYDRA radiation hydrodynamics code.^{92} However, running 2D HYDRA simulations for each MCMC sample is too expensive, so a deep neural network (DNN) surrogate model is trained by generating 70000 samples across the eight-dimensional parameter space. These points are then used to train a DNN using the DJINN^{93} approach, providing a mapping from the input parameters to experimental observables with a calibrated surrogate uncertainty. This surrogate can then be used to efficiently sample the posterior distribution of the parameters quantifying the various degradation mechanisms, an example of which is shown in the top four plots in Fig. 8. The analysis consistently shows that reduced compression due to artificial preheat is a significant contributor to degradation of the experiments compared to 2D HYDRA predictions. The analysis is then refined to test a variety of hypotheses to explain the reduced compression, revealing the fuel–ablator mix to be the most likely mechanism. Finally, the ensemble of analyzed shots is used to calibrate a scaling model, which accounts for the measured discrepancies to predict performance at larger delivered laser energy (bottom plot in Fig. 8). In this example, the model is used to predict what drive energy is required to recover the predicted capsule performance if no degradation is present. Recent work, which will be discussed in Sec. IV, has extended the deep learned surrogate model to incorporate vastly more diagnostic information than the handful of scalars used in this work.^{94}

A final example of data assimilation in the HEDP literature is using a hydrodynamic model with parametric representations of equations of state and strength models to integrate data from multiple different dynamic compression experiments.^{75} This framework has been used to infer a variety of parameters describing the equation of state, plasticity, and anelasticity of tantalum under ramp compression by simultaneously leveraging experimental data from the Z machine and a gas gun. The forward model is constructed to simulate the evolution of the tantalum sample under the measured drive conditions, which itself is inferred via the Bayesian method described previously,^{63} subject to parameterized strength and equation of state models. The forward model produces a simulated free surface velocity, which is compared to the measured free surface velocity using a normal likelihood distribution. As has become typical, the forward model is expensive, so a surrogate is used for efficiency. In this example, the surrogate is a $GP$ model that captures the mean of the observable and the uncertainty of the surrogate. An effective number of data points is determined for the measurements based on an auto-correlation analysis of the signals. The inference of the model parameters is performed in one case simultaneously over all Z-shots and including all Z-shots and gun-shots simultaneously. By incorporating the data from both facilities, which span different ranges of strain and strain rate, the method is able to constrain the model parameters with lower uncertainty and reveal correlations between parameters that were not present with Z data alone. The differences observed by the authors clearly demonstrate the power of simultaneously and self-consistently analyzing experiments from different facilities and in different regimes to constrain constitutive models of material properties. Figure 9(a) shows the resulting parameter values with credible intervals inferred from one of the Z experiments analyzed as a function of time illustrating when in the experiment different parameters are being accessed. Figure 9(b) shows the equation of state calibrated using the data in this study compared to existing Diamond Anvil Cell (DAC) data.

## IV. SAVING ON COMPUTATIONAL COSTS

As indicated in Sec. II, statistical inference typically requires many thousands or more evaluations of forward (and sometimes adjoint) physics models. In general, this will necessitate the use of some combination of simplified physics models,^{59–61,84,90,98–101} machine learning methods,^{91,95,102–106} analytic approximations in statistical inference,^{90,107,108} high-performance computing, and advanced inference algorithms in order to reduce computational costs. In this section, we review some applications of machine learning, analytic approximations, and advanced algorithms drawn from the HEDP, ICF, and adjacent literature to achieve this reduction. These examples will apply tools such as neural networks,^{37–39} Gaussian process regression,^{73} projection-based reduced order models (ROMs),^{109} and Bayesian optimization.^{39,40} However, it is important to point out that many of the examples mentioned in this section are not currently leveraged to conduct statistical inference and uncertainty quantification (UQ), which is the central theme of this article. To further clarify the role that such methods can play in enabling Bayesian inference, we will provide some additional context in Sec. IV E. While we highlight a wide variety of methods, an extensive exposition of the veritable zoo of machine learning and surrogate modeling approaches goes well beyond the scope of this work, with many methods having yet to see adoption in the HEDP community. Thus, we recommend the interested reader see one of the many excellent reviews^{110–114} or textbooks^{37–40,109} for more details on model architectures as well as associated data-types and tasks that various approaches can handle.

In Fig. 10, we indicate a coarse categorization of computational reduction approaches into four different levels of granularity with examples of each taken from the literature. Panel (a) of Fig. 10 represents the lowest level category, where physics/diagnostic submodules that may be placed within a larger simulation framework are replaced by a fast emulator. The second categorization in panel (b) of Fig. 10 represents cases in which various physical quantities of interest, such as field variables or observables, are evolved dynamically using a reduced model. We note that general (not necessarily UQ driven) applications using these first two approaches appear to be fairly sparse in the HEDP and ICF literature, but are beginning to be more widely adopted in a variety of physics and engineering applications, with a particularly rich application space in computational fluid dynamics (CFD).^{115–118} The pursuit of these approaches in the fields of HEDP and ICF represents an exciting and impactful area of research likely to see significant growth as practitioners in the field become more familiar with advances in machine learning. To reinforce this perspective, we have included two examples of such applications in Sec. IV B from the broader ML for CFD community. Although not directly applied to HEDP and ICF experiment or simulation, they are relevant to this community (e.g., shock propagation and hydrodynamic instabilities). We emphasize, however, that adopting these cutting-edge techniques to realistic problems in HEDP and ICF presents a basic research challenge that must be overcome. Panel (c) of Fig. 10 categorizes together methods in which a full calculation from model inputs to a set of outputs or observables is surrogated. Note that methods that predict time and space resolved measurements blur the line somewhat between this category and the previous one. Panel (d) of Fig. 10 groups together methods that enable simplifications in the statistical analysis to produce computational cost savings. We organize the discussion in the remainder of this section largely around Fig. 10, providing details of several examples from within each grouping. Where possible, we provide additional references for relevant work not covered in Fig. 10 for the reader to further explore.

### A. Emulating physics and diagnostic submodules

^{37–39,110,111}to enable fast surrogates. Mathematically, neural networks (NNs) have a relatively simple form. If we denote by

**x**

^{ℓ}the output of layer

*ℓ*, we may write

*f*

^{ℓ}is a (generally nonlinear) “activation” function of the layer evaluated element wise on its argument, $W\u2113\u2208Rn\u2113,n\u2113\u22121$ is a trainable weight matrix for the layer,

**b**

^{ℓ}is a trainable bias vector, and ◦ indicates linear composition of $W\u2113$ with

**x**

^{ℓ−1}(matrix multiplication in fully connected NNs or convolution in convolutional NNs). Typical activation functions include the rectified linear unit, sigmoid, and hyperbolic tangent.

^{38}

**x**

^{L}, then we may note that

**x**

^{L}(

**x**

^{0},

*θ*) depends on the input data and all of the weights and biases $\theta =[WL,bL,WL\u22121,bL\u22121,\u2026,W0,b0]$. The tuning of these parameters is referred to as training, with the goal of achieving

**x**

^{L}(

**x**

^{0},

*θ*) ≈

*F*(

**x**

^{0}) for some target function

*F*. Generally, training is formulated as a minimization problem,

*L*

^{p}norm, binary cross-entropy, etc.) between

**x**

^{L}and

*F*on a training data point $x0i$. The optimization is typically achieved via gradient descent.

^{38}We briefly discuss in Sec. IV E how this training may be promoted to a Bayesian framework, but assume that point estimates for

*θ** are obtained from training in what follows.

In Ref. 95, non-local thermal equilibrium (NLTE) opacity calculations are replaced with a NN [Fig. 10(a)] for use inline with an MHD simulation capability. Once trained, small to medium sized neural networks containing 10s–100s of thousands of parameters may typically be evaluated on millisecond timescales, offering potential for orders-of-magnitude speedup. For this case, a factor of $\u223c7\xd7$ speedup of the full simulation was obtained by replacing the physics-based opacity calculations with the NN version with $<10%$ relative error. Furthermore, it is anticipated that the NN solution may scale favorably as compared to the full NLTE calculations as the number of atomic levels included in the opacity calculation is increased [see Fig. 11(a)].

As a second example at this level of granularity, we highlight the work of Ref. 119 shown in Fig. 11(b). In this application, x-ray radiography measurements are inverted to obtain electron density profiles in CH/Be interface heating experiments conducted at the OMEGA 60 laser facility. In this work, the authors utilize a neural network architecture for diagnostic inversion. They begin by specifying a physics based diagnostic forward model, *d*(**z**), that operates on an underlying physics state **z**, here electron density profile, to produce a synthetic diagnostic x-ray radiograph, **y**. The authors proceed to construct *f*(**y**) ≡ *d*◦*e*_{NN}(**y**) ≈ *i*(**y**) = **y**, where *f*(·) approximates the identity map *i*(·) that operates on radiography signals and *e*_{NN} is a NN that will be trained to approximate *d*^{−1}, the diagnostic inverse. The neural network function *e*_{NN} is trained by minimizing the loss $L=|yobs\u2212f(yobs)|2$. As an aside, we note that it is also possible to train an approximation of the diagnostic inverse mapping, $e\u0303NN$, by providing pairs of points ${(yobsk,zsysk)}k=1Ntrain$, which must be obtained directly from forward modeling of the samples $zsysk$. Once the NN is trained using the forward modeled data, it can be applied to experimental data. However, this approach may risk being less robust to non-representative or out-of-training-sample data as compared to the self-consistent architecture utilized by Ref. 119. We also mention that this type of self-consistent learning architecture is similar to that used in Ref. 94 and can be useful for regularizing the trained model when an ill-posed inversion problem is being addressed.

**f**= [

*f*(

**x**

_{1}),

*f*(

**x**

_{2}), …,

*f*(

**x**

_{M})] evaluated on a set $X={xi}i=1M\u2286X$ with

*M*≥ 1 is jointly Gaussian. This joint Gaussian distribution has mean

*μ*= [

*m*(

**x**

_{1}),

*m*(

**x**

_{2}), …,

*m*(

**x**

_{M})] and covariance

**Σ**

_{i,j}=

*K*(

**x**

_{i},

**x**

_{j}), where

*m*is a mean function (typically taken to be zero) and

*K*is a kernel function (e.g., the square exponential described in Sec. IV E). It may be shown analytically that the posterior distribution of

*f*evaluated at points $X*={xj*}j=1N$ has the form

*K*

_{A,B}indicates the matrix formed by evaluating the kernel function on all pairs of points (

*a*,

*b*) for

*a*∈

*A*and

*b*∈

*B*. We pause to note from the structure of Eqs. (25)–(27) that $GP$’s have the attractive property that they exactly interpolate training data, making them a strong candidate for fitting deterministic data from physics simulations.

*f*

_{LF}(

**x**) and sparsely sampled expensive high-fidelity (HF) models

*f*

_{HF}(

**x**), where

**x**are the inputs (e.g., temperature and ion density). In the original Kennedy–O’Hagan framework,

^{121}a linear correlation between LF and HF models along with a non-linear discrepancy term

*δ*

_{HF}(

**x**) are specified so that

*ρ*is a proportionality constant used to capture an overall linear correlation between the LF and HF data. The authors of Ref. 106 consider both this “linear MF” approach and a more general framework allowing for a nonlinear mapping between LF and HF models given by

^{122},

*z*is, in general, a nonlinear mapping that may depend non-trivially on

**x**. A combination of relatively cost-effective to evaluate LF and HF model data may be used to fit $GP$’s for

*f*

_{LF}(

**x**),

*δ*

_{HF}(

**x**), and

*ρ*or

*z*(·) depending on the chosen framework. The authors find that the approach given by Eq. (29) generally outperforms the Kennedy–O’Hagan framework when significant nonlinear correlations between LF and HF models are present. Although the authors of Ref. 106 do not incorporate their $GP$ emulator into a larger simulation framework, it should be clear from context that this could constitute such a submodule of a larger calculation.

*f*

_{LF}(

**x**) and

*δ*

_{HF}(

**x**) need not be the same. This allows for fewer evaluations of the HF model, i.e.,

*N*

_{train,HF}≪

*N*

_{train,LF}, allowing for a reduced training computational budget. We also note that this approach allows for straightforward extension to a set of multiple hierarchical models through recursive application of Eq. (29),

*i*= 1, 2, …,

*i*

_{max}indexes the expected physical accuracy of the model. The MF approach additionally allows for the incorporation of experimental datasets as the highest fidelity results.

Interestingly, the use of $GP$’s naturally incorporates an estimate of uncertainty in the emulator, namely, the posterior predictive interval, which grows as one moves away from data points **x**_{i} on which $GP$’s for *f*_{LF} and *δ*_{HF} are trained. This uncertainty could be propagated by directly sampling from the $GP$ at runtime. However, interpretation of this and other forms of surrogate uncertainty requires caution. We give further details in Sec. IV E. We also wish to highlight that $GP$ regression can be extended to incorporate both functional (e.g., monotonic,^{123} bounded,^{124} and multi-output^{125}) and physics-based constraints that may be useful in certain applications (e.g., divergence-/curl-free^{126} and thermodynamic consistency for EOS modeling^{127}). Furthermore, we note that the work of Ref. 106 assumed that there is an inherent ordering to the accuracy of the models used, leading to a hierarchical structure where it makes sense to treat the high-fidelity case with a discrepancy from the low fidelity case. Recent efforts utilizing MF models to capture non-hierarchical structure between different models^{128,129} are capable of lifting this assumption. Incorporating non-hierarchical structures may be particularly useful in HEDP and ICF applications where different models are not necessarily ordered uniformly across their input domain, e.g., one model may work better at high temperature and another at low temperature.

Before proceeding, we emphasize that in choosing to cover primarily HEDP and ICF literature, we have biased the application space to some degree. In particular, we wish to emphasize that ML methods may also be used to accelerate or replace expensive algorithmic portions of computational solvers. For example, applications of machine learning to choose effective solver-preconditioner setups,^{130–134} efficiently generate matrix preconditioners,^{135} or to replace such solvers altogether with learned components^{136,137} have been explored. Neural networks have also found application in large eddy simulations (LESs). In LES, low-resolution simulations include subgrid scale source and sink terms to capture higher-resolution effects to improve computational cost while maintaining model fidelity. Accurate closure models of the subgrid scale terms are an important and active area of research. Recently, subgrid scale effects parameterized by machine learned models have been incorporated, for example, into numerical simulation of turbulent flows,^{138} the 1D Burger’s equation,^{139} and 2D MHD Kelvin–Helmholtz instability.^{140} In Ref. 141, direct numerical simulations (DNSs) are accelerated through ML enabled accurate coarse-gridding of algorithm components most affected by a decrease in resolution. This was achieved by allowing a convolutional neural network (CNN) to learn local flux interpolation and field correction rules from DNS training data. These methods may provide significant speedups in partial differential equation (PDE) solvers at the sub-component level.

### B. Reduced models for dynamics

At the next higher level of granularity presented in panel (b) of Fig. 10, one may attempt to surrogate the dynamical evolution of the system under consideration. If this results in sufficient computational speedup, such an endeavor may allow for the incorporation of simulations directly into forward modeling in Bayesian inference and/or experiment design frameworks. To be specific, suppose that we have a simulation code that operates on some history of the system state ${zti}ti=tktN$ to give a sequence of states at later times ${ztj}tj=tN+1tM$. We denote this by $H({zti}ti=tktN,\mu )={ztj}tj=tN+1tM$, where *μ* are any other parameters needed to specify the operator that evolves **z**_{t}. For continuous time dynamics, *H* typically takes the form of an integro-differential operator that can be solved using a discretization scheme. A goal of the work reviewed in this section is to replace the operator *H* and/or state **z** with a much faster approximation $H\u0302$ to evolve the state $z\u0302$, which may be the original state or some reduced-dimensional representation from which an estimate of **z** may be obtained.

The example from Ref. 96 shown in panel (b) of Fig. 10 uses a recurrent neural network (RNN) to surrogate the evolution of 1D temperature, pressure, and density fields. There, time evolution of the fields computed using a 1D radiation hydrodynamics simulation code is emulated using a sequence-to-sequence model based on gated recurrent units in an RNN architecture. The use of an RNN architecture allows the network to learn short- and long-time correlations in the evolution of the dynamics that cannot be effectively captured using a more standard ANN architecture lacking recurrent units. Particularly, the authors demonstrate the efficacy of the RNN approach over a simple state transition estimation by an ANN, showing a $\u223c30%$ reduction in the integrated absolute error between predicted and true values on a testing set for the RNN case compared to the ANN. Two additional examples are presented in Ref. 96. One example treats time evolution in a 1D diffusive problem. The second surrogates computation of time dependent x-ray emission images computed from 3D ICF implosion simulations. Both examples obtain similar findings in terms of ANN vs RNN performance.

*N*

_{G}coupled first order ODEs after applying spatial discretization and any operations needed to reduce the system to first order in time; $z\u2208RNs$ is the system state; and

*μ*are parameters describing initial conditions or other free model parameters (e.g., viscosity). We assume that the system state may be expanded as

*ϕ*

_{i}} spanning the trial solution space of chosen dimensionality,

*n*

_{s}, in which the dynamics are allowed to evolve (

*n*

_{s}≪

*N*

_{S}for ROM) and $z\u0302(t)$ represent the (time-dependent) mode amplitudes. We note that the expansion in Eq. (32) is referred to as Galerkin projection and is commonly used in CFD. We refer the reader to Chap. 11 of Ref. 109 and references therein for applications and extensions of Galerkin projection. With this ansatz, the equations of motion (after projecting onto the trial space) read

**f**into linear (

**L**) and nonlinear (

**N**) pieces resulting in

**N**= 0 above, then the matrix multiplications involving Φ,

**A**,

**L**, and

**z**

_{0}may be computed once and Eq. (34) clearly represents a reduction of the original

*N*

_{s}ODEs to

*n*

_{s}<

*N*

_{s}equations, providing computational gains. Unfortunately, the residual nonlinearity will still require evaluation of functions and inner products at each time step with sizes that scale with the original problem dimension

*N*

_{s}. The careful treatment of this term goes beyond the scope of our discussion here, but we refer the interested reader to Chap. 12 of Ref. 109 and references therein for more details.

Another natural question arising from projection methods is how to find the best set of basis functions, *ϕ*_{i}, on which to project? One common approach is to apply proper orthogonal decomposition (POD) to a simulation dataset with a representative input parameter point *μ*. POD decomposes the spatiotemporal simulation data into linear combinations of orthonormal spatially dependent basis functions with time dependent coefficients of ever decreasing energy content (c.f. discussion in Sec. V A for a more detailed description of POD). Typically, such representations are truncated at some rank, *r*. Geometrically, if we consider the solution to live on some manifold $M$, POD will find the best *r*-dimensional linear approximation to the portion of $M$ from which the solution snapshots are provided to the algorithm. This basis is then used to compute ROM evolution for parameters $\mu *\u2208B\mu $ in some neighborhood of *μ*. Alternately, one may attempt to develop interpolation schemes by including realizations of dynamics with different values of *μ* in the training data.

*r*. For example, in problems dominated by advection or with transient phenomena, POD may fail to capture a low-dimensional representation of the dynamics. However, Eq. (32) may be thought of as attempting to find a low-dimensional representation of the dynamics specified by $z\u0302$. In this context, we may think of $z\u0302$ as providing Euclidean coordinates $z\u0302\u2208Rr$ that map to the configuration space of the full dynamics

**z**given, more generally, by

**g**providing mapping from Euclidean coordinates $z\u0302$ into manifold coordinates $z\u2208M\u2282RNs$.

A more accurate approach then would be to construct an appropriate set of locally linear approximations of **g**, i.e., a set of ROM bases that depend on time ${\Phi (tj)}j=1N$ and which can be thought of as spanning the tangent space of $M$ near the point on $M$ corresponding to time *t*_{j}. Evolving the dynamics within one of these basis sets and projecting into the next appropriate basis at the appropriate time may result in accurately capturing the dynamics. While there is much mathematical detail we will not provide, this is the general idea behind the approach used to treat evolution of the classical Rayleigh–Taylor instability (RTI) in Ref. 142, as indicated in Fig. 12(a). In that work, different POD basis vectors are used based on how far the interfacial instability (bubble/spike) has evolved, allowing the authors to accurately capture single-mode RTI evolution well into the nonlinear regime.

On the other hand, the method in Ref. 143 attempts to circumvent such approximations by training an auto-encoder on realizations of the dynamics for different values of *μ*. The decoder portion then represents **g** in Eq. (35). Furthermore, with the use of a neural network, gradients are readily available for allowing straightforward and rapid computation of the Jacobian matrix that appears in the equation of motion derived in Ref. 143 for the reduced dimensional Euclidean variable $z\u0302$. This approach has been successfully applied, among other examples, for surrogating shock propagation simulated with the 1D Burger’s equation showing significant improvement in solution fidelity as compared to the linear ROM approach while also effectively capturing the intrinsic dimensionality of the solution manifold. The basic approach and shock propagation example are shown in Fig. 12(b).

Many significant details for these types of approaches go well beyond the scope of this Review and have been omitted. The goal was to provide only a high level outline to give the reader some intuition for approaches taken in the literature with application to a wide variety of fields. A number of other closely related tools should also be mentioned here. For example, sparse identification of nonlinear dynamics is a tool that has been developed to identify underlying dynamical systems models from measurement or simulation data and may be viewed as extracting a ROM directly from data.^{144,145} Similarly, operator regression has been developed to discover nonlinear operators in PDEs from data.^{146–149} Dynamic mode decomposition (DMD) is a matrix decomposition method that can identify spatiotemporal coherent structures in high-dimensional measurements/simulations having fixed linear behavior in time, i.e., damped or growing oscillatory structure.^{150} Recently, DMD has been applied to study magnetized plasmas^{151} and has been leveraged to accelerate computations of thermal radiation transfer for a Marshak wave test problem.^{152} Physics informed neural networks (PINNs) are another tool that have been introduced to enable data-driven solution of PDEs. PINNs have been applied, for example, to the 1D Burger’s equation^{153,154} and for PDE identification from data as demonstrated for incompressible 2D Navier–Stokes from data for vortex shedding behind a cylindrical obstruction.^{153}

We emphasize that all of the methods covered in this section will sacrifice physical accuracy and flexibility (to a greater-or-lesser extent depending on the particular application and method) *in lieu* of computation performance. ROM methods are actively in development that incorporate conservation laws,^{118,153,155,156} for example, of mass/momentum/energy, which may improve physical fidelity while preserving the performant nature of the surrogate. Where possible, effort should also be made to quantify and incorporate the uncertainty and bias introduced by the choice to use a surrogate model. The further development and adaptation of ROM methods with uncertainty quantification to realistic problems is an active area of research. Advocating for and contributing to research on realistic and relevant HEDP and ICF exemplars represent a significant community need.

### C. Learning input-to-observable maps

For the task of Bayesian parameter estimation, one typically only requires a mapping from the model parameters ** θ** to be estimated to the final quantities that will be compared to observations. Here, we refer to such a map as an “input-to-observable” map. As a somewhat conceptually simpler approach, these methods have proven popular despite being more “black-box” than attempting to surrogate full dynamical evolution. In particular, they have enabled a number of HEDP and ICF applications, including model calibration,

^{91,97}plasma parameter inferences,

^{64}post-shot analysis,

^{157}and optimization of experiment designs

^{102–104,158,159}and diagnostic configurations.

^{160}In many cases, these applications involve the use of a Bayesian statistical inference models, while those that do not could, in principle, be promoted to such a setting.

*N*= 104 simulations with uniformly distributed random inputs covering the domain of experimental configurations and values of

**considered possible. The chosen emulator model was a Bayesian extension of multivariate adaptive regression (BMARS). Rather than providing a single function estimating the output, fitting a BMARS model actually provides a distribution over MARS functions, each of which estimate the shock breakout time across the input domain. Example draws from this distribution are shown in panel (c) of Fig. 10. This distribution of regression functions allows for propagation of uncertainty in the MARS model fit through the calibration process via sampling of the posterior over the regression model parameters $p(k,\beta k|{y(xi,\theta i)}i=1Ntrain)$, where k and**

*θ**β*

_{k}represent parameters in a given MARS model being fit on

*N*

_{train}≈ 100 samples from HYADES. We note that Sec. IV E provides additional details and references for a variety of approaches to incorporating UQ arising from the surrogate modeling process.

In the remaining two examples, input-to-output emulators were developed to aid in optimizing the design of an experiment or diagnostic configuration. Broadly speaking, the critical component to such approaches is expert-motivated selection of a scalar valued loss for optimization (or multiple loss terms with Pareto optimization^{161}). Typically, the evaluation of such a loss across a broad range of design or diagnostic configurations will be costly to compute. As a result, one must typically aid the optimization process either by emulating the expensive forward model or surrogating the computation of the loss in a fashion that allows for an intelligent parameter search.

In Ref. 102, represented in Fig. 13(b), a random forest model is fit to the total energy yield from the HYDRA radiation hydrodynamics simulation code of NIF ICF capsule implosions. The authors developed the expert-motivated objective function for optimization given by $C=10P+ITFX$, where $P$ is a measure of the sensitivity of the yield output from the surrogate to small perturbations in the input and *ITFX* = *Y*(*ρR*)^{2} is an ignition threshold parameter. The factor of 10 in the first term ensures the two are of similar scale. The optimization results in proposal of an ICF experimental configuration, which may not belong to the data used to train the random forest surrogate. The new design point may be evaluated through a forward simulation. Provided that the original surrogate was trained with sufficient data to map the energy yield response accurately, the discovered experimental design will balance robustness to small, $O(10%)$, deviations from nominal input parameters and performance as measured through *ITFX*. As random forests are relatively fast to evaluate, an exhaustive search is then possible.

*x*. The width of $p(M|x)$ is related to the sparsity of known data in the particular region of input parameter space under consideration. Furthermore, the behavior of the average value of the $GP$ is governed by the observed data. Bayesian optimization works by proposing the next model evaluation point as that which maximizes the expected improvement given by

Interestingly, unlike the example in Ref. 102, Bayesian optimization does not necessarily require an accurate mapping of the response at the outset. It thereby reduces the need to generate large amounts of initial training data. However, there are practical considerations regarding the ability of Bayesian optimization to avoid being stuck in local optima (see Ref. 162). Furthermore, the robustness parameter $P$ in Ref. 102 requires *O*(1000) forward model evaluations to compute, necessitating the use of a surrogate with detailed knowledge of the simulation response across the input domain. In practice, one will need to carefully chose the approach utilized based on computation resources and complexity, expert insight, and the particular application space.

### D. Statistical algorithms and simplifications

The examples presented in Secs. IV A–IV C provide approaches to reducing the computational cost of a given physics model through the application of data-driven methods. This subsection is devoted to highlighting methods of statistical approximations and statistical algorithms that enable analytic calculations or reduce computational burden for more rapid and accurate inference irrespective of an underlying physics model. We begin with a discussion of some practical applications of the analytic approach discussed in Sec. II and variational inference and move on to a discussion of examples in the literature, demonstrating the role that differentiable programming and gradient based sampling algorithms plays in efficient inference.

The Laplace approximation was discussed in Sec. II A as a means to enable an analytic treatment and connection of Bayesian methods to potentially more familiar *χ*^{2} analysis. This approximation may also be useful as a first approximation or to marginalize out well understood uncertain parameters. For example, the Bayesian data assimilation of Ref. 90 utilizes the Laplace approximation to the full posterior distribution for emissivity weighted temperature, hotspot pressure, and hotspot volume, while also assessing the Lawson parameter *χ* consistent with these inferred values across a database of MagLIF experiments. As indicated in panel (d) of Fig. 10, for this specific application, the normal approximation tended to slightly overestimate the posterior variance while not tending to strongly bias the inference. At the same time, the Laplace approximation generally will require fewer forward model evaluations, provided that the initial parameter optimization step is not too costly.

As indicated in Fig. 14(a), the authors of Refs. 107 and 108 performed an analytical marginalization of *N*_{nuisance} = 29 simulation inputs of convergent ablator experiments on the NIF with well-characterized uncertainties. This is achieved by assuming a linear response of simulation outputs to variations in these nuisance parameters along with a normal likelihood function. The very high dimensional integral needed to marginalize these parameters would have required a prohibitively expensive number of simulations to evaluate numerically. Despite these assumptions, the posterior for the parameters of interest can be highly non-Gaussian as a result of the nonlinear nature of the simulations coupling these parameters to the nuisance parameters to generate simulation output. In addition to providing an analytic route to nuisance parameter marginalization, this work demonstrates the impact of leaving out these nuisance parameters and the importance of physically reasonable priors on microphysics models. The corresponding maximum likelihood estimate (MLE) differs significantly when including the well constrained priors on the microphysics informed by previous experiments. In particular, it is found that the prior uncertainty on microphysics model multipliers needs to be increased by about 1–2 orders of magnitude relative to recover the MLE in the uniformed prior case. The result demonstrates the critical need to incorporate prior knowledge into inferences when available. Complimentary to these observations, note that the authors of Refs. 107 and 108 highlighted sensitivity to prior information, which should be carefully considered when establishing a choice of prior to avoid biasing the result without strong prior experimental evidence. Before proceeding, we point out another approximate inference method known as variational inference (VI). In VI, the posterior distribution *p*(** θ**|

**D**) of Eq. (3) is approximated with a distribution $q\alpha *(\theta )$ from a family of distributions parameterized by a set of quantities

*α*that is more tractable (typically multi-variate normal with

*α*giving mean and covariance) such that $q\alpha *(\theta )$ minimizes some measure of discrepancy with the target distribution

*p*(

**|**

*θ***D**) (typically Kullback–Leibler divergence

^{163}). As we have not found significant application in the ICF and HEDP literature, we forgo further discussion and refer the reader to the textbooks of Refs. 37–40 and references therein for further details.

As mentioned in Sec. II D, gradient assisted MCMC is a useful way to improve efficiency of sampling algorithms. When available, the adjoint equations to a PDE allow for direct computation of derivatives of quantities of interest with respect to input parameters. For example, the study^{164} shown in Fig. 14(b) used analytic adjoint methods to conduct sensitivity analyses of flux-limited radiative diffusion. This work demonstrated that a full adjoint-based sensitivity analysis significantly outperformed even a single variable sensitivity analysis based on finite differencing of the forward model. In general, providing adjoint equations analytically by hand may be difficult. Recently, the authors of Ref. 165 demonstrated the use of *JAX*,^{166} an automatic-differentiation programming framework, to conduct fully differentiable kinetic simulations of the Vlasov–Poisson–Fokker–Plank equations with gradients. This code was paired with a neural network trained to tune a forcing function to minimize a physics cost function resulting in the discovery of a novel nonlinear physics response. Neural networks also allow for rapid gradient computations (using an approach commonly called back-propagation in the machine learning community^{167}) through the chain rule. This can be achieved as a result of using a series of simple operations with known derivatives. As an example, by replacing a physics forward model with an ANN trained on simulation data, the authors of Ref. 64 were able to use the gradient-based NUTS^{54} method for MCMC using the python packages *keras*^{168} and *pymc*.^{49,50}

Before concluding, we highlight that MF modeling ideas have also recently been applied to accelerate MCMC by reducing the number of high-fidelity model evaluations required. We refer the reader to Sec. 4 of the review in Ref. 169 and citations provided in that work for more details on such multi-level MCMC schemes. Given the range of model fidelities and associated computational costs, we believe these methods are an excellent candidate for application to problems in HEDP and ICF physics.

### E. Additional considerations

We note explicitly that this section has omitted a review of physics model simplifications that do not rely on the above surrogate modeling approaches. Although not covered here, such models will continue to play a significant role in enabling computationally tractable statistical inferences while allowing for detailed physical insights. Generally, the methods and examples highlighted in this section seek to balance the physical-fidelity computational-complexity trade-off incurred by the use of emulators to enable statistical inference. As such, it is important to consider when and how the resulting bias and uncertainty may be incorporated into the analysis to be conducted. However, we emphasize that the considerations presented here are not unique to data-driven modeling, but apply equally to physics-motivated approaches.

In regard to uncertainty quantification, models such as $GP$ regression and BMARS are formulated using a Bayesian framework. These models come equipped with a posterior distribution over regression functions (and thus a posterior predictive interval). This allows for direct incorporation of uncertainty through sampling from said posterior distribution. In the context of other machine learning methods, such as NN regression, one has multiple approaches available of varying degrees of rigor. For example, the model may be fit to training data using a Bayesian approach. In this Review, we view the loss function as being proportional to a log-likelihood, leading to Bayesian NNs (BNNs).^{171,172} BNNs come equipped with posterior distributions for the NN weights and biases, allowing for a direct sampling approach as with $GP$ regression. Alternately, NN weight dropout methods may be used to approximate the Bayesian approach.^{173} In Fig. 15(a), we indicate the approach to UQ taken in Refs. 64 and 170. In particular, the authors of those works utilize direct computation of out-of-sample error estimates from holdout data not used to train models. They then characterize variance of the employed emulator about ground truth model. However, this requires assumptions on statistics of the error term, assumed normal in the above two citations, so that $yemulated=ygroundtruth+N(0,\Sigma est)$, where Σ_{est} is estimated on holdout data not used for training. The authors of Ref. 105 developed a more flexible approach indicated in Fig. 15(b) by training two NNs on different loss functions to provide point and interval estimates. The authors are able to reduce the sensitivity to parametric error models by training directly on estimated distributions of the error for given input samples. In all cases, we note that caution must be applied in tasks involving model extrapolation (i.e., out-of-sample model evaluation) and interpreting uncertainty. The choice of model always imposes a bias. If this is not well matched to the data, then one cannot expect model extrapolation and any associated posterior predictive intervals to appropriately model the data or give reasonable estimates of uncertainty.

*ℓ*= 0.5 on different data shown in Fig. 16. In the first row of Fig. 16, data follow

*f*(

*x*) = sin(

*x*), while in the second row, data follow

*f*(

*x*) = sin(

*x*) + 3, and in the first column, we train on only the two leftmost data points, while in the right column, we train on the full dataset. In the first case, the data follow a functional form that has zero-mean with an output domain

*f*(

*x*) ∈ [−1, 1] and therefore matches the inductive bias imposed by the $GP$ model reasonably well. As a result, even when training on a small number of data points, the model provides reasonably meaningful (although still poor) point estimates and uncertainties, but this is not the case in the second row of Fig. 16. We see in the second row that the $GP$ regression is sufficiently flexible to interpolate the full dataset and provide reasonable uncertainty where the model is interpolative, but not extrapolative. Furthermore, with less data (lower left plot of Fig. 16), the posterior predictive interval does not share a particularly meaningful relation with the unseen data, providing only information that the model lacks knowledge of data in the region

*x*≳ 0. Of course, in our somewhat contrived example, if sufficient data are available to estimate the mean and variance of observations, standard normal scaling (subtraction by mean and division by standard deviation) can be applied to enforce that the data better match our model bias. Regardless, we wish to highlight that these considerations are universal and not unique to the $GP$ example represented in Fig. 16. Direct assessment of the bias and variance of out-of-sample data with respect to a model and strong inductive priors (e.g., physics constraints) is effectively the only safeguard against misinterpretation of model uncertainty. Methods such as cross-validation play an important role in understanding generalizability of models fit to data.

^{37–40,174}These techniques and others are also important in hypothesis testing and model selection, which are discussed briefly in Sec. VI A. We wish to emphasize that the particular choice of $GP$’s to demonstrate this feature was out of convenience. This is because $GP$’s allow for more direct access to the underlying model assumptions making it more straightforward to emphasize the importance of out-of-sample testing. However, these considerations are critical regardless of the underlying model. A detailed discussion of the benefits and drawbacks of different approaches to uncertainty quantification for general machine learning methods is well beyond the scope of this Review. Rather, the intent here is to indicate key ideas to the reader. For further reading, we recommend a variety of textbooks that treat machine learning in a probabilistic (i.e., Bayesian) framework.

^{37,39,40}For uncertainty quantification in deep learning, see the review in Ref. 175.

Various model calibration approaches may be appropriate to aid in understanding the model bias in the context of experimental data or high-fidelity simulation. Transfer learning, for example, has shown significant promise by leveraging multi-fidelity data ranging from simple simulations through complex integrated experiments to make models more predictive. Typically, transfer learning is achieved by modifying the output of a model directly either via a learned discrepancy or multi-fidelity terms as highlighted in Ref. 106 of Sec. IV A or by freezing all but the last few layers in a NN model and retraining to match data from a related target task.^{176–179} In these cases, the base task might be a simulation and the similar target to be learned could be a higher-fidelity simulation or experiment. As an alternate approach, calibration could be conducted on the inputs, including inputs that are, in principal, well controlled in the experiment. One may then attempt to learn a mapping from the known experimental inputs to the inputs needed to reproduce observed data. This approach, taken in Ref. 91 discussed in detail in Sec. III, additionally allows for the exploration of explanatory degradation mechanisms, with the potential for developing data-driven physics insights. Critically, we emphasize that these methods should not be viewed as a replacement for but rather a compliment to the toolsets currently employed across the range of theory, simulation, and experimental analysis to accelerate the design and discovery cycle. In particular, *post-facto* investigation of the learned models using, for example, feature importance metrics^{180} and correlation analysis^{181} may provide physicists with new insights into fundamental processes.

## V. DATA PROCESSING

As indicated in Secs. II and III, attempting to perform Bayesian inference in cases with large numbers of inputs will prove challenging due to the curse of dimensionality. This arises from the fact that volume grows exponentially quickly with dimensionality, making (probability) density estimation difficult. Furthermore, spurious background signals may contribute to significantly biased inferences, while noisy data may result in high-variance inferences. At the same time, high-dimensional experiment signals and synthetic simulation outputs can create memory usage and data storage problems, slowing or halting analysis. To alleviate these issues, a variety of expert-motivated and/or data-driven automated processing and featurization steps may be employed. Goals of these methods include reducing dimensionality of representations used for inference, accounting for instrument effects, filtering noise, subtracting background, and rejecting outliers. It is important to highlight that these methods may need to be applied to experimental or simulation derived data or both depending on the context to allow for efficient inference. In this section, we highlight a variety of applications demonstrating these approaches. We attempt to emphasize how these methods may enable or improve inference, although many examples presented stop short of performing a full inference with UQ. For clarity, we attempt to group the examples into two main categories. The first category is methods associated with dimensionality reduction. The second category is approaches to address background subtraction, noise reduction, and removal or assessment of instrumental effects that may adversely affect inferences. As has been the case throughout, we primarily restrict our consideration to HEDP and ICF relevant examples. However, many of the approaches discussed have been developed in diverse communities for applications ranging from digital signal processing to medical imaging. We again encourage the reader to refer to the wider literature for a more comprehensive perspective.

### A. Dimensionality reduction

Dimensionality reduction refers to any process by which important information for a particular application is distilled into a smaller set of quantities for use in further analysis or data storage. Uses of dimensionality reduction include enabling sparse measurements for high-fidelity state estimation and data compression to data visualization and avoidance of the curse of dimensionality in statistical analysis. As indicated in Fig. 17, dimensionality reduction methods can range from expert-motivated feature engineering approaches through purely data-driven feature or representation learning. A variety of methods are available to the practitioner, some of which are explicitly called out in Fig. 17.

*ν*) given by

*ϵ*

_{ν}(

*x*,

*y*,

*z*) from projections of the emission onto a number of lines-of-sight (LOS). Denoting the projection operator along the

*i*th LOS by $Pi$, one typically proceeds by minimizing a loss function characterizing the difference between the modeled and observed projection,

*D*pixel) basis, one then has to estimate the model inputs

*ϵ*

_{ν}(

*x*

_{i},

*y*

_{j},

*z*

_{k}), which has

*n*

_{i}

*n*

_{j}

*n*

_{k}∼

*N*

^{3}input parameters. For even modest dimensions of, say, 64 cells in each cardinal direction, one must infer emissivity for over 250 000 voxels, quickly making Bayesian posterior estimation unviable as resolution increases. Furthermore, in the HEDP and ICF experiment, a sparse number of projected views is generally available to help constrain the full morphology of an object. Often, such tomographic inversion problems are ill-posed due to the insufficient amounts of data. In Ref. 182, this is addressed by deriving 3

*D*electron temperature maps from 2

*D*projections using a voxel basis, as indicated in Fig. 18(a), while characterizing uncertainty using a bootstrapping

^{39}method rather than Bayesian inference. In this case, variance in the solution on a voxel-by-voxel basis is assessed by adding Poisson distributed noise to experimental images before performing repeated tomographic inversion (using $O(1000)$ realizations of images with noise added). The authors of Ref. 182 additionally assess systematic error incurred by using limited two and three lines-of-sight for reconstruction in this fashion. In this case, however, the rigorously defined uncertainty quantification from Bayesian inference is foregone

*in lieu*of a more flexible representation that may be able to better capture detailed features in the 3

*D*object. We note that inspection of 1

*D*lineouts from the bootstrapped data (as in Fig. 12 of Ref. 182) shows significant variance in the local structure so that it is not clear if the structure is associated with the physical quantity of interest, increased noise added to experiments for the bootstrapping procedure, or from a possibly multimodal structure in the loss landscape. As an alternate approach involving a high-dimensional formulation, we note the work of Ref. 183, discussed in more detail in Sec. V B. Here, a deep-learning approach with a synthetic model is used to perform tomographic inversion. This approach corresponds to an implicit regularization for the inversion based on the network structure and training data. However, in the context of Bayesian inference, this suffers from difficulties of characterizing the high-dimensional posterior distribution of neural network weights.

*s*

_{ℓ,m}(

*r*) reducing the input dimensionality from number of voxels ∝

*N*

^{3}→

*N*down to linear in the number of bins in the radial (or spectral) coordinate.

In general, all of these approaches make significant trade-offs in representational flexibility and dimensionality with considerations for how uncertainty is quantified. It will be critical for this application space to compare different algorithms to better understand performance and reliability of uncertainty estimates. More generally, methods that enable accurate uncertainty quantification in tomographic inversion incorporating a variety of effects are of significant interest to the broader community and have been an active area of research in a variety of application spaces for some time.^{185–193}

Basis expansion methods are also commonly used to provide useful summaries of data for comparison in a variety of other contexts. For example, the authors of Refs. 194 and 195 analyze nonlinear harmonic generation and mode–mode interactions in growth of the magneto-Rayleigh–Taylor instability using a spectral approach indicated in Fig. 19(a). Methods for extracting spectral features rely on regression, as in the least-squares spectral analysis^{196} approach used in Ref. 195. Such methods, therefore, have natural Bayesian extensions^{197} (recall connection between loss function and log-likelihood), providing a route to rigorous uncertainty quantification, for example, of assessing growth rates of different modes. In another example, the authors of Ref. 198 built an emulator of the code HYADES for laser deposition in a radiating shock experiment to be used as input for the radiation hydrodynamics code CRASH. Note that this is an example of surrogating a physics submodule as discussed in Sec. IV A. CRASH requires full space-time dependent field information (velocity, density, pressure, etc.) from the HYADES output. However, the authors noted that CRASH was insensitive to many of the details of the full HYADES output. As a result, the authors were able to select a set of 40 parameters (position, density, velocity, and pressure at each of ten locations specified by physical conditions), which could be used to reconstruct the HYADES field quantities using linear interpolation. This constitutes using a triangle basis function expansion, where the basis function location and width depend on the simulation input parameters. The authors found that the free parameters (coefficients, basis function locations, and widths) in this expansion could be found and used to emulate field variables with sufficient fidelity, as shown in Fig. 19(b), for use as input to the code CRASH.

*N*to obtain a reduced representation of dimension

*k*. The expert will generally wish to choose this mapping to give rise to a set of reduced dimensional quantities that are sensitive and constraining for the physics of interest. Physics intuition and synthetic data studies may both play a role in developing expert-motivated choices of

*F*for this purpose. In this case, the log-likelihood function would be formulated as

**to be inferred. The difference between this equation and Eq. (40) is that both the observed data and synthetic data have an additional transformation applied before comparison. Note that a basis function expansion of the observations may be chosen as a special case of**

*θ**F*, where the function

*F*now maps the observation to a set of basis function coefficients. While we do not have explicit need to be more general for our discussion, note that transformations may be simultaneously applied to both observations and model parameters.

We emphasize that the expert-engineered features discussed here provide highly interpretable results and will continue to serve useful for the community. For this reason, we now highlight several cases in the literature of choices for the function *F* before moving on to discuss details of data-driven dimensionality reduction. In the field of x-ray spectroscopy, spectral features, including linewidths, line-ratios, and continuum emission slope, constitute a small set of scalar quantities that are sensitive to and used for deriving estimates of various aspects of a plasma’s state and even geometry.^{199–202} Recent work has leveraged such summary features to obtain 2D electron temperature maps with the ratio of the energy-integrated Bremsstrahlung signal after attenuation by two different filters, providing the summary feature.^{203} Time dependent electron temperature of the ablating plasma in gas-filled holraums at the NIF has also been assessed using line-ratios from spectral features induced by inclusion of a mid-Z tracer layer^{204} [top panel of Fig. 20(a)]. Similarly, in inertial fusion experiments, fitting of neutron spectra in different contexts can be used to characterize ion temperature,^{205,206} *ρR* of the DT shell surrounding the hot-spot for implosions on OMEGA 60,^{207} magnetic flux entrained in a premagnetized fusion fuel,^{64,80,81,208,209} and residual kinetic flows in the hot-spot^{210–212} and, using neutron backscatter spectra, the dense fuel layer.^{213} Summary features have played a similar role in interpretation of neutron spectra. For example, the authors of Refs. 64 and 81 perform parameter inference directly on features characterizing the width and asymmetry of secondary DT neutron spectra and nToF measurements.

Image data are also amenable to such hand-picked featurization. For example, the authors of Ref. 216 summarized self-emission x-ray images from MagLIF experiments utilizing brightness contours to define a plasma volume, similar to the extraction of volume from neutron imaging diagnostics in Ref. 217. In Refs. 170 and 218, features for self-emission x-ray images from MagLIF experiments were computed using a geometric parameterization to mimic the observed image structure and Fourier analysis of axially resolved radially integrated intensity and brightness centroid, respectively. As another case, the blastwave radius may be extracted from multi-frame shadowgraphy images in experiments designed as surrogates to assess the laser preheat energy deposited in MagLIF^{214} as seen in Fig. 20(b).

Feature engineering of image data has also played a significant role in the study of hydrodynamic instabilities relevant in a variety of HEDP settings ranging from astrophysical phenomena, such as supernova explosions,^{215,219–221} to their role in degrading ICF performance.^{222} These reduced features have proved significant in astronomical observations, laboratory experiments, and in the study of numerical and analytic calculations. For example, bubble-spike amplitudes and mix-layer width in classical-/ablative-/magneto-Rayleigh–Taylor (RT) and Richtmyer–Meshkov (RM) instabilities are commonly used to assess growth rates,^{62,215,221,223–228} an example of which is shown in Fig. 20(c).

Whether it is some hand-chosen highly reduced features sensitive to a particular quantity or a hand-picked basis (generally chosen on physics-based grounds) for reconstruction, the types of features discussed may constitute a dramatic reduction in the overall dimensionality. When this dimensionality reduction applies to the model parameters themselves, as in the case for basis expansion methods in tomographic inversion, one may avoid difficulties with the curse of dimensionality in sampling from the Bayesian posterior distribution of those parameters. When applied to observables or simulation state variables, dimensionality reduction may aid in data storage and retrieval, reducing computational costs, and addressing the likelihood weighting problem discussed in Sec. III. Furthermore, in some cases, such as when dealing with image data, dimensionality reduction may be thought of as providing a metric for comparison under a likelihood function. This is useful as pixel-wise mean square error can be highly sensitive to small deformations of the image such as a slight shift in registration and are known to be poor choice in many cases.^{229} However, the effectiveness of such approaches will, in general, depend on the chosen representation. As a simple example, a propagating shock may be well described by the shock front location and density jump with just a few parameters, but a Fourier representation will require infinitely many modes to capture the discontinuous jump. For more complex systems, it may not be possible for an expert to discover a parsimonious description of the system. It is therefore advantageous to develop machine learning approaches that may be applied directly to the data to extract the most important features for the problem or task under consideration. We will refer to this approach as data-driven feature learning.

^{37,109}The basic algorithm for PCA is as follows: Given a dataset (assumed to be zero-mean) represented by a matrix $X\u2208Rn\xd7m$, where

*n*is the number of data samples and

*m*is the number of features (typically the number of pixels in an image or number of time-series data points), we may compute the eigenvectors and eigenvalues (known as principal values and principal vectors, respectively) of the covariance matrix

**X**

^{T}

**X**. It is not hard to show that the first principal vector associated with the largest principal value is oriented along the direction of maximum variance in the data, the second principal vector and value with the direction of second largest variance (after projecting out the first principal component), etc.

^{37}From a practical stand point, one often uses singular value decomposition (SVD), which computes the factorization

**X**=

**UΣV**

^{T}, where $U\u2208Rn\xd7n$ and $V\u2208Rm\xd7m$ are orthogonal matrices and $\Sigma \u2208Rn\xd7m$ is a diagonal matrix. The columns of

**V**are then the principal vectors, and diagonal entries of

**Σ**, referred to as singular values, are the square root of the principal values. By truncating the representation to the first

*r*principal vectors, one obtains $X\u2248Xr\u2261Ur\Sigma rVrT$, where $Ur\u2208Rn\xd7r$ are the first

*r*columns of

**U**, $\Sigma r\u2208Rr\xd7r$, and $Vr\u2208Rm\xd7r$.

**X**

_{r}provides the provably best (in the sense of minimizing the Frobenius norm of

**X**−

**X**

_{r}) rank

*r*approximation to

**X**. This is an

*r*-dimensional representation that explains a fraction

**X**. Provided that

*f*

_{r}≈ 1 with

*r*≪

*m*, a significant dimensionality reduction is achieved. In Fig. 21(a), we show the principal components of a dataset drawn from a 2D multi-variate normal distribution. The dark blue vector shows the direction of maximum variance in the data, and the perpendicular projection of data onto this vector will constitute a dimensionality reduction that captures a significant portion of the information contained in the data.

In Fig. 21(b), we show an example of the first (top) and third (bottom) principal components in the streamwise (left) and normal (right) directions for turbulent 3D fluid flow past a cylinder taken from Ref. 231. Since these modes are computed using snapshots of dynamical data, they are referred to as POD modes. Notice that the modes capture large scale coherent structures that are observed across time in the fluid. Furthermore, the flow is highly asymmetric on either side of the cylinder so that a Fourier decomposition would be a less appropriate descriptor of the data (i.e., require incorporating more components for the same level of accuracy). Despite being a very common starting point, explicit use of PCA in HEDP and ICF applications appears to be uncommon. We do, however, note the application in Refs. 232–235, where the authors compared inference of the properties of X-pinch plasmas generated from spectroscopy using a principal component decomposition of the non-local thermodynamic equilibrium spectra to a more traditional line-ratio analysis. We also note the work of Ref. 180, which utilized PCA and a sparse formulation of PCA to identify clusters of related design variables for ICF experiments on the NIF.

We pause to reflect on the fact that PCA/SVD may be thought of as representing a particular data sample from **X**, say, **x**_{n} by $xn\u2248\u2211i=1rai(n)vi$, where $ai(n)=Un,i\sigma i$ are the weights of the basis vectors $vi=Vi,:T$. Here, the subscript {*i*, :} indicates the vector given by taking the *i*th row of **V**^{T}. PCA guarantees that **v**_{i} · **v**_{j} = *δ*_{i,j} (by orthogonality of **V**). This decomposition, therefore, bears significant resemblance to a variety of methods such as Fourier decomposition. However, PCA may be more appropriate for certain applications as the linear basis vectors are learned directly from data to optimally explain variance. Matrix decomposition methods come in a variety of forms useful for different applications and have found a wealth of applications (c.f. discussion of POD and DMD in Sec. IV B). Given the ubiquity and success of these methods, the reader is encouraged to consult the wider literature and statistics and machine learning textbooks^{37,38,40,109} where a plethora of applications, extensions (such as probabilistic PCA^{236}), and limitations can be more fully appreciated. We especially highlight that the authors of Ref. 109 pointed to many excellent references, especially from the CFD community.

In general, matrix factorization approaches will fail to produce significant dimensionality reduction in cases with significant nonlinear correlations within a particular dataset. Strong nonlinearity of the correlations manifests as a singular value spectrum that decays very slowly with increasing rank *r* of the representation. Fortunately, there is strong empirical evidence for the “manifold hypothesis”^{237} that suggests that natural signals tend to cluster about surfaces or manifolds of significantly lower-dimension than the naive dimension of the data. To discover and exploit this structure, a vast number of “manifold learning” or non-linear dimensionality reduction methods have been developed, such as local linear embedding,^{238} t-distributed stochastic neighbor embedding,^{239} UMAP,^{240} kernel-PCA,^{241} Gaussian process latent variable models,^{242} auto-encoders,^{243} and many others.^{244,245} Figure 21(c) demonstrates the general goal behind manifold learning as a dimensionality reduction technique. In Fig. 21(c), the data appear to be two dimensional, yet a one-dimensional coordinate under a nonlinear mapping is sufficient to capture most of the structure in the data, with the residual structure corresponding to random fluctuations.

Auto-encoders have proven to be a particularly popular approach to learning underlying nonlinear structure owing, in part, to the fact that they can easily be constructed using auto-differentiable programming. This allows for their incorporation in reduced order modeling (c.f. discussion in Sec. IV B). Furthermore, they are commonly extended to serve as generative models^{37–40,246} for use in sampling and Bayesian inference. In Fig. 21(d), we highlight the work of Ref. 94, where the authors trained a multi-modal Wasserstein auto-encoder with a 32-dimensional latent space on a combination of image and scalar data of naive dimension given by four 64 × 64 pixel images and 15 scalars $\u224816\u2009400$. This constitutes a more than 500 × reduction in the data output size. The authors then utilized the learned latent representation to train an invertible emulator of a semi-analytic numerical simulator on the latent space with enforced cycle consistency improving the generalization error of the emulator.

It is beyond the scope of our review to pursue a rigorous and detailed explanation of the incorporation of the topics of Sec. V into a Bayesian inference framework. However, we note that experimental and synthetic data features are often extracted using an algorithm or model and, where possible, should be equipped with some form of uncertainty obtained, e.g., via Bayesian fitting or hyperparameter sensitivity analysis in extraction of features from data. See, for example, Ref. 247, which discusses calibration of uncertainty for the manifold and cyclically consistent (MaCC) architecture.^{94} We also observe that recent significant effort has been placed on the development of generative models (see the textbooks of Refs. 37–40 and 246 for more details and references) that provide tractable methods for treating the original data space and latent space in a rigorous probabilistic manner. This enables their direct incorporation into Bayesian inference. We anticipate that as these methods continue to mature, they will be more widely adopted in the HEDP and ICF community. We also remind the reader that there is an inherent connection between parameter optimization as in *χ*^{2} minimization and Bayesian inference via the maximum-likelihood estimation as discussed in Sec. II. We hope that knowledge of this intuitive connection will help to guide the reader in further investigation of this topic in these more advanced cases. We also advocate for readers to attempt to reformulate analyses they have conducted or observed in the literature in terms of the Bayesian theory framework to better understand these connections.

### B. Handling noise, background, and instrument response

*θ*. Indeed, even before applying the function

*F*, $Oimodel(\theta )$ and $Oiobs$ may not be comparable without properly accounting for background and instrument response. As addressing these aspects is integral to being able to perform accurate inference, we now present several approaches in the literature that can be used to avoid these issues. We roughly categorize these methods into those that deal with the noise term, the background term, and impacts of the instrument response on the signal term.

^{229}Here, we follow the work of Ref. 248, which makes use of a representation known as the Mallat Scattering Transform (MST). The MST has a number of desirable features as an image metric, for example, having been shown to be continuous to small image deformations.

^{249}This metric is used in Ref. 248 to directly compare MagLIF stagnation images and test sensitivity to a variety of realistic features, such as image noise and instrument effects. In this case, one may think of the function

*F*(here the function computing the MST) as mapping from spatial information to localized scale information, i.e., a space-scale representation. One may then ask if the application of

*F*to images sensitively depends on the signal-to-noise ratio (SNR). Using ideas of data augmentation from machine learning, the authors compared two different scans of the same image plate, with nominally the same morphology, differing only in the signal-to-noise ratio (SNR). Defining the SNR by

*σ*(·) computes the variance in the noise, the authors found that $F(Si1+Ni1)\u2260F(Si2+Ni2)$ for signals $Si1\u221dSi2$ when SNR

^{1}≠ SNR

^{2}. This is due to the fact that the MST is highly sensitive to statistical texture so that otherwise identical images could show significant discrepancy in this space when exhibiting different SNRs, an example of which is shown in Fig. 22. As a result, denoising before computing the image metric was needed to ensure a fair comparison between pairs of images taken from a combination of experiment and simulation using this metric in the presence of different SNRs. Ensuring such a fair comparison would be important for allowing Bayesian inference using this metric in the likelihood function in any future work. As another example where noise reduction plays a role, real-space filtering methods were applied in Ref. 250 to recover image features from gated x-ray detector data collected on NIF implosions. The improved SNR allowed for the accurate extraction of Legendre mode amplitudes fit to an intensity contour characterizing low mode asymmetry with uncertainty, which may then be used in a Bayesian inference.

Returning to Eq. (44), we notice that similar considerations apply for the background term. Note that when working directly with $Oiobs$, if one has a model for both *S*_{i} and *B*_{i}, it is possible to apply Bayesian inference directly, inferring posterior distributions for signal and background model parameters simultaneously. However, there are several reasons such an approach may not be desirable. First, referring back to Sec. V A, one sees that incorporating background model parameters will increase the problem dimensionality, potentially resulting in difficulties from the curse of dimensionality. Second, returning to the case of image data, it is often useful to work with a transformed quantity $F(Oiobs)$. As a result, steps must be taken to ensure that the background term does not introduce strong bias into the inference process. For this reason, we now discuss several different conceptual approaches to handling the background term to allow for accurate inference.

One common approach to addressing background due to spatial non-uniformities caused by instrument and *invariant* environmental effects is to perform direct measurements for use in a process called flat field calibration.^{251–253} Similar procedures can be followed to obtain spectral normalization to correct for spectral non-uniformities by fielding reference samples. In Refs. 254–256, this is achieved by fielding a reference sample in tandem with a sample of interest on opacity experiments conducted at Sandia’s Z Facility. The authors then performed a statistical analysis to obtain a spectral reference with uncertainty across many experiments that enabled characterization of shot-to-shot reproducibility. These approaches are attractive due to their effectively model-free approach to characterizing background, though we note that typically additive or multiplicative background is assumed depending on context. Furthermore, with a detailed understanding of statistics of noise sources, one may easily adapt such methods to a Bayesian context to propagate uncertainty from the background subtraction. This may be achieved either by parameterizing the uncertainty or sampling directly from experimental observations to mimic sampling from a prior distribution.

*θ*

_{B}to obtain $p(\theta S|D)$. Here, we assume $p(\theta S,\theta B|DX)=p(\theta S)p(\theta B|DX)\u2248p(\theta S)p(\theta B|D)$. In other words, we assume that the background is constrained by data in $X$. We then perform the marginalization as

^{37}We then employ the stated approximation going from line two to three and utilize MCMC samples from the posterior distribution $p(\theta B|DX)$ to approximate the integral obtaining the fourth line. In the final line, we utilize the Bayes rule while also writing out the explicit subtraction of the background resulting from the particular MCMC sample

*θ*

_{B,j}.

This is the type of approach that was used in Ref. 64, where secondary DT nToF data backgrounds were fit using Bayesian inference to find background model parameters as indicated in Fig. 23. By sampling from the posterior of background fit parameters, the authors of Ref. 64 obtained a distribution for the nToF shape quantities-of-interest, resulting in a significant dimensionality reduction in the observations to be fit (see Sec. V A). We note that this work made a further approximation that the likelihood function of the background subtracted data given the shape features of the signal *θ*_{S} was delta function distributed about the computation of the shape features from the background subtracted data. This amounts to assuming that the variance in shape features is dominated by the uncertainty in background fit so that noise in the computation of cumulative area under the nToF used to compute shape features is negligible. The authors used the experimental feature distributions as observations equipped with an approximate uncertainty in a final Bayesian inference of fuel magnetization from nToF and neutron yield data. This is achieved using a likelihood function form approximated from the experimental shape feature distribution with a multi-variate normal where the covariance matrix is estimated directly from MCMC samples of shape parameters obtained by the Bayesian background fitting procedure for a given experiment. Paired with a neural network surrogate of a physics forward model, this allowed for fast MCMC sampling and an assessment of the magnetic confinement parameter for the stagnated fuel in MagLIF experiments as discussed in Sec. II E. We highlight that this inference incorporates uncertainty from both the surrogate model fit (c.f. discussion in Sec. IV E) and the experimental feature extraction via Bayesian background fitting discussed here.

In general, the process specified in Eq. (47) requires identification of a subset of the observational data $X$ not containing the signal of interest. This may be done manually if needed, but may also be automated using signal processing and machine learning methods. As an example of a machine learning based approach, Fig. 24(a) shows results from Ref. 257, where the CNN-based pipeline was used to segment self-emission x-ray images from MagLIF experiments into fuel and background. The segmentation allowed for background subtraction and statistical analysis of the noise and background. The resulting experimental distributions of SNR and coefficients of low degree polynomials fit to slowly varying background are shown in Fig. 24(a). In particular, the analysis in Ref. 257 demonstrated that the assumption of additive Gaussian noise is well matched to the experimental data, while obtaining a multivariate normal approximation on background coefficients. We emphasize that this type of analysis can provide quantitative guides for the selection of parameterized distributions for data augmentation in machine learning applications or as priors in Bayesian background subtraction procedures that quantify uncertainty. Such data augmentation is commonly employed in machine learning applications such as those in Refs. 170, 257, and 258. Ensuring the data augmentation procedure is well matched to the experiment is critical. For example, it has been shown that convolutional neural network techniques for image analysis, such as classification and segmentation, may exhibit significant sensitivity to realistic noise.^{259} Training data augmentation methods are a common approach to avoid this feature. While it is common practice to assume normally distributed noise, it is important to consider rigorous approaches to effectively capture the noise distribution associated with experimental data.

Rather than attempting to remove background and noise to ensure that an inference on a signal or derived quantity is insensitive to those features, one may instead try to construct a mapping *F* that has the desired property that *F*(*S*_{i} + *B*_{i} + *N*_{i}) ≈ *F*(*S*_{i}). This is the type of approach undertaken in Ref. 183 and indicated in in Fig. 24(b). In that work, generative adversarial networks (GANs) are used to learn a function that may be applied to synthetic images to approximate the background and noise features seen in the experiment. That is, $Simodel\u2192Simodel+Bimodel+Nimodel$ (note that an additive form is not necessary here and is not assumed in Ref. 183). Broadly speaking, GANs operate using a split generator and discriminator architecture to learn the desired mapping. The generator, *G*(*x*), takes some input (in this case, a synthetic image) and generates an output, *y* (in this case, a modified synthetic image). The discriminator, $D(y\u0303)$, attempts to classify the output of the generator $y\u0303=G(x)$ as belonging either as an output of *G* or a member of the ground truth dataset. The networks are trained using feedback of one to inform the other so that *G* learns to better approximate the distribution of experimental data known to the discriminator. The learned generator function applies noise distributions to synthetic images that appear to arise from aspects of the measurement process that would be difficult or impossible to capture using an analytic forward model. The authors of Ref. 183 discuss the use of an additional neural network model applied to the GAN output to obtain high resolution 3D reconstruction of ICF capsule density profiles from a sparse number of views or even a single view. Although more work will be required to validate the reconstructions, the use of realistic looking GAN outputs in training the tomographic inversion is likely to reduce bias in the reconstruction. Before moving on, we note that, although not discussed explicitly in Ref. 183, generative models may be thought of as estimating the joint probability distributions over data and can serve as effective priors in Bayesian inference problems.^{260}

Finally, instrument response should also be appropriately accounted for to avoid biased inference. This is usually accounted for in one of two ways depending on the application. Perhaps the easier of the two methods involves convolving a model of the instrument response functions (IRFs) with a forward synthetic model. The output synthetic may be directly compared to observations, which naturally include these instrument response effects. This is the approach taken, for example, in Refs. 64, 205, and 261 to derive synthetic nToF data from a physics model that includes response. The authors of Ref. 261 went a step further by propagating uncertainty associated with imperfect information on the IRFs in a Bayesian framework, arriving at results for ion temperatures that accurately reflect that uncertainty. Alternately, one may attempt to solve the inverse problem to deconvolve experimental data using known responses such as point spread functions to remove instrument effects and obtain a ground truth experimental signal. This is the approach taken in Refs. 262 and 263, where an iterative maximum likelihood technique is used to remove non-stationary distortions caused by the aperture in pinhole neutron imager data from NIF experiments. If the IRF is parameterized, then a prior distribution over those parameters may be established by Bayesian fitting to calibration data. If it is directly measured but not parameterized, measurement uncertainty may be used to propagate the uncertainty through forward models using Monte Carlo sampling.

In this section, we have highlighted a unifying framework (for the case of additive signals) to understand the impact of realistic features in experimental data on dimensionality reduction and inference. We have presented a wide variety of methods spread across many subdisciplines that can be used to mitigate unwanted bias introduced from these features. It is our hope that the scope of applications is broad enough to provide good starting points for new practitioners regardless of their particular physics application.

## VI. NEXT STEPS

In this section, we highlight more advanced applications of the concepts presented in this Review that have been applied in other fields and could be adapted to enhance the power of ICF and HEDP experiments.

### A. Hypothesis testing

The examples outlined in Secs. II E and III almost exclusively analyzed data with a single model to infer parameters of importance from an experiment. However, one of the great powers of the Bayesian formalism is that it provides a mechanism to quantitatively compare models against each other. While estimating parameters from data and calibrating models is a critical task, scientific discovery moves forward when a candidate hypothesis can be ruled out by existing data. Hypothesis testing, or Bayesian model selection, is a well-established methodology applied in many fields.^{264–267}

In order to do this, we must evaluate which model among a library of candidate models has more explanatory power, given the available data. We quickly realize that our models may have different numbers of parameters and a model with more parameters could reasonably be expected to fit the data better than a model with fewer parameters, though this does not necessarily make it a *better* model. A naive example is observed when fitting a noisy dataset with two different polynomial models. You would expect a higher-order polynomial to intersect more points than a linear model, producing a “better” fit. However, a high-order polynomial is prone to over fitting and will not necessarily *explain* the data better than the linear model. Appealing to Ockham’s razor, all things being equal we prefer the simpler model. Bayesian model selection takes this effect into account when comparing models.

*h*

_{i}, each with their own parameter vectors

*θ*_{i}, we can write the posterior probability of each model as

*h*

_{i}is correct, given some observations and our background data. The likelihood on the RHS expresses the probability that we would observe the data

**D**, given the model and our background. We immediately notice that there is no dependence on

*θ*_{i}in this expression. This is because we are interested in the probability of a model being correct regardless of the particular parameter values, so

*θ*_{i}has been marginalized out. We can expand the likelihood in Eq. (49) to show the dependence as

*θ*_{i}, this term is often called the marginal likelihood. By inspection, we see that the integrand in Eq. (50) is equivalent to the numerator in Bayes’ theorem, where the hypothesis or model

*h*

_{i}can be viewed as part of the background information

*I*. Therefore, the marginal likelihood is nothing more than the evidence in Bayes’ theorem, acting as the normalization constant for the posterior.

*h*

_{1}over

*h*

_{2}, then we would expect that

*p*(

*h*

_{1}|

**) >**

*D**p*(

*h*

_{2}|

**), or alternatively the ratio of the posteriors**

*D**p*(

*h*

_{1}|

**)/**

*D**p*(

*h*

_{2}|

**) > 1. If we assume the prior probability of each model (how much we believe either model to be correct before seeing the data) is equal, determining the ratio of the posterior probabilities reduces to the ratio of the evidences, also known as the Bayes factor,**

*D*Evaluation of the Bayes factor requires marginalization of the likelihood over the entire space of the parameters, which tends to be a very complex and costly integration. Many methods have been developed to perform or approximate this integration, such as MCMC,^{268} nested sampling,^{269} thermodynamic integration,^{270} and the Laplace approximation,^{35} among others.

^{28,39,40,271}Both metrics use the negative log-likelihood at the maximum as a surrogate for the posterior weight and add a penalty term for the number of parameters. The AIC can be written as

*k*is the number of model parameters. In the BIC,

*N*is the number of data points used in the inference. When comparing multiple models, the model providing the smallest value of AIC or BIC will be preferred. These metrics only require the likelihood be evaluated at the MLE, obviating the need for the complex and costly integration of the evidence.

These methods are approximate, and importantly, they ignore all information about the posterior apart from the weight at the MLE. Other metrics have been developed to utilize more information provided by a chain of MCMC samples from the posterior. These methods, while still approximate, do take into account the full distribution of parameters provided by the chain, making them more accurate. Two of these methods are Leave One Out (LOO) cross-validation, and the Widely Applicable (or Watanabe–Akaike) Information Criteria (WAIC). We refer the reader to Refs. 174 and 272 for detailed information on the theory and implementation of these metrics. For our purposes, we note that they utilize a chain of samples from MCMC for each model and can be readily computed with many modern statistical packages. To compare multiple models, we compute either the LOO or WAIC from the posterior samples of each. The preferred model is the model that produces the largest value of the chosen metric.

To illustrate this, we revisit our example from Sec. II. In this example, we fit our neutron spectrum with multiple different models. The first was an approximate model for the neutron spectrum, which assumed that it took a Gaussian shape and used a normal likelihood. This model required four parameters to describe the spectrum. The second model used a normal likelihood but replaced the neutron spectrum with the Ballabio model, which reduced the number of model parameters from four to three. Finally, we implemented a Poisson likelihood in the inference with the Ballabio model. Figure 3 shows the posterior distribution of the ion temperature inferred with each model (as well as the Laplace approximation model). We can see that the Ballabio model with the normal likelihood produced a posterior for the ion temperature that was nearly identical to that produced using the Gaussian model, but required one fewer parameter to fit the data. The Poisson model produces a slightly narrow distribution. We can use LOO to quantify these differences and select the preferred model.

Figure 25 shows the LOO scores (circles) computed for each model along with the estimated standard error in the difference between each model and the highest ranked model (error bars). The Poisson model shows no error because it is the top-ranked model. We can see that the Ballabio model performs better than the Gaussian model, but the Poisson model is clearly superior. This agrees with our intuition, given the reduced uncertainty in the ion temperature demonstrated with this model and the fact that the data were generated using the Ballabio model as the ground truth with a Poisson process.

A prominent example applying Bayesian model selection in the literature is in the detection of gravitational waves by the LIGO observatory and the identification of the event that produced the observed signal.^{273,274} An inference pipeline was developed that applies a library of models to the observed data, estimating the relevant parameters for each candidate model. Multiple independent methods were used to sample the posterior and evaluate the evidence to check for consistency. The event was identified as a binary black hole merger because this is the event model that produced the largest evidence, by a significant margin. This process allowed the team to simultaneously identify the event and estimate the relevant parameters of the system (initial black hole masses, final mass, final black hole spin, etc.). Similar approaches have been applied to other gravitational wave observations^{275} and other large astronomical observational endeavors, such as WMAP3.^{266}

In HEDP and ICF, we are faced with many models that are used to explain experiments and predict trends. Sometimes, these models are theoretical models with small numbers of parameters, but often, these models take the form of multi-physics simulation codes, each of which employ different discretizations, physical approximations, and numerical choices that can impact the fidelity and applicability of the model to a specific domain or experiment. The example of Ref. 91 discussed in Sec. III developed multiple hypotheses for physical mechanisms, which were implemented in a simulation tool in an ad-hoc manner. The experimental data were then used to select which hypothetical process was most consistent. Although Bayes factors were not computed, this application is a clear example of the power of this methodology to guide scientific discovery by systematically ruling out hypotheses that are inconsistent with data. By applying this framework to a more diverse array of experiments and physical processes, we can expect rapid progress in our understanding.

### B. Optimal experiment design

Building from the foundations of inference and hypothesis testing, we can use these tools not only to analyze experimental data but to intelligently guide how we conduct our experiments. By combining these ideas, we can construct a framework that allows us to decide where to obtain experimental data to maximally inform our models or provide data that maximally discriminate between competing model predictions. A fully Bayesian experiment design formalism accounts for all known uncertainties and sensitivities when selecting design points. It has the added benefit that the formalism used to determine the design points is readily able to ingest new experimental data once it is obtained to produce refined model predictions and propose new points.

We begin by proposing a utility function *U*(** D**|

**) that quantifies the value of newly obtained data**

*x***at a given experiment design point**

*D***. This utility function is domain specific and must be engineered based on the problem at hand. However, some general metrics have been developed, which rely on various statistical approximations, most commonly the Laplace approximation. Two common metrics are A-optimality and D-optimality. The A-optimality criterion minimizes the trace of the covariance matrix, effectively minimizing the combined variance of each parameter in the model, ignoring correlations. The D-optimality criterion minimizes the determinant of the covariance matrix, allowing for the incorporation of linear correlations in the design. Variations on these criteria can be found in the literature. Alternatively, information gain can be maximized by using the Kullback–Leibler divergence as the utility function, a metric with roots in information theory. We refer the reader to Ref. 271 for more details on these and other criteria.**

*x*The experiment design points can be thought of as any parameters of the experiment that we have control over. This could be capsule dimensions, drive power or history, gas density, etc. These design points can also entail properties of the measurements being taken, such as the photon energy of a probe, pinhole diameter, the time at which images are taken, and filters used in an instrument. Because our task is to propose an experiment, we do not know what the data will be ahead of time, so we must marginalize over possible outcomes. The data itself depend on the parameters of our model ** θ**, which are also uncertain, so we must include this uncertainty by marginalizing over model parameters. This process gives the expected utility function or the expected value of

*U*(

**|**

*D***) as a function of experiment design parameters weighted by the likelihood of**

*x***and the prior distributions on the model parameters. The proposed design point is then**

*D***that maximizes the expected value. Finding this design point, however, is computationally very challenging. It involves a high-dimensional optimization that requires sampling and integrating over the likelihood at each iteration.**

*x*Often, our goal is to find design points that maximize the constraining power of our *diagnostic observations*, not the performance or dynamics of the experiment itself. However, this approach can be used to maximize some predicted output of an experiment. With traditional approaches to designing experiments in HEDP and ICF, we will often look for sensitivity to underlying conditions or processes in our models that are not directly accessible. This formalism requires us to then look at those underlying conditions through the lens of our instruments. In order to do this well, we must have high fidelity instrument models that account for instrumental response, noise, and background effects so that we can confidently say that our proposed experiment is sensitive to measurable quantities.

Although not Bayesian in nature, an important example in the literature is found in Ref. 276, where a statistical model of yield from direct drive capsules on the Omega laser was used to predict design changes that would lead to enhanced neutron yield. The statistical model in question used a large ensemble of 1D calculations to produce a baseline model of target performance dependent on experimental parameters, such as capsule dimensions and laser drive. An experimental database was used to correct, or calibrate, this model based on experimental observables such as neutron yield and fuel areal density. The calibrated model was then used to suggest new design points that would not have been intuitively discovered if the 1D predictions were naively used to optimize performance. The study ultimately succeeded in tripling the neutron yield from a baseline high performance design. An important benefit of this approach is that the calibrated model is able to ingest the output of each new design point once the experiment is conducted, enabling continual improvement of its predictive capability.

A common application of the Bayesian information based formalism is the design of instruments to minimize uncertainty in a specified quantity. As discussed in Sec. IV, Knapp *et al.* applied this formalism to the optimization of an x-ray detector configuration (detector sensitivity, x-ray filter material, and thickness) to minimize uncertainty and bias in specific inferred quantities.^{160} Bayesian optimization was used to minimize the utility function, which was defined as the mean-squared error of the inferred x-ray spectrum relative to the ground truth from simulation data. In the closely related field of magnetic confinement fusion, this approach has been used to optimize the configuration of an interferometry diagnostic subject to constraints enforced by the vacuum vessel and the geometry of the experiment.^{277} The goal was to configure the lines of sight of three interferometers to maximally inform the electron density profile, for which the information gain was quantified using the Kullback–Leibler divergence in the utility function. Due to the complexity of both the instruments and the underlying physics, applying this approach to the design of new instruments on HEDP facilities could significantly improve the value of new data in addition to providing a means to prioritize resources.

Although well established in other fields (e.g., observational astronomy^{278–280}), the use of this formalism to design the conditions and parameters of experiments on HEDP and ICF facilities is nascent. In this application, our utility function describes the value of new information in informing or constraining model parameters. This is an extremely promising approach in ICF and HEDP where experiments are rare and expensive as it provides a quantitative means to choose between proposed experiments and guide decisions that maximize each experiment’s value. Again, sampling our models over large spans of experimental parameter space enabling marginalization over model parameters is expensive, and so the surrogate modeling techniques and statistical simplifications previously discussed may be necessary.

HEDP and ICF experiments function in a wide range of parameter space, with important processes spanning many orders of magnitude in spatial and temporal scales, with wide ranges of plasma density and temperature. Because of this, it is not trivial to know *a priori* when and where a given model is valid or if other processes may have non-negligible impact on the outcome. By designing experiments with the goal of selecting between candidate physical models, we can readily reject hypotheses and update our understanding of the complex physical phenomena under study. The application of optimal experiment design techniques to model selection is less well developed, though examples do exist outside of HEDP and ICF that we can learn from.^{281,282} In this application, we wish to design experiments such that the observables are capable of discriminating between different models. Now, our utility function captures the discrepancy between predictions from different models, marginalized over the model parameters and likelihood. In the simplest case, one model is correct and the other is incorrect. In reality, most physical models do not obey this dichotomy. Most models are correct in some regime of applicability and fail when parameters stray outside of this regime, where another model is applicable. An example of this is the transition between Spitzer-like transport for weakly coupled plasmas and transport in the strongly coupled regime.^{283} Spitzer theory works extremely well in the weakly coupled limit, but breaks down as the Coulomb coupling parameter Γ → 1. Strongly coupled theories work well for large coupling, but break down as the coupling parameter decreases. Often, these two theories vary widely in the regime of Γ ∼ 1 because neither is valid in this transitional regime. In this and similar situations, we seek to optimize experiments that will delineate the regime of validity for each model. Additionally, the Bayesian framework admits model averaging, which can potentially produce a model that gracefully transitions between regimes of validity.

An important special case that often occurs in the physical sciences is that of *nested* models. Two or more models are said to be nested when they become degenerate in some region of parameter space. Physically, nested models can be viewed as a hierarchy, where one model is a simplification of another. This often occurs in plasma physics because simplified models are derived by making assumptions that reduce the complexity of more complete, but less tractable models. This situation is difficult from an optimization standpoint because in the regime where the two models are degenerate, the fitness of the models will be undetermined, as one cannot be favored over the other. Model selection applied in the case of nested models has been addressed to some extent in the literature though it is an active research area.^{284} By applying these ideas to experiments in ICF and HEDP, we provide a means to optimize the use of expensive resources with the goal of advancing our state of knowledge as efficiently as possible.

### C. Real-time inference and control

The advent of high repetition rate facilities, such as the new generation of high powered lasers^{11,12} and the LCLS,^{10} will push the frontier in both HEDP experiment and data-driven modeling and analysis. These facilities provide a wealth of high quality data at rates that are unprecedented in HEDP physics. The flagship large HEDP facilities (e.g., NIF and Z) operate at 1–3 shots per day. The OMEGA laser can provide up to 15 shots per day. Today’s high rep-rate facilities can operate in the range of 1 experiment/min. to $\u223c10$ experiments/s. The traditional model of operating HEDP experiments does not extend to this unprecedented scale, and a new way of thinking about operation and control at these facilities is needed. To this end, Hatfield *et al.* discussed opportunities for the use of machine learning to analyze data for real-time control and feedback.^{285} Clearly, real-time analysis and inference methods will need to leverage many of the methods presented in Secs. IV and V in order to be feasible on relevant timescales.

As examples of forward looking approaches, the authors of Refs. 286 and 287 show the use of Bayesian optimization for real-time control of a high power laser facility to optimize a wakefield accelerator. In Ref. 288, the authors developed a NN taking x-ray spectra as inputs and outputting spectrum amplitude, slope of the spectrum (a proxy for electron temperature), and total particle dose. The authors discussed how the approach may be appropriate for real-time control applications. As the HEDP/ICF community begins to develop these approaches, we encourage seeking existing solutions from fields where such methods have already reached significant maturity, for example, in magnetic confinement fusion, NN-based methods for real-time applications, including classifying Alfvén eigenmodes that may cause loss of confinement and wall damage^{289} as well as profile prediction.^{290,291} In Ref. 292, a reinforcement learned^{293} model implemented on a field-programmable gate array (FPGA) was applied for controlling an important subsystem of the Fermilab Booster accelerator.

A key consideration for real-time applications is satisfaction of latency requirements. One approach to improving latency is to utilize code that is optimized for run-time. To this end, the authors of Ref. 294 developed a library to convert NNs built and trained in *keras*^{168} to “real-time compatible C.” In the field of high-energy particle (HEP) physics, particularly for experiments at the Large Hadron Collider, event rates of $O(10MHz)$ with data size of $O(1MB/event)$ give rise to $O(1TB/s)$ data rates. In order to manage these data, much of which constitutes uninteresting background events, the HEP community has developed a number of tools that enable ML algorithms to be programmed on FPGAs to provide fast and accurate triggering. For example, the library *hls4ml*^{295} converts NNs trained in common ML libraries to high-level synthesis code for FPGAs. This has enabled a variety of ML algorithms to be programmed on FPGAs, including graph NNs,^{296} boosted decision trees,^{297} and autoencoders.^{298} Finally, we note that FPGA-based applications typically will require smaller NNs due to memory constraints. Various methods to compress NNs while preserving accuracy such as pruning as well as reduced numerical precision of weights and biases (e.g., binary and ternary) have been developed.^{299,300}

Before concluding this section, we note that application of these methods may enable not only observation- or point-estimate-based control, but also real-time posterior inference. This could allow for more general methods such as the optimal experiment design framework discussed in Sec. VI B to be brought to bear. At a high rep-rate facility, the typical targets for each shot can be fed in a continuous stream, such as metal foils, liquid jets, and gas jets. As a result, the parameters varied will be primarily parameters characterizing the drive and diagnostic settings, such as relative pump–probe timing. The same model could be applied to more traditional HEDP facilities, taking advantage of the additional time available between experiments. If we consider operations on OMEGA, where 45–60 min. are available between experiments, it is clear that only a select number of parameters can be varied on a per shot basis. The optimization algorithm would need to have knowledge of the inventory of targets available and the quantities that can be changed (diagnostic timing, laser power history, etc.). However, campaigns could be designed with this mode of operation in mind. With judicious use of surrogate models and automated workflows, it is entirely feasible to process data, infer important quantities, and propose a new design point in under an hour. Facilities such as NIF and Z allow for one or more days in between successive experiments. While this is still not enough time to build new targets, it is enough time to make some changes to the experimental configuration.

Such an approach has the potential to dramatically accelerate progress in HEDP science. Traditionally, campaigns on such facilities are designed in such a way that pre-chosen sets of data are collected and analyzed after the experimental run is over. More data can only be collected at the next opportunity for facility time, which often requires a new proposal submission. This introduces a significant lag between design iterations that could be circumvented with intelligent adaptive control algorithms. Significant investment in data handling, automated processing, and fast inference algorithms will be required to realize this vision.

## VII. COMMUNITY NEEDS

We have demonstrated broad interest within the HEDP and ICF communities in the application of advanced analysis tools to better constrain inferences and maximize the value of data obtained on expensive experiments. These tools have demonstrated success in improving inferences, allowing for access to unobserved physical processes, and the intelligent combination of disparate data across single and multiple experiments. These advances are partially enabled by the similarly recent explosion in the use of machine learning to provide efficient surrogates of complex models. One of the goals of this Review is to help proliferate these techniques across the fields of HEDP and ICF and to spur advancement of key technologies that will improve our ability to perform high quality measurements and efficiently integrate experiment with theory and modeling. In this section, we will discuss gaps and needs that could be filled to help realize these goals. Many of the items discussed strongly parallel recommendations from the work of Hatfield *et al.*^{285} and multiple recent community reports.^{301–303}

### A. Training and education

In order to spread these methodologies and increase their use across HEDP and ICF, the community requires training materials. Most tutorials regarding Bayesian inference available online focus on standard statistical problems. These examples are extremely important, but miss critical aspects of the problems encountered when analyzing experimental data in HEDP and ICF. Importantly, they generally use built-in statistical models and do not address the issues encountered when implementing physical forward models of the experimental system and instruments. This introduces a conceptual leap that must be taken and one that is not always trivial.

Similarly, the literature and databases available to the novice in machine learning are focused on standard problems of interest to industry. These often include data suitable for training a variety of classification tasks (e.g., cats vs dogs, identification of tumors, spam detection, and predicting user product preferences) and some regression tasks (e.g., real estate, weather, and traffic forecasting). While these problems illustrate important tools and techniques, and they leverage existing databases of well understood data, the applications are often subtly, but importantly, different from the physicists needs. Again, this introduces a conceptual barrier that is difficult to overcome.

To circumvent this shortcoming, it is imperative that the HEDP and ICF community develop examples and instructional material that is geared toward problems of interest. This will be the most efficient way to help students and early career researchers adopt these new methodologies that are often missing from graduate school curricula. This Review includes a Jupyter notebook working through the simplified Bayesian inference example from Sec. II, which is available on GitLab. This is intended to be a starting point around which other efforts can coalesce. The authors encouraged collaboration with the community to expand and improve this material. The LLNL open data initiative serves as a useful entry point for machine learning-ready datasets with applications in the physical sciences.^{304} This site provides databases in a variety of topical areas, most relevant to this Review being the JAG ICF dataset for machine learning.^{305}

We also recommend that experts in these methods develop instructional materials that can be used for summer schools and mini-courses at conferences. This development should be done in conjunction with academia to ensure proper pedagogy and integration with existing and potential future graduate school curricula. There are several examples within the HEDP community that could be used as templates for such activities. There are two HEDP summer schools, which operate on an alternating schedule.^{306,307} These schools have been operating for well over a decade at this point and are established within the community. Additionally, LLNL has partnered with the University of California San Diego to make several courses covering various topics in HEDP available to students and researchers across the country.^{308} These examples provide models for new courses or potential opportunities for collaboration that should be explored.

### B. Databases and standards

In addition to training materials, another means to help accelerate the adoption and advancement of these methods is to provide curated databases to the community. Many of the methods presented here require data with which to train algorithms. By making well organized and labeled databases available, others can leverage them to test new ideas, develop new algorithms, and build tools where application specific data are not available or extremely limited. Again, we refer to the JAG ICF database as an example from which the community can build. In cases where the application of interest differs substantially from the use case that generated the data, curated databases can still be useful for prototyping. Alternately, they may serve as a basis for training models that will be updated with transfer learning using a small dataset that is more targeted.

Bayesian methods can also benefit from the availability of data. These databases can be used to test for sensitivity of a particular inference model, develop new diagnostic concepts that can enhance experiments, test model calibration and selection methods, and, in general, to validate a particular inference. The use of data that are well vetted by the community to establish a new technique increases the acceptance of new methods and promotes reproducibility because others will have access to the data.

Beyond making data available to the public for testing and development, these datasets also provide an opportunity to establish a standard for storing raw data, metadata, and other important information. This applies not only to experimental data, but also to simulation data. The availability of a common format with tools for accessing and manipulating data enhances communication and collaboration across the entire community. It also provides a route to readily adapt methods developed at one institution to problems being investigated at another. This applies to ML algorithms and to the development of more conventional forward models for physics and diagnostics. Care must be taken when developing a standard for complex data like that generated in HEDP and ICF research so that the data are not only easy to access and manipulate but are also complete.

### C. Testing and validation

Finally, all of the tools and techniques discussed in this Review require some form of testing and validation before they are applied to experimental data. In addition to setting standards for the storage and dissemination of data, the community must also set standards for validation. The ML community has standard datasets for specific applications that can be used to train and test new algorithms. In ICF and HEDP, it will not always be the case that one can train an algorithm on one of the available standard datasets and then apply it to their specific problem of interest. In such cases, reporting validation scores on these datasets will not be meaningful. To circumvent this, it is recommended that the community develops methodologies and metrics that are meaningful for the tasks discussed here.

## VIII. CONCLUDING REMARKS

In this Review, we have provided a foundation for the researchers to begin their journey applying advanced methods to their work and a number of examples from recent literature illustrating the current status of these methods in ICF and HEDP research. We have provided enough background information to convey the power and potential of these methods to transform many aspects of experimental ICF and HEDP research. Many of the tools discussed here from inference, to experiment design, to automated data reduction have already been applied with great success. The future is bright for the further development and application of advanced analysis tools to greatly advance our understanding of matter in the most extreme conditions.

## ACKNOWLEDGMENTS

The authors would like to thank Dr. Andrew J. Porwitzky, Dr. Owen Mannion, Dr. Jeff Woolstrum, and Dr. Lucas Stanek for providing feedback on the full article greatly improving overall quality; Dr. Jeffrey R. Fein for providing feedback for Sec. V with particularly insightful comments on image analysis and tomography; Dr. Ravi G. Patel for reviewing Sec. IV with especially helpful references on CFD and physics informed machine learning and detailed comments throughout; Dr. Kathryn Maupin for feedback on Bayesian inference and model selection in Secs. II and VI; and Dr. James Gunning for insightful discussion on the topic of data assimilation. In addition, the authors would like to thank the data science working group within the Pulsed Power Sciences Center at SNL for providing a vibrant research environment for exploring and applying advanced methods to problems in HEDP and ICF. This group includes Dr. B. Klein, Dr. M.-A. Schaeuble, Dr. J. Fein. Dr. K. Maupin, Dr. T. Nagayama, Dr. C. Jennings, Dr. S. Hansen, Dr. D. Ampleford, Dr. K. Beckwith, Dr. S. Slutz, Dr. C. Tyler, Dr. R. Patel, Dr. K. Blaha, Dr. E. Harding, Dr. C. Tyler, Dr. S. Fields, Dr. J. Brown, and Dr. G. Vasey. Finally, P. F. Knapp would like to thank Dr. Michael Glinsky for his mentorship, without which this work would not have been possible.

This article was authored by an employee of National Technology and Engineering Solutions of Sandia, LLC under Contract No. DE-NA0003525 with the U.S. Department of Energy (DOE). The employee owns all right, title, and interest in and to the article and is solely responsible for its contents. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes. The DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan, https://www.energy.gov/downloads/doe-public-access-plan.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**P. F. Knapp**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). **W. E. Lewis**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

## REFERENCES

*The Physics of Inertial Fusion*

*High Energy Density Physics: Fundamentals, Inertial Fusion, and Experimental Astrophysics*

*Probability Theory: The Logic of Science*

*Bayesian Data Analysis*

*Bayesian Probability Theory: Applications in the Physical Sciences*

*Maximum Entropy and Bayesian Methods*

*Gaussian Processes for Machine Learning*

*Probabilistic Graphical Models: Principles and Applications*